11 points

Instead of writing captions, the team asked annotators to record 60- to 90-second verbal descriptions answering a list of questions about each image. They then transcribed the descriptions—which often stretched across several pages—and used other large language models to clean up, crunch down, and standardize them.

So those other LLMs are needed to train this one?

permalink
report
reply
1 point

And a modern calculator has more computer power than the Apollo program… This is how tech works.

permalink
report
reply
41 points
*

This reads like an ad. They claim to use 1000 times less data than proprietary models, except nobody knows how much data they use or how big proprietary models actually are. Also there’s a giant asterisk here they fail to mention: Molmo outperforms the competition at visual benchmarks, not actual text chat.

permalink
report
reply
14 points

Daaaang, Apache license AND open dataset + training tools.

permalink
report
reply
68 points

This kind of skill might help developers build AI agents that identify buttons or fields on a webpage to handle tasks like making a reservation at a restaurant.

… to improve efficiency of click farms and to bypass captchas.

permalink
report
reply

Technology

!technology@lemmy.world

Create post

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


Community stats

  • 17K

    Monthly active users

  • 5.5K

    Posts

  • 111K

    Comments