Meta’s new AI model learns by watching videos

Meta’s AI researchers have released a new model that’s trained in a similar way as today’s large language models, but instead of learning from words, as today’s state-of-the-art language models do, it learns from video.

Yann LeCun, who leads Meta’s FAIR (foundational AI research) group, has been explaining over the past year that the reason children learn about the world so quickly is because they intake lots of information through their optical nerve and through their ears. They learn what things in the world are called and how they work together. Current large language models (LLMs), such as OpenAI’s GPT-4 or Meta’s own Llama models, learn mainly by processing language—they try to learn about the world as its described on the internet. And that, LeCun argues, is why current LLMs aren’t moving very quickly toward artificial general intelligence (where AI is generally smarter than humans).

LLMs are normally trained on thousands of sentences or phrases where some of the words are masked, forcing the model to find the best words to fill in the blanks. In doing so the model learns what words are statistically most likely to come next in a sequence, and they gradually pick up a rudimentary sense of how the world works. They learn, for example, that when a car drives off a cliff it doesn’t just hang in the air—it drops very quickly to the rocks below.

LeCun believes that if LLMs and other AI models could use the same masking technique, but on video footage, they could learn more like babies do. LeCun’s new baby, and the embodiment of his theory, is a research model called Video Joint Embedding Predictive Architecture (V-JEPA). It learns by processing unlabeled video and figuring out what probably happened in a certain part of the screen during the few seconds it was blacked out.

“V-JEPA is a step toward a more grounded understanding of the world so machines can achieve more generalized reasoning and planning,” said LeCun in a statement.

Note that V-JEPA isn’t a generative model. It doesn’t answer questions by generating video, but rather by describing concepts, like the relationship between two real-world objects. The Meta researchers say that V-JEPA, after pretraining using video masking, “excels at detecting and understanding highly detailed interactions between objects.”

Meta’s next step after V-JEPA is to add audio to the video, which would give the model a whole new dimension of data to learn from—just like a child watching a muted TV then turning the sound up. The child would not only see how objects move, but also hear people talking about them, for example. A model pretrained this way might learn that after a car speeds off a cliff it not only rushes toward the ground but makes a big sound upon landing.

“Our goal is to build advanced machine intelligence that can learn more like humans do,” LeCun said, “forming internal models of the world around them to learn, adapt, and forge plans efficiently in the service of completing complex tasks.”

The research could have big implications for both Meta and the broader AI ecosystem.

Meta has talked before about a “world model” in the context of its work on augmented reality glasses. The glasses would use such a model as the brain of an AI assistant that would, among other things, anticipate what digital content to show the user to help them get things done and have more fun. The model would, out of the box, have an audio-visual understanding of the world outside the glasses, but could then learn very quickly about the unique features of a user’s world through the device’s cameras and microphones.

V-JEPA might also lead toward a change in the way AI models are trained, full stop. Current pretraining methods for foundation models require massive amounts of time and compute power (which has ecological implications). At the moment, in other words, developing foundation models is reserved for the rich. With more efficient training methods, that could change. This would be in line with Meta’s strategy of releasing much of its research as open-source rather than protecting it as valuable IP as OpenAI and others do. Smaller developers might be able to train larger and more capable models if training costs went down.

Meta says it’s releasing the V-JEPA model under a Creative Commons noncommercial license so that researchers can experiment with it and perhaps expand its capabilities.

https://www.fastcompany.com/91029951/meta-v-jepa-yann-lecun?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

Created 10mo | Feb 15, 2024, 6:40:07 PM

Other posts in this group

TikTok is full of bogus, potentially dangerous medical advice

TikTok is the new doctor’s office, quickly becoming a go-to platform for medical advice. Unfortunately, much of that advice is pretty sketchy.

A new report by the healthcare software fi

Dec 25, 2024, 12:30:03 AM | Fast company - tech

45 years ago, the Walkman changed how we listen to music

Back in 1979, Sony cofounder Masaru Ibuka was looking for a way to listen to classical music on long-haul flights. In response, his company’s engineers dreamed up the Walkman, ordering 30,000 unit

Dec 24, 2024, 3:10:04 PM | Fast company - tech

The greatest keyboard never sold

Even as the latest phones and wearables tout speech recognition with unprecedented accuracy and spatial computing products flirt with replacing tablets and laptops, physical keyboards remain belov

Dec 24, 2024, 12:50:02 PM | Fast company - tech

The 25 best new apps of 2024

One of the most pleasant surprises about this year’s best new apps have nothing to do with AI.

While AI tools are a frothy area for big tech companies and venture capitalists, ther

Dec 24, 2024, 12:50:02 PM | Fast company - tech

The future belongs to systems of action

The world of enterprise tech is built on sturdy foundations. For decades, systems of record—the databases, customer relationship management (CRM), and enterprise resource planning (ERP) platforms

Dec 23, 2024, 10:50:06 PM | Fast company - tech

Bluesky users report AI bots, disinformation, and copycat accounts

Bluesky has seen its user base soar since the U.S. presidential election,

Dec 23, 2024, 10:50:05 PM | Fast company - tech

Banning Chinese-made drones could hurt some Americans

Russell Hedrick, a North Carolina farmer, flies drones to spray fertilizers on his corn, soybean and wheat fields at a fraction of what it

Dec 23, 2024, 8:40:03 PM | Fast company - tech

Tomas_r2