Researchers at Meta Used Internet Videos to Build a “Robot Brain”

By Sophia Wang

April 19, 2026

At present, each new robot design requires a separate training dataset, learning material customized for its shape and tasks, they are considered foundational for the development of robotics and AI. Smart navigators require their own dataset, and chatbots have their own as well. This also means that a navigator cannot control a robotic arm, and robotic arms cannot generate text. Without a way for models to generalize across tasks, training a single multi-purpose model can cost millions and years of work.

These two challenges are known as scalability and transferability, respectively. Quentin Garrido, a researcher at FAIR at Meta, and his team sought to address these issues by developing a model capable of “predicting the consequences of [its] actions” (Garrido et al. 2026). The team is working towards creating one model using one dataset that can generalize and apply its knowledge across multiple domains, so they turned towards an existing ocean of data that can show the robot motion, dogs, speech, and so much more in hopes of addressing these issues: the Internet.

Choosing to use Internet videos meant that Garrido had to deal with a problem no other research team wanted to handle: teaching a model to understand the chaos and variety within these online videos. Videos on the Internet lack a set format. Content creators can choose to have a stationary camera, a moving camera, or even a body cam. In addition, the content of these videos vastly differs. Achieving success in just this hurdle alone would grant the entire AI industry a massive influx of new trainable data to scale their models on.

To help the model identify the focus within these chaotic Internet videos, Garrido developed information regularizers, which create a region of interest for the model to focus on. Instead of overwhelming the model with the swing of a bat, the shattering of glass, the spilling of water, and a shout from a man, the regularizer would point the model to study specifically the impact of a bat swung at a cup. In this way, the model observes frame-by-frame how movement and contact between the bat and cup result in shattered glass.

Contrary to common practice, Garrido chose not to label these videos, and used a special model instead. Labeling training data equates to holding a horse’s picture in front of the model and saying “horse” a thousand times. In addition to being inefficient for the teacher, this approach also has other drawbacks: it fails to help the model identify a horse in a more disordered frame, and makes it harder for the model to generalize its “horse knowledge” to similar looking animals such as ponies and donkeys.

Garrido and his team chose to use a Latent Action Model (LAM) to mitigate this problem, because LAMs don’t need to be “told” there’s a horse through a labeled image; it “observes” a large range of horses and generalizes across them. When Garrido feeds the LAM two consecutive video frames (milliseconds apart), the LAM first attempts to determine the source of change between the two frames—also known as the latent, or hidden, action. Using this latent action, the LAM then proceeds to determine the next possible frame, essentially the consequence of the latent action.

At the end of the training, the LAM develops what researchers call a latent action space. This is a conceptual understanding of how actions produce change in the environment. Where other models may describe an arm throw simply as “throwing,” Garrido’s LAM can differentiate “chucking,” “passing,” and “lobbing.” The model has gained a more nuanced perception of the world than any other AI models right now. It is this essence that can be applied to different applications, such as robotic manipulation and navigation. Having mastered the inner workings of movement, the LAM only requires a small amount of task-specific training on top to learn and master a new skill.

The strength of the LAM is its versatility. Garrido and his team tested the LAM with the DROID robotic arm dataset and the RECON navigation dataset, comparing its performance with specialized models. While the LAM was “not on par with” the top models of each task (such as V-JEPA 2 and NWM), Garrido noticed that it achieved similar performance with regular models (V-JEPA 2-AC). The fact that Garrido’s LAM is based entirely off Internet videos is a testament to the latent action space’s transferability. Possessing one such model would equal having multiple specialized models.

Garrido had set out to tackle a problem the rest of the field had avoided, and in doing so built something that transcends any single robot or task. Perhaps, in the future, one could simply carry such a model around, ready to be deployed onto any robot or electronic device nearby to assist with precision and expertise.

The model that Garrido and his colleagues developed has provided a stepping stone in solving the issues of scalability and generalization. For this LAM, no video is incompatible or too expensive to label. Instead, every video reinforces the LAM’s physical understanding of the world. With that intuition, Garrido and his team’s model is ready for any task, transferring its understanding to practical actions.

Works Cited

Garrido, Q., Nagarajan, T., Terver, B., Ballas, N., LeCun, Y., & Rabbat, M. (2026). Learning Latent Action World Models In The Wild. https://arxiv.org/abs/2601.05230