Google DeepMind’s robotics team is teaching robots how a human trainee would learn: by watching a video. The team has a new paper demonstrating how Google’s RT-2 robots, equipped with the Gemini 1.5 Pro generative AI model, can absorb information from videos to learn how to move and even execute commands at their destination.
The Gemini 1.5 Pro model’s long context window makes it possible to train a robot like a new trainee. This window allows the AI to process large amounts of information at once. The researchers would film a video tour of a designated area, such as a home or office. The robot would then watch the video and learn about its surroundings.
The details in the video tours show the robot completing tasks based on its learned knowledge, using both verbal and visual output. It’s an impressive way to show how robots can interact with their environment in ways that mimic human behavior. You can see how it works in the video below, as well as examples of different tasks the robot can perform.
Limited context length makes it challenging for many AI models to remember environments. 🌐Equipped with 1.5 Pro’s 1 million tokens of context length, our robots can use human-like instructions, video tours, and common sense to successfully navigate a space. pic.twitter.com/eIQbtjHCbWJuly 11, 2024
Robot AI expertise
These demonstrations aren’t just random flukes, either. In practical tests, Gemini-powered robots operated within a 9,000-square-foot area and successfully followed more than 50 different user instructions with a 90 percent success rate. This high degree of accuracy opens up many potential applications for AI-powered robots in the real world, helping out with chores around the home or at work with simple or even more complex tasks.
That’s because one of the most notable aspects of the Gemini 1.5 Pro model is its ability to perform multi-step tasks. DeepMind’s research has found that the robots can figure out how to answer questions, like whether a specific drink is available, by navigating to a refrigerator, visually processing what’s inside, and then returning and answering the question.
The idea of planning and executing the full sequence of actions demonstrates a level of understanding and execution that goes beyond the current standard of single-step commands for most robots.
Don’t expect this robot to be on sale anytime soon, though. For one thing, it takes up to 30 seconds to process each instruction, which is much slower than doing it yourself in most cases. The chaos of real homes and offices will be much harder for a robot to navigate than a controlled environment, no matter how advanced its AI model.
Still, integrating AI models like Gemini 1.5 Pro into robotics is part of a larger leap forward in the field. Robots equipped with models like Gemini or its rivals could transform healthcare, shipping, and even cleaning tasks.