James in Tech

Forget Sora, this is the AI video that will blow your mind – and maybe even scare you

The development of humanoid robots has been moving at a snail’s pace for the better part of two decades, but it’s accelerating rapidly thanks to a collaboration between Figure AI and OpenAI, resulting in the most stunning piece of real humanoid robot video I’ve ever seen. have seen.

On Wednesday, startup robotics company Figure AI released a video update (see below) of its Figure 01 robot with a new Visual Language Model (VLM) that has somehow transformed the bot from a rather uninteresting automaton into a full-fledged sci-fi bot approaching C-3PO level capabilities.

In the video, Figure 01 stands behind a table set with a plate, an apple and a cup. On the left is a dish drainer. A human stands in front of the robot and asks the robot: “Figure 01, what do you see now?”

After a few seconds, Figure 01 responds in a remarkably human-sounding voice (there is no face, just an animated light moving in sync with the voice), describing everything on the table and the details of the man standing in front of it.

“That’s cool,” I thought.

Then the man asks, “Hey, can I eat something?”

Figure 01 responds, “Sure,” and then, in a deft, fluid motion, picks up the apple and hands it to the man.

“Wow,” I thought.

The man then empties some crumpled trash from a bin in front of Figure 01 while asking, “Can you explain why you did what you just did while collecting this trash?”

Figure 01 wastes no time explaining the reasoning as he places the paper back in the bin. “So I gave you the apple because it’s the only edible item I could give you from the table.”

I thought, “This can’t be real.”

However, at least it is according to Figure AI.

Speech-to-speech

The company explained in a release that Figure 01 engages in “speech-to-speech” reasoning using OpenAI’s pre-trained multimodal model, VLM, to understand images and texts and relies on a full voice conversation to formulate its responses . This is different from, for example, OpenAI’s GPT-4, which focuses on written instructions.

It also uses what the company calls “low-level learned bimanual manipulation.” The system links precise image calibrations (down to the pixel level) to its neural network to control movements. “These networks record on-board images at 10 Hz and generate 24-DOF actions (wrist postures and finger joint angles) at 200 Hz,” Figure AI wrote in a release.

The company claims that all behavior in the video is based on systems learning and is not teleoperated, meaning no one is puppeteering behind the scenes. Figure 01.

Without seeing Figure 01 in person and asking my own questions, it is difficult to verify these claims. It’s possible that this isn’t the first time Figure 01 has gone through this routine. It could have been the 100th time, which could explain the speed and fluidity.

Or maybe this is 100% real and in that case: wow. Just wow.

Speech-to-speech

You might also like it