Embodied AI

Best Papers on Reasoning in Embodied AI? A Reading List

Dive into the future of robotics with our curated reading list of the best research papers on reasoning in embodied AI. From foundational works to LLM-powered agents.

Dr. Alistair Finch

Senior AI researcher focusing on robot learning and human-robot interaction.

September 16, 20257 min read85 views

7 min read

1,492 words

85 views

Updated

For years, the dream of embodied AI felt like a two-part problem: we needed robots that could see the world and robots that could move within it. We poured immense effort into perception systems and motor control, and the results have been incredible. But a crucial piece was always waiting in the wings: how do we get robots to think?

We’re not just talking about executing a pre-programmed sequence. We’re talking about genuine reasoning—the ability to take a vague command like, "I’m cleaning up, can you help?", break it down into logical steps, understand the context of the environment, and adapt when things go wrong. This is the frontier of embodied intelligence, where perception and action are guided by cognition.

The field is moving at a breakneck pace, largely thanks to the seismic impact of large language and vision-language models. If you're looking to get up to speed or just want a curated list of the most pivotal research, you've come to the right place. This isn't an exhaustive library, but a reading list designed to walk you through the key ideas that are shaping the future of intelligent robots.

The Leap from Perception to Cognition

Early robotic systems were often reactive. They operated on tight perception-action loops: see a cup, execute “grasp cup” policy. This is powerful but brittle. It fails when faced with ambiguity or tasks that require multiple, dependent steps. What if the cup is behind a book? What if the user says, "Get me something to drink from" instead?

To solve this, a robot needs to plan. It needs a mental model of the world and the ability to reason about cause and effect. The papers that follow represent the community's most exciting attempts to build this cognitive layer.

Laying the Groundwork: Foundational Ideas

Before we could have robots that reason with language, we needed to bridge the gap between abstract concepts (words, sentences) and the physical world (pixels, forces).

The Vision-Language Revolution (e.g., CLIP)

While not an embodied AI paper itself, OpenAI's CLIP (Contrastive Language-Image Pre-Training) from 2021 was a watershed moment. The paper demonstrated how to train a model on a massive dataset of images and their captions from the internet to learn a shared representation for both modalities.

For the first time, a model could understand that the pixels forming a picture of an apple were semantically close to the words "red fruit" or "a granny smith apple." This ability to connect open-ended language to visual concepts became a fundamental building block for almost all modern embodied reasoning systems. It gave robots a way to “see” the world through the lens of human language.

Grounding Language in What's Possible

A core challenge has always been affordance grounding. An affordance is what the environment offers an agent. A chair affords sitting; a knob affords turning. For a robot, a language command is useless unless it can be grounded in the affordances of its environment and its own physical capabilities. Telling a robot to “pick up the water” is pointless if it can only grasp solid objects. Early research focused on learning these affordances from scratch, which was slow and data-intensive. The revolution came when we found a way to let large models do the high-level reasoning, and use other systems to handle the grounding.

The Main Event: When Large Models Met Robots

This is where things get really exciting. Researchers realized that the symbolic reasoning and world knowledge baked into Large Language Models (LLMs) could serve as a powerful planning engine.

SayCan: Grounding Language in Affordances

Perhaps the most iconic paper in this subfield is Google's 2022 work, "Do As I Can, Not As I Say." The insight behind SayCan is beautifully simple. When a robot is given a high-level instruction, like "I just worked out, can you bring me a healthy snack and a drink?", the system does two things:

The "Say" part: An LLM (like GPT-3 or PaLM) is prompted to break down the goal into a list of possible next steps. For our example, it might suggest: "1. Go to the counter," "2. Pick up the apple," "3. Find a water bottle," "4. Bring the user a chocolate bar."
The "Can" part: A set of pre-trained, low-level value functions (the affordance models) score each of these suggestions based on how likely the robot is to succeed at that action right now, in its current state. If there's no apple in sight, the score for "pick up the apple" will be very low.

By multiplying the probabilities from the LLM (what's useful) and the value functions (what's possible), the robot selects the best, most feasible next action. SayCan elegantly fused the common-sense knowledge of LLMs with the real-world grounding of robotic policies.

PaLM-E: Weaving a World of Words and Pixels

While SayCan used an LLM as a separate “planner,” Google's PaLM-E (Embodied Multimodal Language Model), published in 2023, took a more integrated approach. The key idea was to treat robot sensor data—like images and joint states—as just another part of the language model's input.

Imagine a sentence: "The user asked me to [IMAGE_01] pick up the blue block. I should..." PaLM-E was trained to understand these “multimodal sentences,” where special tokens represent rich, continuous sensor data. By injecting the state of the world directly into the model's context, PaLM-E could generate textual plans that were inherently grounded in the here and now. It wasn't just a planner; it was an embodied reasoner, blurring the lines between language, vision, and action.

RT-2: Vision-Language Models as Robotic Brains

The next logical step was to close the loop entirely. If a model can read images and output text, why can't it output robot actions directly? That's the premise behind Google DeepMind's RT-2 (Robotic Transformer 2).

RT-2 is a Vision-Language-Action (VLA) model. It takes in an image and a text command, and its output is not a sentence, but the literal motor commands for the robot (e.g., [TERMINATE, MOVE_ARM, x, y, z, roll, pitch, yaw]). This is achieved by co-fine-tuning a powerful Vision-Language Model (VLM) on both internet-scale text/image data and robotic trajectory data. The result is stunning: the model exhibits emergent capabilities. Because it has learned concepts like “a container for trash” from the web, it can correctly respond to a command like “throw away this banana peel” even if it has never been explicitly trained on that specific task during its robotics training. It's a powerful demonstration of knowledge transfer from the web to the physical world.

The Frontier: Self-Correction and Long-Horizon Tasks

Following a single command is one thing. True intelligence requires persistence, error correction, and the ability to handle tasks that take dozens of steps.

Inner Monologue: Teaching Robots to Think Out Loud

What happens when a plan goes wrong? The 2022 paper "Inner Monologue: Embodied Reasoning through Planning with Language Models" tackles this head-on. The system uses an LLM not just to create an initial plan, but to actively reason about its execution.

The robot provides a continuous stream of feedback to the LLM: what it sees, what actions it just took, and whether they succeeded. The LLM then acts like an internal narrator, or “inner monologue.” If it sees that an action failed (e.g., “I tried to pick up the sponge, but my gripper missed”), it can update the plan on the fly (“Okay, I will try to grasp the sponge from a different angle”). This ability to self-correct using natural language feedback is a monumental step towards building robust, autonomous agents that don't just give up at the first sign of trouble.

OK-Robot: Learning in the Wild

A final, crucial direction is moving beyond the lab. The OK-Robot project (Open-Knowledge Robot) explores how robots can learn long-horizon, everyday tasks by operating in real human environments like offices and homes. The key is combining large-scale navigation data with manipulation policies learned from many different sources. By pre-exploring a new environment and building a map of objects and locations, the robot can then use a VLM planner (like those we've discussed) to execute complex commands like "bring me a snack from the kitchen." This line of work is critical for bridging the sim-to-real gap and creating robots that are genuinely useful in our own spaces.

Where Do We Go From Here?

The trajectory is clear. We're moving from modular systems with separate planners and controllers towards unified, end-to-end models that take in raw sensor data and output motor commands. The knowledge distilled in massive web-scale models is giving robots an unprecedented level of common-sense reasoning and generalization.

Of course, huge challenges remain. Real-time performance, ensuring safety and predictability, and the insatiable need for high-quality robotic data are all active areas of research. But by building on the ideas in these papers, the field is steadily moving from robots that simply do to robots that truly understand. The next time you see a robot pause, it might not be stuck—it might just be thinking.