The Allen Institute for AI (AI2) has released MolmoAct 7B, an open-source robotics AI model that enables robots to “reason in space” and “think” in three dimensions. This Action Reasoning Model challenges existing offerings from tech giants like Nvidia and Google by providing robots with enhanced spatial understanding capabilities, achieving a 72.1% task success rate in benchmarking tests that outperformed models from Google, Microsoft, and Nvidia.
What makes it different: MolmoAct represents a significant departure from traditional vision-language-action (VLA) models by incorporating genuine 3D spatial reasoning capabilities.
- “MolmoAct has reasoning in 3D space capabilities versus traditional vision-language-action (VLA) models,” AI2 explained, noting that “most robotics models are VLAs that don’t think or reason in space, but MolmoAct has this capability, making it more performant and generalizable from an architectural standpoint.”
- The model is based on AI2’s open-source Molmo foundation and comes with an Apache 2.0 license, while its training datasets are licensed under CC BY-4.0.
How it works: MolmoAct processes physical environments through a sophisticated multi-step approach that translates visual input into actionable robotic movements.
- The model outputs “spatially grounded perception tokens” using a vector-quantized variational autoencoder—a system that converts data inputs like video into specialized tokens that differ from traditional text-based VLA inputs.
- These tokens enable the model to estimate distances between objects and predict sequences of “image-space” waypoints to create navigation paths.
- After establishing waypoints, MolmoAct outputs specific physical actions such as dropping an arm by a few inches or stretching out movements.
Real-world applications: AI2 positions MolmoAct as particularly valuable for home robotics, where environments are irregular and constantly changing.
- “MolmoAct could be applied anywhere a machine would need to reason about its physical surroundings,” the company said, adding that “we think about it mainly in a home setting because that’s where the greatest challenge lies for robotics.”
- Researchers demonstrated that the model can adapt to different robot embodiments—whether mechanical arms or humanoid robots—with minimal fine-tuning required.
What experts are saying: Industry professionals view AI2’s research as a meaningful step forward in physical AI development, though they emphasize room for continued improvement.
- Alan Fern, professor at Oregon State University College of Engineering, described the research as “a natural progression in enhancing VLMs for robotics and physical reasoning” while noting that “these benchmarks still fall short of capturing real-world complexity and remain relatively controlled and toyish in nature.”
- Daniel Maturana, co-founder of startup Gather AI, praised the model’s accessibility: “This is great news because developing and training these models is expensive, so this is a strong foundation to build on and fine-tune for other academic labs and even for dedicated hobbyists.”
Competitive landscape: MolmoAct enters a rapidly expanding physical AI market where major tech companies are racing to merge large language models with robotics capabilities.
- Google Research’s SayCan helps robots reason about tasks using LLMs to determine movement sequences, while Meta and New York University’s OK-Robot uses visual language models for movement planning and object manipulation.
- Nvidia, which has proclaimed physical AI as the next major trend, recently released several models including Cosmos-Transfer1 to accelerate robotic training.
- Hugging Face has democratized access with a $299 desktop robot designed to make robotics development more accessible.
Why this matters: The shift from manually coding robot movements to LLM-based decision-making represents a fundamental change in how robots interact with their physical environments.
- Before LLMs, scientists had to program every single robotic movement, creating significant work overhead and limiting flexibility in possible actions.
- LLM-based methods now allow robots to determine subsequent actions based on objects they interact with, moving the industry closer to achieving general physical intelligence.
- As Fern noted, “large physical intelligence models are still in their early stages and are much more ripe for rapid advancements, which makes this space particularly exciting.”
AI2’s MolmoAct model ‘thinks in 3D’ to challenge Nvidia and Google in robotics AI