Published on:
VLA Architecture Gains Momentum in Intelligent Driving, as Industry Debate Shifts from 'Architecture Wars' to Capability Convergence
The rapid pace of development in China’s intelligent vehicle industry is reshaping how core technologies are perceived. A framework once described as a “next-generation star” has, in less than a year, been labeled by some critics as a “simplified architecture.” That framework is VLA (Vision-Language-Action), now one of the most discussed approaches in assisted and automated driving.

From Robotics to Automobiles
The concept of VLA entered public discussion in July 2023, following DeepMind’s release of the RT-2 model for robotic control. Within just a few months, early autonomous driving developers adapted the VLA concept—originally designed for embodied intelligence—into the automotive domain, attracted by its potential to map raw perception directly to driving actions.
By 2025, multiple assisted-driving systems based on VLA principles had entered real-world deployment. VLA has since become one of the mainstream technical routes, though not the only one.

World Models and VLA: Less Different Than They Appear
At first glance, the two approaches—World Models and VLA—seem fundamentally opposed. World models emphasize reconstructing a digital replica of the physical environment, while VLA highlights end-to-end perception-to-action learning.
However, closer inspection reveals that both are, at their core, engineering implementations of the same paradigm: neural networks combined with reinforcement learning.
The difference lies more in emphasis—world models focus on explicit environment reconstruction, while VLA emphasizes action generation—but the underlying mechanics are remarkably similar.

Practical Deployment: Li Auto’s VLA Driver Model
Among automakers, Li Auto is widely recognized as the first to deploy a VLA-based driver model at scale. Since its initial full rollout, the system has already undergone multiple iterations, with recent updates delivered via OTA 8.1.
According to real-world driving data, the VLA driver model demonstrates smoother motion control and more human-like driving logic. This improvement stems from several technical leaps:
- Scalability: Near-doubling of activated model parameters to around 4 billion.
- Performance: Increased trajectory output frequency of 10 Hz, significantly reducing latency.
- Reasoning: Stronger 3D spatial reasoning in traffic “negotiation” scenarios.

Beyond Assisted Driving: Toward AI Agents
More advanced applications highlight VLA’s longer-term potential. In semi-closed environments such as industrial parks, the system can infer user intent without explicit navigation input, relying on semantic reasoning and long-term memory.
These capabilities point toward VLA evolving into an AI agent rather than a narrowly defined driving function—capable of learning, remembering, and adapting strategies based on changing conditions.

Convergence, Not Replacement
Industry observers increasingly argue that the future of assisted driving may not depend on replacing one architecture with another, but on deep optimization of existing frameworks.
VLA and world models appear to be converging toward a shared goal: scalable, generalizable intelligence for driving. The debate is gradually shifting away from “which architecture wins” toward how quickly real-world performance can improve under practical constraints.
