Published on:

VLA Architecture Gains Momentum in Intelligent Driving, as Industry Debate Shifts from 'Architecture Wars' to Capability Convergence

The rapid pace of development in China’s intelligent vehicle industry is reshaping how core technologies are perceived. A framework once described as a “next-generation star” has, in less than a year, been labeled by some critics as a “simplified architecture.” That framework is VLA (Vision-Language-Action), now one of the most discussed approaches in assisted and automated driving.

VLA Architecture concept visualization

From Robotics to Automobiles

The concept of VLA entered public discussion in July 2023, following DeepMind’s release of the RT-2 model for robotic control. Within just a few months, early autonomous driving developers adapted the VLA concept—originally designed for embodied intelligence—into the automotive domain, attracted by its potential to map raw perception directly to driving actions.

By 2025, multiple assisted-driving systems based on VLA principles had entered real-world deployment. VLA has since become one of the mainstream technical routes, though not the only one.

DeepMind RT-2 robotics to automotive transition

World Models and VLA: Less Different Than They Appear

At first glance, the two approaches—World Models and VLA—seem fundamentally opposed. World models emphasize reconstructing a digital replica of the physical environment, while VLA highlights end-to-end perception-to-action learning.

However, closer inspection reveals that both are, at their core, engineering implementations of the same paradigm: neural networks combined with reinforcement learning.

The difference lies more in emphasis—world models focus on explicit environment reconstruction, while VLA emphasizes action generation—but the underlying mechanics are remarkably similar.

Comparison between World Models and VLA paradigms

Practical Deployment: Li Auto’s VLA Driver Model

Among automakers, Li Auto is widely recognized as the first to deploy a VLA-based driver model at scale. Since its initial full rollout, the system has already undergone multiple iterations, with recent updates delivered via OTA 8.1.

According to real-world driving data, the VLA driver model demonstrates smoother motion control and more human-like driving logic. This improvement stems from several technical leaps:

  • Scalability: Near-doubling of activated model parameters to around 4 billion.
  • Performance: Increased trajectory output frequency of 10 Hz, significantly reducing latency.
  • Reasoning: Stronger 3D spatial reasoning in traffic “negotiation” scenarios.

Li Auto OTA 8.1 driver model interface

Beyond Assisted Driving: Toward AI Agents

More advanced applications highlight VLA’s longer-term potential. In semi-closed environments such as industrial parks, the system can infer user intent without explicit navigation input, relying on semantic reasoning and long-term memory.

These capabilities point toward VLA evolving into an AI agent rather than a narrowly defined driving function—capable of learning, remembering, and adapting strategies based on changing conditions.

VLA AI Agent intent inference logic

Convergence, Not Replacement

Industry observers increasingly argue that the future of assisted driving may not depend on replacing one architecture with another, but on deep optimization of existing frameworks.

VLA and world models appear to be converging toward a shared goal: scalable, generalizable intelligence for driving. The debate is gradually shifting away from “which architecture wins” toward how quickly real-world performance can improve under practical constraints.

Capability convergence in intelligent vehicle industry