“We do not react to the world. We predict it — and then correct.” — Kenneth Craik, The Nature of Explanation (1943) 1
You’ve never seen a shoelace before. No instructions. No demo. You’re asked to untie one.
An LLM fails. It can write paragraphs about shoelaces — physics, etymology, technique. It has read every book about sailing. It has never felt the wind. A world model reaches down, feels the tension, and reasons about what happens when you pull. It doesn’t describe the world. It simulates it.
That difference is the most important idea in AI right now.
What Is a World Model?
A world model is an agent’s internal simulation of how reality works — not a database of facts about it, but a functional model of its dynamics. When you catch a ball, you don’t query a physics textbook. You predict where the ball will be in 0.4 seconds and move your hand there.
Craik proposed this in 1943. 1 Schmidhuber formalized it for neural networks in 1990. 2 It took until 2018 for the hardware, data, and techniques to catch up.
The Blueprint: V, M, C
In 2018, David Ha and Schmidhuber published “World Models” 3 — a paper with an interactive demo where you could watch an AI agent dream. Three modules, tight concert.
V — Vision compresses raw pixel observations into a compact latent vector z using a Variational Autoencoder. Not the full image — the essential abstraction of it.
M — Memory takes z and the last action and predicts a probability distribution over the next z using an MDN-RNN. Not one prediction — a distribution over possible futures. Uncertainty modeled explicitly.
C — Controller is a tiny linear network. Takes z and M’s hidden state h. Outputs the action. It is not intelligent. The intelligence lives in what it receives.
The key move: C never trained in the real environment. It trained entirely inside M’s hallucinated dream rollouts — then transferred to reality. Car Racing. VizDoom. It worked. 3
DreamerV3 (2025) mastered 150+ tasks with a single set of hyperparameters. DayDreamer put it on a real robot that learned to walk in an hour. 4 The ceiling wasn’t capability — it was generalization. Scale the dream, but the dream stayed narrow.
Why LLMs Hit a Wall
Yann LeCun has been making the same argument since 2022 5, sharpening it with each year. Three pillars:
No grounding. Text is a lossy, secondary encoding of the world. “Fire” appears near “hot” in a corpus. The model has never felt heat. It cannot simulate what fire does to paper. The gap between map and territory isn’t an engineering problem — it’s a categorical one.
No planning. Planning requires simulating forward: if I take this action, what state will the world be in? Autoregressive token prediction doesn’t provide this. Chain-of-thought looks like reasoning — it’s still next-token prediction. 5
Scaling won’t fix it. A child learns to drive in hours — sensing inertia, feeling the brake. No LLM achieves this from text at any scale, because the information isn’t in the text. More tokens don’t move the grounding ceiling. 5
At VivaTech 2024, LeCun told PhD students directly: work on the architectures that overcome these weaknesses. In November 2025, he left Meta, raised $1B, and founded AMI Labs. The foundation: JEPA — Joint Embedding Predictive Architecture. Instead of predicting tokens, JEPA predicts abstract representations of the next state. Forced to learn structure, not surface. 6
State of the Art, 2025
The RL and video prediction traditions converged. Things are moving fast: 7
- Genie 3 (DeepMind) — interactive environments from unlabeled video, no action labels. Photorealistic 720p at 24 FPS. 8
- UniSim (ICLR 2024 Outstanding Paper) — RL policy trained inside a video world model, transferred to real robot at 81% zero-shot success. 9
- CausVid (CVPR 2025) — made video diffusion causal and autoregressive, enabling real-time action injection for the first time. 10
- Self Forcing (NeurIPS 2025) — distilled 35 denoising steps to 4. Real-time interactive video generation. 11
- DreamGen (NVIDIA, 2025) — a humanoid learned 22 behaviors in unseen environments from one pick-and-place demonstration. No additional teleoperation. 12
Honest caveat: general-purpose manipulation is still unsolved. Most production robotics in 2025 are still VLAs, not world models. Generation quality degrades over longer horizons. Specific demonstrations are remarkable. The general case is future work.
The Enterprise Problem
Enterprises don’t run on text. They run on transactions, approvals, state machines, and causal chains. An LLM that has read every procurement manual has no model of what happens when a purchase order hits a three-way match exception at quarter-close. It can describe the process. It cannot reason inside it.
The grounding gap is structural. More context, bigger models, better prompts — these extend the plateau. They don’t lift the ceiling.
What makes enterprise particularly interesting: decades of structured transactional data — ERP records, financial flows, supply chain events — isn’t text. It’s a causal record of how business processes actually work. That data, modeled correctly, is closer to a world model substrate than any fine-tuned documentation corpus. The organisations that understand this earliest will have an AI advantage that is genuinely hard to replicate.
What Comes Next
- 2025–2027 — Simulators for training. World models as synthetic data generators, not primary inference engines. DreamGen’s approach scales well and sidesteps the hard real-time inference problem.
- 2027–2030 — Causal retrieval. JEPA-style architectures replace text-chunk RAG in enterprise AI: embed query → retrieve structural relationships → reason over causal graph.
- Beyond — Grounded agents. A persistent, updatable internal model of an operating environment. Not AGI — something more tractable and more useful. AI that knows the difference between what it was told and what is true.
LLMs will power enterprise applications for years. But the frontier has moved. The shoelace problem is still unsolved. For the first time since 1943, we have the pieces to start solving it.
References
The author leads an enterprise AI team building LLM-powered copilots, evaluation frameworks, and agentic infrastructure. Previously: Generative AI at Mastercard, B2B Search at G2.
Footnotes
-
Craik, K. (1943). The Nature of Explanation. Cambridge University Press. ↩ ↩2
-
Schmidhuber, J. (1990). Making the World Differentiable. Technical Report FKI-126-90, TU Munich. ↩
-
Ha, D., & Schmidhuber, J. (2018). World Models. arXiv:1803.10122. Paper · Interactive demo ↩ ↩2
-
Hafner, D., et al. (2023). Mastering Diverse Domains through World Models (DreamerV3). arXiv:2301.04104. Published in Nature, 2025. ↩
-
LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. Meta AI / OpenReview. Paper ↩ ↩2 ↩3
-
Bardes, A., et al. (2024). V-JEPA: Revisiting Feature Prediction for Video Representations. arXiv:2404.08471. ↩
-
Yin, H., & Xia, N. (2025). The Model That Dreams the World. MoE Capital. Article ↩
-
Bruce, J., et al. (2024). Genie: Generative Interactive Environments. Google DeepMind. ↩
-
Yang, S., et al. (2024). UniSim: Learning Interactive Real-World Simulators. ICLR 2024 Outstanding Paper. ↩
-
Jain, A., et al. (2025). CausVid. CVPR 2025. ↩
-
Self Forcing: Bridging Train-Test Discrepancy in Video Diffusion. NeurIPS 2025. ↩
-
NVIDIA Research (2025). DreamGen: Unlocking Generalizable Robot Policy Learning via Video World Models. Paper ↩