Loading learning content...
When you plan to catch a ball, you don't try every possible arm movement and see what happens. Instead, you maintain an internal model of how the world works—how balls move through air, how your muscles affect your arm position, how objects behave when caught. You run mental simulations: 'If I move my arm there, the ball will land in my hand.' This internal model enables planning, imagination, and reasoning about the future without physical experimentation.
World Models in artificial intelligence aim to replicate this capability: learning internal simulators of the environment that enable agents to imagine future trajectories, evaluate potential actions, and plan effectively without executing actions in the real world. This represents a fundamental shift from reactive systems that map observations directly to actions toward systems that understand the dynamics of their environment.
The world model paradigm has emerged as a major research direction because it addresses core limitations of model-free approaches—sample inefficiency, lack of generalization, and inability to reason about novel situations. Systems with accurate world models could learn new tasks from imagination alone.
By the end of this page, you will understand what world models are and why they matter, the key technical approaches to learning world models, how world models enable model-based reinforcement learning and planning, recent breakthroughs in large-scale world models, and the fundamental challenges that remain in this exciting research direction.
A world model is a learned representation of environment dynamics—a model that predicts how the environment will change in response to actions. Formally, given the current state (or observation) and an action, the world model predicts the next state, and potentially the reward or other relevant quantities.
Formal Definition
In the context of reinforcement learning, where an agent interacts with a Markov Decision Process (MDP), a world model approximates the transition dynamics:
The world model can be:
Why World Models Matter
World models address several fundamental limitations of model-free reinforcement learning:
Sample Efficiency: Model-free methods learn directly from environment interactions, often requiring millions of samples. World models enable learning from imagined experience, dramatically reducing required real-world interactions.
Transfer and Generalization: A good world model captures invariant aspects of environment dynamics that transfer across tasks. An agent that understands physics can apply this knowledge to new goals.
Planning and Reasoning: With a world model, agents can mentally simulate action consequences, enabling look-ahead planning, counterfactual reasoning, and systematic exploration.
Safe Exploration: Agents can test dangerous or expensive actions in imagination before committing in the real world.
| Aspect | Model-Free | Model-Based with World Model |
|---|---|---|
| Learning approach | Direct policy/value from interactions | Learn dynamics, then plan/optimize |
| Sample efficiency | Low (millions of samples) | High (orders of magnitude fewer) |
| Computation at inference | Fast (forward pass) | Higher (planning in model) |
| Generalization | Task-specific | Potentially task-agnostic |
| Failure mode | Needs more data | Compounding model errors |
| Examples | DQN, PPO, SAC | Dreamer, MuZero, PlaNet |
Model-based methods trade interaction complexity for computational complexity. They require fewer real-world samples but more computation for learning the model and planning within it. As compute becomes cheaper while real-world interaction remains expensive (robotics, healthcare, autonomous driving), this tradeoff increasingly favors model-based approaches.
World models have evolved from simple forward models in state space to sophisticated latent-space predictors that can imagine complex environments. Understanding the architectural evolution helps contextualize current approaches.
Direct Pixel Prediction
The most straightforward approach predicts future observations (e.g., video frames) directly. Given current observations and action, predict the next observation.
Challenges:
Despite these challenges, video prediction models have shown impressive results, with recent large-scale models (Sora, Genie) demonstrating remarkably coherent long-horizon generation.
Recurrent State-Space Models
More powerful world models maintain a latent state that summarizes observation history and predicts forward in this abstract space:
This architecture, pioneered by systems like PlaNet and Dreamer, enables efficient planning in the compact latent space rather than expensive planning in observation space.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152
# Conceptual World Model Architecture (Dreamer-style)import torchimport torch.nn as nnimport torch.nn.functional as Ffrom typing import Tuple, Dict class RecurrentStateSpaceModel(nn.Module): """ A latent world model that maintains belief state and predicts dynamics. Components: - Encoder: observations → latent - RSSM: recurrent dynamics in latent space - Decoder: latent → observations - Reward model: latent → rewards """ def __init__( self, obs_dim: int, action_dim: int, latent_dim: int = 256, hidden_dim: int = 512, deterministic_dim: int = 256, stochastic_dim: int = 32, ): super().__init__() # Observation encoder self.encoder = nn.Sequential( nn.Linear(obs_dim, hidden_dim), nn.LayerNorm(hidden_dim), nn.ELU(), nn.Linear(hidden_dim, hidden_dim), nn.ELU(), ) # Recurrent State-Space Model (RSSM) components # Deterministic state transition self.gru = nn.GRUCell(hidden_dim + action_dim, deterministic_dim) # Prior: p(z_t | h_t) - predictions without observation self.prior_net = nn.Sequential( nn.Linear(deterministic_dim, hidden_dim), nn.ELU(), nn.Linear(hidden_dim, stochastic_dim * 2), # mean and std ) # Posterior: q(z_t | h_t, o_t) - incorporates observation self.posterior_net = nn.Sequential( nn.Linear(deterministic_dim + hidden_dim, hidden_dim), nn.ELU(), nn.Linear(hidden_dim, stochastic_dim * 2), ) # Observation decoder self.decoder = nn.Sequential( nn.Linear(deterministic_dim + stochastic_dim, hidden_dim), nn.ELU(), nn.Linear(hidden_dim, hidden_dim), nn.ELU(), nn.Linear(hidden_dim, obs_dim), ) # Reward predictor self.reward_model = nn.Sequential( nn.Linear(deterministic_dim + stochastic_dim, hidden_dim), nn.ELU(), nn.Linear(hidden_dim, 1), ) self.deterministic_dim = deterministic_dim self.stochastic_dim = stochastic_dim def _sample_latent(self, stats: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: """Sample from Gaussian with reparameterization trick.""" mean, std = torch.chunk(stats, 2, dim=-1) std = F.softplus(std) + 0.1 # Ensure positive std # Reparameterization: z = mean + std * epsilon epsilon = torch.randn_like(std) sample = mean + std * epsilon return sample, (mean, std) def observe( self, obs: torch.Tensor, action: torch.Tensor, hidden: torch.Tensor, ) -> Dict[str, torch.Tensor]: """ Process an observation and update the world model state. Returns posterior latent state (uses observation). """ # Encode observation obs_embed = self.encoder(obs) # Update deterministic state hidden_input = torch.cat([hidden, action], dim=-1) new_hidden = self.gru(hidden_input, hidden) # Compute prior (without observation) prior_stats = self.prior_net(new_hidden) prior_sample, prior_dist = self._sample_latent(prior_stats) # Compute posterior (with observation) posterior_input = torch.cat([new_hidden, obs_embed], dim=-1) posterior_stats = self.posterior_net(posterior_input) posterior_sample, posterior_dist = self._sample_latent(posterior_stats) return { 'hidden': new_hidden, 'posterior': posterior_sample, 'prior': prior_sample, 'posterior_dist': posterior_dist, 'prior_dist': prior_dist, } def imagine( self, action: torch.Tensor, hidden: torch.Tensor, stochastic: torch.Tensor, ) -> Dict[str, torch.Tensor]: """ Imagine forward one step without observation. Uses prior distribution for next stochastic state. """ # Combine deterministic and stochastic for GRU input model_input = torch.cat([stochastic, action], dim=-1) new_hidden = self.gru(model_input, hidden) # Sample from prior (no observation to condition on) prior_stats = self.prior_net(new_hidden) prior_sample, prior_dist = self._sample_latent(prior_stats) return { 'hidden': new_hidden, 'stochastic': prior_sample, 'prior_dist': prior_dist, } def decode(self, hidden: torch.Tensor, stochastic: torch.Tensor) -> torch.Tensor: """Reconstruct observation from model state.""" state = torch.cat([hidden, stochastic], dim=-1) return self.decoder(state) def predict_reward(self, hidden: torch.Tensor, stochastic: torch.Tensor) -> torch.Tensor: """Predict reward from model state.""" state = torch.cat([hidden, stochastic], dim=-1) return self.reward_model(state)Stochastic vs Deterministic Models
World models must handle environment stochasticity—the inherent randomness in dynamics. Two paradigms exist:
Deterministic Models: Predict a single next state: z' = f(z, a)
Stochastic Models: Predict a distribution over next states: p(z' | z, a)
Modern world models typically use stochastic components for representing uncertainty while maintaining deterministic recurrent states for stability. The RSSM (Recurrent State-Space Model) architecture combines both: a deterministic GRU state carries long-term information, while stochastic latents capture moment-to-moment uncertainty.
Discrete vs Continuous Latent Spaces
Recently, some world models have adopted discrete latent representations:
Discrete representations can be more robust, enable token-based sequence modeling (GPT-style world models), and connect to symbolic reasoning. MuZero and DreamerV3 use discrete or hybrid representations successfully.
The quality of world model predictions depends critically on the learned representation. The latent space must capture aspects of the observation relevant for dynamics and decisions while ignoring irrelevant details. This connects world model learning to broader questions in representation learning and raises the question: what is the 'right' representation for a world model?
Once we have a learned world model, we can use it to plan—searching over possible action sequences to find those that lead to desirable outcomes. This enables agents to act intelligently on new tasks even without task-specific training, simply by specifying goals and planning toward them.
Planning Problem Formulation
Given:
Find an action sequence (a_0, a_1, ..., a_{H-1}) that maximizes expected cumulative reward:
max_{a_0:H-1} E[Σ_{t=0}^{H-1} γ^t r_t | world model]
This is solved through search in the model rather than real-world execution.
Model Predictive Control (MPC)
A common planning approach in continuous control:
The 'model predictive' aspect refers to only executing the first action and then re-planning—this handles model errors by continuously correcting the plan based on actual observations.
Cross-Entropy Method (CEM)
A popular optimization method for MPC planning:
CEM is simple but effective, handling high-dimensional action spaces better than random search.
Monte Carlo Tree Search (MCTS)
For discrete action spaces or when exploration is critical, MCTS provides a principled approach:
MCTS with learned world models achieved superhuman performance in Go (AlphaGo Zero) and later in Atari (MuZero).
Learning to Plan: Policy Optimization in Imagination
Rather than planning at test time, we can use the world model to train a policy:
This is the approach taken by Dreamer and related methods. Advantages include:
The policy essentially 'compiles' extensive planning into learned behavior.
Behavior Learning in Latent Space
The most efficient approach learns behavior directly in the world model's latent space:
This avoids expensive observation prediction during training—we only need latent dynamics, which are faster to compute.
All planning methods face the fundamental challenge of model error. Small prediction errors compound over long rollouts, leading to unrealistic imagined trajectories. Strategies to mitigate this include: short planning horizons, ensembling multiple models, conservative uncertainty-aware planning, and model-free policy refinement. No approach completely solves this challenge yet.
The development of world models has been marked by several breakthrough systems, each introducing key innovations. Understanding these landmarks provides context for the current state of the field.
Ha & Schmidhuber's 'World Models' (2018)
This influential paper crystallized the modern world model paradigm:
Key insight: The 'large' complex model learns the world; the 'small' policy learns to act within it. Demonstrated that agents can learn largely from imagined experience, solving CarRacing from pixels with ~100x fewer real environment steps.
PlaNet (2019)
DeepMind's 'Learning Latent Dynamics for Planning from Pixels' introduced:
PlaNet demonstrated strong results on DeepMind Control Suite tasks, establishing latent planning as a viable approach to image-based control.
Dreamer (2020) and DreamerV2/V3
Building on PlaNet, Dreamer introduced policy learning in imagination:
DreamerV2 (2021) added discrete latent representations for more stable learning. DreamerV3 (2023) achieved significant scaling, learning to play Minecraft from raw pixels—including the challenging 25-minute task of collecting a diamond.
DreamerV3's success on Minecraft demonstrated world models can handle:
MuZero (2020)
MuZero represents perhaps the most impressive model-based success story:
MuZero's key insight: the world model need only capture aspects relevant for decision-making. Predicting future values and rewards is more tractable than predicting future pixels, and sufficient for planning.
Video Prediction Models: Sora and Beyond
Recent advances in large-scale video diffusion models suggest a new paradigm for world models:
While not yet integrated into agentic systems, video diffusion models represent a possible future where world models emerge from large-scale pretraining, similar to how language understanding emerged in LLMs.
Genie (2024)
DeepMind's Genie learns world models from unlabeled internet videos:
Genie suggests that foundation world models—trained on diverse video data—might provide general-purpose environment understanding transferable to downstream tasks.
| System | Year | Key Innovation | Domain |
|---|---|---|---|
| World Models | 2018 | VAE + RNN architecture, imagination training | CarRacing |
| PlaNet | 2019 | RSSM, CEM planning in latent space | DMC Suite |
| Dreamer | 2020 | Actor-critic in imagination | DMC Suite |
| MuZero | 2020 | Value-equivalent models, MCTS | Atari, Board Games |
| DreamerV3 | 2023 | Scalability, discrete latents | Minecraft (diamond) |
| Sora/Genie | 2024 | Large-scale video diffusion | Open-ended video |
In just six years, world models have advanced from solving simple racing games to collecting diamonds in Minecraft and generating coherent minute-long videos. This trajectory suggests world models may be approaching a capability threshold where they become practical for real-world applications in robotics, scientific simulation, and interactive systems.
A crucial aspect of human world understanding is its compositional, object-centric nature. We represent the world not as a homogeneous field of pixels but as collections of distinct objects with properties and relations. Object-centric world models aim to capture this structure, decomposing scenes into discrete object representations and modeling dynamics at the object level.
Why Object-Centric?
Object-centric representations offer several advantages:
Unsupervised Object Discovery
A key challenge is discovering objects from raw observations without supervision:
MONet (Multi-Object Network): Iteratively segments scenes using attention, representing each segment with a separate VAE latent.
IODINE (Iterative Object Decomposition Inference Network): Refines object representations through iterative amortized inference.
Slot Attention: Uses attention mechanism with 'slots' that compete to represent different objects. Particularly successful and efficient, enabling object discovery from single images.
These methods can discover individual objects in complex scenes, enabling subsequent object-level reasoning.
Object-Centric Dynamics
Given object representations, we can learn dynamics at the object level:
Graph Neural Networks for Dynamics: Represent objects as nodes and potential interactions as edges. GNN message passing models how objects influence each other:
Relational World Models: Explicitly model relations between objects (touching, contains, supports) and how actions affect these relations. Enables more symbolic-like reasoning about object interactions.
Hierarchical Abstraction: Some approaches learn hierarchies of increasingly abstract representations:
This enables reasoning at appropriate abstraction levels for different tasks.
Challenges in Object-Centric World Modeling
Object-centric learning faces a version of the 'binding problem' from cognitive science: how do we bind together features that belong to the same object? Slot Attention addresses this through competitive attention—features must compete for slot membership. This mirrors theories of attention in human perception as a binding mechanism.
Robotics presents perhaps the most compelling application domain for world models. Physical robot interaction is slow, expensive, and potentially dangerous. World models that enable learning in simulation and imagination could dramatically accelerate robot learning while improving safety.
The Sim-to-Real Challenge
A major challenge in robot world models is sim-to-real transfer—models trained in simulation often fail when deployed on real robots due to:
Domain Randomization: Make models robust by training with randomized simulation parameters:
The hope is that the real world becomes 'just another sample' from the randomized distribution.
Learning from Real Data: Alternatively, learn world models directly from real robot experience:
Dexterous Manipulation
World models have shown particular promise for dexterous manipulation—complex object manipulation requiring fine motor control:
Recent work from DeepMind and others has demonstrated sim-to-real transfer of complex manipulation skills including Rubik's cube solving, in-hand reorientation, and bimanual coordination.
Mobile Robots and Navigation
For mobile robots, world models capture:
Foundation Models for Robotics
Emerging work explores using large pretrained models as world models for robotics:
Systems like RT-2 and PaLM-E demonstrate that pretrained multimodal models contain substantial world knowledge applicable to robot control.
Some researchers argue that true world understanding requires embodiment—physical interaction with the world rather than just observation. If true, robot world models may need to be learned through physical experience rather than from internet-scale observation alone. This remains an open and fascinating question at the intersection of AI and cognitive science.
Despite impressive progress, world models face fundamental challenges that represent active research frontiers. Solving these could unlock transformative capabilities.
Long-Horizon Prediction and Compounding Errors
World models inevitably have prediction errors. Over long rollouts, these errors compound catastrophically:
Current approaches mitigate but don't solve this:
Fundamental solutions might require:
Generalization Across Environments
Most world models are trained on specific environments and don't transfer:
The dream of foundation world models—trained on diverse data and applicable to novel environments—remains largely unrealized. Key questions:
The Abstraction Problem
Humans reason at multiple levels of abstraction—from muscle movements to high-level plans. Most world models operate at a single fixed granularity. Key challenges:
Open-World Generalization
Real-world agents must handle novelty—objects, situations, and dynamics never seen in training:
Integration with Language and Reasoning
Current world models are largely non-linguistic. Integrating natural language could enable:
Vision-language models like Flamingo and GPT-4V demonstrate rich world knowledge in language form—connecting this to action remains a frontier.
Many believe that world models—systems that truly understand how the world works through learned internal simulators—represent a necessary component of artificial general intelligence. Current systems are impressive but narrow. The path to general-purpose world models that enable flexible reasoning, planning, and adaptation across domains remains one of AI's grand challenges.
We've explored the rich landscape of world models—from basic concepts to cutting-edge research. Let's consolidate the key insights:
What's Next:
Having explored world models' approach to environment simulation and planning, we'll next examine AI Safety—the crucial research direction focused on ensuring AI systems, including those with capable world models, remain beneficial and aligned with human values as they become more powerful.
You now understand world models—from their theoretical foundations through architectural innovations to frontier research challenges. This foundation prepares you to engage with cutting-edge work on systems that truly understand and can reason about the physical world.