JEPA: World Models and Energy-Based Models in AI

Key Figures at a Glance

2.85x

Inference speedup achieved by VL-JEPA compared to classical vision-language models

43x

Less training data required by VL-JEPA compared to the Perception Encoder

$1.03bn

Seed funding for AMI Labs, one of the largest seed rounds in tech history

790m

Trainable parameters in VL-JEPA, 50 percent fewer than comparable models

The Epistemological Crisis of Generative AI

Transformer-based large language models handle syntax and text generation with impressive accuracy. What they do not do, however, is causally understand the physical reality their symbols refer to. This gap manifests in the well-known phenomenon of hallucinations. When a model operates without grounding in physical laws, it produces content that is statistically coherent but factually impossible.

Simply scaling, adding more parameters and more data, does not resolve the causality problem. Yann LeCun therefore described autoregressive language models as a conceptual dead end on the road to Advanced Machine Intelligence (AMI). His alternative: world models that learn physical causalities, simulate the consequences of hypothetical actions and plan across multiple abstraction levels.

The core problem of probabilistic modelling: The softmax function forces normalisation across all possible outputs and creates a strong winner-takes-all bias. The model struggles to represent multimodal distributions in which several equally valid futures exist. Energy-Based Models resolve this by eliminating normalisation entirely.

The Architecture of Autonomous Machine Intelligence

LeCun's position paper "A Path Towards Autonomous Machine Intelligence" describes a modular, fully differentiable architecture that integrates fast reactive behaviour (System-1 cognition) with analytical planning (System-2 cognition). The six modules communicate continuously with one another:

Configurator

The executive centre of the system. It dynamically modulates the parameters, attention focus and information flows of all other modules depending on the current task or goal.

Perception Module

Receives raw, high-dimensional sensor data (visual, acoustic, tactile) and compresses it into a low-dimensional, task-relevant estimate of the current world state.

World Model

The core of intelligence and the primary domain of JEPA. Infers missing information and predicts plausible future states when a specific action sequence is executed, acting as the agent's internal simulator.

Cost Module

Comprises a hard-wired intrinsic cost block (analogous to pain) and a trainable critic that predicts future costs based on current observations and planned actions.

Actor Module

Generates sequences of potential actions and optimises them using optimal control theory to identify the action sequence that minimises accumulated future costs.

Short-term Memory

Records the immediate history of perceptions, states and actions. Provides the temporal context for the world model to extrapolate trajectories.

Because all components are differentiable, error gradients can be propagated from the cost module back through the world model to the actor. This enables planning at inference time, where the model "thinks" intensively before every physical action rather than reactively executing a learned heuristic.

Energy-Based Models: The Mathematical Foundation

The conceptual strength of JEPA derives directly from its mathematical foundation. An Energy-Based Model (EBM) defines a real-valued scalar energy function E(x, y), parametrised by the weights of a neural network. Compatible, physically plausible state pairs receive low energy values; incompatible configurations receive high energy values.

Prediction is not formulated as probabilistic sampling but as a minimisation problem:

Inference as energy minimisation: The system searches for the value y that minimises energy with respect to the given context x. This deterministic approach allows natural modelling of multimodal data distributions without having to compute a partition function over the entire output space.

Quasi-Metric Spaces and Physical Irreversibility

Recent formal analyses reveal deep connections to theoretical physics. The energy function can be understood as the expression of a quasi-metric geometry representing the infimum of accumulated local work along a feasible trajectory. Critically, this function is asymmetric: E(x, y) does not equal E(y, x).

A glass can fall from a table and shatter, which is energetically plausible (low energy). The reverse process, where shards spontaneously reassemble into an intact glass, must receive an extremely high energy value. Probabilistic cosine similarities are symmetric and fail this fundamental requirement of physics modelling entirely.

No Normalisation

EBMs require no partition function over all possible outputs, which drastically simplifies computation and avoids intractability.

Multimodality

EBMs can assign deep energy valleys to disjoint regions in data space, naturally capturing multiple equally valid futures.

Composability

Energy functions can be linked additively (product of experts), offering immense flexibility when building hierarchical systems.

Numerical Stability

Logarithm and exponential functions that are ubiquitous in probabilistic modelling often cancel each other in EBMs, stabilising optimisation.

The Anatomy of the Joint Embedding Predictive Architecture

JEPA is the specific architectural instantiation that trains world models through self-supervised learning (SSL) without manually annotated labels. The key shift is that the loss function is applied in the abstract embedding space rather than in the data space (pixels, voxels).

The Three Main Components

Context Encoder (fc)

Typically a powerful Vision Transformer (ViT). Takes the observable context (past video frames or visible image regions) and projects it into a dense, continuous representation space. The compression systematically eliminates unpredictable microstructural surface details.

Target Encoder (fa)

Structurally identical to the context encoder. Processes the target state (the future or masked image regions) to generate the target representation. Receives no direct gradient updates but is updated as an exponentially moving average of the context encoder weights.

Predictor (p)

Deliberately designed as a shallow network (lightweight MLP or a few transformer layers). Its sole task is to model the internal dynamics of the state transition, transforming the context representation into a prediction of the target representation.

The Role of the Latent Variable

A critical design element is the integration of a latent variable z into the predictor. Because the world is not fully deterministic, there is often a one-to-many mapping between a current state and possible futures. If the predictor received only the context as input, it would compute the statistical average of all possible futures, producing incoherent, blurred predictions.

The latent variable z provides the missing information that determines which of the many plausible futures actually occurs. At inference time, z acts as a control vector: by sampling different values, the world model can systematically simulate a range of alternative future scenarios for the cost module to evaluate.

Collapse Prevention

The fundamental weakness of self-supervised architectures is representational collapse: the network quickly finds a trivial global minimum where all inputs map to the same constant vector. JEPA addresses this with a combination of three methods:

Method	Mechanism	Effect in JEPA
Exponential Moving Average (EMA)	Target encoder weights are updated as a moving average of the context encoder weights	Target representations change more slowly than context representations, stabilising the moving target
Latent Variable Regularisation	Penalises the information content of z via a regularisation term	Prevents the predictor from ignoring context and extracting all information exclusively from z
VICReg	Explicitly maximises variance of each embedding dimension and minimises covariance between dimensions	Ensures the embedding fills the space and no information bottleneck collapse occurs

The Evolution of JEPA Models: From I-JEPA to VL-JEPA

The versatility of the JEPA approach was evident in the rapid development of modality-specific variants between 2023 and 2026. Each evolutionary stage addressed specific sensory modalities and architectural bottlenecks.

I-JEPA (2023): Semantic Image Understanding

The Image-based Joint Embedding Predictive Architecture was the first large-scale proof of concept. The challenge: forcing the model to learn global semantic concepts (a dog) without overfitting to local pixel textures (a specific colour variation in the fur).

I-JEPA resolves this through a multi-block masking strategy. A substantial area of the image is left unmasked as the "context block" and fed to the context encoder. Simultaneously, multiple "target blocks" at completely different locations are fully masked. The predictor must infer the semantics of these missing target blocks purely from the unmasked context, since the prediction happens in the abstract embedding space and focuses on large-scale causalities.

Efficiency gain: Pre-training a ViT-H/14 on ImageNet-1K required fewer than 1,200 GPU hours, more than 2.5 times faster than iBOT and over ten times more efficient than Masked Autoencoders (MAE).

C-JEPA: Overcoming EMA Limitations

Empirical research showed that the EMA mechanism does not prevent complete model collapse at every architectural configuration. C-JEPA (Contrastive-JEPA) integrates the VICReg strategy into the JEPA framework. By explicitly controlling the variance and covariance of embedding vectors across batches, C-JEPA achieves significantly faster convergence and higher performance on both linear probing and fine-tuning on ImageNet-1K.

V-JEPA and V-JEPA 2: Physical Intuition

The transition to video processing marked another shift: models had to learn not only spatial arrangements but also temporal causality, Newtonian mechanics and object permanence. V-JEPA applies the masking principle to the temporal dimension.

V-JEPA 2 (1.2 billion parameters) was pre-trained unsupervised on over one million hours of diverse video data and one million static images. The model learned an implicit understanding of physical reality: gravity, inertia, spatial occlusion and object manipulation.

The variant V-JEPA 2-AC (Action-Conditioned) was turned into a robotic world model through minimal fine-tuning with robot interaction data. The key breakthrough: zero-shot transferability. Robots solved unknown manipulation tasks in unfamiliar environments without extensive retraining, because the world model enabled "thinking before acting" - mentally simulating kinematic actions before activating servomotors.

Evaluation Benchmarks for Physical Understanding

IntPhys 2

Evaluates the ability to distinguish between physically realistic and impossible video scenarios. Tests understanding of fundamental laws of nature.

MVPBench

Video question answering, constructed to reveal whether models exploit statistical correlations (dataset shortcuts) or possess genuine physical understanding.

CausalVQA

Focuses explicitly on cause-and-effect reasoning, anticipation of future events and counterfactual thinking (what would have happened if X had not occurred?).

VL-JEPA 2026: Removing the Autoregressive Bottleneck

The Vision-Language Joint Embedding Predictive Architecture (VL-JEPA) directly addresses the biggest criticism of established vision-language models such as GPT-4V or LLaVA: the massive inefficiencies of token generation. Classical VLMs convert visual inputs into embeddings, concatenate them with text queries and feed them into a language model that generates the answer token by token. Describing a simple 30-second video can require over 50 sequential forward passes through a massive LLM.

The principle of VL-JEPA: Do not predict tokens; predict semantic meaning. Instead of generating token by token, the predictor outputs a 1,536-dimensional vector representing the abstract semantic answer.

Comparison: Autoregressive VLMs vs. VL-JEPA

Dimension	Classical Autoregressive VLMs	VL-JEPA (2026)
Architecture backbone	CLIP + LLM decoder	V-JEPA 2 (vision) + Llama-3 layers (predictor)
Learning objective	Reconstruction of text tokens	Prediction of continuous text embeddings
Inference latency	Extreme bottleneck from token-by-token generation	2.85x speedup through selective decoding
Trainable parameters	Typically over 7 billion	790 million (50 percent fewer than comparable models)
Data efficiency	High demand for text-image pairs	43x more efficient (2.0bn samples vs. 86bn for the Perception Encoder)

Selective Decoding for Real-Time Video Streaming

When VL-JEPA monitors a video feed, it produces a continuous stream of target embeddings. If the semantics of a scene do not change across dozens of frames (a stationary glass), the variance of the embeddings remains extremely low and the text decoder stays inactive. Only when a semantic break occurs (the glass tips over and spills water) does the embedding variance spike and the decoder is triggered for a single pass. This non-generative nature eliminates nearly 65 percent of redundant compute operations.

Ablation Studies and Loss Functions

Using a pure squared error (L2 loss) to minimise the distance between predicted and true text embeddings produced "blurred" representations, since L2 tends to pull towards the statistical mean across multimodal targets. Meta instead implemented contrastive InfoNCE loss functions and modified cosine distances in the latent space. Results were clear: InfoNCE improved VQA accuracy by 9.8 points and retrieval recall@1 by 18.6 points compared to L2, confirming that strong geometric alignment of embedding spaces is essential for complex reasoning tasks.

The Text-Image Joint Embedding Predictive Architecture (TI-JEPA) focuses explicitly on the problem of cross-modal alignment, particularly for complex tasks such as multimodal sentiment analysis. The gap between the syntactic structure of text and the spatial arrangement of pixels is a significant challenge.

TI-JEPA integrates elaborate cross-attention mechanisms into the energy-based framework. The architecture freezes pre-trained text and image encoders to preserve their learned feature knowledge and prevent energy collapse. The freed compute capacity is used exclusively to optimise the cross-attention modules, which map cross-modal dependencies and generate a robust multimodal representation.

Hierarchical JEPA and Autonomous Agents

The ultimate goal of LeCun's vision is not to classify videos or texts but to create autonomous agents that orchestrate complex, multi-step action sequences in the physical and digital world. This is where the concept of Hierarchical JEPA (H-JEPA) applies.

Human action planning is intrinsically hierarchical: crossing a road involves abstract goals ("reach the other side"), intermediate planning levels ("wait for the green light") and microscopic control mechanisms ("contract the quadriceps"). H-JEPAs stack multiple JEPA modules on top of each other, with lower levels modelling fine-grained time steps and upper levels interpolating causalities across long time horizons.

Unlike LLMs, which can catastrophically derail when a step in the generation chain fails, EBM-based energy minimisation allows the evaluation of partial trajectories. If the energy of an intermediate state exceeds a threshold, the system initiates a course correction before the final action is executed.

Application Areas for JEPA-Based Agents

Smart Homes

Hierarchical planning coordinates thermostats, lighting and security systems. Energy minimisation optimises user comfort, energy efficiency and safety simultaneously.

Medical Diagnostics

JEPA encodes patient symptoms and clinical knowledge in the latent space. Strict EBM cost functions minimise the hallucination risk in treatment recommendations.

Industrial Processes

World models simulate the physical dynamics of production lines. Energy-based evaluation of partial trajectories prevents machine failures before they occur.

Financial Planning

Portfolio rebalancing is simulated as minimising financial risk across different temporal forecast horizons, evaluated through energy functions.

AMI Labs: A Counterpoint to LLM Hyperscalers

The escalating dispute over the future of AI architectures manifested in tectonic market shifts in late 2025 and early 2026. While Meta, Google and OpenAI continued to invest billions in giant data centres for massive LLMs, Yann LeCun left Meta to commercialise his theory of world models outside rigid corporate structures.

$1.03bn

Seed funding, one of the largest seed rounds in technology history

$3.5bn

Pre-money valuation at founding in March 2026

4 Sites

Paris, New York, Montreal and Singapore - a decentralised global network

In March 2026, LeCun founded Advanced Machine Intelligence (AMI) Labs, headquartered in Paris. Co-funded by Cathay Innovation, Greycroft, Hiro Capital, HV Capital and Bezos Expeditions, AMI Labs positions itself as a conceptual and technological counterpoint to LLM-focused hyperscalers. The premise is unambiguous: autoregressive language models will never reach the level of human intelligence. True AGI requires world models that plan in latent space and understand physical causality.

Geopolitical Significance for Europe

LeCun is deliberately establishing a decentralised global network to access the talent pool outside the monopolistic structures of Silicon Valley. For the European technology sector, which had often fallen behind in generative foundation models, the Paris headquarters represents an opportunity to gain technological independence in the post-LLM era. The focus of AMI Labs on sectors where reliability, controllability and safety are critical prerequisites also aligns closely with the EU AI Act's requirements for high-risk AI applications.

Conclusion: Beyond Statistical Patterns

The analysis of the Joint Embedding Predictive Architecture marks a meaningful turning point in AI research. The generative AI paradigm has demonstrated its capability in processing syntactic patterns but has systematically failed at the threshold of physical world understanding.

By consistently using Energy-Based Models in the abstract latent space, JEPA addresses the core problem of noise in sensory data streams. I-JEPA proved that semantic concepts can be learned far more efficiently when the constraint of pixel reconstruction is removed. V-JEPA 2 created world models that enable robots to perform zero-shot manipulations through physical reasoning before acting. VL-JEPA 2026 dismantled the autoregressive bottleneck through selective decoding, achieving a 285 percent speedup while halving trainable parameters and reducing data requirements by a factor of 43.

The founding of AMI Labs with over a billion US dollars in starting capital signals that the market anticipates the technological saturation of LLMs. For sectors with uncompromising requirements for causal precision, error correction, hierarchical planning and hallucination resistance, JEPA-based world models currently offer the most scientifically grounded path towards advanced, autonomous machine intelligence.

Frequently Asked Questions about JEPA and World Models

What is JEPA (Joint Embedding Predictive Architecture)? +

JEPA is an AI architecture that makes predictions in the abstract latent representation space rather than in the data pixel space. Instead of reconstructing every pixel of an image or frame, JEPA learns the semantic meaning of state transitions. The architecture was developed by Yann LeCun as the core building block for world models and autonomous machine intelligence, forming the technical foundation for models such as I-JEPA, V-JEPA 2 and VL-JEPA.

What are Energy-Based Models and why are they relevant to JEPA? +

Energy-Based Models (EBMs) define a real-valued energy function that assigns low energy to compatible state pairs and high energy to incompatible configurations. Predictions are generated by minimising this function rather than by probabilistic sampling. EBMs form the mathematical foundation of JEPA and allow modelling of multimodal data distributions without the normalisation overhead of traditional generative models, correctly representing the physical irreversibility of real-world events.

How does V-JEPA 2 differ from classical video AI models? +

V-JEPA 2 learns physical causality, gravity, inertia and object permanence through unsupervised training on over one million hours of video data. Classical video AI models generate pixels and model surface noise. V-JEPA 2 works in the latent space and enables zero-shot transferability to robot tasks the model was never explicitly trained on, because it has built genuine physical world understanding rather than surface-level pattern matching.

What does VL-JEPA 2026 achieve compared to classical vision-language models? +

VL-JEPA achieves a 2.85x inference speedup through selective decoding: the text decoder is only activated when the semantic meaning of a scene actually changes. The model has 50 percent fewer trainable parameters than comparable models (790 million) and requires 43 times less training data. Instead of generating token by token, it predicts continuous semantic embedding vectors and eliminates nearly 65 percent of redundant compute operations.

What is AMI Labs and why is its founding significant for Europe? +

AMI Labs (Advanced Machine Intelligence Labs) was founded in March 2026 by Yann LeCun in Paris and secured seed funding of 1.03 billion US dollars at a valuation of 3.5 billion US dollars. The company focuses on world models for industrial applications, robotics and healthcare, targeting sectors where reliability and safety are non-negotiable. For Europe, the Paris headquarters represents a real opportunity to regain technological independence in the post-LLM era of AI, aligned with EU AI Act requirements.

How does JEPA prevent representational collapse during training? +

JEPA combines three strategies: first, the target encoder receives no direct gradient updates but is updated as an exponentially moving average of the context encoder (EMA), so target representations change more slowly than context representations. Second, regularisation of the latent variable z limits information entropy and prevents the predictor from ignoring context. Third, the VICReg method maximises the variance of each embedding dimension and minimises covariances, ensuring the embedding fills the space and no collapse can occur.

JEPA: World Models, Energy-Based Models and a New Direction in AI