Yann LeCun has proposed JEPA as an architecture that makes predictions in latent space rather than pixel space. This in-depth analysis covers the mathematical foundations, the architectural evolution and the industrial consequences, including the billion-dollar founding of AMI Labs in March 2026.
Transformer-based large language models handle syntax and text generation with impressive accuracy. What they do not do, however, is causally understand the physical reality their symbols refer to. This gap manifests in the well-known phenomenon of hallucinations. When a model operates without grounding in physical laws, it produces content that is statistically coherent but factually impossible.
Simply scaling, adding more parameters and more data, does not resolve the causality problem. Yann LeCun therefore described autoregressive language models as a conceptual dead end on the road to Advanced Machine Intelligence (AMI). His alternative: world models that learn physical causalities, simulate the consequences of hypothetical actions and plan across multiple abstraction levels.
LeCun's position paper "A Path Towards Autonomous Machine Intelligence" describes a modular, fully differentiable architecture that integrates fast reactive behaviour (System-1 cognition) with analytical planning (System-2 cognition). The six modules communicate continuously with one another:
The executive centre of the system. It dynamically modulates the parameters, attention focus and information flows of all other modules depending on the current task or goal.
Receives raw, high-dimensional sensor data (visual, acoustic, tactile) and compresses it into a low-dimensional, task-relevant estimate of the current world state.
The core of intelligence and the primary domain of JEPA. Infers missing information and predicts plausible future states when a specific action sequence is executed, acting as the agent's internal simulator.
Comprises a hard-wired intrinsic cost block (analogous to pain) and a trainable critic that predicts future costs based on current observations and planned actions.
Generates sequences of potential actions and optimises them using optimal control theory to identify the action sequence that minimises accumulated future costs.
Records the immediate history of perceptions, states and actions. Provides the temporal context for the world model to extrapolate trajectories.
Because all components are differentiable, error gradients can be propagated from the cost module back through the world model to the actor. This enables planning at inference time, where the model "thinks" intensively before every physical action rather than reactively executing a learned heuristic.
The conceptual strength of JEPA derives directly from its mathematical foundation. An Energy-Based Model (EBM) defines a real-valued scalar energy function E(x, y), parametrised by the weights of a neural network. Compatible, physically plausible state pairs receive low energy values; incompatible configurations receive high energy values.
Prediction is not formulated as probabilistic sampling but as a minimisation problem:
Recent formal analyses reveal deep connections to theoretical physics. The energy function can be understood as the expression of a quasi-metric geometry representing the infimum of accumulated local work along a feasible trajectory. Critically, this function is asymmetric: E(x, y) does not equal E(y, x).
A glass can fall from a table and shatter, which is energetically plausible (low energy). The reverse process, where shards spontaneously reassemble into an intact glass, must receive an extremely high energy value. Probabilistic cosine similarities are symmetric and fail this fundamental requirement of physics modelling entirely.
EBMs require no partition function over all possible outputs, which drastically simplifies computation and avoids intractability.
EBMs can assign deep energy valleys to disjoint regions in data space, naturally capturing multiple equally valid futures.
Energy functions can be linked additively (product of experts), offering immense flexibility when building hierarchical systems.
Logarithm and exponential functions that are ubiquitous in probabilistic modelling often cancel each other in EBMs, stabilising optimisation.
JEPA is the specific architectural instantiation that trains world models through self-supervised learning (SSL) without manually annotated labels. The key shift is that the loss function is applied in the abstract embedding space rather than in the data space (pixels, voxels).
Typically a powerful Vision Transformer (ViT). Takes the observable context (past video frames or visible image regions) and projects it into a dense, continuous representation space. The compression systematically eliminates unpredictable microstructural surface details.
Structurally identical to the context encoder. Processes the target state (the future or masked image regions) to generate the target representation. Receives no direct gradient updates but is updated as an exponentially moving average of the context encoder weights.
Deliberately designed as a shallow network (lightweight MLP or a few transformer layers). Its sole task is to model the internal dynamics of the state transition, transforming the context representation into a prediction of the target representation.
A critical design element is the integration of a latent variable z into the predictor. Because the world is not fully deterministic, there is often a one-to-many mapping between a current state and possible futures. If the predictor received only the context as input, it would compute the statistical average of all possible futures, producing incoherent, blurred predictions.
The latent variable z provides the missing information that determines which of the many plausible futures actually occurs. At inference time, z acts as a control vector: by sampling different values, the world model can systematically simulate a range of alternative future scenarios for the cost module to evaluate.
The fundamental weakness of self-supervised architectures is representational collapse: the network quickly finds a trivial global minimum where all inputs map to the same constant vector. JEPA addresses this with a combination of three methods:
| Method | Mechanism | Effect in JEPA |
|---|---|---|
| Exponential Moving Average (EMA) | Target encoder weights are updated as a moving average of the context encoder weights | Target representations change more slowly than context representations, stabilising the moving target |
| Latent Variable Regularisation | Penalises the information content of z via a regularisation term | Prevents the predictor from ignoring context and extracting all information exclusively from z |
| VICReg | Explicitly maximises variance of each embedding dimension and minimises covariance between dimensions | Ensures the embedding fills the space and no information bottleneck collapse occurs |
The versatility of the JEPA approach was evident in the rapid development of modality-specific variants between 2023 and 2026. Each evolutionary stage addressed specific sensory modalities and architectural bottlenecks.
The Image-based Joint Embedding Predictive Architecture was the first large-scale proof of concept. The challenge: forcing the model to learn global semantic concepts (a dog) without overfitting to local pixel textures (a specific colour variation in the fur).
I-JEPA resolves this through a multi-block masking strategy. A substantial area of the image is left unmasked as the "context block" and fed to the context encoder. Simultaneously, multiple "target blocks" at completely different locations are fully masked. The predictor must infer the semantics of these missing target blocks purely from the unmasked context, since the prediction happens in the abstract embedding space and focuses on large-scale causalities.
Efficiency gain: Pre-training a ViT-H/14 on ImageNet-1K required fewer than 1,200 GPU hours, more than 2.5 times faster than iBOT and over ten times more efficient than Masked Autoencoders (MAE).
Empirical research showed that the EMA mechanism does not prevent complete model collapse at every architectural configuration. C-JEPA (Contrastive-JEPA) integrates the VICReg strategy into the JEPA framework. By explicitly controlling the variance and covariance of embedding vectors across batches, C-JEPA achieves significantly faster convergence and higher performance on both linear probing and fine-tuning on ImageNet-1K.
The transition to video processing marked another shift: models had to learn not only spatial arrangements but also temporal causality, Newtonian mechanics and object permanence. V-JEPA applies the masking principle to the temporal dimension.
V-JEPA 2 (1.2 billion parameters) was pre-trained unsupervised on over one million hours of diverse video data and one million static images. The model learned an implicit understanding of physical reality: gravity, inertia, spatial occlusion and object manipulation.
The variant V-JEPA 2-AC (Action-Conditioned) was turned into a robotic world model through minimal fine-tuning with robot interaction data. The key breakthrough: zero-shot transferability. Robots solved unknown manipulation tasks in unfamiliar environments without extensive retraining, because the world model enabled "thinking before acting" - mentally simulating kinematic actions before activating servomotors.
Evaluates the ability to distinguish between physically realistic and impossible video scenarios. Tests understanding of fundamental laws of nature.
Video question answering, constructed to reveal whether models exploit statistical correlations (dataset shortcuts) or possess genuine physical understanding.
Focuses explicitly on cause-and-effect reasoning, anticipation of future events and counterfactual thinking (what would have happened if X had not occurred?).
The Vision-Language Joint Embedding Predictive Architecture (VL-JEPA) directly addresses the biggest criticism of established vision-language models such as GPT-4V or LLaVA: the massive inefficiencies of token generation. Classical VLMs convert visual inputs into embeddings, concatenate them with text queries and feed them into a language model that generates the answer token by token. Describing a simple 30-second video can require over 50 sequential forward passes through a massive LLM.
| Dimension | Classical Autoregressive VLMs | VL-JEPA (2026) |
|---|---|---|
| Architecture backbone | CLIP + LLM decoder | V-JEPA 2 (vision) + Llama-3 layers (predictor) |
| Learning objective | Reconstruction of text tokens | Prediction of continuous text embeddings |
| Inference latency | Extreme bottleneck from token-by-token generation | 2.85x speedup through selective decoding |
| Trainable parameters | Typically over 7 billion | 790 million (50 percent fewer than comparable models) |
| Data efficiency | High demand for text-image pairs | 43x more efficient (2.0bn samples vs. 86bn for the Perception Encoder) |
When VL-JEPA monitors a video feed, it produces a continuous stream of target embeddings. If the semantics of a scene do not change across dozens of frames (a stationary glass), the variance of the embeddings remains extremely low and the text decoder stays inactive. Only when a semantic break occurs (the glass tips over and spills water) does the embedding variance spike and the decoder is triggered for a single pass. This non-generative nature eliminates nearly 65 percent of redundant compute operations.
Using a pure squared error (L2 loss) to minimise the distance between predicted and true text embeddings produced "blurred" representations, since L2 tends to pull towards the statistical mean across multimodal targets. Meta instead implemented contrastive InfoNCE loss functions and modified cosine distances in the latent space. Results were clear: InfoNCE improved VQA accuracy by 9.8 points and retrieval recall@1 by 18.6 points compared to L2, confirming that strong geometric alignment of embedding spaces is essential for complex reasoning tasks.
The Text-Image Joint Embedding Predictive Architecture (TI-JEPA) focuses explicitly on the problem of cross-modal alignment, particularly for complex tasks such as multimodal sentiment analysis. The gap between the syntactic structure of text and the spatial arrangement of pixels is a significant challenge.
TI-JEPA integrates elaborate cross-attention mechanisms into the energy-based framework. The architecture freezes pre-trained text and image encoders to preserve their learned feature knowledge and prevent energy collapse. The freed compute capacity is used exclusively to optimise the cross-attention modules, which map cross-modal dependencies and generate a robust multimodal representation.
The ultimate goal of LeCun's vision is not to classify videos or texts but to create autonomous agents that orchestrate complex, multi-step action sequences in the physical and digital world. This is where the concept of Hierarchical JEPA (H-JEPA) applies.
Human action planning is intrinsically hierarchical: crossing a road involves abstract goals ("reach the other side"), intermediate planning levels ("wait for the green light") and microscopic control mechanisms ("contract the quadriceps"). H-JEPAs stack multiple JEPA modules on top of each other, with lower levels modelling fine-grained time steps and upper levels interpolating causalities across long time horizons.
Unlike LLMs, which can catastrophically derail when a step in the generation chain fails, EBM-based energy minimisation allows the evaluation of partial trajectories. If the energy of an intermediate state exceeds a threshold, the system initiates a course correction before the final action is executed.
Hierarchical planning coordinates thermostats, lighting and security systems. Energy minimisation optimises user comfort, energy efficiency and safety simultaneously.
JEPA encodes patient symptoms and clinical knowledge in the latent space. Strict EBM cost functions minimise the hallucination risk in treatment recommendations.
World models simulate the physical dynamics of production lines. Energy-based evaluation of partial trajectories prevents machine failures before they occur.
Portfolio rebalancing is simulated as minimising financial risk across different temporal forecast horizons, evaluated through energy functions.
The escalating dispute over the future of AI architectures manifested in tectonic market shifts in late 2025 and early 2026. While Meta, Google and OpenAI continued to invest billions in giant data centres for massive LLMs, Yann LeCun left Meta to commercialise his theory of world models outside rigid corporate structures.
In March 2026, LeCun founded Advanced Machine Intelligence (AMI) Labs, headquartered in Paris. Co-funded by Cathay Innovation, Greycroft, Hiro Capital, HV Capital and Bezos Expeditions, AMI Labs positions itself as a conceptual and technological counterpoint to LLM-focused hyperscalers. The premise is unambiguous: autoregressive language models will never reach the level of human intelligence. True AGI requires world models that plan in latent space and understand physical causality.
LeCun is deliberately establishing a decentralised global network to access the talent pool outside the monopolistic structures of Silicon Valley. For the European technology sector, which had often fallen behind in generative foundation models, the Paris headquarters represents an opportunity to gain technological independence in the post-LLM era. The focus of AMI Labs on sectors where reliability, controllability and safety are critical prerequisites also aligns closely with the EU AI Act's requirements for high-risk AI applications.
The analysis of the Joint Embedding Predictive Architecture marks a meaningful turning point in AI research. The generative AI paradigm has demonstrated its capability in processing syntactic patterns but has systematically failed at the threshold of physical world understanding.
By consistently using Energy-Based Models in the abstract latent space, JEPA addresses the core problem of noise in sensory data streams. I-JEPA proved that semantic concepts can be learned far more efficiently when the constraint of pixel reconstruction is removed. V-JEPA 2 created world models that enable robots to perform zero-shot manipulations through physical reasoning before acting. VL-JEPA 2026 dismantled the autoregressive bottleneck through selective decoding, achieving a 285 percent speedup while halving trainable parameters and reducing data requirements by a factor of 43.
The founding of AMI Labs with over a billion US dollars in starting capital signals that the market anticipates the technological saturation of LLMs. For sectors with uncompromising requirements for causal precision, error correction, hierarchical planning and hallucination resistance, JEPA-based world models currently offer the most scientifically grounded path towards advanced, autonomous machine intelligence.
JEPA is an AI architecture that makes predictions in the abstract latent representation space rather than in the data pixel space. Instead of reconstructing every pixel of an image or frame, JEPA learns the semantic meaning of state transitions. The architecture was developed by Yann LeCun as the core building block for world models and autonomous machine intelligence, forming the technical foundation for models such as I-JEPA, V-JEPA 2 and VL-JEPA.
Energy-Based Models (EBMs) define a real-valued energy function that assigns low energy to compatible state pairs and high energy to incompatible configurations. Predictions are generated by minimising this function rather than by probabilistic sampling. EBMs form the mathematical foundation of JEPA and allow modelling of multimodal data distributions without the normalisation overhead of traditional generative models, correctly representing the physical irreversibility of real-world events.
V-JEPA 2 learns physical causality, gravity, inertia and object permanence through unsupervised training on over one million hours of video data. Classical video AI models generate pixels and model surface noise. V-JEPA 2 works in the latent space and enables zero-shot transferability to robot tasks the model was never explicitly trained on, because it has built genuine physical world understanding rather than surface-level pattern matching.
VL-JEPA achieves a 2.85x inference speedup through selective decoding: the text decoder is only activated when the semantic meaning of a scene actually changes. The model has 50 percent fewer trainable parameters than comparable models (790 million) and requires 43 times less training data. Instead of generating token by token, it predicts continuous semantic embedding vectors and eliminates nearly 65 percent of redundant compute operations.
AMI Labs (Advanced Machine Intelligence Labs) was founded in March 2026 by Yann LeCun in Paris and secured seed funding of 1.03 billion US dollars at a valuation of 3.5 billion US dollars. The company focuses on world models for industrial applications, robotics and healthcare, targeting sectors where reliability and safety are non-negotiable. For Europe, the Paris headquarters represents a real opportunity to regain technological independence in the post-LLM era of AI, aligned with EU AI Act requirements.
JEPA combines three strategies: first, the target encoder receives no direct gradient updates but is updated as an exponentially moving average of the context encoder (EMA), so target representations change more slowly than context representations. Second, regularisation of the latent variable z limits information entropy and prevents the predictor from ignoring context. Third, the VICReg method maximises the variance of each embedding dimension and minimises covariances, ensuring the embedding fills the space and no collapse can occur.