登录查看更多内容

I-JEPA: Advancing Human-Like AI Through Predictive World Models

Stefan Wendin

Driving transformation, innovation & business growth by bridging the gap between technology and business; combining system & design thinking with cutting-edge technologies; Graphs, AI, GenAI, LLM, ML ??

发布日期: 2024年12月10日

As the field of AI strives toward more human-like intelligence, researchers are looking beyond today’s dominant large language models (LLMs). While powerful, autoregressive LLMs—trained to predict the next word or token—face fundamental challenges in reasoning, factuality, and efficiency. In contrast, a new model from Meta AI, called I-JEPA (Image-based Joint Embedding Predictive Architecture), takes a different approach, inspired by Chief AI Scientist Yann LeCun’s vision. I-JEPA aims to learn internal “world models” that capture common-sense knowledge and enable more flexible, efficient, and semantically grounded reasoning.

Why Move Beyond Autoregressive LLMs? Yann LeCun un has argued that LLMs represent a detour from building truly intelligent systems. Autoregressive text generation tends to accumulate errors and struggles to guarantee factuality. Models that produce their entire reasoning process as output are also computationally inefficient, since humans, by contrast, reason silently and share only their conclusions. Furthermore, LLMs’ Transformer-based architectures spend uniform computational effort on each token, whether that token is easy to predict or requires complex reasoning.

For AI to become more robust and adaptive, LeCun suggests that models need to develop internal representations—world models—that understand how the world works. Such models would reason internally, guide decision-making, and produce results more efficiently, just as humans do when they think through a problem before speaking.

The Image-based Joint-Embedding Predictive Architecture (I-JEPA) uses a single context block to predict the representations of various target blocks originating from the same image. The context encoder is a Vision Transformer (ViT) that only processes the visible context patches. The predictor is a narrow ViT that takes the context encoder output and predicts the representations of a target block at a specific location, conditioned on positional tokens of the target (shown in color). The target representations correspond to the outputs of the target-encoder, the weights of which are updated at each iteration via an exponential moving average of the context encoder weights.

A New Approach: I-JEPA I-JEPA implements a key piece of this broader vision. Instead of predicting pixels or next tokens directly, I-JEPA focuses on learning high-level, semantic representations of images without relying on hand-crafted data augmentations. The approach is built on a Joint Embedding Predictive Architecture (JEPA):

Predictive Learning in Embedding Space: Rather than reconstructing raw pixels for masked regions (as in generative masked autoencoders), I-JEPA predicts abstract, high-level representations of these regions. This is crucial: by applying the loss in embedding space rather than pixel space, the model discards low-level, irrelevant details and focuses on semantic concepts like object parts, poses, and relationships.
Context and Target Blocks: I-JEPA splits an image into patches and samples several large “target blocks” that are masked out. Another large “context block” provides partial but informative visual input. The model then predicts the representations of these missing target blocks from the visible context. Crucially, these target blocks are sizable and semantically meaningful, guiding the model to capture higher-level information rather than trivial details.
A Predictor Network for Structured Understanding: The I-JEPA architecture includes a context encoder, a target encoder, and a predictor. The target encoder processes the full image once, and targets are formed by masking its output representations. The predictor is then conditioned on positional embeddings corresponding to the missing target blocks, modeling spatial uncertainty and reasoning about what should appear in those regions. This setup helps I-JEPA develop a primitive world model, enabling it to form coherent guesses about unseen object parts.

Strong Performance and Efficiency I-JEPA has demonstrated impressive results:

Computational Efficiency: Training I-JEPA is far more efficient than many alternatives. For instance, a 632M-parameter I-JEPA vision transformer (ViT) model can be trained with 16 A100 GPUs in under 72 hours. This is significantly faster than other widely used computer vision models that may require two to ten times more GPU-hours.
Robust Semantic Representations: I-JEPA achieves strong off-the-shelf results on multiple tasks, outperforming pixel-reconstruction methods and matching or surpassing methods that rely on hand-crafted view augmentations. For example, on low-shot ImageNet classification (using just 1% of labels), I-JEPA’s representations excel. This indicates that the model captures semantic, high-level concepts in a way that supports efficient adaptation to new tasks.
Versatility Across Tasks: While many self-supervised methods skew either toward global semantic features or fine-grained local details, I-JEPA manages both. It can handle semantic classification tasks as well as object counting and depth prediction, surpassing invariance-based methods on these more local tasks. This balance underscores I-JEPA’s capacity to learn features that are truly general-purpose.

领英推荐

The Sparks of Artificial General Intelligence AGI in…

Data Science Dojo 1 年前

AI Innovations: Unveiling the Latest Breakthroughs

Bayes Labs 2 个月前

AI Innovations: Unveiling the Latest Breakthroughs

Bayes Labs 1 个月前

Visualizing the Internal World Model Qualitative analyses confirm that I-JEPA understands the spatial and semantic layout of scenes. When its internal representations are mapped back to pixel space using a specialized decoder, the model predicts high-level image components—like the back of a dog’s head or the correct pose of a building—without overfitting to low-level minutiae. It captures the essence of what is missing, demonstrating a form of structured reasoning rather than mere pattern completion.

Toward Richer Modalities and General Intelligence I-JEPA is one building block toward the vision LeCun outlined: learning world models that can perform planning, reasoning, and prediction without exhaustive labeled data. Future work includes extending the JEPA framework to richer domains, such as video, audio, or text prompts, to predict long-range spatial and temporal events. With such advancements, AI could gain a deeper, more human-like understanding of the world, effortlessly adapting to novel situations and mastering new tasks with minimal supervision.

Conclusion By predicting abstract, high-level representations rather than raw inputs, I-JEPA steps closer to human-like intelligence. It avoids the limitations of purely autoregressive models, learns efficient and semantic features, and generalizes broadly to diverse tasks. While it does not yet fulfill the entire vision of autonomous machine intelligence, I-JEPA clearly illustrates how joint-embedding predictive architectures can serve as a critical stepping stone on the path toward AI that thinks before it speaks.

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

I-JEPA on Hugging Face

Lele Cao

Senior Principal AI Researcher, Microsoft (ABK) | Ph.D., Tsinghua | Ex EQT Motherbrain, Alibaba and Melbourne Uni.

2 个月

Thanks for this great sharing Stefan!

1 次回应

查看更多评论

要查看或添加评论，请登录

Stefan Wendin的更多文章

How Graphs Taught Transformers to Think Outside the Node

2024年12月15日

How Graphs Taught Transformers to Think Outside the Node

I remember back in the days at Neo4j when I first read the article Transformers are Graph Neural Networks by Chaitanya…

3 条评论
Overcoming the Limitations of Softmax for Sharp Out-of-Distribution Performance in AI Systems

2024年10月4日

Overcoming the Limitations of Softmax for Sharp Out-of-Distribution Performance in AI Systems

Yesterday, I went to bed really late, mostly because I had just jumped on a plane to celebrate the marriage of my…

1 条评论
Building Our Own Knowledge System: Why We Took This Path

2024年9月24日

Building Our Own Knowledge System: Why We Took This Path

A few weeks ago, we unveiled SvenAI at an event held at Friends Arena—now known as Strawberry Arena. Amidst the social…

25 条评论
Solar Pro: High-Performance LLM on a Single GPU

2024年9月13日

Solar Pro: High-Performance LLM on a Single GPU

Solar Pro, an advanced large language model (LLM) developed by Upstage AI. With just 22 billion parameters, Solar Pro…

7 条评论
OpenAI o1 Is Out: Embracing Inference-Time Scaling and the Future of AI Reasoning

2024年9月12日

OpenAI o1 Is Out: Embracing Inference-Time Scaling and the Future of AI Reasoning

We are witnessing a shift toward inference-time Introducing OpenAI o1-preview OpenAI has unveiled the o1 series, a new…

4 条评论
Deep dive into LiGNN: Graph Neural Networks at LinkedIn

2024年2月23日

Deep dive into LiGNN: Graph Neural Networks at LinkedIn

This week, two developments caught my attention. Firstly, LinkedIn announced a significant policy change in its data…

12 条评论
The Illusion of Progress: Why Playing It Safe Is the Riskiest Move of All

2024年2月19日

The Illusion of Progress: Why Playing It Safe Is the Riskiest Move of All

Breaking the Cycle: How Businesses Stifle Their Own Growth The concept of marginal gains from the world of sports…

7 条评论
The Intersection of Innovation, Privacy, and Collaboration in South Korea's Tech Landscape

2024年1月15日

The Intersection of Innovation, Privacy, and Collaboration in South Korea's Tech Landscape

I seized the opportunity to reconnect with some familiar faces from last year's trip - Pete Tae-hoon Kim and Hyun-Kyu…

9 条评论
Integrating Behavioral Economics into AI

2024年1月12日

Integrating Behavioral Economics into AI

The evolution of language model alignment methods, from Reinforcement Learning from Human Feedback (RLHF) to Direct…

9 条评论
SOLAR 10.7B: the Sun of AI Rises in the East

2024年1月8日

SOLAR 10.7B: the Sun of AI Rises in the East

Last week, I had the incredible opportunity to meet with Hwalsuk Lee, the Chief Technology Officer at Upstage, for an…

3 条评论

See all articles

I-JEPA: Advancing Human-Like AI Through Predictive World Models

Stefan Wendin

Driving transformation, innovation & business growth by bridging the gap between technology and business; combining system & design thinking with cutting-edge technologies; Graphs, AI, GenAI, LLM, ML ??

领英推荐

Stefan Wendin的更多文章

社区洞察

其他会员也浏览了

Today is AI-Appreciation Day!

Bridging the Divide: How Open-Source AI Models Are Catching Up with Closed-Source Counterparts

The World This Week in AI (17th September 2024)

Deeply Seeking AI: The Open-Source Revolution

OpenAI's o3: A Leap Forward in AI, But Challenges Remain

DeepSeek R1: Enter the Next Frontier of AI Evolution

Deeply Seeking AI: The Open-Source Revolution

Metadata’s Role in Sustainable, Cost Effective AI Development

The AI Data Odyssey: Navigating the Synthetic Seas

Deep Deconstruction: The Core Differences and Strategic Advantages between Google Gemini and SearchGPT

领英推荐

Stefan Wendin的更多文章

How Graphs Taught Transformers to Think Outside the Node

Overcoming the Limitations of Softmax for Sharp Out-of-Distribution Performance in AI Systems

Building Our Own Knowledge System: Why We Took This Path

Solar Pro: High-Performance LLM on a Single GPU

OpenAI o1 Is Out: Embracing Inference-Time Scaling and the Future of AI Reasoning

Deep dive into LiGNN: Graph Neural Networks at LinkedIn

The Illusion of Progress: Why Playing It Safe Is the Riskiest Move of All

The Intersection of Innovation, Privacy, and Collaboration in South Korea's Tech Landscape

Integrating Behavioral Economics into AI

SOLAR 10.7B: the Sun of AI Rises in the East

社区洞察

其他会员也浏览了

Today is AI-Appreciation Day!

Bridging the Divide: How Open-Source AI Models Are Catching Up with Closed-Source Counterparts

The World This Week in AI (17th September 2024)

Deeply Seeking AI: The Open-Source Revolution

OpenAI's o3: A Leap Forward in AI, But Challenges Remain

DeepSeek R1: Enter the Next Frontier of AI Evolution

Deeply Seeking AI: The Open-Source Revolution

Metadata’s Role in Sustainable, Cost Effective AI Development

The AI Data Odyssey: Navigating the Synthetic Seas

Deep Deconstruction: The Core Differences and Strategic Advantages between Google Gemini and SearchGPT