登录查看更多内容

Emergent Reality Simulation with OpenAI Sora Text-to-Video

Yan Nuriyev

Chief Technology Officer at Flushing Bank

发布日期: 2024年2月17日

Looking past the amazing visuals Sora teased us with earlier this week, there is a ground breaking, nascent, capability to simulate reality. Objects interact as they should in the physical world, meaning Sora has a generalized understanding of some aspects of how the world works. It's not yet perfect, but it does offer a tantalizing preview of what is to come.

Why This is Such a Big Deal

Prior to generative AI, computer imagery required recreating the world through 3D modeling. This is an extremely labor intensive process as each object in a scene must be placed there by an artist, and is computationally intensive because the behavior of light and fluids needs to be brute-forced calculated by the computer.

Sora does not use fluid dynamics to simulate waves, or use ray-tracing to simulate light refraction. It understands those things intuitively to create beautiful scenes.

Computer generated graphics never seemed to quite match reality, leading to the uncanny valley effect when trying to recreate human faces. Sora-generated faces are indistinguishable from reality.

Combining Diffusion and Transformers

Prior to Sora, image and video generation AI models relied almost exclusively on a process called diffusion. The algorithm starts with a random noise image and then, through a series of learned steps, gradually removes the noise while adding details to transform this noise into a coherent image that matches a given prompt or specification. This approach has proven to be remarkably effective for generating high-quality images and very short video clips, with little movement.

Sora diverges from traditional approaches by its ability to generate high-fidelity videos of up to a minute in duration, across various durations, resolutions, and aspect ratios.

领英推荐

The Future of AI and Mixed Reality: Insights from Meta…

Ashis Kumar Mishra 8 个月前

?? **Biggest AI Film Competition** ???

Claire Xue 3 个月前

Injecting Magic in the AR/VR world with EMOTIV

EMOTIV 2 年前

Sora combines the diffusion architecture from image generation models with a transformer architecture of large language models. In an LLM, text is represented as chunks called tokens. In Sora, visual data is broken up in to spacetime segments called patches. However, Sora doesn't create videos by guessing what comes next frame by frame. Instead, it works in a latest space of these abstract sketches (patches) to plan out the video and then translates that plan into the actual video frames. Working in latent space lets Sora plan better because it is not bogged down by the details from the start.

Sora also uses highly descriptive synthetic text captioning of each frame to help maintain long-range coherence and temporal consistency to its subjects.

Limitations

Sora's understanding of physics is fragile. While it does mostly exhibit object permanence, it doesn't fully grasp cause and effect leading to some reality bending visuals as seen below.

It also struggles with some specific interactions, like glass shattering.

The Future

These are early first steps, think of this as a GPT2 or GPT3, but you can already begin to extrapolate what Sora will be capable of once it achieves GPT4 levels.

Amal Kiran

Building Temperstack | Full stack AI Agent for Software Reliability

1 年

Yan, ??

要查看或添加评论，请登录

Yan Nuriyev的更多文章

AI Agents: The Future of Software

2024年8月12日

AI Agents: The Future of Software

AI agents have the potential to transform software development. In the near future, these agents could autonomously…

16 条评论
This Week in AI Research: Object Detection and Cutting-Edge Audio Analysis

2024年2月13日

This Week in AI Research: Object Detection and Cutting-Edge Audio Analysis

Every day seems to bring a new wave of AI research. To cut through the noise, I curate a selection of works that I find…

3 条评论
The Rise of Modular AI Models

2024年1月28日

The Rise of Modular AI Models

Fusing multiple models is a promising direction in AI research. It has the potential to enable new cognitive…

1 条评论

Emergent Reality Simulation with OpenAI Sora Text-to-Video

Yan Nuriyev

Chief Technology Officer at Flushing Bank

Why This is Such a Big Deal

Combining Diffusion and Transformers

领英推荐

Limitations

The Future

Yan Nuriyev的更多文章

社区洞察

其他会员也浏览了

Life inside augmented modernity. Regulators vs Big Tech. Should we act to reverse economic growth?

Why a Robust SLAM Algorithm Outweighs Hardware Specs in 3D Reality Capture Mapping Performance

Photorealistic Virtual Worlds: How AI and 3D Gaussian Splatting Enable Real-Time Rendering

Vision Transformer Market: Effectively Processing Images for Computer Vision Tasks

The Future of VR, AR, and Hybrid Reality: AI’s Role in the Next Digital Paradigm

GAUDI: A Neural Architect for Immersive 3D Scene Generation

GAUDI: A Neural Architect for Immersive 3D Scene Generation

A Practical Guide to Adopting Artificial Intelligence (AI) and Extended Reality (XR) for Businesses

GAUDI: A Neural Architect for Immersive 3D Scene Generation

Why This is Such a Big Deal

Combining Diffusion and Transformers

领英推荐

Limitations

The Future

Yan Nuriyev的更多文章

AI Agents: The Future of Software

This Week in AI Research: Object Detection and Cutting-Edge Audio Analysis

The Rise of Modular AI Models

社区洞察

其他会员也浏览了

Life inside augmented modernity. Regulators vs Big Tech. Should we act to reverse economic growth?

Why a Robust SLAM Algorithm Outweighs Hardware Specs in 3D Reality Capture Mapping Performance

Photorealistic Virtual Worlds: How AI and 3D Gaussian Splatting Enable Real-Time Rendering

Vision Transformer Market: Effectively Processing Images for Computer Vision Tasks

The Future of VR, AR, and Hybrid Reality: AI’s Role in the Next Digital Paradigm

GAUDI: A Neural Architect for Immersive 3D Scene Generation

GAUDI: A Neural Architect for Immersive 3D Scene Generation

A Practical Guide to Adopting Artificial Intelligence (AI) and Extended Reality (XR) for Businesses

GAUDI: A Neural Architect for Immersive 3D Scene Generation