Image for relax.

At the beginning of 2024, Sora emerged out of nowhere, shocking the world. Sora boldly claims to be “a video generation model that simulates the world.” Some pessimistically predict that many traditional fields might be upended, with computer graphics, short videos, and film entertainment being among the most vulnerable. As OpenAI has revealed more technical details, many videos generated by Sora that exhibit physical paradoxes have spread online.

Here, I offer an explanation of the current technical shortcomings in Sora’s approach based on perspectives from modern mathematics, particularly from the field of global differential geometry. My hope is to offer a modest contribution that sparks further ideas, thereby broadening the thinking of AI researchers and engineers and promoting further advancements. In this explanation, I mainly use manifold embedding theory, catastrophe theory (critical state theory), characteristic class theory of fiber bundles, the heat diffusion equation, and the regularity theory of optimal transport equations (Monge–Ampère equation).

The Manifold Distribution Principle

In the field of deep learning, a natural dataset is regarded as a probability distribution on a manifold. This is known as the manifold distribution principle. We regard an observed sample as a point in the original data space. A large number of samples form a dense point cloud in the original data space that lies near a certain low-dimensional manifold; this manifold is called the data manifold. The distribution of the point cloud on the data manifold is not uniform but follows specific distribution rules, which are represented as a data probability distribution.

Naturally, we then ask the following questions: 1. Why is the data point cloud low-dimensional rather than filling the entire original data space? 2. Why is the collection of points a manifold, i.e., locally continuous and smooth?

The answer to the first question is: because natural phenomena adhere to numerous natural laws, these constraints reduce the dimensionality of the data sample point cloud, making it impossible to fill the entire space. For example, consider the dataset consisting of all natural human face photos. Each sample is an image, and the number of pixels multiplied by 3 gives the dimension of the original image space. Any point in the original image space represents an image, but only very few images are human face images that fall on the face image manifold. Therefore, the face image manifold cannot fill the entire original image space.

Human faces need to satisfy many natural physiological rules. Each rule reduces the dimensionality of the data manifold. For example, bilateral symmetry almost halves the number of independent pixels; the presence of five facial features with fixed geometric and textural regions, and the similar shapes of these features with few parameters, further reduces the dimensionality. Ultimately, the genes controlling human faces are very limited, so the dimensionality of the face image manifold is far lower than the number of image pixels.

Similarly, consider the steady-state temperature distribution over a planar region. According to the physical heat diffusion theorem, a stable function satisfies the classical Laplace equation and is uniquely determined by its boundary values. If we have  sampling points inside the region and  sampling points on the boundary, then each observed temperature function is represented as a vector of dimension , i.e., the original data space has dimension ; however, the actual manifold’s dimension is that of the boundary function, which is . This shows that the data manifold formed by observation samples that satisfy physical laws is far lower in dimensionality than the original data space.

The answer to the second question is: in most cases, physical systems are well-posed, but in critical states, physical systems undergo sudden changes (described by catastrophe theory or critical state theory). Physical laws are mostly described by systems of partial differential equations. The solutions of these equations are controlled by the initial and boundary values. The system being well-posed means that, due to physical constraints such as energy conservation, mass conservation, and the speed limit of energy transfer (less than the speed of light), the solution changes gradually as the initial and boundary values change gradually. In the regularity theory of partial differential equations, this means that the Sobolev norm of the boundary values controls the Sobolev norm of the solution. We can regard the solution as a point on the data manifold and the boundary values as its corresponding local coordinates (that is, the corresponding latent feature vector in the latent space).

The mapping from the data manifold to the latent space is called the encoding map, and the mapping from the latent space to the data manifold is called the decoding map. The regularity theory ensures that both the encoding and decoding maps are continuous, even smooth, and the uniqueness of the solution ensures that these maps are topologically or differentiably equivalent. The boundary values can be arbitrarily perturbed locally, meaning that the latent variables have an open Euclidean disc neighborhood. This implies that the observed samples that satisfy specific physical laws form a data manifold.

Fig. 1. Sora encodes videos into the latent space, then segments them into spatiotemporal patches, which are called time-space tokens. (openai.com)

As shown in Fig. 1, Sora’s training set is composed of short videos, with each sample being a short video. Similar short videos form a data manifold. Sora encodes them into a latent space for dimensionality reduction, and then in the latent space, it segments the latent feature vectors into patches, which, with the addition of temporal order, form spatiotemporal patches, or time-space tokens. The concept of spatiotemporal information is critical here, as each token records both the temporal frame number (time) and the spatial row and column indices (space) of the current frame.

Transformation of Probability Distributions

We can further ask the following question:

  1. How can the probability distribution on the data manifold be represented?

The answer to the third question is: by using transport maps to transform the data probability distribution into a Gaussian distribution that a computer can generate. This transport map can be applied in either the original data space or the latent space. Common transport transforms include the optimal transport transform and heat diffusion. We can explain this from the perspective of fluid dynamics. Imagine the entire latent space is a water tank filled with a certain solvent, with its density representing the probability density. We disturb the water tank, causing the fluid to flow so that the density of the solvent changes. We compute the flow direction and speed of each water molecule so that the entropy of the probability density continually increases, eventually resulting in a Gaussian distribution.

For example, consider the distribution of face images, where each water molecule corresponds to a face image. By continuously adding noise to the face images, we obtain a series of images until they become pure white noise. This series represents the trajectories of the water molecules. Eventually, each face image turns into white noise, and all these white noise distributions satisfy a Gaussian distribution. This process is known as Langevin dynamics.

Conversely, given a white noise image, if we trace the water molecule trajectories back to their source, we recover a face image. This is the principle behind diffusion models. Alternatively, one can directly solve for a homeomorphism from the latent space to itself using optimal transport theory, transforming the data distribution into a Gaussian distribution, which requires solving the Monge–Ampère equation. Thus, all the information of the data distribution is contained in the transport map, which is expressed by a deep network.

Fig. 2. Sora uses a diffusion model to generate data spatiotemporal tokens from white noise spatiotemporal tokens. (openai.com)

As shown in Fig. 2, in the latent space Sora transforms the probability distribution of the data tokens into a Gaussian distribution via a diffusion process (Langevin dynamics—gradually adding noise to each token), and then uses the inverse transform to convert the white noise tokens in the latent space back into data tokens.

The Enhancement by Large Language Models

Sora integrates the large language model ChatGPT, which greatly improves the performance of the system. Firstly, Sora’s training samples are (text, video) pairs. Some videos have titles that are too brief or lack subtitles, so Sora employs DALL-E’s re-titling technique.

Sora’s training set includes some high-quality samples (highly descriptive subtitles and short videos), which are used to train the data manifold of short videos (including the spatiotemporal token manifold), with each manifold being identified by its subtitles (titles). For poor-quality short videos that lack titles or have ambiguous subtitles, Sora encodes them into the latent space and searches for latent feature vectors that are close to those of high-quality videos, then copies the subtitles (titles) from the high-quality videos to the poor ones. In this way, Sora can add highly descriptive subtitles to all the training video data, thereby improving the quality of the training set and further enhancing system performance.

At the same time, the large language model can expand the user’s input prompts, making them more precise and descriptive, so that the generated videos better match the user’s needs. This gives Sora an extra boost. However, Sora still has many shortcomings, which we can analyze through the following examples.

The Contradiction Between Correlation and Causality

ChatGPT breaks down sentences into tokens and then uses a Transformer to learn the probability distribution of connections among tokens in context. Similarly, Sora breaks down videos into spatiotemporal tokens and then learns the probability distribution of connections among tokens in context. Based on this probability distribution, it generates tokens from white noise, connects them, and decodes them into short videos.

Each token represents a local region in an image or video, and the stitching together of different local regions becomes the key issue. Sora learns each token relatively independently, expressing the spatial relationships between tokens using probabilities derived from the training set. As a result, it is unable to precisely express the spatiotemporal causal relationships between tokens.

Video 1. Video of an elderly lady blowing out birthday candles generated by Sora. (openai.com)

As shown in Video 1, every frame in the video generated by Sora is extraordinarily realistic, yet when the elderly lady blows out the birthday candles, the flame remains completely still. If we zoom into the region corresponding to each token, we see a gorgeous, realistic picture with very smooth and natural transitions between tokens. However, when there is a causal relationship between tokens that are far apart—for example, the air blown affecting the flickering of the flame—the physical causality between the two tokens is not reflected.

This means that a Transformer, which is used to express the statistical correlations between tokens, cannot precisely express physical causality. Although Transformers can manipulate natural language to some extent, natural language cannot accurately express physical laws, which are currently expressed precisely only by partial differential equations. This reflects a certain limitation of world models based on probability.

The Contradiction Between Local Plausibility and Global Absurdity

Currently, Sora does a reasonable job stitching together adjacent tokens, but the overall assembled video may exhibit various paradoxes. This indicates a gap between local stitching and global expansion.

Video 2. The “Ghost Chair” video generated by Sora. (openai.com) If we observe the “Ghost Chair” video and limit our view to a local area in the center of the screen, the video appears entirely reasonable. A careful examination of the transitions between different token regions shows very continuous and smooth connections. However, the entire chair appears to be mysteriously suspended in mid-air, which contradicts everyday experience.

This kind of “locally plausible, globally absurd” video generation suggests that the Transformer has learned the local connection probabilities between tokens but lacks a broader, global understanding of the spatiotemporal context. In this video, the global concept comes from the gravitational field in physics, which is omnipresent though not apparent locally.

Video 3. The “Quadruped Ant” video generated by Sora. (openai.com)

Another example is the video of “quadruped ants” generated by Sora. The ants move vividly, almost like flowing clouds. Locally, the movements are very smooth and natural, making one wonder if such quadruped ants might exist on some planet. However, globally, there are no quadruped ants in the natural world on Earth. Here, local plausibility does not guarantee global plausibility, as the global perspective comes from biological facts.

Video 4. The “Contradictory Treadmill” video generated by Sora. (openai.com)

Similarly, in Sora’s “Contradictory Treadmill” video, if we examine each local region, the images appear reasonable, and the connections between tokens seem natural. However, the overall video is absurd: the treadmill moves in the opposite direction to the runner. This global perspective contradicts the facts of ergonomics.

These examples indicate that although the current Transformer can learn local contexts, it is unable to learn a more global context. This global context may be the gravitational field in physics, ergonomics, or biological species classification. This global perspective is precisely what Professor Zhu Songchun described as the “dark matter” of the AI world. Although each training sample video implicitly expresses a global perspective, the tokenization process fragments this global view, retaining only the limited connection probabilities between neighboring tokens, leading to locally plausible yet globally absurd results.

Modern global differential geometry places great emphasis on the contradiction between the local and the global, and has thus invented various theoretical tools. For instance, one can construct smooth frame fields locally on a topological manifold, but these cannot be globally extended due to the obstruction given by the characteristic classes of fiber bundles. On complex manifolds, one can locally construct meromorphic functions, but globally these local functions cannot be pieced together into a global meromorphic function. The discrepancy between local extension and global existence can be precisely characterized by the sheaf cohomology theory.

Many physical theories are expressed in terms of the characteristic class theories of specific fiber bundles, such as the theory of topological insulators. This type of mathematical theory, which is easy to construct locally but encounters substantial difficulties when extended globally, is in fact a crystallization of humanity’s profound exploration of nature. Such global topological and geometrical perspectives have not yet been extended into the AI field. If Transformers could learn these global obstructions in context on their own, AI would be much more effective in exploring the natural world.

The Absence of Critical States

The vast majority of physical processes in nature alternate between steady states and critical states. In steady states, the system parameters change slowly, making it easy to obtain observational data; in critical states (catastrophic changes), the system suddenly shifts, catching one off guard, making it very hard to capture observational data. Consequently, critical state samples are very scarce and almost of zero measure in the training set.

As a result, the data manifold learned by the Sora system is almost entirely composed of samples from steady states. In physical processes, critical state samples are mostly distributed along the boundaries of the data manifold. Therefore, during the generation process, Sora very easily generates video segments corresponding to steady states, but it often skips the critical states. Yet in human perception, the most crucial observations are precisely those critical states that occur with almost zero probability.

Video 5. Video of juice splashing generated by Sora. (openai.com) In the juice splashing video generated by Sora, there are two stable states: the state in which the cup is upright and the state in which the juice has already splashed out. However, the most critical state—the process of the juice spilling from the cup—is not generated. Although it only lasts a few frames, this process is extremely important for human perception of the entire event. Sora’s failure to generate images of the critical state may be due to the following reasons:

Different steady state samples in a physical process generate different connected branches of the data manifold, and the critical state samples lie near the boundary of the steady state manifold, between the boundaries of two steady state manifolds. The thermodynamic diffusion process blurs the boundaries of the manifold, thus confusing the manifold boundaries and generating videos with ambiguous transitions. In other words, the near-critical state corresponds to the boundary of the data manifold, and during learning, the boundary conditions should be preserved rather than causing mode confusion.

As shown in Fig. 3, we trained an encoder-decoder using MNIST and plotted the latent space distribution of the dataset. The 10 handwritten digits correspond to 10 clusters, each cluster representing a mode, i.e., a connected branch of the data manifold. The boundaries of the clusters are the boundaries of the support of the data’s latent space distribution. We generated 100 sample points in the latent space and decoded them into 100 images of handwritten digits. If a sample point falls well within a cluster, the generated image is very clear; if it falls outside in the boundary region of a cluster, the generated image is very blurry, often a fusion of two handwritten digits. Therefore, identifying the boundaries of the data manifold is very important for recognizing critical states.

The popular diffusion model currently employed by Sora inevitably smooths the boundaries of the steady state data manifold when computing the transport map, thus confusing different modes and directly skipping the generation of critical state images. Consequently, the video appears to abruptly jump from one state to another, with the most critical transitional process missing, leading to a physical absurdity.

Video 6. Video of puppies generated by Sora. (openai.com)

Video 6 illustrates another scenario where errors occur due to crossing the manifold boundaries. In the video, a group of puppies is frolicking, sometimes blocking each other, sometimes scattering apart. At one moment in the video, three puppies in the frame suddenly become four. Our explanation is as follows: images of four puppies constitute one manifold (or connected branch), while images of three puppies constitute another branch. At the boundary of the four-puppy image manifold, there is a critical event: when the four puppies overlap, only three are visible in the image.

Sora’s diffusion model does not recognize the boundaries of the manifold; instead, it crosses these boundaries, jumping between the three-puppy image manifold and the four-puppy image manifold. The correct approach should be to first identify the boundaries of the manifold, and then, in situations where physical crossing is impossible (such as three versus four), to reflect back onto the original manifold at the boundary.

Fig. 4. The optimal transport map based on geometric methods can precisely detect the boundaries of the data manifold and accurately capture the critical states.

The shortcomings of the diffusion model can be overcome by the optimal transport model based on geometric methods. As shown in Fig. 4, suppose we compute the optimal transport map from a uniform distribution inside a disc to a uniform distribution within a hippocampal-shaped region on the right. According to the Brenier theorem, the optimal transport map is given by the gradient of a convex potential function. This potential function satisfies the Monge–Ampère equation. The potential function is not differentiable everywhere; the set where it is continuous but not differentiable projects onto the singular set (the black curve) in the disc. Regular points map to regular points in the target region, while the singular set maps to the boundary of the target region (each singular point mapping simultaneously to two boundary points on the left and right).

When we cross the singular set, it means we have crossed between two steady states, and a critical (catastrophic) event must occur, namely a physical event where the steady state is broken. Thus, precisely identifying the singular set of the transport map and detecting the critical (catastrophic) states is fundamentally important for modeling the physical world.

Summary

In summary, although Sora claims to be “a video generation model that simulates the world,” its current technical approach cannot correctly simulate the physical laws of the world.

Firstly, using statistical correlations based on probability cannot precisely express the causality of physical laws; the contextual correlations of natural language do not reach the precision of partial differential equations. Secondly, although Transformers can learn the connection probabilities between adjacent spatiotemporal tokens, they cannot judge global plausibility. Global plausibility requires higher-level mathematical theoretical perspectives or deeper, more implicit backgrounds in natural and human sciences—perspectives that current Transformers are incapable of truly grasping.

In addition, Sora neglects the most critical aspect of physical processes: the critical (catastrophic) state. On one hand, this is because critical state samples are scarce; on the other hand, the diffusion model blurs the boundaries of the steady state data manifold, eliminating the existence of critical states and causing jumps between different steady states in the generated videos.

In contrast, the optimal transport theory framework based on geometric methods can precisely detect the boundaries of the steady state data manifold, thereby emphasizing the generation of critical state events and avoiding crossovers between different steady states, thus being closer to physical reality.

Currently, the data-driven world simulation models represented by Sora and the world simulation models based on first-principles physical laws and partial differential equations have entered into a fierce battle. This may be a great turning point in human history. I hope that young readers will actively join the torrent of our times and use their intelligence and talent to promote the development of technology and society!

Reference

  • Gu Xianfeng - A Geometric Explanation of Sora’s Physical Paradoxes (顾险峰 - Sora物理悖谬的几何解释)