无界｜OpenAI Sora 视频生成模型

文摘科学 2024-03-03 06:55 英国

∷ ∷ ∷ ∷

无界｜艺术与科技

OpenAI Sora视频生成模型

∷ ∷

SUMMARY

摘要

2024年2月16日，OpenAI发布了视频生成模型Sora的研究报告，标志着人工智能生成内容(Artificial Intelligence Generated Content, AIGC)领域的显著进展

目前，Sora能够根据用户的文本提示创建长达一分钟的高质量视频。根据报告中发布的测试结果，Sora展现了对文字指令非凡的理解能力，同时在多个关键领域设立了新的行业标准，包括视频的保真度、长度、稳定性、一致性、分辨率以及对文字的理解能力均达到了当前最优效果(State-of-the-Art, SOTA) )。同时，Sora亦展现了涌现能力(Emergent Ability)，如对物理影响和因果关系的理解等。目前，Sora正处于测试阶段，旨在通过视觉艺术家、设计师和电影制作人的反馈进行改进，如减少潜在的偏见和有害内容生成

CHANLLENGES AND OPPORTUNITIES

挑战及机遇

ChatGPT、Sora等复杂模型的快速发展为提高工作效率、推动技术进步做出了重大贡献。大模型AI的崛起带来对算力的巨大需求。据华尔街日报2月8日报道，Sam Altman计划筹集5到7万亿美元，旨在重塑半导体行业的结构, 此举包括扩大芯片、能源及数据中心的全球基础设施和供应链等多个方面

2月28日，微软研究院和理海大研究者发布报告「A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models」（大型视觉模型的背景、技术、局限及机遇综述），通过解析Sora的技术报告和逆向工程，该报告全面审视了大型视觉模型，展示了其在各个领域的潜在影响及重要性。报告中指出，Sora的影响远超视频创作本身，为自动内容生成至复杂决策过程等任务提供变革性潜力，潜在的应用场景包括电影制作、教育、游戏、医疗保健和机器人技术等关键领域

然而，这些进步也引起了人们对滥用技术的担忧，包括产生假新闻、侵犯隐私，以及道德困境。对于创作者而言，Sora的出现将降低了视频制作的门槛，通过有效利用AI工具，创作者可快速将概念转化为视觉内容。同时，内容创作的竞争也会随之加剧。虽然对原创性等文化、法律的争议仍难以解决，也有艺术家应用技术进行了尝试 - 2024年年初，高古轩在比佛利山庄展出了导演及艺术家贝尼特·米勒(Bennett Miller)利用DALL·E图像生成器创作的版画系列。通过这些作品，米勒将人工智能（AI）的出现与摄影图像的历史联系起来，围绕感知的偶然性提出了问题

Bennett Miller个展，摄影：Jeff McLane

Sora的发布引发了对未来创作自由、就业模式、经济结构以及艺术价值认定等方面的广泛讨论。变革之下，挑战及机遇并存，需要社会各界共同适应、探索这一新兴技术革命的发展路径

SORA FULL REPORT

SORA报告全文（中英对照）

Video Generation Models as World Simulators

视频生成模型作为世界模拟器

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

我们探索在视频数据上对生成模型进行大规模训练。具体来说，我们联合训练了文本条件扩散模型，处理不同持续时间、分辨率和宽高比的视频和图像。我们利用了一种在视频和图像潜码的patches上操作的变压器架构进行操作。我们最大的模型 Sora 能够生成一分钟的高保真视频。据我们的研究结果表明，扩展视频生成模型是建立物理世界通用模拟器一条具有前途的道路

This technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.

本技术报告的重点是：(1) 我们将各种类型的视觉数据转化为统一表示法的方法，该方法可实现生成模型的大规模训练；(2) 对 Sora 的能力和局限性的定性评估。本报告中未包括模型和实现细节

Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,1,2,3 generative adversarial networks,4,5,6,7 autoregressive transformers,8,9 and diffusion models.10,11,12 These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data - it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.

之前的许多研究工作都使用各种方法对视频数据进行了生成建模，包括递归网络、生成对抗网络、自回归变换器和扩散模型。这些工作通常专注于狭窄类别的视觉数据、较短的视频或固定大小的视频。Sora 是一种通用的视觉数据模型-它可以生成持续时间、宽高比和分辨率各异的视频和图像，最高可生成一分钟的高清视频

Turning Visual Data into Patches

将视觉数据转换为Patches

We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.13,14 The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.15,16,17,18 We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.

我们从大型语言模型中汲取灵感，这些模型通过在互联网规模的数据上进行训练，获得了通用能力。这种范式的成功在一定程度上得益于使用词元编码/令牌（token），它们巧妙地统一了文本的多种形式——代码、数学和各种自然语言。在这项工作中，我们将考虑如何继承视觉数据生成模型的这些优点。与拥有文本令牌的不同的是，Sora拥有视觉块嵌入编码（visual patches）。视觉块已被证明是视觉数据模型的一种有效表示。我们发现，patches是一种高度可扩展且有效的表示形式，用于在多种类型的视频和图像上训练生成模型

At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space,19 and subsequently decomposing the representation into spacetime patches.

在高维上，我们首先将视频压缩到较低维度的潜在空间中，然后将表示分解为时空嵌入，从而将视频转化一系列编码块

Video Compression Network

视频压缩网络

We train a network that reduces the dimensionality of visual data.20 This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.

我们训练了一个网络，用于降低视觉数据的维度。这个网络将原始视频作为输入，并输出一个在时间和空间上都被压缩的潜在表示。Sora在这个压缩的潜在空间内接受训练，并随后生成视频。此外，我们还训练了一个相应的解码器模型，将生成的潜在表示映射回像素空间

Spacetime Latent Patches

隐空间时空patches

Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.

给定一个压缩的输入视频，我们提取一系列时空编码块作为transformer令牌（token）。这种方案也适用于图像，因为图像只是帧数为单一的视频。我们基于patches的表示使得Sora能够训练不同分辨率、持续时间和宽高比的视频和图像。在推理时，我们可以通过在适当大小的网格中排列随机初始化的编码块来控制生成视频的大小

Scaling Transformers for Video Generation

扩展Transformer用于视频生成

Sora is a diffusion model21,22,23,24,25; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.26 Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,13,14 computer vision,15,16,17,18 and image generation.27,28,29

Sora是一个扩散模型，给定输入的噪声块（和条件信息如文本提示），它被训练去预测原始“干净”的patches。重要的是，Sora是一个扩散变压器。变压器在多个领域展现了卓越的扩展性能，包括语言建模、计算机视觉和图像生成

In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.

在这项工作中，我们发现扩散变压器作为视频模型也能有效地扩展。下面，我们展示了训练进度中，使用固定种子和输入的视频样本比较。随着训练计算的增加，样本质量显著提高

Variable Durations, Resolutions, Aspect Ratios

可变的持续时间、分辨率、宽高比

Past approaches to image and video generation typically resize, crop or trim videos to a standard size—e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.

过去对图像和视频生成的方法通常是将视频调整大小、裁剪或修剪到标准尺寸，例如，4秒视频以256x256分辨率。我们发现，相反地在其原始大小上训练数据提供了许多好处

Sampling Flexibility

采样灵活性

Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution - all with the same model.

Sora可以采样宽屏1920x1080p视频、垂直1080x1920视频以及之间的所有内容。这让Sora能够直接以它们原生的宽高比为不同的设备创建内容。它还允许我们在使用同一模型生成全分辨率内容之前，快速原型化较小尺寸的内容

Improved Framing and Composition

改进的构图和布局

We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right) have improved framing.

通过实证我们发现，在其原生宽高比的视频上训练可以改进构图和布局。我们将Sora与一个版本的模型进行了比较，该模型将所有训练视频裁剪成正方形，这是训练生成模型时的常见做法。在正方形裁剪上训练的模型（左侧）有时会生成主体只部分出现在视野中的视频。相比之下，来自Sora的视频（右侧）具有改进的取景

Language Understanding

语言理解

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 330 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.

训练文本到视频生成系统需要大量带有对应文本标题的视频。我们将在DALL·E 3中引入的重新标注技术应用到视频上。我们首先训练一个高度描述性的标注模型，然后使用它为我们训练集中的所有视频生成文本标题。我们发现，在高度描述性的视频标题上训练可以提高文本的忠实度以及视频的整体质量

Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.

与DALL·E 3类似，我们也利用GPT将用户的简短提示转换成更长的详细说明，然后发送给视频模型。这使得Sora能够生成高质量的视频，准确地遵循用户的提示

Prompting with Images and Videos

使用图片和视频进行提示

All of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.

上述所有结果以及我们的登录页面中展示的都是文本到视频的样本。但Sora也可以用其他输入进行提示，比如预先存在的图片或视频。这一能力使Sora能够执行广泛的图像和视频编辑任务——创建完美循环的视频，为静态图像添加动画，向前或向后延伸视频等

Animating DALL·E Images

为DALL·E图片添加动画

Sora is capable of generating videos provided an image and prompt as input. Below we show example videos generated based on DALL·E 231 and DALL·E 330 images.

Sora能够基于图片和提示输入生成视频。下面我们展示了基于DALL·E 2和DALL·E 3图片生成的示例视频

Extending Generated Videos

延长所生成的视频

Sora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending.

Sora也能够将视频向前或向后延长时间。下面是四个视频，它们都是从生成的视频片段开始向后延长的。因此，尽管这四个视频的开头各不相同，但最终的都会展现出相同的结局

We can use this method to extend a video both forward and backward to produce a seamless infinite loop.

我们可以使用这种方法将一个视频向前和向后扩展，制作出一个无缝的无限循环

Video-to-video Editing

视频到视频编辑

Diffusion models have enabled a plethora of methods for editing images and videos from text prompts. Below we apply one of these methods, SDEdit,32 to Sora. This technique enables Sora to transform the styles and environments of input videos zero-shot.

扩散模型已经使得从文本提示编辑图像和视频的众多方法成为可能。下面我们将其中一个方式，SDEdit，应用于Sora。这项技术使得Sora能够零次学习地转换输入视频的风格和环境

Connecting Videos

连接视频

We can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. In the examples below, the videos in the center interpolate between the corresponding videos on the left and right.

我们也可以使用Sora逐渐在两个输入视频之间插值，创建在完全不同的主题和场景构成之间的无缝转换。在下面的示例中，中间的视频在左右两边对应视频之间进行插值

Image Generation Capabilities

图像生成能力

Sora is also capable of generating images. We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame. The model can generate images of variable sizes—up to 2048x2048 resolution.

Sora也能够生成图像。我们通过在具有一个帧时间范围的空间网格中排列高斯噪声块来实现这一点。该模型可以生成不同大小的图像——分辨率最高可达2048x2048

Emerging Simulation Capabilities

新兴的仿真能力

We find that video models exhibit a number of interesting emergent capabilities when trained at scale. These capabilities enable Sora to simulate some aspects of people, animals and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc. - they are purely phenomena of scale.

我们发现，当在大规模上训练时，视频模型展现出许多有趣的新兴能力。这些能力使Sora能够模拟物理世界中的人、动物和环境的某些方面。这些属性没有任何明确的归纳偏见，如3D、对象等-它们纯粹是规模现象

3D Consistency. Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.

3D一致性：Sora能够生成具有动态相机运动的视频。随着相机的移动和旋转，人物和场景元素在三维空间中保持一致的移动

Long-range Coherence and Object Permanence. A significant challenge for video generation systems has been maintaining temporal consistency when sampling long videos. We find that Sora is often, though not always, able to effectively model both short- and long-range dependencies. For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.

长距离连贯性和对象持久性：对视频生成系统来说，一个重大的挑战是在采样长视频时保持时间上的连贯性。我们发现，Sora通常（虽然不总是）能够有效地建模短程和长程依赖关系。例如，我们的模型能够保持人物、动物和物体即使在被遮挡或离开画面时也仍然存在。同样，它可以在单个样本中生成同一角色的多个镜头，贯穿整个视频保持其外观不变

Interacting with the World. Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.

与世界进行互动：Sora有时能够模拟以简单方式影响世界状态的行为。例如，一个画家可以在画布上留下随时间持续存在的新笔触，或者一个人可以吃掉一个汉堡并留下咬痕

Simulating Digital Worlds: Sora is also able to simulate artificial processes–one example is video games. Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity. These capabilities can be elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”

These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them.

模拟数字世界：Sora也能够模拟人工过程, 如视频游戏。Sora可以在控制《Minecraft》中的玩家的同时，以高保真度渲染世界及其动态。这些能力可以通过用提到“Minecraft”的字幕提示Sora来零次启动获得

这些能力表明，继续扩大视频模型的规模是朝着开发出能够高效仿真物理和数字世界及其中的物体、动物和人类的高能力仿真器的有希望的道路

Discussion

Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. We enumerate other common failure modes of the model - such as incoherencies that develop in long duration samples or spontaneous appearances of objects - in our landing page.

We believe the capabilities Sora has today demonstrate that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals and people that live within them.

讨论

Sora作为一个仿真器目前仍展示了许多局限性。例如，它无法准确地模拟许多基本互动的物理特性，比如玻璃碎裂。以及其他互动，如在吃食物，无法保证总是可以正确地改变物体状态。我们在我们的登陆页面上列举了模型的其他常见失败模：例如，在长时间样本中发展的不连贯性或物体的自发出现

我们相信，Sora目前所具有的能力表明，继续拓展视频模型的规模是向着开发出能够仿真物理和数字世界及其中的物体、动物和人类的有能力的仿真器的有希望的路径

Sora报告原文链接：

https://openai.com/research/video-generation-models-as-world-simulators

A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

https://arxiv.org/pdf/2402.17177.pdf

文案及编辑：润卿Gabby

	「豹变｜玛丽·特蕾丝·霍布莱茵」
	► 点击阅读

	展讯 \| 伦敦展览2023.12
	► 点击阅读

	图书馆｜圣诞专辑：冬日的十二件艺术品
	► 点击阅读

点击阅读原文查看更多内容

Swanfall Art

Creative Talents Projects, sponsored by Swanfall Limited