Everything You Need To Know About OpenAI's Sora Video Generator

OpenAI, the maker of products like ChatGPT and the Dall-E image engine, has set the internet on fire with Sora. It's a text-to-video model, which essentially takes a few lines of text prompts to describe a scene and then makes a video. At the moment, Sora can create videos that are up to one minute in duration. Of course, Sora is not the first product of its kind. Late in 2022, Meta introduced Make-A-Video, a generative AI tool capable of producing short video clips from text prompts. 

Aside from making surreal and stylized videos, Meta's tool is also capable of adding motion to static images and creating different variations of an input video. However, the videos produced by Make-A-Video weren't really what one would call high quality, even though the company explains they are made using highly complex technology. What sets OpenAI's Sora apart is the almost photorealistic nature of the videos that it can produce. 

More importantly, the company claims that the videos generated by Sora not only take into account what the user has asked in the form of text but also consider how the actual items described in the prompt look in the real world. That is particularly evident in some of the videos OpenAI has released so far, including clips depicting California during the historical gold rush, a woman walking across the streets at night, a drone view of Big Sur, and a close-up view of a woman's eye.

Is Sora available for you?

Sora is currently exclusive to OpenAI red teamers who are assessing critical areas for potential harm or risks. Additionally, the company has also granted access to a select group of visual artists, designers, and filmmakers to gather feedback on how to enhance the model and iron out its flaws. The FAQ section on the official OpenAI community forum mentions that Sora is not widely available. As of now, there is no waitlist system for accessing it or getting a hold of the APIs.

However, there are scammy ads and posts out there claiming that they can help get access to Sora without any special requisites. All such ads and posts are categorically misleading, and you should not engage with them or tap on any links they offer. The best way forward is to keep an eye on the official OpenAI blog, the developer forum, or even follow the accounts of top OpenAI executives such as Sam Altman or Greg Brockman on social media.

The obvious reason why OpenAI is keeping Sora limited to a small group of testers is to find flaws in the system and fix them before a wider release. For a tool like Sora, the stakes are even higher, given the photorealistic nature of content it can produce and the high risk of deepfakes and other such defamatory media. "We're being careful about deployment here and making sure we have all our bases covered before we put this in the hands of the general public," an OpenAI scientist told MIT Technology Review.

How Sora stands out

So far, the only other text-to-video AI tool in the market that has really made an impression so far is Runway Gen 2. This tool can not only produce videos but also animate stills, modify clips, and even take an original video and create a stylized version lifted from the aesthetics of another image. The capabilities are definitely diverse, but the results exhibit tell-tale signs of AI processing, such as extremely visible motion artifacts, overall edge fuzzing, and shimmer, among others.

Sora is said to be capable of making videos depicting specific motion, handling multiple characters, and replicating niche kinds of motion and perspective. In addition, it provides an overall better understanding of language nuances tied to the reality of the physical world.

Internet sleuths have also highlighted how the videos generated by Sora showcase an impressive understanding of geometry and understanding of dimensions and 3D reconstruction. Some have also pointed out how Sora exhibits an impressive level of adhesion for an object even in motion, giving a special shoutout to the AI-generated video of a blue bird that was shared by OpenAI on its official website. The detailing is extremely impressive, the motion seems fluid, and there are hardly any blurry artifacts visible, especially in videos depicting a realistic setting.

A brief technical overview

Sora follows a similar trajectory as large language models (LLMs) powering text-based products such as ChatGPT. Where LLMs use tokens — which are essentially collections of words and phrases — treating them as morsels of data for training and processing, Sora relies on patches. "At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space, and subsequently decomposing the representation into spacetime patches," OpenAI explains.

At its heart, Sora is a diffusion model, which means it is fed noisy input data (patches, in this case), which it subsequently uses to generate a clean patch that appears as the final video. The inherent training tech is still the transformer model instead of the GAN-based text-to-video models that arrived a while ago. In a nutshell, Sora is a hybrid, or as OpenAI likes to call it, a diffusion transformer.

Sora also solves some extremely challenging aspects of AI video generation, especially when it comes to context-aware frame generation in 3D space from static as well as moving perspectives. The AI can sustain the visibility of people, animals, and items as they move through a three-dimensional space, even when they are hidden or leave the frame. It can also capture various angles of a single character in one instance, ensuring consistency in their visual portrayal throughout the video. The camera smoothly transitions and revolves, allowing individuals and elements in the scene to move seamlessly in a three-dimensional environment.

The challenges ahead

Sora can sample videos of up to full-HD (1920 x 1080 pixels) resolution in both portrait and landscape orientation, which means it can generate videos tailored for watching on phones as well as larger screens of computers and tablets. It can also extend a video in forward as well as backward direction to create seamless infinite looping clips while also changing the video settings on the go. Of course, it's not without its faults. OpenAI says Sora can falter at understanding the physics of various situations, such as breaking a glass with liquid inside. Here's another example of unrealistic physics:

On a similar note, it can struggle with depicting a state of change. For example, a man taking a bite out of a burger may not always leave a bite mark. Sora can also struggle with short and long-range depictions in the same frame. "If someone goes out of view for a long time, they won't come back. The model kind of forgets that they were supposed to be there," OpenAI's Tim Brooks explained to MIT Technology Review.

Then comes the question of reality. With AI-generated videos making their way to video platforms such as YouTube, Instagram, and TikTok, what is going to stop bad actors from passing them off as real? Thankfully, OpenAI has a plan. All Sora-generated videos will have a signature in their metadata. In addition, OpenAI has also developed a detector that works in the same vein as the one it developed for the Dall-E text-to-image engine.

Bigger questions about Sora

The biggest question that remains unanswered is what data OpenAI used to create the videos made by Sora. OpenAI extended its partnership with Shutterstock to license its media library for six years in 2023, but OpenAI has not yet detailed if this is the training material that was used to train Sora.

Then there is the confluence of AI, fair use, and labor security. We have already seen instances where AI tools have jeopardized the job of real humans. The SAG-AFTRA strikes proved to be a watershed moment in this tussle. However, that hasn't stopped the controversial use-case scenarios. For instance, Marvel was criticized for using AI-generated art in the credits sequence of its TV show.

Finally, we have the question of costs. Compared to text-based content generation, or dealing with images or audio, video is an entirely different ball game because of the computing and power resources it would take. This begs the question of whether Sora will be a standard offering with a premium ChatGPT subscription or if it will spawn a wallet-breaking payment tier for enthusiasts swayed by all the possibilities.