Sora is a
text-to-video model Text-to-Video is a state of the art artificial intelligence technology which needs only text as input for the output as video. The inspiration came from text-to-image models which deliver images as output from text as input.
Video prediction on ma ...
developed by
OpenAI
OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...
. The model
generates short video clips based on user
prompts, and can also extend existing short videos. Sora was released publicly for ChatGPT Plus and ChatGPT Pro users in December 2024.
History
Several other text-to-video generating models had been created prior to Sora, including
Meta
Meta (from the Greek μετά, '' meta'', meaning "after" or "beyond") is a prefix meaning "more comprehensive" or "transcending".
In modern nomenclature, ''meta''- can also serve as a prefix meaning self-referential, as a field of study or ende ...
's Make-A-Video,
Runway
According to the International Civil Aviation Organization (ICAO), a runway is a "defined rectangular area on a land aerodrome prepared for the landing and takeoff of aircraft". Runways may be a man-made surface (often asphalt concrete, as ...
's Gen-2, and
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
's Lumiere, the last of which, is also still in its research phase.
OpenAI
OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...
, the company behind Sora, had released
DALL·E 3, the third of its DALL-E
text-to-image model
A text-to-image model is a machine learning model which takes as input a natural language description and produces an image matching that description. Such models began to be developed in the mid-2010s, as a result of advances in deep neural netwo ...
s, in September 2023.
The team that developed Sora named it after
the Japanese word for sky to signify its "limitless creative potential".
On February 15, 2024, OpenAI first previewed Sora by releasing multiple clips of
high-definition video
High-definition video (HD video) is video of higher resolution and quality than standard-definition. While there is no standardized meaning for ''high-definition'', generally any video image with considerably more than 480 vertical scan lines ( ...
s that it created, including an
SUV
A sport utility vehicle (SUV) is a car classification that combines elements of road-going passenger cars with features from off-road vehicles, such as raised ground clearance and four-wheel drive.
There is no commonly agreed-upon defini ...
driving down a mountain road, an animation of a "short fluffy monster" next to a candle, two people walking through
Tokyo
Tokyo (; ja, 東京, , ), officially the Tokyo Metropolis ( ja, 東京都, label=none, ), is the capital and List of cities in Japan, largest city of Japan. Formerly known as Edo, its metropolitan area () is the most populous in the world, ...
in the snow, and fake historical footage of the
California gold rush, and stated that it was able to generate videos up to one minute long.
The company then shared a technical report, which highlighted the methods used to train the model.
OpenAI CEO
Sam Altman
Samuel H. Altman ( ; born April 22, 1985) is an American entrepreneur, investor, programmer, and blogger. He is the CEO of OpenAI and the former president of Y Combinator.
Early life and education
Altman grew up in St. Louis, Missouri; his mo ...
also posted a series of tweets, responding to
Twitter
Twitter is an online social media and social networking service owned and operated by American company Twitter, Inc., on which users post and interact with 280-character-long messages known as "tweets". Registered users can post, like, and ...
users' prompts with Sora-generated videos of the prompts.
In November 2024, an
API
An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
key for Sora access was leaked by a group of testers on
Hugging Face
Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users ...
, who posted a
manifesto
A manifesto is a published declaration of the intentions, motives, or views of the issuer, be it an individual, group, political party or government. A manifesto usually accepts a previously published opinion or public consensus or promotes a ...
stating they protesting that Sora was used for "
art washing". OpenAI revoked the access three hours after the leak was made public, and gave a statement that "hundreds of artists" have shaped the development, and that "participation is voluntary."
As of the 9th of December 2024, OpenAI has made Sora available to the public, for ChatGPT Pro and ChatGPT Plus users. Prior to this, the company had provided limited access to a small "
red team
A red team or team red are a group that plays the role of an enemy or competitor to provide security feedback from that perspective. Red teams are used in many fields, especially in cybersecurity, airport security, law enforcement, the military ...
", including experts in misinformation and bias, to perform
adversarial testing on the model.
The company also shared Sora with a small group of creative professionals, including video makers and artists, to seek feedback on its usefulness in creative fields.
Capabilities and limitations
200px, A video generated by Sora of someone lying in a bed with a cat on it, containing several mistakes
The technology behind Sora is an adaptation of the technology behind
DALL-E 3. According to OpenAI, Sora is a diffusion transformer – a
denoising latent diffusion model with one
Transformer
A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...
as the denoiser. A video is generated in latent space by denoising 3D "patches", then transformed to standard space by a video decompressor. Re-captioning is used to
augment training data, by using a video-to-text model to create detailed captions on videos.
OpenAI trained the model using publicly available videos as well as copyrighted videos licensed for the purpose, but did not reveal the number or the exact source of the videos.
Upon its release, OpenAI acknowledged some of Sora's shortcomings, including its struggling to simulate complex physics, to understand
causality
Causality (also referred to as causation, or cause and effect) is influence by which one event, process, state, or object (''a'' ''cause'') contributes to the production of another event, process, state, or object (an ''effect'') where the ca ...
, and to differentiate left from right. One example shows a group of wolf pups seemingly multiplying and converging, creating a hard-to-follow scenario. OpenAI also stated that, in adherence to the company's existing safety practices, Sora will restrict text prompts for sexual, violent, hateful, or celebrity imagery, as well as content featuring pre-existing
intellectual property
Intellectual property (IP) is a category of property that includes intangible creations of the human intellect. There are many types of intellectual property, and some countries recognize more than others. The best-known types are patents, cop ...
.
Tim Brooks, a researcher on Sora, stated that the model figured out how to create
3D graphics
3D computer graphics, or “3D graphics,” sometimes called CGI, 3D-CGI or three-dimensional computer graphics are graphics that use a three-dimensional representation of geometric data (often Cartesian) that is stored in the computer for th ...
from its dataset alone, while Bill Peebles, also a Sora researcher, said that the model automatically created different video angles without being prompted.
According to OpenAI, Sora-generated videos are tagged with
C2PA metadata to indicate that they were AI-generated.
Reception
Will Douglas Heaven of the ''
MIT Technology Review
''MIT Technology Review'' is a bimonthly magazine wholly owned by the Massachusetts Institute of Technology, and editorially independent of the university. It was founded in 1899 as ''The Technology Review'', and was re-launched without "The" in ...
'' called the demonstration videos "impressive", but noted that they must have been cherry-picked and may not be representative of Sora's typical output.
American academic
Oren Etzioni
Oren Etzioni (born 1964) is an American entrepreneur, Professor Emeritus of computer science, and founding CEO of the Allen Institute for Artificial Intelligence (AI2). On June 15, 2022, he announced that he will step down as CEO of AI2 effective ...
expressed concerns over the technology's ability to create online
disinformation
Disinformation is false information deliberately spread to deceive people. It is sometimes confused with misinformation, which is false information but is not deliberate.
The English word ''disinformation'' comes from the application of the ...
for political campaigns.
For ''
Wired
''Wired'' (stylized as ''WIRED'') is a monthly American magazine, published in print and online editions, that focuses on how emerging technologies affect culture, the economy, and politics. Owned by Condé Nast, it is headquartered in San Fran ...
'',
Steven Levy
Steven Levy (born 1951) is an American journalist and Editor at Large for '' Wired'' who has written extensively for publications on computers, technology, cryptography, the internet, cybersecurity, and privacy. He is the author of the 1984 boo ...
similarly wrote that it had the potential to become "a misinformation train wreck" and opined that its preview clips were "impressive" but "not perfect" and that it "show
dan emergent grasp of cinematic grammar" due to its unprompted shot changes. Levy added, "
will be a very long time, if ever, before text-to-video threatens actual filmmaking."
Lisa Lacy of
CNET called its example videos "remarkably realistic – except perhaps when a human face appears close up or when sea creatures are swimming".
Filmmaker
Tyler Perry
Tyler Perry (born Emmitt Perry Jr., September 13, 1969) is an American actor, comedian, filmmaker, and playwright. He is the creator and performer of the Madea character, a tough elderly woman. Perry's films vary in style from orthodox filmmak ...
announced he would be putting a planned $800 million expansion of his
Atlanta
Atlanta ( ) is the capital and most populous city of the U.S. state of Georgia. It is the seat of Fulton County, the most populous county in Georgia, but its territory falls in both Fulton and DeKalb counties. With a population of 498,71 ...
studio on hold, expressing concern about Sora's potential impact on the film industry.
See also
*
*
Dream Machine (text-to-video model)
References
External links
*
{{Artificial intelligence navbox
OpenAI
Applications of artificial intelligence
2024 software
Video processing
Film and video technology
Text-to-video generation