HOME

TheInfoList



OR:

LAION (acronym for Large-scale Artificial Intelligence Open Network) is a German non-profit which makes open-sourced
artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
models and datasets. It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile
text-to-image model A text-to-image model is a machine learning model which takes an input natural language prompt and produces an image matching that description. Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom ...
s, including
Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...
and Imagen. In February 2023, LAION was named in the
Getty Images Getty Images Holdings, Inc. (stylized as gettyimages) is a visual media company and supplier of stock images, editorial photography, video, and music for business and consumers, with a library of over 477 million assets. It targets three mark ...
lawsuit against
Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...
as a non-party. In April 2023, LAION was directly sued by a German photographer who wanted to have his images removed from the training set. In September 2024, the Regional Court of Hamburg dismissed the lawsuit, in what was described as a "landmark ruling on TDM nowiki/>Text and data mining">Text_and_data_mining.html" ;"title="nowiki/>Text and data mining">nowiki/>Text and data miningexceptions for AI training data" in Germany and the EU more generally. On April 15, 2023, LAION and contributors publicly released an open source AI assistant chatbot called OpenAssistant.


Image datasets

LAION has publicly released a number of large datasets of image-caption pairs which have been widely used by AI researchers. The data is derived from the Common Crawl, a dataset of scraped web pages. The developers searched the crawled html for tags and treated their
alt attribute The alt attribute is the HTML attribute HTML attributes are special words used to adjust the behavior or display of an ''HTML element''. An attribute either modifies the default functionality of an element type or provides functionality to ...
s as captions. They used CLIP to identify and discard images whose content did not appear to match their captions. LAION does not host the content of scraped images themselves; rather, the dataset contains
URL A uniform resource locator (URL), colloquially known as an address on the Web, is a reference to a resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identi ...
s pointing to images, which researchers must download themselves. The first such dataset, LAION-400M, was released in August 2021 and consisted of 400 million image-caption pairs. The pairs were extracted from a random subset of webpages scraped by Common Crawl between 2014 and 2021. It was an attempt to recreate the process used by
OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...
to collect the 400 million image-caption pairs they used to train the CLIP model - the company had chosen to open-source the model's code and weights, but not its training dataset. Imagen, a text-to-image model announced by Google Brain in 2022, was trained on LAION-400M in combination with private internal datasets. A successor of more than 5 billion pairs, LAION-5B, was released in March 2022. As of its release, it was the largest freely available dataset of image-caption pairs in existence. Its creation was funded by Doodlebot,
Hugging Face Hugging Face, Inc. is a French-American company based in List of tech companies in the New York metropolitan area, New York City that develops computation tools for building applications using machine learning. It is most notable for its Transf ...
and Stability AI, the AI company behind the funding of the
Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...
text-to-image model, which was trained on it.


Criticism

Several studies show that the images in LAION-5B contain problematic images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content. An investigation by Bayerischer Rundfunk showed that LAION's datasets, hosted on Hugging Face, contain large amounts of private and sensitive data harvested from public websites. In December 2023, the
Stanford Internet Observatory The Stanford Internet Observatory (SIO) was a multidisciplinary program for the study of abuse in information technologies, with a focus on social media, established in 2019. It is part of the Stanford Cyber Policy Center, a joint initiative of t ...
released a report on LAION-5B that found 3,226 suspected instances of links to
child sexual abuse material Child pornography (also abbreviated as CP, also called child porn or kiddie porn, and child sexual abuse material, known by the acronym CSAM (underscoring that children can not be deemed willing participants under law)), is erotic material that ...
with 1,008 of these being externally validated. In response, LAION temporarily removed LAION-5B and LAION-400M citing its "zero tolerance policy for illegal content" and "an abundance of caution". In August 2024, LAION released a cleaned dataset called Re-LAION-5B.


OpenAssistant

OpenAssistant is an
artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
(AI)
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
chat-based assistant that understands tasks, can interact with third-party systems and retrieve information dynamically to do so. The project is developed by a group of volunteers in collaboration with LAION. One of the goals for development includes free access to
large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...
s that can be run locally on consumer hardware. The project is backed by a worldwide
crowdsourcing Crowdsourcing involves a large group of dispersed participants contributing or producing goods or services—including ideas, votes, micro-tasks, and finances—for payment or as volunteers. Contemporary crowdsourcing often involves digit ...
effort involving over 13,500 volunteers who have created 600k human-generated data points.


See also

* Artificial intelligence and copyright


References

{{reflist, refs= {{cite web , work=LAION.ai , title=About , access-date=26 September 2022 , url=https://laion.ai/about/ {{cite news , work=Ars Technica , date=15 September 2022 , last=Edwards, first=Benj , title=Have AI image generators assimilated your art? New tool lets you check , url=https://arstechnica.com/information-technology/2022/09/have-ai-image-generators-assimilated-your-art-new-tool-lets-you-check/ {{cite news , last1=Newman , first1=Marissa , last2=Cantrill , first2=Aggi , title=The Future of AI Relies on a High School Teacher's Free Database , url=https://www.bloomberg.com/news/features/2023-04-24/a-high-school-teacher-s-free-image-database-powers-ai-unicorns , access-date=24 April 2023 , work=
Bloomberg News Bloomberg News (originally Bloomberg Business News) is an international news agency headquartered in New York City and a division of Bloomberg L.P. Content produced by Bloomberg News is disseminated through Bloomberg Terminals, Bloomberg T ...
, date=24 April 2023 , language=en
{{cite web , work=InfoQ , last=Alford, first=Anthony , date=17 May 2022 , title=LAION Releases Five Billion Image-Text Pair Dataset LAION-5B , url=https://www.infoq.com/news/2022/05/laion-5b-image-text-dataset/ {{cite web , work=LAION blog , last=Schuhmann, first=Christoph , title=LAION-400-Million Open Dataset , date=8 August 2021 , access-date=26 September 2022 , url=https://laion.ai/blog/laion-400-open-dataset/ {{cite web , work=LAION blog , last=Beaumont, first=Romain , date=3 March 2022 , title=LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets , url=https://laion.ai/blog/laion-5b/ {{cite news , work=Ars Technica , last=Edwards, first=Benj , date=21 September 2022 , title=Artist finds private medical record photos in popular AI training data set , url=https://arstechnica.com/information-technology/2022/09/artist-finds-private-medical-record-photos-in-popular-ai-training-data-set/ {{cite arXiv , title=Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , date=23 May 2022, last1=Saharia, first1=Chitwan , last2=Chan, first2=William , last3=Saxena, first3=Saurabh , last4=Li, first4=Lala , last5=Whang, first5=Jay , last6=Denton, first6=Emily , last7=Kamyar Seyed Ghasemipour, first7=Seyed , last8=Karagol Ayan, first8=Burcu , last9=Sara Mahdavi, first9=S. , last10=Gontijo Lopes, first10=Rapha , last11=Salimans, first11=Tim , last12=Ho, first12=Jonathan , last13=J Fleet, first13=David , last14=Norouzi, first14=Mohammad , class=cs.CV , eprint=2205.11487 {{cite web , work=TechCrunch , last=Wiggers, first=Kyle , date=12 August 2022 , title=This startup is setting a DALL-E 2-like AI free, consequences be damned , url=https://techcrunch.com/2022/08/12/a-startup-wants-to-democratize-the-tech-behind-dall-e-2-consequences-be-damned/ Applications of artificial intelligence Open-source artificial intelligence Artificial intelligence laboratories Non-profit organisations based in Germany