HOME

TheInfoList



OR:

Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. Its name comes from the Japanese phrase "tatoeba" (), meaning "for example". It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as Tatoebans. It is hosted by Association Tatoeba, a French non-profit organization funded through donations. As of November 2022, the Tatoeba Corpus has over 10,800,000 sentences in 420 languages. 55 of these languages have 10,000 or more sentences. About 1 million sentences have audio recordings. The sentences are interrelated within a graph, facilitating translations in different languages. As of November 2022, the Tatoeba Graph lists over 21,800,000 links between sentences. 237 language pairs have over 10,000 translated sentences.


History

In 2006, Trang Ho was frustrated that unlike some of their Japanese counterparts, German
bilingual dictionaries A bilingual dictionary or translation dictionary is a specialized dictionary used to translate words or phrases from one language to another. Bilingual dictionaries can be ''unidirectional'', meaning that they list the meanings of words of one la ...
didn't feature
full-text search In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts r ...
of usage examples with translations. It led her to imagine her ideal dictionary and to build a prototype hosted on SourceForge under the name "multilangdict." The main focus was already the crowdsourcing of translated sentences: "A Wikipedia type of thing, except people add sentences, not articles." Alongside her studies at the
University of Technology of Compiègne The University of Technology of Compiègne (french: link=no, Université de Technologie de Compiègne, UTC) is a public research university located in Compiègne, France. The university has both the status of public university and grande école. ...
, Trang Ho gradually improved her website with a few classmates. She rebuilt the project from scratch twice and rebranded it as Tatoeba. In September 2007, about 150,000 English-Japanese sentence pairs from the Tanaka Corpus — a public-domain compilation released in 2001 by
Hyogo University is a private university in Japan. Its campus is located in Shinzaike, Hiraoka-cho, Kakogawa, Hyōgo Prefecture is a prefecture of Japan located in the Kansai region of Honshu. Hyōgo Prefecture has a population of 5,469,762 () and has ...
professor Yasuhito Tanaka and maintained by
Jim Breen James William Breen (born 1947) is a Research Fellow at Monash University in Australia, where he was a professor in the area of IT and telecommunications before his retirement in 2003. He holds a BSc in mathematics, an MBA and a PhD in computat ...
and Paul Blay — were imported into the Tatoeba Corpus. In December 2008, Trang Ho released the first version of the current codebase built around a more flexible
data model A data model is an abstract model that organizes elements of data and Standardization, standardizes how they relate to one another and to the properties of real-world Entity, entities. For instance, a data model may specify that the data element ...
. The following month, the website moved to the tatoeba.org domain. Over the 2009-2010 academic year, Allan Simon — then a student at
SUPINFO SUPINFO International University, formerly called "École Supérieure d'Informatique", is a private institution of higher education in Computer Science that was created in 1965 and has been recognized by the French state since 10 January 1972. Ov ...
— became a core developer of Tatoeba. Together with Trang Ho and other young developers, they made Tatoeba more social: sentence lists, user profiles, private messaging, and Facebook-inspired Wall. They also introduced significant features like sentence linking, tagging, and "translation of translation" search. In November 2010, Tatoeba passed the 600,000 sentences mark. Within a year, the number of sentences added daily had increased almost 50-fold. Between 2014 and 2016, a new team of developers formed around Trang Ho.  They mentored students at the Google Summer of Code 2014 and added features to improve corpus quality. Over the 2018-2020 period, support from the Mozilla Foundation as part of the
Common Voice Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. ...
project allowed Tatoeba to make its platform more open and user-friendly.


Openness


Reading

Users, even those who are not registered, can search for words in any language to retrieve sentences that use them. Each sentence in the Tatoeba Corpus is displayed next to its likely translations in other languages; translations and "translations of translations" are differentiated. Sentences are tagged for content such as subject matter, dialect, or vulgarity; they also each have individual comment threads to facilitate feedback and corrections from other users and cultural notes. Sentences can be browsed by language, tag, and other criteria.


Editing

Registered users can add new sentences or translate or proofread existing ones, even if their target language is not their native tongue. However, users are encouraged to add original sentences or translations in their native or strongest language. Users can freely edit their sentences, "adopt" and correct sentences without an owner, and comment on others' sentences. Advanced contributors, a rank above ordinary contributors, can tag, link, and unlink sentences. Corpus maintainers, a rank above advanced contributors, can untag and delete sentences. They can also modify owned sentences, though they typically do so only if the owner fails to respond to a request to make the change.


Operation

Tatoeba received a grant from Mozilla Drumbeat in December 2010. Some work on the Tatoeba infrastructure was sponsored by Google Summer of Code, 2014 edition. In May 2018 they received a $25,000 Mozilla Open Source Support (MOSS) program grant. In August 2019 they received a $15,000 Mozilla Open Source Support (MOSS) program grant.


Access to content


Content licensing

By default, the sentences of the Tatoeba Corpus are published under a Creative Commons Attribution 2.0 license, freeing it for academic and other use. Users can also contribute sentences under
Creative Commons Zero A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work".A "work" is any creative material made by a person. A painting, a graphic, a book, a song/lyric ...
, though translations of those sentences currently can't share the same license. Audio recordings of the sentences use the speaker's choice of license, such as CC BY 4.0, BY-SA, BY-NC, or no public license at all.


Offline use

Visitors can download tab-delimited sentence pairs ready for import into Anki and similar Spaced Repetition Software at the Tatoeba website.


Related projects


Second-language acquisition

The JMdict Japanese-English dictionary selects its example sentences from the Tatoeba Corpus. OpenRussian is a free Russian dictionary built primarily from the content of Wiktionary and Tatoeba. GoodExample tries to automatically extract a diverse set of high-quality example sentences from the English Tatoeba Corpus. Reverso uses Tatoeba
parallel corpora A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libr ...
in its bilingual concordancer. Charles Kelly and Paul Raine, both EFL teachers in Japan, have developed
language learning Language acquisition is the process by which humans acquire the capacity to perceive and comprehend language (in other words, gain the ability to be aware of language and to understand it), as well as to produce and use words and sentences t ...
activities based on sentences curated from the Tatoeba Corpus. Clozemaster is a language self-study program that generates gamified cloze tests from Tatoeba sentence pairs. Some Anki users share
flashcards A flashcard or flash card (also known as an index card) is a card bearing information on both sides, which is intended to be used as an aid in memorization. Each flashcard bears a question on one side and an answer on the other. Flashcards are ...
that were created using Tatoeba. Tatoeba datasets can power
incidental learning Learning is the process of acquiring new understanding, knowledge, behaviors, skills, values, attitudes, and preferences. The ability to learn is possessed by humans, animals, and some machines; there is also evidence for some kind of learni ...
experiences that blend the acquisition of a foreign language with the user's everyday activities like web browsing or book reading. A team at MIT Media Lab used example sentences from Tatoeba in WordSense, a mixed reality platform that enables "
serendipitous Serendipity is an unplanned fortunate discovery. Serendipity is a common occurrence throughout the history of product invention and scientific discovery. Etymology The first noted use of "serendipity" was by Horace Walpole on 28 January 1754. I ...
language learning in the wild." More recently, Japanese researchers implemented a Tatoeba search feature in an integrated writing assistance environment.


Regional or minority languages

Some language digital activists contribute to open collaborative projects like Tatoeba, Wikipedia, and
Common Voice Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. ...
to promote their minority language in digital spaces.
Regional languages * A regional language is a language spoken in a region of a sovereign state, whether it be a small area, a federated state or province or some wider area. Internationally, for the purposes of the European Charter for Regional or Minority Lan ...
like Kabyle,
Catalan Catalan may refer to: Catalonia From, or related to Catalonia: * Catalan language, a Romance language * Catalans, an ethnic group formed by the people from, or with origins in, Northern or southern Catalonia Places * 13178 Catalan, asteroid #1 ...
, or
Basque Basque may refer to: * Basques, an ethnic group of Spain and France * Basque language, their language Places * Basque Country (greater region), the homeland of the Basque people with parts in both Spain and France * Basque Country (autonomous co ...
can register more than a hundred members on Tatoeba.


Constructed languages

Selected content from Tatoeba in
Esperanto Esperanto ( or ) is the world's most widely spoken constructed international auxiliary language. Created by the Warsaw-based ophthalmologist L. L. Zamenhof in 1887, it was intended to be a universal second language for international commun ...
is available in the multilingual DVD ''Esperanto Elektronike'' published by
E@I E@I ("Education@Internet") is an international youth non-profit organization that hosts educational projects and meetings to support intercultural learning and the usage of languages and internet technologies. E@I started as an informal inter ...
. As of November 2022, Esperanto is Tatoeba's fifth pivot language, with over 330,000 sentences translated into at least two languages. Other
constructed languages A constructed language (sometimes called a conlang) is a language whose phonology, grammar, and vocabulary, instead of having developed naturally, are consciously devised for some purpose, which may include being devised for a work of fiction. ...
like
Toki Pona Toki Pona (rendered as ''toki pona'' and often translated as 'the language of good'; ; ) is a philosophical artistic constructed language (philosophical artlang) known for its small vocabulary, simplicity, and ease of acquisition. It was create ...
, Interlingua, Klingon,
Lojban Lojban (pronounced ) is a logical, constructed, human language created by the Logical Language Group which aims to be syntactically unambigious. It succeeds the Loglan project. The Logical Language Group (LLG) began developing Lojban in 198 ...
, and
Ido Ido () is a constructed language derived from Reformed Esperanto, and similarly designed with the goal of being a universal second language for people of diverse backgrounds. To function as an effective ''international auxiliary language'', ...
also have a significant footprint.


Language technology

From 2008 to 2011, Francis Bond used the Tatoeba Corpus for his research on the Japanese language. Since 2013, Jörg Tiedemann has been spreading Tatoeba
parallel corpora A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libr ...
more widely in the
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates th ...
community by sharing them on the OPUS repository and organizing the "Tatoeba Translation Challenge". With the rise of
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. D ...
, researchers increasingly use Tatoeba's data sets to train and evaluate their massively multilingual models in tasks like
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates th ...
, language identification,
semantic search Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query. Semantic search seek ...
, and
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the mai ...
.


See also

* Phrase book *
Parallel text A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libra ...
*
Common Voice Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. ...
*
Lingua Libre Lingua Libre is an online collaborative project and tool by the Wikimedia France association, which aims to build a collaborative, multilingual, audiovisual corpus under free license. Description Lingua Libre enables to record words, phrases o ...
* Wiktionary


References


External links

* *
Video of Trang Ho introducing Tatoeba at MozFest 2019

Tatoeba's statistics

Tatoeba Translation Challenge
{{Corpus linguistics Advertising-free websites Computational linguistics Corpora Creative Commons-licensed websites Free-content websites French educational websites Language learning software Natural language processing Open educational resources Social networking language-learning websites