Tatoeba is a
free collection of example sentences with translations geared towards
foreign language learners. Its name comes from the Japanese phrase "tatoeba" (), meaning "for example". It is written and maintained by a community of volunteers through a model of
open collaboration. Individual contributors are known as Tatoebans. It is hosted by Association Tatoeba, a French
non-profit organization funded through donations.
As of November 2022, the Tatoeba Corpus has over 10,800,000 sentences in 420 languages. 55 of these languages have 10,000 or more sentences. About 1 million sentences have audio recordings.
The sentences are interrelated within a
graph, facilitating translations in different languages. As of November 2022, the Tatoeba Graph lists over 21,800,000 links between sentences. 237 language pairs have over 10,000 translated sentences.
History
In 2006, Trang Ho was frustrated that unlike some of their Japanese counterparts, German
bilingual dictionaries
A bilingual dictionary or translation dictionary is a specialized dictionary used to translate words or phrases from one language to another. Bilingual dictionaries can be ''unidirectional'', meaning that they list the meanings of words of one la ...
didn't feature
full-text search
In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts r ...
of usage examples with translations. It led her to imagine her ideal dictionary and to build a prototype hosted on
SourceForge under the name "multilangdict." The main focus was already the
crowdsourcing of translated sentences: "A Wikipedia type of thing, except people add sentences, not articles."
Alongside her studies at the
University of Technology of Compiègne
The University of Technology of Compiègne (french: link=no, Université de Technologie de Compiègne, UTC) is a public research university located in Compiègne, France. The university has both the status of public university and grande école. ...
, Trang Ho gradually improved her website with a few classmates. She rebuilt the project from scratch twice and rebranded it as Tatoeba. In September 2007, about 150,000 English-Japanese sentence pairs from the Tanaka Corpus — a public-domain compilation released in 2001 by
Hyogo University
is a private university in Japan. Its campus is located in Shinzaike, Hiraoka-cho, Kakogawa, Hyōgo Prefecture
is a prefecture of Japan located in the Kansai region of Honshu. Hyōgo Prefecture has a population of 5,469,762 () and has ...
professor Yasuhito Tanaka and maintained by
Jim Breen
James William Breen (born 1947) is a Research Fellow at Monash University in Australia, where he was a professor in the area of IT and telecommunications before his retirement in 2003. He holds a BSc in mathematics, an MBA and a PhD in computat ...
and Paul Blay — were imported into the Tatoeba Corpus. In December 2008, Trang Ho released the first version of the current codebase built around a more flexible
data model
A data model is an abstract model that organizes elements of data and Standardization, standardizes how they relate to one another and to the properties of real-world Entity, entities. For instance, a data model may specify that the data element ...
. The following month, the website moved to the tatoeba.org domain.
Over the 2009-2010 academic year, Allan Simon — then a student at
SUPINFO
SUPINFO International University, formerly called "École Supérieure d'Informatique", is a private institution of higher education in Computer Science that was created in 1965 and has been recognized by the French state since 10 January 1972.
Ov ...
— became a core developer of Tatoeba. Together with Trang Ho and other young developers, they made Tatoeba more social: sentence lists, user profiles, private messaging, and
Facebook-inspired Wall. They also introduced significant features like sentence linking, tagging, and "translation of translation" search. In November 2010, Tatoeba passed the 600,000 sentences mark. Within a year, the number of sentences added daily had increased almost 50-fold.
Between 2014 and 2016, a new team of developers formed around Trang Ho. They mentored students at the
Google Summer of Code 2014 and added features to improve corpus quality.
Over the 2018-2020 period, support from the
Mozilla Foundation as part of the
Common Voice
Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. ...
project allowed Tatoeba to make its platform more open and user-friendly.
Openness
Reading
Users, even those who are not registered, can search for words in any language to retrieve sentences that use them. Each sentence in the Tatoeba Corpus is displayed next to its likely translations in other languages; translations and "translations of translations" are differentiated. Sentences are
tagged for content such as subject matter,
dialect, or
vulgarity; they also each have individual comment threads to facilitate feedback and corrections from other users and cultural notes. Sentences can be browsed by language, tag, and other criteria.
Editing
Registered users can add new sentences or translate or proofread existing ones, even if their target language is not their native tongue. However, users are encouraged to add original sentences or translations in their native or strongest language.
Users can freely edit their sentences, "adopt" and correct sentences without an owner, and comment on others' sentences. Advanced contributors, a rank above ordinary contributors, can tag, link, and unlink sentences. Corpus maintainers, a rank above advanced contributors, can untag and delete sentences. They can also modify owned sentences, though they typically do so only if the owner fails to respond to a request to make the change.
Operation
Tatoeba received a grant from
Mozilla Drumbeat in December 2010.
Some work on the Tatoeba infrastructure was sponsored by
Google Summer of Code, 2014 edition.
In May 2018 they received a $25,000
Mozilla Open Source Support (MOSS) program grant.
In August 2019 they received a $15,000 Mozilla Open Source Support (MOSS) program grant.
Access to content
Content licensing
By default, the sentences of the Tatoeba Corpus are published under a
Creative Commons Attribution 2.0 license, freeing it for academic and other use. Users can also contribute sentences under
Creative Commons Zero
A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work".A "work" is any creative material made by a person. A painting, a graphic, a book, a song/lyric ...
, though translations of those sentences currently can't share the same license.
Audio recordings of the sentences use the speaker's choice of license, such as CC BY 4.0, BY-SA, BY-NC, or no public license at all.
Offline use
Visitors can download tab-delimited sentence pairs ready for import into
Anki and similar
Spaced Repetition Software at the Tatoeba website.
Related projects
Second-language acquisition
The
JMdict Japanese-English dictionary selects its example sentences from the Tatoeba Corpus. OpenRussian is a
free Russian dictionary built primarily from the content of
Wiktionary and Tatoeba. GoodExample tries to automatically extract a diverse set of high-quality example sentences from the English Tatoeba Corpus.
Reverso uses Tatoeba
parallel corpora
A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libr ...
in its
bilingual concordancer.
Charles Kelly and Paul Raine, both
EFL teachers in Japan, have developed
language learning
Language acquisition is the process by which humans acquire the capacity to perceive and comprehend language (in other words, gain the ability to be aware of language and to understand it), as well as to produce and use words and sentences t ...
activities based on sentences curated from the Tatoeba Corpus. Clozemaster is a
language self-study program that generates
gamified cloze tests from Tatoeba sentence pairs. Some
Anki users share
flashcards
A flashcard or flash card (also known as an index card) is a card bearing information on both sides, which is intended to be used as an aid in memorization. Each flashcard bears a question on one side and an answer on the other. Flashcards are ...
that were created using Tatoeba.
Tatoeba datasets can power
incidental learning
Learning is the process of acquiring new understanding, knowledge, behaviors, skills, values, attitudes, and preferences. The ability to learn is possessed by humans, animals, and some machines; there is also evidence for some kind of learni ...
experiences that blend the acquisition of a foreign language with the user's everyday activities like
web browsing or book reading. A team at
MIT Media Lab used example sentences from Tatoeba in WordSense, a
mixed reality platform that enables "
serendipitous
Serendipity is an unplanned fortunate discovery. Serendipity is a common occurrence throughout the history of product invention and scientific discovery.
Etymology
The first noted use of "serendipity" was by Horace Walpole on 28 January 1754. I ...
language learning in the wild." More recently, Japanese researchers implemented a Tatoeba search feature in an integrated writing assistance environment.
Regional or minority languages
Some language
digital activists contribute to
open collaborative projects like Tatoeba,
Wikipedia, and
Common Voice
Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. ...
to promote their
minority language in digital spaces.
Regional languages
*
A regional language is a language spoken in a region of a sovereign state, whether it be a small area, a federated state or province or some wider area.
Internationally, for the purposes of the European Charter for Regional or Minority Lan ...
like
Kabyle,
Catalan
Catalan may refer to:
Catalonia
From, or related to Catalonia:
* Catalan language, a Romance language
* Catalans, an ethnic group formed by the people from, or with origins in, Northern or southern Catalonia
Places
* 13178 Catalan, asteroid #1 ...
, or
Basque
Basque may refer to:
* Basques, an ethnic group of Spain and France
* Basque language, their language
Places
* Basque Country (greater region), the homeland of the Basque people with parts in both Spain and France
* Basque Country (autonomous co ...
can register more than a hundred members on Tatoeba.
Constructed languages
Selected content from Tatoeba in
Esperanto
Esperanto ( or ) is the world's most widely spoken constructed international auxiliary language. Created by the Warsaw-based ophthalmologist L. L. Zamenhof in 1887, it was intended to be a universal second language for international commun ...
is available in the multilingual DVD ''Esperanto Elektronike'' published by
E@I
E@I ("Education@Internet") is an international youth non-profit organization that hosts educational projects and meetings to support intercultural learning and the usage of languages and internet technologies.
E@I started as an informal inter ...
. As of November 2022, Esperanto is Tatoeba's fifth
pivot language, with over 330,000 sentences translated into at least two languages.
Other
constructed languages
A constructed language (sometimes called a conlang) is a language whose phonology, grammar, and vocabulary, instead of having developed naturally, are consciously devised for some purpose, which may include being devised for a work of fiction. ...
like
Toki Pona
Toki Pona (rendered as ''toki pona'' and often translated as 'the language of good'; ; ) is a philosophical artistic constructed language (philosophical artlang) known for its small vocabulary, simplicity, and ease of acquisition. It was create ...
,
Interlingua,
Klingon,
Lojban
Lojban (pronounced ) is a logical, constructed, human language created by the Logical Language Group which aims to be syntactically unambigious. It succeeds the Loglan project.
The Logical Language Group (LLG) began developing Lojban in 198 ...
, and
Ido
Ido () is a constructed language derived from Reformed Esperanto, and similarly designed with the goal of being a universal second language for people of diverse backgrounds. To function as an effective ''international auxiliary language'', ...
also have a significant footprint.
Language technology
From 2008 to 2011, Francis Bond used the Tatoeba Corpus for his research on the Japanese language.
Since 2013, Jörg Tiedemann has been spreading Tatoeba
parallel corpora
A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libr ...
more widely in the
machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates th ...
community by sharing them on the OPUS repository and organizing the "Tatoeba Translation Challenge". With the rise of
deep learning
Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.
D ...
, researchers increasingly use Tatoeba's data sets to train and evaluate their massively multilingual models in tasks like
machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates th ...
,
language identification,
semantic search
Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query. Semantic search seek ...
, and
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the mai ...
.
See also
*
Phrase book
*
Parallel text
A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libra ...
*
Common Voice
Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. ...
*
Lingua Libre
Lingua Libre is an online collaborative project and tool by the Wikimedia France association, which aims to build a collaborative, multilingual, audiovisual corpus under free license.
Description
Lingua Libre enables to record words, phrases o ...
*
Wiktionary
References
External links
*
*
Video of Trang Ho introducing Tatoeba at MozFest 2019Tatoeba's statisticsTatoeba Translation Challenge{{Corpus linguistics
Advertising-free websites
Computational linguistics
Corpora
Creative Commons-licensed websites
Free-content websites
French educational websites
Language learning software
Natural language processing
Open educational resources
Social networking language-learning websites