A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The

Loeb Classical Library The Loeb Classical Library (LCL; named after James Loeb; , ) is a monographic series of books originally published by Heinemann and since 1934 by Harvard University Press. It has bilingual editions of ancient Greek and Latin literature, ...

and the

Clay Sanskrit Library The Clay Sanskrit Library is a series of books published by New York University Press and the JJC Foundation. Each work features the text in its original language (transliterated Sanskrit) on the left-hand page, with its English translation on the ...

are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study;

Origen Origen of Alexandria (), also known as Origen Adamantius, was an Early Christianity, early Christian scholar, Asceticism#Christianity, ascetic, and Christian theology, theologian who was born and spent the first half of his career in Early cent ...

Hexapla ''Hexapla'' (), also called ''Origenis Hexaplorum'', is a Textual criticism, critical edition of the Hebrew Bible in six versions, four of them translated into Ancient Greek, Greek, preserved only in fragments. It was an immense and complex wor ...

(Greek for "sixfold") placed six versions of the Old Testament side by side. A famous example is the

Rosetta Stone The Rosetta Stone is a stele of granodiorite inscribed with three versions of a Rosetta Stone decree, decree issued in 196 BC during the Ptolemaic dynasty of ancient Egypt, Egypt, on behalf of King Ptolemy V Epiphanes. The top and middle texts ...

, whose discovery allowed the

Ancient Egyptian language The Egyptian language, or Ancient Egyptian (; ), is an extinct branch of the Afro-Asiatic languages that was spoken in ancient Egypt. It is known today from a large corpus of surviving texts, which were made accessible to the modern world ...

to begin being deciphered. Large collections of parallel texts are called parallel corpora (see

text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...

). Alignments of parallel corpora at sentence level are prerequisite for many areas of

linguistic Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

research. During translation, sentences can be split, merged, deleted, inserted or reordered by the translator. This makes alignment a non-trivial task. Parallel texts may be used in

language education Language education refers to the processes and practices of teaching a second language, second or foreign language. Its study reflects interdisciplinarity, interdisciplinary approaches, usually including some applied linguistics. There are f ...

Types of parallel corpora

Parallel corpora can be classified into four main categories: * A ''parallel corpus'' contains translations of the same document in two or more languages, aligned at least at the sentence level. These tend to be rarer than less-comparable corpora. * A ''noisy parallel corpus'' contains bilingual sentences that are not perfectly aligned or have poor quality translations. Nevertheless, most of its contents are bilingual translations of a specific document. * A ''comparable corpus'' is built from non-sentence-aligned and untranslated bilingual documents, but the documents are topic-aligned. * A ''quasi-comparable corpus'' includes very heterogeneous and non-parallel bilingual documents that may or may not be topic-aligned.

Noise in corpora

Large corpora used as training sets for

machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

algorithms are usually extracted from large bodies of similar sources, such as databases of news articles written in the first and second languages describing similar events. However, extracted fragments may be noisy, with extra elements inserted in each corpus. Extraction techniques can differentiate between

bilingual Multilingualism is the use of more than one language, either by an individual speaker or by a group of speakers. When the languages are just two, it is usually called bilingualism. It is believed that multilingual speakers outnumber monolin ...

elements represented in both corpora and

monolingual Monoglottism ( Greek μόνος ''monos'', "alone, solitary", + γλῶττα , "tongue, language") or, more commonly, monolingualism or unilingualism, is the condition of being able to speak only a single language, as opposed to multilingualism. ...

elements represented in only one corpus in order to extract cleaner parallel fragments of bilingual elements. Comparable corpora are used to directly obtain knowledge for translation purposes. High-quality parallel data is difficult to obtain, however, especially for under-resourced languages.

Bitext

In the field of

translation studies Translation studies is an academic interdiscipline dealing with the systematic study of the theory, description and application of translation, interpreting, and localization. As an interdiscipline, translation studies borrows much from the vari ...

a bitext is a merged document composed of both source- and target-language versions of a given text. Bitexts are generated by a piece of software called an ''alignment tool'', or a ''bitext tool'', which automatically aligns the original and translated versions of the same text. The tool generally matches these two texts sentence by sentence. A collection of bitexts is called a ''bitext database'' or a ''bilingual corpus'', and can be consulted with a search tool.

Bitexts and translation memories

''Bitexts'' have some similarities with translation memories. The most salient difference is that a translation memory loses the original context, while a bitext retains the original sentence order. That said, some implementations of translation memory, such as Translation Memory eXchange (TMX), a standard

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...

format for exchanging translation memories between

computer-assisted translation Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software, also known as a translator, to assist a human translator in the translation process. The tr ...

(CAT) programs, allow preserving the original order of sentences. Bitexts are designed to be consulted by a human

translator Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''trans ...

, not by a machine. As such, small alignment errors or minor discrepancies that would cause a translation memory to fail are of no importance. In his original 1988 article, Harris also posited that bitext represents how translators hold their source and target texts together in their mental working memories as they progress. However, this hypothesis has not been followed up. Online bitexts and translation memories may also be called online bilingual concordances. Several are available on the public Web, including Linguée, Reverso, and Tradooit.

References

External links

Parallel corpora

The JRC-Acquis Multilingual Parallel Corpus
of the total body of

European Union The European Union (EU) is a supranational union, supranational political union, political and economic union of Member state of the European Union, member states that are Geography of the European Union, located primarily in Europe. The u ...

(EU) law: ''

Acquis Communautaire The Community acquis or ''acquis communautaire'' (; ), sometimes called the EU acquis, and often shortened to acquis, is the accumulated legislation, legal acts and court decisions that constitute the body of European Union law that came into ...

'' with 231 language pairs.
European Parliament Proceedings Parallel Corpus 1996–2011

The Opus project aims at collecting freely available parallel corpora

* ttp://www.linguateca.pt/COMPARA/ COMPARA – Portuguese/English parallel corpora
TERMSEARCH – English/Russian/French parallel corpora (Major international treaties, conventions, agreements, etc.

TradooIT – English/French/Spanish – Free Online tools

Nunavut Hansard – English/Inuktitut parallel corpus

ParaSol – A parallel corpus of Slavic and other languages

Glosbe: Multilanguage parallel corpora
with online search interface
InterCorp: A multilingual parallel corpus
40 languages aligned with Czech
online search interface

myCAT – Olanto
concordancer (open source AGPL) with online search on JCR and UNO corpus
TAUS
with online search interface.
linguatools
multilingual parallel corpora, online search interface.
EUR-Lex Corpus – corpus
built up of the

EUR-Lex EUR-Lex is the official online database of European Union law and other public documents of the European Union (EU), published in 24 official Languages of the European Union, languages of the EU. The Official Journal of the European Union, Offici ...

database consists of

European Union law European Union law is a system of Supranational union, supranational Law, laws operating within the 27 member states of the European Union (EU). It has grown over time since the 1952 founding of the European Coal and Steel Community, to promote ...

and other public documents of the

Language Grid – Multilingual service platform that includes parallel text services

Documentation

* ttps://web.archive.org/web/20060913013656/https://www.cs.unt.edu/~rada/wpt/ Proceedings of the 2003 Workshop on Building and Using Parallel Texts
Proceedings of the 2005 Workshop on Building and Using Parallel Texts

Alignment tools

Uplug – tools for processing parallel corpora (2003)

An implementation of the Gale and Church sentence alignment algorithm (2005)

The Hunalign sentence aligner (2005)

Champollion (2006)

mALIGNa (2008–2020)

Gargantua sentence aligner (2010)

Bleualign – machine translation based sentence alignment (2010)

YASA (2013)

Hierarchical alignment tool (HAT) (2018)

Vecalign sentence alignment algorithm (2019)

Web Alignment Tool at University of Grenoble
{{Natural language processing Translation databases Language acquisition Corpus linguistics