
A parallel text is a text placed alongside its translation or translations.
Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The
Loeb Classical Library
The Loeb Classical Library (LCL; named after James Loeb; , ) is a series of books originally published by Heinemann_(publisher), Heinemann in London, but is currently published by Harvard University Press. The library contains important works ...
and the
Clay Sanskrit Library
The Clay Sanskrit Library is a series of books published by New York University Press and the JJC Foundation. Each work features the text in its original language (transliterated Sanskrit) on the left-hand page, with its English translation on the ...
are two examples of dual-language series of texts. Reference
Bibles
The Bible (from Koine Greek , , 'the books') is a collection of religious texts or scriptures that are held to be sacred in Christianity, Judaism, Samaritanism, and many other religions. The Bible is an anthologya compilation of texts of a ...
may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study;
Origen
Origen of Alexandria, ''Ōrigénēs''; Origen's Greek name ''Ōrigénēs'' () probably means "child of Horus" (from , "Horus", and , "born"). ( 185 – 253), also known as Origen Adamantius, was an early Christian scholar, ascetic, and the ...
's
Hexapla
''Hexapla'' ( grc, Ἑξαπλᾶ, "sixfold") is the term for a critical edition of the Hebrew Bible in six versions, four of them translated into Greek, preserved only in fragments. It was an immense and complex word-for-word comparison of the ...
(Greek for "sixfold") placed six versions of the Old Testament side by side. A famous example is the
Rosetta Stone
The Rosetta Stone is a stele composed of granodiorite inscribed with three versions of a decree issued in Memphis, Egypt, in 196 BC during the Ptolemaic dynasty on behalf of King Ptolemy V Epiphanes. The top and middle texts are in Ancient ...
, whose discovery allowed the
Ancient Egyptian language
The Egyptian language or Ancient Egyptian ( ) is a dead Afro-Asiatic language that was spoken in ancient Egypt. It is known today from a large corpus of surviving texts which were made accessible to the modern world following the deciphe ...
to begin being
deciphered.
Large collections of parallel texts are called parallel corpora (see
text corpus
In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
). Alignments of parallel corpora at sentence level are prerequisite for many areas of
linguistic
Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
research.
During translation, sentences can be split, merged, deleted, inserted or reordered by the translator. This makes alignment a non-trivial task.
Parallel texts may be used in
language education
Language education – the process and practice of teaching a second or foreign language – is primarily a branch of applied linguistics, but can be an interdisciplinary field. There are four main learning categories for language educati ...
.
Types of parallel corpora
Parallel corpora can be classified into four main categories:
* A ''parallel corpus'' contains translations of the same document in two or more languages, aligned at least at the sentence level. These tend to be rarer than less-comparable corpora.
* A ''noisy parallel corpus'' contains bilingual sentences that are not perfectly aligned or have poor quality translations. Nevertheless, most of its contents are bilingual translations of a specific document.
* A ''comparable corpus'' is built from non-sentence-aligned and untranslated bilingual documents, but the documents are topic-aligned.
* A ''quasi-comparable corpus'' includes very heterogeneous and non-parallel bilingual documents that may or may not be topic-aligned.
Noise in corpora
Large corpora used as training sets for
machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
algorithms are usually extracted from large bodies of similar sources, such as databases of news articles written in the first and second languages describing similar events.
However, extracted fragments may be noisy, with extra elements inserted in each corpus. Extraction techniques can differentiate between
bilingual
Multilingualism is the use of more than one language, either by an individual speaker or by a group of speakers. It is believed that multilingual speakers outnumber monolingual speakers in the world's population. More than half of all Eu ...
elements represented in both corpora and
monolingual
Monoglottism (Greek μόνος ''monos'', "alone, solitary", + γλῶττα , "tongue, language") or, more commonly, monolingualism or unilingualism, is the condition of being able to speak only a single language, as opposed to multilingualism. ...
elements represented in only one corpus in order to extract cleaner parallel fragments of bilingual elements. Comparable corpora are used to directly obtain knowledge for translation purposes. High-quality parallel data is difficult to obtain, however, especially for under-resourced languages.
Bitext
In the field of
translation studies
Translation studies is an academic interdiscipline dealing with the systematic study of the theory, description and application of translation, interpreting, and Language localisation, localization. As an interdiscipline, translation studies borr ...
a bitext is a merged document composed of both source- and target-language versions of a given text.
Bitexts are generated by a piece of software called an ''alignment tool'', or a ''bitext tool'', which automatically aligns the original and translated versions of the same text. The tool generally matches these two texts sentence by sentence. A collection of bitexts is called a ''bitext database'' or a ''bilingual corpus'', and can be consulted with a search tool.
Bitexts and translation memories
''Bitexts'' have some similarities with translation memories. The most salient difference is that a translation memory loses the original context, while a bitext retains the original sentence order. That said, some implementations of translation memory, such as
Translation Memory eXchange Translation Memory eXchange (TMX) is an XML specification for the exchange of translation memory (TM) data between computer-aided translation and localization tools with little or no loss of critical data.
TMX was originally developed and mainta ...
(TMX), a standard
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. ...
format for exchanging translation memories between
computer-assisted translation
Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software to assist a human translator in the translation process. The translation is created by a huma ...
(CAT) programs, allow preserving the original order of sentences.
Bitexts are designed to be consulted by a human
translator
Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...
, not by a machine. As such, small alignment errors or minor discrepancies that would cause a translation memory to fail are of no importance.
In his original 1988 article, Harris also posited that bitext represents how translators hold their source and target texts together in their mental working memories as they progress. However, this hypothesis has not been followed up.
Online bitexts and translation memories may also be called online bilingual concordances. Several are available on the public Web, including
Linguée,
Reverso, and Tradooit.
See also
*
Bilingual inscription
In epigraphy, a multilingual inscription is an inscription that includes the same text in two or more languages. A bilingual is an inscription that includes the same text in two languages (or trilingual in the case of three languages, etc.). Mul ...
*
Computer-assisted reviewing
*
Example-based machine translation
*
Natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
*
Polyglot (book)
A polyglot is a book that contains side-by-side versions of the same text in several different languages. Some editions of the Bible or its parts are polyglots, in which the Hebrew and Greek originals are exhibited along with historical translat ...
*
Ruby character
Ruby characters or rubi characters () are small, annotative glosses that are usually placed above or to the right of logographic characters of languages in the East Asian cultural sphere, such as Chinese ''hanzi'', Japanese ''kanji'', and Kor ...
*
Statistical machine translation
Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrast ...
References
External links
Parallel corpora
The JRC-Acquis Multilingual Parallel Corpusof the total body of
European Union
The European Union (EU) is a supranational political and economic union of member states that are located primarily in Europe. The union has a total area of and an estimated total population of about 447million. The EU has often been ...
(EU) law: ''
Acquis Communautaire
The Community acquis or ''acquis communautaire'' (; ), sometimes called the EU acquis and often shortened to acquis, is the accumulated legislation, legal acts and court decisions that constitute the body of European Union law that came into b ...
'' with 231 language pairs.
European Parliament Proceedings Parallel Corpus 1996–2011The Opus project aims at collecting freely available parallel corpora*
ttp://www.linguateca.pt/COMPARA/ COMPARA – Portuguese/English parallel corporaTERMSEARCH – English/Russian/French parallel corpora (Major international treaties, conventions, agreements, etc.TradooIT – English/French/Spanish – Free Online toolsNunavut Hansard – English/Inuktitut parallel corpusParaSol – A parallel corpus of Slavic and other languagesGlosbe: Multilanguage parallel corporawith online search interface
InterCorp: A multilingual parallel corpus40 languages aligned with Czech
online search interfacemyCAT – Olanto concordancer (open source AGPL) with online search on JCR and UNO corpus
TAUS with online search interface.
linguatoolsmultilingual parallel corpora, online search interface.
EUR-Lex Corpus – corpusbuilt up of the
EUR-Lex
Eur-Lex (stylized EUR-Lex) is an official website of European Union law and other public documents of the European Union (EU), published in 24 official languages of the EU. The Official Journal (OJ) of the European Union is also published on E ...
database consists of
European Union law
European Union law is a system of rules operating within the member states of the European Union (EU). Since the founding of the European Coal and Steel Community following World War II, the EU has developed the aim to "promote peace, its valu ...
and other public documents of the
European Union
The European Union (EU) is a supranational political and economic union of member states that are located primarily in Europe. The union has a total area of and an estimated total population of about 447million. The EU has often been ...
Language Grid – Multilingual service platform that includes parallel text services
Documentation
*
ttps://web.archive.org/web/20060913013656/https://www.cs.unt.edu/~rada/wpt/ Proceedings of the 2003 Workshop on Building and Using Parallel TextsProceedings of the 2005 Workshop on Building and Using Parallel Texts
Alignment tools
Uplug – tools for processing parallel corpora (2003)An implementation of the Gale and Church sentence alignment algorithm (2005)The Hunalign sentence aligner (2005)Champollion (2006)mALIGNa (2008–2020)Gargantua sentence aligner (2010)Bleualign – machine translation based sentence alignment (2010)YASA (2013)Hierarchical alignment tool (HAT) (2018)Vecalign sentence alignment algorithm (2019)Web Alignment Tool at University of Grenoble
{{Natural language processing
Translation databases
Language acquisition
Corpus linguistics