HOME

TheInfoList



OR:

A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The
Loeb Classical Library The Loeb Classical Library (LCL; named after James Loeb; , ) is a series of books originally published by Heinemann in London, but is currently published by Harvard University Press. The library contains important works of ancient Greek and ...
and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study;
Origen Origen of Alexandria, ''Ōrigénēs''; Origen's Greek name ''Ōrigénēs'' () probably means "child of Horus" (from , "Horus", and , "born"). ( 185 – 253), also known as Origen Adamantius, was an early Christian scholar, ascetic, and the ...
's
Hexapla ''Hexapla'' ( grc, Ἑξαπλᾶ, "sixfold") is the term for a critical edition of the Hebrew Bible in six versions, four of them translated into Greek, preserved only in fragments. It was an immense and complex word-for-word comparison of the ...
(Greek for "sixfold") placed six versions of the Old Testament side by side. A famous example is the
Rosetta Stone The Rosetta Stone is a stele composed of granodiorite inscribed with three versions of a decree issued in Memphis, Egypt, in 196 BC during the Ptolemaic dynasty on behalf of King Ptolemy V Epiphanes. The top and middle texts are in Anci ...
, whose discovery allowed the
Ancient Egyptian language The Egyptian language or Ancient Egyptian ( ) is a dead Afro-Asiatic language that was spoken in ancient Egypt. It is known today from a large corpus of surviving texts which were made accessible to the modern world following the deciphe ...
to begin being deciphered. Large collections of parallel texts are called parallel corpora (see
text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
). Alignments of parallel corpora at sentence level are prerequisite for many areas of
linguistic Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
research. During translation, sentences can be split, merged, deleted, inserted or reordered by the translator. This makes alignment a non-trivial task. Parallel texts may be used in
language education Language education – the process and practice of teaching a second or foreign language – is primarily a branch of applied linguistics, but can be an interdisciplinary field. There are four main learning categories for language educatio ...
.


Types of parallel corpora

Parallel corpora can be classified into four main categories: * A ''parallel corpus'' contains translations of the same document in two or more languages, aligned at least at the sentence level. These tend to be rarer than less-comparable corpora. * A ''noisy parallel corpus'' contains bilingual sentences that are not perfectly aligned or have poor quality translations. Nevertheless, most of its contents are bilingual translations of a specific document. * A ''comparable corpus'' is built from non-sentence-aligned and untranslated bilingual documents, but the documents are topic-aligned. * A ''quasi-comparable corpus'' includes very heterogeneous and non-parallel bilingual documents that may or may not be topic-aligned.


Noise in corpora

Large corpora used as training sets for
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates ...
algorithms are usually extracted from large bodies of similar sources, such as databases of news articles written in the first and second languages describing similar events. However, extracted fragments may be noisy, with extra elements inserted in each corpus. Extraction techniques can differentiate between
bilingual Multilingualism is the use of more than one language, either by an individual speaker or by a group of speakers. It is believed that multilingual speakers outnumber monolingual speakers in the world's population. More than half of all ...
elements represented in both corpora and monolingual elements represented in only one corpus in order to extract cleaner parallel fragments of bilingual elements. Comparable corpora are used to directly obtain knowledge for translation purposes. High-quality parallel data is difficult to obtain, however, especially for under-resourced languages.


Bitext

In the field of translation studies a bitext is a merged document composed of both source- and target-language versions of a given text. Bitexts are generated by a piece of software called an ''alignment tool'', or a ''bitext tool'', which automatically aligns the original and translated versions of the same text. The tool generally matches these two texts sentence by sentence. A collection of bitexts is called a ''bitext database'' or a ''bilingual corpus'', and can be consulted with a search tool.


Bitexts and translation memories

''Bitexts'' have some similarities with translation memories. The most salient difference is that a translation memory loses the original context, while a bitext retains the original sentence order. That said, some implementations of translation memory, such as
Translation Memory eXchange Translation Memory eXchange (TMX) is an XML specification for the exchange of translation memory (TM) data between computer-aided translation and localization tools with little or no loss of critical data. TMX was originally developed and maintai ...
(TMX), a standard
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
format for exchanging translation memories between
computer-assisted translation Computer-aided translation (CAT), also referred to as computer-assisted translation or computer-aided human translation (CAHT), is the use of software to assist a human translator in the translation Translation is the communication of ...
(CAT) programs, allow preserving the original order of sentences. Bitexts are designed to be consulted by a human
translator Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transl ...
, not by a machine. As such, small alignment errors or minor discrepancies that would cause a translation memory to fail are of no importance. In his original 1988 article, Harris also posited that bitext represents how translators hold their source and target texts together in their mental working memories as they progress. However, this hypothesis has not been followed up. Online bitexts and translation memories may also be called online bilingual concordances. Several are available on the public Web, including Linguée, Reverso, and Tradooit.


See also

*
Bilingual inscription In epigraphy, a multilingual inscription is an inscription that includes the same text in two or more languages. A bilingual is an inscription that includes the same text in two languages (or trilingual in the case of three languages, etc.). Mul ...
*
Computer-assisted reviewing {{Unreferenced, date=September 2008 Computer-assisted reviewing (CAR) tools are pieces of software based on text-comparison and analysis algorithms. These tools focus on the differences between two documents, taking into account each document's typ ...
*
Example-based machine translation Example-based machine translation (EBMT) is a method of machine translation often characterized by its use of a bilingual corpus with parallel texts as its main knowledge base at run-time. It is essentially a translation by analogy and can be vi ...
*
Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
*
Polyglot (book) A polyglot is a book that contains side-by-side versions of the same text in several different languages. Some editions of the Bible or its parts are polyglots, in which the Hebrew and Greek originals are exhibited along with historical transla ...
*
Ruby character Ruby characters or rubi characters () are small, annotative glosses that are usually placed above or to the right of logographic characters of languages in the East Asian cultural sphere, such as Chinese ''hanzi'', Japanese ''kanji'', and Ko ...
*
Statistical machine translation Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contras ...


References


External links


Parallel corpora


The JRC-Acquis Multilingual Parallel Corpus
of the total body of
European Union The European Union (EU) is a supranational union, supranational political union, political and economic union of Member state of the European Union, member states that are located primarily in Europe, Europe. The union has a total area of ...
(EU) law: ''
Acquis Communautaire The Community acquis or ''acquis communautaire'' (; ), sometimes called the EU acquis and often shortened to acquis, is the accumulated legislation, legal acts and court decisions that constitute the body of European Union law that came into b ...
'' with 231 language pairs.
European Parliament Proceedings Parallel Corpus 1996–2011

The Opus project aims at collecting freely available parallel corpora


* ttp://www.linguateca.pt/COMPARA/ COMPARA – Portuguese/English parallel corpora
TERMSEARCH – English/Russian/French parallel corpora (Major international treaties, conventions, agreements, etc.

TradooIT – English/French/Spanish – Free Online tools

Nunavut Hansard – English/Inuktitut parallel corpus

ParaSol – A parallel corpus of Slavic and other languages

Glosbe: Multilanguage parallel corpora
with online search interface
InterCorp: A multilingual parallel corpus
40 languages aligned with Czech
online search interface

myCAT – Olanto
concordancer (open source AGPL) with online search on JCR and UNO corpus
TAUS
with online search interface.
linguatools
multilingual parallel corpora, online search interface.
EUR-Lex Corpus – corpus
built up of the
EUR-Lex Eur-Lex (stylized EUR-Lex) is an official website of European Union law and other public documents of the European Union (EU), published in 24 official languages of the EU. The Official Journal (OJ) of the European Union is also published on EU ...
database consists of
European Union law European Union law is a system of rules operating within the member states of the European Union (EU). Since the founding of the European Coal and Steel Community following World War II, the EU has developed the aim to "promote peace, its valu ...
and other public documents of the
European Union The European Union (EU) is a supranational union, supranational political union, political and economic union of Member state of the European Union, member states that are located primarily in Europe, Europe. The union has a total area of ...

Language Grid – Multilingual service platform that includes parallel text services


Documentation



* ttps://web.archive.org/web/20060913013656/https://www.cs.unt.edu/~rada/wpt/ Proceedings of the 2003 Workshop on Building and Using Parallel Texts
Proceedings of the 2005 Workshop on Building and Using Parallel Texts


Alignment tools




Uplug – tools for processing parallel corpora (2003)

An implementation of the Gale and Church sentence alignment algorithm (2005)

The Hunalign sentence aligner (2005)

Champollion (2006)

mALIGNa (2008–2020)

Gargantua sentence aligner (2010)

Bleualign – machine translation based sentence alignment (2010)

YASA (2013)

Hierarchical alignment tool (HAT) (2018)

Vecalign sentence alignment algorithm (2019)

Web Alignment Tool at University of Grenoble
{{Natural language processing Translation databases Language acquisition Corpus linguistics