Apertium
   HOME

TheInfoList



OR:

Apertium is a free/open-source rule-based machine translation platform. It is
free software Free software, libre software, libreware sometimes known as freedom-respecting software is computer software distributed open-source license, under terms that allow users to run the software for any purpose as well as to study, change, distribut ...
and released under the terms of the
GNU General Public License The GNU General Public Licenses (GNU GPL or simply GPL) are a series of widely used free software licenses, or ''copyleft'' licenses, that guarantee end users the freedom to run, study, share, or modify the software. The GPL was the first ...
.


Overview

Apertium is a transfer-based
machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...
system, which uses
finite state transducer A finite-state transducer (FST) is a finite-state machine with two memory ''tapes'', following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. ...
s for all of its lexical transformations, and Constraint Grammar taggers as well as
hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...
s or
Perceptron In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...
s for
part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...
/ word category disambiguation. A structural transfer component is responsible for word movement and agreement; most Apertium language pairs up until now have used "chunking" or shallow transfer rules, though newer pairs use (possibly recursive) rules defined in a
Context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the fo ...
. Many existing
machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...
systems available at present are commercial or use proprietary technologies, which makes them very hard to adapt to new usages. Apertium code and data is
free software Free software, libre software, libreware sometimes known as freedom-respecting software is computer software distributed open-source license, under terms that allow users to run the software for any purpose as well as to study, change, distribut ...
and uses a language-independent specification, to allow for the ease of contributing to Apertium, more efficient development, and enhancing the project's overall growth. At present (December 2020), Apertium has released 51 stable language pairs, delivering fast translation with reasonably intelligible results (errors are easily corrected). Being an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
project, Apertium provides tools for potential developers to build their own language pair and contribute to the project.


History

Apertium originated as one of the machine translation engines in the project OpenTrad, which was funded by the Spanish government, and developed by the Transducens research group at the Universitat d'Alacant. It was originally designed to translate between closely related languages, although it has recently been expanded to treat more divergent language pairs. To create a new machine translation system, one just has to develop linguistic data (dictionaries, rules) in well-specified
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
formats. Language data developed for it (in collaboration with the Universidade de Vigo, the Universitat Politècnica de Catalunya and the Universitat Pompeu Fabra) currently support (in stable version) the
Arabic Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...
, Aragonese, Asturian,
Basque Basque may refer to: * Basques, an ethnic group of Spain and France * Basque language, their language Places * Basque Country (greater region), the homeland of the Basque people with parts in both Spain and France * Basque Country (autonomous co ...
, Belarusian, Breton, Bulgarian, Catalan, Crimean Tatar, Danish, English,
Esperanto Esperanto (, ) is the world's most widely spoken Constructed language, constructed international auxiliary language. Created by L. L. Zamenhof in 1887 to be 'the International Language' (), it is intended to be a universal second language for ...
, French, Galician,
Hindi Modern Standard Hindi (, ), commonly referred to as Hindi, is the Standard language, standardised variety of the Hindustani language written in the Devanagari script. It is an official language of India, official language of the Government ...
, Icelandic, Indonesian,
Italian Italian(s) may refer to: * Anything of, from, or related to the people of Italy over the centuries ** Italians, a Romance ethnic group related to or simply a citizen of the Italian Republic or Italian Kingdom ** Italian language, a Romance languag ...
, Kazakh, Macedonian, Malaysian, Maltese,
Northern Sami Northern may refer to the following: Geography * North, a point in direction * Northern Europe, the northern part or region of Europe * Northern Highland, a region of Wisconsin, United States * Northern Province, Sri Lanka * Northern Range, a ...
, Norwegian (
Bokmål Bokmål () (, ; ) is one of the official written standards for the Norwegian language, alongside Nynorsk. Bokmål is by far the most used written form of Norwegian today, as it is adopted by 85% to 90% of the population in Norway. There is no cou ...
and
Nynorsk Nynorsk (; ) is one of the two official written standards of the Norwegian language, the other being Bokmål. From 12 May 1885, it became the state-sanctioned version of Ivar Aasen's standard Norwegian language (''Landsmål''), parallel to the Da ...
),
Occitan Occitan may refer to: * Something of, from, or related to the Occitania territory in parts of France, Italy, Monaco and Spain. * Something of, from, or related to the Occitania administrative region of France. * Occitan language, spoken in parts o ...
, Polish, Portuguese, Romanian, Russian, Sardinian,
Serbo-Croatian Serbo-Croatian ( / ), also known as Bosnian-Croatian-Montenegrin-Serbian (BCMS), is a South Slavic language and the primary language of Serbia, Croatia, Bosnia and Herzegovina, and Montenegro. It is a pluricentric language with four mutually i ...
, Silesian, Slovene, Spanish, Swedish, Tatar, Ukrainian,
Urdu Urdu (; , , ) is an Indo-Aryan languages, Indo-Aryan language spoken chiefly in South Asia. It is the Languages of Pakistan, national language and ''lingua franca'' of Pakistan. In India, it is an Eighth Schedule to the Constitution of Indi ...
, and Welsh languages. A full list is available below. Several companies are also involved in the development of Apertium, including Prompsit Language Engineering, Imaxin Software and Eleka Ingeniaritza Linguistikoa. The project has taken part in the 2009, 2010, 2011, 2012, 2013 and 2014 editions of
Google Summer of Code The Google Summer of Code, often abbreviated to GSoC, is an international annual program in which Google awards stipends to contributors who successfully complete a free and open-source software coding project during the summer. , the program is ...
and the 2010, 2011, 2012, 2013, 2014, 2015, 2016 and 2017 editions of Google Code-In.


Translation methodology

This is an overall, step-by-step view how Apertium works. The diagram displays the steps that Apertium takes to translate a source-language text (the text we want to translate) into a target-language text (the translated text). # Source language text is passed into Apertium for translation. #The ''deformatter'' removes formatting markup (HTML, RTF, etc.) that should be kept in place but not translated. #The ''morphological analyser'' segments the text (expanding
elision In linguistics, an elision or deletion is the omission of one or more sounds (such as a vowel, a consonant, or a whole syllable) in a word or phrase. However, these terms are also used to refer more narrowly to cases where two words are run to ...
s, marking set phrases, etc.), and looks up segments in the language dictionaries, returning dictionary forms and tags for all matches. In pairs that involve agglutinative morphology, including a number of
Turkic languages The Turkic languages are a language family of more than 35 documented languages, spoken by the Turkic peoples of Eurasia from Eastern Europe and Southern Europe to Central Asia, East Asia, North Asia (Siberia), and West Asia. The Turkic langua ...
, a Helsinki Finite State Transducer (HFST) is used. Otherwise, an Apertium-specific
finite state transducer A finite-state transducer (FST) is a finite-state machine with two memory ''tapes'', following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. ...
system called ''lttoolbox,'' is used. #The ''morphological disambiguator'' (the ''morphological analyser'' and the ''morphological disambiguator'' together form the '' part of speech tagger'') resolves ambiguous segments (i.e., when there is more than one match) by choosing one match. Apertium uses Constraint Grammar rules (with the vislcg3 parser) for most of its language pairs. # ''Retokenisation'' uses a finite state transducer to match sequences of lexical units and may reorder or translate tags (often used for translating idiomatic expressions into something that more approaches the target language grammar) #''Lexical transfer'' looks up disambiguated source-language basewords to find their target-language equivalents (i.e., mapping source language to target language). For ''lexical transfer'', Apertium uses an
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
-based dictionary format called ''bidix.'' #''Lexical selection'' chooses between alternative translations when the source text word has alternative meanings. Apertium uses a specific
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
-based technology, ''apertium-lex-tools,'' to perform ''lexical selection''. #''Structural transfer'' (i.e., it is an
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
format that allows writing complex structural transfer rules) can consist of one-step chunking transfer, three-step chunking transfer or a CFG-based transfer module. The chunking modules flag grammatical differences between the source language and target language (e.g. gender or number agreement) by creating a sequence of chunks containing markers for this. They then reorder or modify chunks in order to produce a grammatical translation in the target-language. The newer CFG-based module matches input sequences into possible parse trees, selecting the best-ranking one and applying transformation rules on the tree. #The ''morphological generator'' uses the tags to deliver the correct target language surface form. The morphological generator is a morphological transducer, just like the morphological analyser. A morphological transducer both analyses and generates forms. #The ''post-generator'' makes any necessary orthographic changes due to the contact of words (e.g.
elision In linguistics, an elision or deletion is the omission of one or more sounds (such as a vowel, a consonant, or a whole syllable) in a word or phrase. However, these terms are also used to refer more narrowly to cases where two words are run to ...
s). #The ''reformatter'' replaces formatting markup (HTML, RTF, etc.) that was removed by the deformatter in the first step. #Apertium delivers the target-language translation.


Supported languages

As of , the following 108 pairs and 50 languages and languages varieties are supported by Apertium. #
Afrikaans Afrikaans is a West Germanic languages, West Germanic language spoken in South Africa, Namibia and to a lesser extent Botswana, Zambia, Zimbabwe and also Argentina where there is a group in Sarmiento, Chubut, Sarmiento that speaks the Pat ...
to Dutch #
Arabic Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...
to Maltese # Aragonese to Catalan # Aragonese to Spanish # Arpitan (
Franco-Provençal Franco-Provençal (also Francoprovençal, Patois or Arpitan) is a Gallo-Romance languages, Gallo-Romance language that originated and is spoken in eastern France, western Switzerland, and northwestern Italy. Franco-Provençal has several di ...
) to French #
Basque Basque may refer to: * Basques, an ethnic group of Spain and France * Basque language, their language Places * Basque Country (greater region), the homeland of the Basque people with parts in both Spain and France * Basque Country (autonomous co ...
to English # Basque to Spanish # Belarusian to Russian # Breton to French # Bulgarian to Macedonian # Catalan to Aragonese # Catalan to English # Catalan to Esperanto # Catalan to French # Catalan to Italian # Catalan to Occitan # Catalan to Aranese # Catalan to Portuguese # Catalan to
Brazilian Portuguese Brazilian Portuguese (; ; also known as pt-BR) is the set of Variety (linguistics), varieties of Portuguese language native to Brazil. It is spoken by almost all of the 203 million inhabitants of Brazil and widely across the Brazilian diaspora ...
# Catalan to
European Portuguese European Portuguese (, ), also known as Lusitanian Portuguese () or as the Portuguese (language) of Portugal (), refers to the dialects of the Portuguese language spoken in Portugal. The word "European" was chosen to avoid the clash of "Portugues ...
(traditional spelling) # Catalan to Romanian # Catalan to Sardinian # Catalan to Spanish # Crimean Tatar to Turkish # Danish to Norwegian (Bokmål) # Danish to Norwegian (Nynorsk) # Danish to Swedish # Dutch to Afrikaans # English to Catalan # English to Valencian # English to Esperanto # English to Galician # English to Serbo-Croatian # English to Spanish #
Esperanto Esperanto (, ) is the world's most widely spoken Constructed language, constructed international auxiliary language. Created by L. L. Zamenhof in 1887 to be 'the International Language' (), it is intended to be a universal second language for ...
to English # French to Arpitan (Franco-Provençal) # French to Catalan # French to Esperanto # French to Occitan # French to Gascon # French to Spanish # Galician to English # Galician to Portuguese # Galician to Spanish #
Hindi Modern Standard Hindi (, ), commonly referred to as Hindi, is the Standard language, standardised variety of the Hindustani language written in the Devanagari script. It is an official language of India, official language of the Government ...
to Urdu # Icelandic to English # Icelandic to Swedish # Indonesian to Malay #
Italian Italian(s) may refer to: * Anything of, from, or related to the people of Italy over the centuries ** Italians, a Romance ethnic group related to or simply a citizen of the Italian Republic or Italian Kingdom ** Italian language, a Romance languag ...
to Catalan # Italian to Sardinian # Italian to Spanish # Kazakh to Tatar # Macedonian to Bulgarian # Macedonian to English # Malay to Indonesian # Maltese to Arabic #
Northern Sámi Northern Sámi or North Sámi ( ; ; ; ; ; disapproved exonym Lappish or Lapp) is the most widely spoken of all Sámi languages. The area where Northern Sámi is spoken covers the northern parts of Norway, Sweden and Finland. Geographic distr ...
to Norwegian (Bokmål) # Norwegian (
Bokmål Bokmål () (, ; ) is one of the official written standards for the Norwegian language, alongside Nynorsk. Bokmål is by far the most used written form of Norwegian today, as it is adopted by 85% to 90% of the population in Norway. There is no cou ...
) to Danish # Norwegian (Bokmål) to Norwegian (Nynorsk) # Norwegian (Bokmål) to East Norwegian, vi→vi # Norwegian (Bokmål) to Swedish # Norwegian (
Nynorsk Nynorsk (; ) is one of the two official written standards of the Norwegian language, the other being Bokmål. From 12 May 1885, it became the state-sanctioned version of Ivar Aasen's standard Norwegian language (''Landsmål''), parallel to the Da ...
) to Danish # Norwegian (Nynorsk) to Norwegian (Bokmål) # Norwegian (Nynorsk) to East Norwegian, vi→vi # Norwegian (Nynorsk) to Swedish # East Norwegian, vi→vi to Norwegian (Nynorsk) # Occitan to Catalan # Occitan to French # Occitan to Spanish #
Aranese Aranese () is a standardized form of the Pyrenean Gascon dialect, Gascon variety of the Occitan language spoken in the Val d'Aran, in northwestern Catalonia close to the France–Spain border, Spanish border with France, where it is one of the t ...
to Catalan # Aranese to Spanish # Gascon to French # Polish to Silesian # Portuguese to Catalan # Portuguese to Galician # Portuguese to Spanish # Romanian to Catalan # Romanian to Spanish # Russian to Belarusian # Russian to Ukrainian # Sardinian to Italian #
Serbo-Croatian Serbo-Croatian ( / ), also known as Bosnian-Croatian-Montenegrin-Serbian (BCMS), is a South Slavic language and the primary language of Serbia, Croatia, Bosnia and Herzegovina, and Montenegro. It is a pluricentric language with four mutually i ...
to English # Serbo-Croatian to Macedonian # Serbo-Croatian to Slovenian # Silesian to Polish # Slovenian to Serbo-Croatian # Spanish to Aragonese # Spanish to Asturian # Spanish to Catalan # Spanish to Valencian # Spanish to English # Spanish to Esperanto # Spanish to French # Spanish to Galician # Spanish to Italian # Spanish to Occitan # Spanish to Aranese # Spanish to Portuguese # Spanish to Brazilian Portuguese # Swedish to Danish # Swedish to Icelandic # Swedish to Norwegian (Bokmål) # Swedish to Norwegian (Nynorsk) # Tatar to Kazakh # Turkish to Crimean Tatar # Ukrainian to Russian #
Urdu Urdu (; , , ) is an Indo-Aryan languages, Indo-Aryan language spoken chiefly in South Asia. It is the Languages of Pakistan, national language and ''lingua franca'' of Pakistan. In India, it is an Eighth Schedule to the Constitution of Indi ...
to Hindi # Welsh to English


See also

* Babel Fish (discontinued; redirects to main Yahoo! site) * Comparison of machine translation applications * Jollo (discontinued) *
Microsoft Translator Microsoft Translator or Bing Translator is a multilingual machine translation cloud service provided by Microsoft. Microsoft Translator is a part of Microsoft Cognitive Services and integrated across multiple consumer, developer, and enterprise pro ...
*
Moses In Abrahamic religions, Moses was the Hebrews, Hebrew prophet who led the Israelites out of slavery in the The Exodus, Exodus from ancient Egypt, Egypt. He is considered the most important Prophets in Judaism, prophet in Judaism and Samaritani ...
* OpenLogos * SYSTRAN * Yandex.Translate


Notes


References

*Corbí-Bellot, M. et al. (2005) "An open-source shallow-transfer machine translation engine for the romance languages of Spain" in ''Proceedings of the European Association for Machine Translation, 10th Annual Conference, Budapest 2005'', pp. 79–86 *Armentano-Oller, C. et al. (2006) "Open-source Portuguese-Spanish machine translation" ''in Lecture Notes in Computer Science 3960 omputational Processing of the Portuguese Language, Proceedings of the 7th International Workshop on Computational Processing of Written and Spoken Portuguese, PROPOR 2006', p 50–59. *Forcada, M. L. et al. (2010) "Documentation of the Open-Source Shallow-Transfer Machine Translation Platform ''Apertium''" ''in Departament de Llenguatges i Sistemes Informatics, University of Alacant''. *Forcada, M. L. et al. (2011) "''Apertium: a free/open-source platform for rule-based machine translation''". in "


External links


Apertium homeApertium WikiOpenTrad
*


End-user services and software

(All services are based on the Apertium engine)


Online translation websites


Apertium Translation homePrompsit Translator

PoliTraductor TranslatorUniversity d' Alacant TranslatorUniversitat Oberta de Catalunya Translator
{{Webarchive, url=https://web.archive.org/web/20160117192333/http://apertium.uoc.edu/ , date=2016-01-17


Offline applications


Apertium CaffeineApertium AndroidApertium OmegaT
Free software programmed in C++ Machine translation software Natural language processing software Natural language processing toolkits Products introduced in 2009 Translation websites Software using the GNU General Public License