The Tekstaro de Esperanto (''Corpus of Esperanto'') is a
text corpus
In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Annotated, they have been used in corp ...
of the
Esperanto
Esperanto (, ) is the world's most widely spoken Constructed language, constructed international auxiliary language. Created by L. L. Zamenhof in 1887 to be 'the International Language' (), it is intended to be a universal second language for ...
language, a large collection of very diverse texts for linguistic research on Esperanto. , the corpus has texts with a total of 5,177,208 words.
It is searchable by
regular expressions
A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of character (computing), characters that specifies a pattern matching, match pattern in string (computer science), text. Usually ...
, including custom search terms that are lexical (e.g., sequences of Esperanto letters) and grammatical (e.g., active participial suffixes, passive participial suffixes, adjectival suffixes, etc.).
History
In 2002 the
Esperantic Studies Foundation
The Esperantic Studies Foundation, abbreviated ESF, is a non-profit organisation initiated in 1968 by Jonathan Pool, E. James Lieberman and Humphrey Tonkin, with the aim to further the understanding and practice of linguistic justice in a multi ...
(ESF) started the project to support linguistic study of Esperanto. ESF hired
Bertilo Wennergren to plan and create the first phase of the project, which finished at the end of April 2003. Wennergren was aided by
Ilona Koutny,
Jouko Lindstedt
Jouko Lindstedt (born 15 July 1955) is a Finnish linguist and a professor at the University of Helsinki. Lindstedt is a member of the Academy of Esperanto and was nominated as the Esperantist of the Year in 2000 (with Hans Bakker and Mauro La ...
,
Carlo Minnaja,
Christopher Gledhill, and
Mauro La Torre.
In 2006 planning of the
Parola tekstaro de Esperanto (''
Speech corpus
A speech corpus (or spoken corpus) is a database of speech audio files and text Transcription (linguistics), transcriptions.
In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with ...
of Esperanto'') was started.
References
External links
*
Interview with Bertilo Wennergren about the Tekstaro de Esperantoin ''
Libera Folio''
Corpora
Esperanto-language websites
{{Esperanto-stub