Bulgarian National Corpus
The Bulgarian National Corpus (BulNC) is a large representative corpus of Bulgarian comprising about 200,000 texts and amounting to over 1 billion words. History The Bulgarian National corpus is created at the Institute for Bulgarian Language „Prof. L. Andreychin” by research associates from the Department of Computational Linguistics and the Department of Bulgarian Lexicology and Lexicography. BulNC incorporates several individual electronic corpora, developed in the period 2001-2009 for the purposes of the two departments. The corpus is constantly enlarged with new texts. Contents The Bulgarian National corpus consists of a monolingual (Bulgarian) part and 47 parallel corpora. The Bulgarian part includes about 1.2 billion words in over 240 000 text samples. The materials in the Corpus reflect the state of the Bulgarian language (mainly in its written form) from the middle of 20th century (1945) until present. It also includes parallel corpora of various size for 47 foreig ... [...More Info...]       [...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   |
|
Text Corpus
In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. In search technology, a corpus is the collection of documents which is being searched. Overview A corpus may contain texts in a single language (''monolingual corpus'') or text data in multiple languages (''multilingual corpus''). In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or ''POS-tagging'', in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of ''tags''. Another example is indicating the lemma (b ... [...More Info...]       [...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   |
|
Corpus Linguistics
Corpus linguistics is the study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The text-corpus method uses the body of texts written in any natural language to derive the set of abstract rules which govern that language. Those results can be used to explore the relationships between that subject language and other languages which have undergone a similar analysis. The first such corpora were manually derived from source texts, but now that work is automated. Corpora have not only been used for linguistics research, they have also been used to compile dictionaries (starting with ''The American Heritage Dictionary of the English Language'' in 1969) and grammar guides, such as '' A Comprehensive Grammar ... [...More Info...]       [...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   |
|
Computational Linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others. Sub-fields and related areas Traditionally, computational linguistics emerged as an area of artificial intelligence performed by computer scientists who had specialized in the application of computers to the processing of a natural language. With the formation of the Association for Computational Linguistics (ACL) and the establishment of independent conference series, the field consolidated during the 1970s and 1980s. The Association for Computational Linguistics defines computational linguistics as: The term "comp ... [...More Info...]       [...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   |
|
BulPosCor
The Bulgarian Part of Speech-annotated Corpus (BulPosCor) (in Bulgarian: Български Пос анотиран корпус (БулПосКор)) is a morphologically annotated general monolingual corpus of written language where each item in a text is assigned a grammatical tag. BulPosCor is created by thDepartment of Computational Linguisticsat the Institute for Bulgarian Language of the Bulgarian Academy of Sciences and consists of 174 697 lexical items. BulPosCor has been compiled from the Structured "Brown" Corpus of Bulgarian by sampling 300+ word-excerpts (expanded to sentence boundary) from the original BCB files in such a way as to preserve the BCB overall structure. The annotation process consists of a primary stage of automatically assigning tags from the Bulgarian Grammar Dictionary and a stage of manual resolving of morphological ambiguities. The disambiguated corpus consists of 174,697 lexical units. Access [...More Info...]       [...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   |
|
BulSemCor
The Bulgarian Sense-annotated Corpus (BulSemCor) ( bg, Български семантично анотиран корпус (БулСемКор)) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences. Structure BulSemCor was created as part of a nationally funded project titled "BulNet – A lexico-semantic network for the Bulgarian Language" (2005–2010). It follows the general methodology of SemCor combined with some specific principles. The corpus for annotation consists of 101,791 tokens covering an excerpt from the Bulgarian "Brown" Corpus modelled on the Brown Corpus. An important feature of BulSemCor is that the samples a ... [...More Info...]       [...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   |
|
Bulgarian Language
Bulgarian (, ; bg, label=none, български, bălgarski, ) is an Eastern South Slavic language spoken in Southeastern Europe, primarily in Bulgaria. It is the language of the Bulgarians. Along with the closely related Macedonian language (collectively forming the East South Slavic languages), it is a member of the Balkan sprachbund and South Slavic dialect continuum of the Indo-European language family. The two languages have several characteristics that set them apart from all other Slavic languages, including the elimination of case declension, the development of a suffixed definite article, and the lack of a verb infinitive. They retain and have further developed the Proto-Slavic verb system (albeit analytically). One such major development is the innovation of evidential verb forms to encode for the source of information: witnessed, inferred, or reported. It is the official language of Bulgaria, and since 2007 has been among the official languages of ... [...More Info...]       [...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]   |