The Croatian Language Corpus (CLC) ( hr, Hrvatski jezični korpus, HJK) is a
corpus
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
of
Croatian
Croatian may refer to:
* Croatia
*Croatian language
*Croatian people
*Croatians (demonym)
See also
*
*
* Croatan (disambiguation)
* Croatia (disambiguation)
* Croatoan (disambiguation)
* Hrvatski (disambiguation)
* Hrvatsko (disambiguation)
* S ...
compiled at the
Institute of Croatian Language and Linguistics
The Institute of Croatian Language and Linguistics ( hr, Institut za hrvatski jezik i jezikoslovlje) is an official institute in Croatia whose purpose is to preserve and foster the Croatian language. It traces its history back to 1948, when it was ...
(
IHJJ).
Background
The CLC was initially funded as a sub-project of the research program ''Riznica'' (''Croatian Language Repository'') by the
Ministry of Science, Education, and Sports of the Republic of Croatia (
MZOŠ) (project no. 0212010) from May 2005. In a second development phase, since 2007, the further extension and development of the CLC was embedded within the research program ''The Croatian Language Repository'' (CLR) that was granted by the
MZOŠ (cf. Ćavar and Brozović Rončević, 2012). Being a research program (PI
Dunja Brozović Rončević) with numerous subsumed independent research projects that make use of the CLC, the corpus is mainly developed as a by-product of those research projects within the CLR. Currently
Dunja Brozović Rončević and
Damir Ćavar are in charge of the corpus development.
Goals
One of the main goals of the CLC project is to create a publicly available
Croatian
Croatian may refer to:
* Croatia
*Croatian language
*Croatian people
*Croatians (demonym)
See also
*
*
* Croatan (disambiguation)
* Croatia (disambiguation)
* Croatoan (disambiguation)
* Hrvatski (disambiguation)
* Hrvatsko (disambiguation)
* S ...
corpus
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
that is annotated on multiple levels, i.e.
lemmatized,
morphologically segmented and
morpho-syntactically annotated,
phonemic
In phonology and linguistics, a phoneme () is a unit of sound that can distinguish one word from another in a particular language.
For example, in most dialects of English, with the notable exception of the West Midlands and the north-wes ...
ally transcribed and syllabified, and syntactically parsed. While the current version of the
corpus
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
provides resources from the
Croatian
Croatian may refer to:
* Croatia
*Croatian language
*Croatian people
*Croatians (demonym)
See also
*
*
* Croatan (disambiguation)
* Croatia (disambiguation)
* Croatoan (disambiguation)
* Hrvatski (disambiguation)
* Hrvatsko (disambiguation)
* S ...
language standard, several
corpora
Corpus is Latin language, Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of lingu ...
from different development phases of
Croatian
Croatian may refer to:
* Croatia
*Croatian language
*Croatian people
*Croatians (demonym)
See also
*
*
* Croatan (disambiguation)
* Croatia (disambiguation)
* Croatoan (disambiguation)
* Hrvatski (disambiguation)
* Hrvatsko (disambiguation)
* S ...
are created as well, including the digitizations of manuscripts and
Croatian
Croatian may refer to:
* Croatia
*Croatian language
*Croatian people
*Croatians (demonym)
See also
*
*
* Croatan (disambiguation)
* Croatia (disambiguation)
* Croatoan (disambiguation)
* Hrvatski (disambiguation)
* Hrvatsko (disambiguation)
* S ...
dictionaries.
Format and Availability
From the outset, the collected and digitized texts in the CLC were annotated using the
Text Encoding Initiative
The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and main ...
(
TEI) P5
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. ...
standard. Currently approx. 90 mil. tokens are available in the
TEI P5
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. ...
format. The
corpus
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
can be accessed online via the Philologic interface (see The ARTFL Project,
Department of Romance Languages and Literatures,
The University of Chicago
The University of Chicago (UChicago, Chicago, U of C, or UChi) is a private research university in Chicago, Illinois. Its main campus is located in Chicago's Hyde Park neighborhood. The University of Chicago is consistently ranked among the b ...
). It is virtualized into various sub-corpora, and individual or specific definitions of sub-corpora can be provided on demand.
Content
The CLC is assembled from selected text of
Croatian
Croatian may refer to:
* Croatia
*Croatian language
*Croatian people
*Croatians (demonym)
See also
*
*
* Croatan (disambiguation)
* Croatia (disambiguation)
* Croatoan (disambiguation)
* Hrvatski (disambiguation)
* Hrvatsko (disambiguation)
* S ...
, covering various functional domains and genres. It includes literature and other written sources from the period of the beginning of the final shaping of the standardization of
Croatian
Croatian may refer to:
* Croatia
*Croatian language
*Croatian people
*Croatians (demonym)
See also
*
*
* Croatan (disambiguation)
* Croatia (disambiguation)
* Croatoan (disambiguation)
* Hrvatski (disambiguation)
* Hrvatsko (disambiguation)
* S ...
, i.e. from the second half of the 19th century on.
The CLC consists of:
* fundamental Croatian literature (e.g. novels, short stories, drama, poetry)
* non-fiction
* scientific publications from various domains and University textbooks
* school books
* translated literature from outstanding
Croatian
Croatian may refer to:
* Croatia
*Croatian language
*Croatian people
*Croatians (demonym)
See also
*
*
* Croatan (disambiguation)
* Croatia (disambiguation)
* Croatoan (disambiguation)
* Hrvatski (disambiguation)
* Hrvatsko (disambiguation)
* S ...
translators
* online journals and newspapers
* books from the pre-standardization period of
Croatian
Croatian may refer to:
* Croatia
*Croatian language
*Croatian people
*Croatians (demonym)
See also
*
*
* Croatan (disambiguation)
* Croatia (disambiguation)
* Croatoan (disambiguation)
* Hrvatski (disambiguation)
* Hrvatsko (disambiguation)
* S ...
that are adapted to nowadays standard
Croatian
Croatian may refer to:
* Croatia
*Croatian language
*Croatian people
*Croatians (demonym)
See also
*
*
* Croatan (disambiguation)
* Croatia (disambiguation)
* Croatoan (disambiguation)
* Hrvatski (disambiguation)
* Hrvatsko (disambiguation)
* S ...
Cooperation
The realization of the CLC was made possible in cooperation with:
*
Školska knjiga d.d.
*
Croatian Academy of Sciences and Arts (HAZU)
* Stoljeća hrvatske književnosti,
Matica hrvatska
Matica hrvatska ( la, Matrix Croatica) is the oldest independent, non-profit and non-governmental Croatian national institution. It was founded on February 2, 1842 by the Croatian Count Janko Drašković and other prominent members of the Illy ...
References
External links
Croatian Language Corpus (CLC) website and Philologic interface*
''Croatian National Corpus'' another Croatian corpus by th
Institute of Linguisticsof the
Faculty of Humanities and Social Sciences,
University of Zagreb
The University of Zagreb ( hr, Sveučilište u Zagrebu, ; la, Universitas Studiorum Zagrabiensis) is the largest Croatian university and the oldest continuously operating university in the area covering Central Europe south of Vienna and all of ...
{{Croatian language
Corpora
Croatian language
Online databases
Applied linguistics
Linguistic research