HOME

TheInfoList



OR:

The Linguistic Data Consortium is an open
consortium A consortium () is an association of two or more individuals, companies, organizations, or governments (or any combination of these entities) with the objective of participating in a common activity or pooling their resources for achieving a ...
of universities, companies and government research laboratories. It creates, collects and distributes speech and text
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
s,
lexicon A lexicon (plural: lexicons, rarely lexica) is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Greek word () ...
s, and other resources for
linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
research and development purposes. The
University of Pennsylvania The University of Pennsylvania (Penn or UPenn) is a Private university, private Ivy League research university in Philadelphia, Pennsylvania, United States. One of nine colonial colleges, it was chartered in 1755 through the efforts of f ...
is the LDC's host institution. The LDC was founded in 1992 with a grant from the US Defense Advanced Research Projects Agency (DARPA), and is partly supported by grant IRI-9528587 from the Information and Intelligent Systems division of the
National Science Foundation The U.S. National Science Foundation (NSF) is an Independent agencies of the United States government#Examples of independent agencies, independent agency of the Federal government of the United States, United States federal government that su ...
. The director of LDC is Mark Liberman. It subsumed the previous ACL Data Collection Initiative. Part of the motivation was to support the benchmark-oriented methodology of DARPA's Human Language Technology program. Previously,
John R. Pierce John Robinson Pierce (March 27, 1910 – April 2, 2002), was an American engineer and author. He did extensive work concerning radio communication, microwave technology, computer music, psychoacoustics, and science fiction. Additionally to ...
directed the committee that produced the ALPAC report (1966), which caused a severe decrease in funding for linguistic AI for about 10 years. Later, Charles Wayne restarted funding in speech and language in the mid-1980s. In order to avoid the criticisms from the ALPAC report, they needed a way to demonstrate objective progress, which led to the benchmark-oriented methodology. DARPA would propose specific quantifiable and testable score targets on benchmarks, and teams being funded would attempt to reach the score targets. It was noted that by 1993, the data needed for training and benchmarking the models was big enough that "Not even the largest companies can easily afford enough of he neededdata... Researchers at smaller companies and in universities risk being frozen out of the process almost entirely."Liberman, M. and Godfrey, J. (1993). The Linguistic Data Consortium. In Chen, Keh-Jiann, Chu-Ren Huang, Proc. ROCLing Computational Linguistics Conference VI, Nantou, Taiwan, September. Association for Computational Linguistics and Chinese Language Processing (ACLCLP). The LDC provided a central location for creating and dispensing such data. There is a membership fee that has been increased once since its founding.


See also

*
Corpus linguistics Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...
* Cross-Linguistic Linked Data (CLLD) – project coordinating over a dozen linguistics databases; hosted by the Max Planck Institute (Germany) * Language Grid – a platform for language resources, operated by NPO Language Grid Association, primarily active in Asia * Machine translation *
Natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
* Speech technology * ACL Data Collection Initiative * Charles Lynn Wayne


References


External links


LDC Website
{{authority control Organizations established in 1992 1992 establishments in Pennsylvania Organizations based in Pennsylvania University of Pennsylvania Corpus linguistics Lexicography Consortia in the United States Applied linguistics Linguistic research institutes