The Linguistic Data Consortium is an open
consortium
A consortium () is an association of two or more individuals, companies, organizations, or governments (or any combination of these entities) with the objective of participating in a common activity or pooling their resources for achieving a ...
of universities, companies and government research laboratories. It creates, collects and distributes speech and text
database
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
s,
lexicon
A lexicon (plural: lexicons, rarely lexica) is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Greek word () ...
s, and other resources for
linguistics
Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
research and development purposes. The
University of Pennsylvania
The University of Pennsylvania (Penn or UPenn) is a Private university, private Ivy League research university in Philadelphia, Pennsylvania, United States. One of nine colonial colleges, it was chartered in 1755 through the efforts of f ...
is the LDC's host institution. The LDC was founded in 1992 with a grant from the US
Defense Advanced Research Projects Agency (DARPA), and is partly supported by grant IRI-9528587 from the Information and Intelligent Systems division of the
National Science Foundation
The U.S. National Science Foundation (NSF) is an Independent agencies of the United States government#Examples of independent agencies, independent agency of the Federal government of the United States, United States federal government that su ...
. The director of LDC is
Mark Liberman. It subsumed the previous
ACL Data Collection Initiative.
Part of the motivation was to support the benchmark-oriented methodology of DARPA's
Human Language Technology program. Previously,
John R. Pierce
John Robinson Pierce (March 27, 1910 – April 2, 2002), was an American engineer and author. He did extensive work concerning radio communication, microwave technology, computer music, psychoacoustics, and science fiction. Additionally to ...
directed the committee that produced the
ALPAC report (1966), which caused a severe decrease in funding for linguistic AI for about 10 years. Later,
Charles Wayne restarted funding in speech and language in the mid-1980s. In order to avoid the criticisms from the ALPAC report, they needed a way to demonstrate objective progress, which led to the benchmark-oriented methodology. DARPA would propose specific quantifiable and testable score targets on benchmarks, and teams being funded would attempt to reach the score targets.
It was noted that by 1993, the data needed for training and benchmarking the models was big enough that "Not even the largest companies can easily afford enough of
he neededdata... Researchers at smaller companies and in universities risk being frozen out of the process almost entirely."
[Liberman, M. and Godfrey, J. (1993). The Linguistic Data Consortium. In Chen, Keh-Jiann, Chu-Ren Huang, Proc. ROCLing Computational Linguistics Conference VI, Nantou, Taiwan, September. Association for Computational Linguistics and Chinese Language Processing (ACLCLP).] The LDC provided a central location for creating and dispensing such data. There is a membership fee that has been increased once since its founding.
See also
*
Corpus linguistics
Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...
*
Cross-Linguistic Linked Data (CLLD) – project coordinating over a dozen linguistics databases; hosted by the Max Planck Institute (Germany)
*
Language Grid – a platform for language resources, operated by NPO Language Grid Association, primarily active in Asia
*
Machine translation
*
Natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
*
Speech technology
*
ACL Data Collection Initiative
*
Charles Lynn Wayne
References
External links
LDC Website
{{authority control
Organizations established in 1992
1992 establishments in Pennsylvania
Organizations based in Pennsylvania
University of Pennsylvania
Corpus linguistics
Lexicography
Consortia in the United States
Applied linguistics
Linguistic research institutes