Parsed Corpus
   HOME

TheInfoList



OR:

In
linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
, a treebank is a parsed
text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...
that annotates
syntactic In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency ...
or
semantic Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
sentence structure. The construction of parsed corpora in the early 1990s revolutionized
computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
, which benefitted from large-scale
empirical data Empirical evidence is evidence obtained through sense experience or experimental procedure. It is of central importance to the sciences and plays a role in various other fields, like epistemology and law. There is no general agreement on how the ...
.


Etymology

The term ''treebank'' was coined by linguist
Geoffrey Leech Geoffrey Neil Leech FBA (16 January 1936 – 19 August 2014) was a specialist in English language and linguistics. He was the author, co-author, or editor of more than 30 books and more than 120 published papers. His main academic interests we ...
in the 1980s, by analogy to other repositories such as a
seedbank A seed bank (also seed banks, seeds bank or seed vault) stores seeds to preserve genetic diversity; hence it is a type of gene bank. There are many reasons to store seeds. One is to preserve the genes that plant breeders need to increase yield, ...
or
bloodbank A blood bank is a center where blood gathered as a result of blood donation is stored and preserved for later use in blood transfusion. The term "blood bank" typically refers to a department of a hospital usually within a clinical pathology labora ...
. This is because both syntactic and semantic structure are commonly represented compositionally as a
tree structure A tree structure, tree diagram, or tree model is a way of representing the hierarchical nature of a structure in a graphical form. It is named a "tree structure" because the classic representation resembles a tree, although the chart is gen ...
. The term ''parsed corpus'' is often used interchangeably with the term treebank, with the emphasis on the primacy of sentences rather than trees.


Construction

Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with
semantic Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
or other linguistic information. Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a
parser Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term '' ...
assigns some syntactic structure which linguists then check and, if necessary, correct. In practice, fully checking and completing the parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years. The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank. Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. th
BulTreeBank
follows
HPSG Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by Carl Pollard and Ivan Sag. It is a type of phrase structure grammar, as opposed to a dependency grammar, and it is the immediate successor t ...
) but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure (for example th
Penn Treebank
o

and those that annotate dependency structure (for example th
Prague Dependency Treebank
or th
Quranic Arabic Dependency Treebank
. It is important to clarify the distinction between the formal representation and the file format used to store the annotated data. Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats. For example, the syntactic analysis for ''John loves Mary'', shown in the figure on the right/above, may be represented by simple labelled brackets in a text file, like this (following th
Penn Treebank
notation): (S (NP (NNP John)) (VP (VPZ loves) (NP (NNP Mary))) (. .)) This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools. However, as corpora become increasingly complex, other file formats may be preferred. Alternatives include treebank-specific
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
schemes, numbered indentation and various types of standoff notation.


Applications

From a
computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
perspective, treebanks have been used to engineer state-of-the-art natural language processing systems such as part-of-speech taggers,
parser Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term '' ...
s, semantic analyzers and machine translation systems. Most computational systems utilize gold-standard treebank data. However, an automatically parsed corpus that is not corrected by human linguists can still be useful. It can provide evidence of rule frequency for a parser. A parser may be improved by applying it to large amounts of text and gathering rule frequencies. However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base. In addition, frequencies are likely to be more accurate. In
corpus linguistics Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...
, treebanks are used to study syntactic phenomena (for example, diachronic corpora can be used to study the time course of syntactic change). Once parsed, a corpus will contain frequency evidence showing how common different grammatical structures are in use. Treebanks also provide evidence of coverage and support the discovery of new, unanticipated, grammatical phenomena. Another use of treebanks in
theoretical linguistics Theoretical linguistics is a term in linguistics that, like the related term general linguistics, can be understood in different ways. Both can be taken as a reference to the theory of language, or the branch of linguistics that inquires into the ...
and
psycholinguistics Psycholinguistics or psychology of language is the study of the interrelation between linguistic factors and psychological aspects. The discipline is mainly concerned with the mechanisms by which language is processed and represented in the mind ...
is interaction evidence. A completed treebank can help linguists carry out experiments as to how the decision to use one grammatical construction tends to influence the decision to form others, and to try to understand how speakers and writers make decisions as they form sentences. Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus. It is then possible to evaluate the impact of non-syntactic phenomena on grammatical choices. In linguistics research, annotated treebank data has been used in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.


Semantic treebanks

A semantic treebank is a collection of natural language sentences annotated with a meaning representation. These resources use a formal representation of each sentence's semantic structure. Semantic treebanks vary in the depth of their semantic representation. A notable example of deep semantic annotation is th
Groningen Meaning Bank
developed at the
University of Groningen The University of Groningen (abbreviated as UG; , abbreviated as RUG) is a Public university#Continental Europe, public research university of more than 30,000 students in the city of Groningen (city), Groningen, Netherlands. Founded in 1614, th ...
and annotated using
Discourse Representation Theory In formal linguistics, discourse representation theory (DRT) is a framework for exploring meaning under a formal semantics approach. One of the main differences between DRT-style approaches and traditional Montagovian approaches is that DRT inc ...
. An example of a shallow semantic treebank is
PropBank PropBank is a corpus that is annotated with verbal propositions and their arguments—a "proposition bank". Although "PropBank" refers to a specific corpus produced by Martha Palmer ''et al.'', the term ''propbank'' is also coming to be used as ...
, which provides annotation of verbal propositions and their arguments, without attempting to represent every word in the corpus in
logical form In logic, the logical form of a statement is a precisely specified semantic version of that statement in a formal system. Informally, the logical form attempts to formalize a possibly ambiguous statement into a statement with a precise, unamb ...
.


Syntactic treebanks

Many syntactic treebanks have been developed for a wide variety of languages: To facilitate the further researches between multilingual tasks, some researchers discussed the universal annotation scheme for cross-languages. In this way, people try to utilize or merge the advantages of different treebanks corpora. For instance, The universal annotation approach for dependency treebanks; and the universal annotation approach for phrase structure treebanks.


Search tools

One of the key ways to extract evidence from a treebank is through search tools. Search tools for parsed corpora typically depend on the annotation scheme that was applied to the corpus. User interfaces range in sophistication from expression-based query systems aimed at computer programmers to full exploration environments aimed at general linguists. Wallis (2008) discusses the principles of searching treebanks in detail and reviews the state of the art around that time.Wallis, Sean (2008). Searching treebanks and other structured corpora. Chapter 34 in Lüdeling, A. & Kytö, M. (ed.) ''Corpus Linguistics: An International Handbook.'' Handbücher zur Sprache und Kommunikationswissenschaft series. Berlin: Mouton de Gruyter.


See also

*
Text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...
*
Phrase structure grammar The term phrase structure grammar was originally introduced by Noam Chomsky as the term for grammar studied previously by Emil Post and Axel Thue ( Post canonical systems). Some authors, however, reserve the term for more restricted grammars in t ...
*
Dependency grammar Dependency grammar (DG) is a class of modern Grammar, grammatical theories that are all based on the dependency relation (as opposed to the ''constituency relation'' of Phrase structure grammar, phrase structure) and that can be traced back prima ...
*
Parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...
*
Part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...


References

{{Natural language processing Computational linguistics Corpus linguistics Syntax Semantics Test items