Parsing
   HOME

TheInfoList



OR:

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language,
computer languages A computer language is a formal language used to communicate with a computer. Types of computer languages include: * Construction language – all forms of communication by which a human can specify an executable problem solution to a comput ...
or data structures, conforming to the rules of a
formal grammar In formal language theory, a grammar (when the context is not given, often called a formal grammar for clarity) describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe ...
. The term ''parsing'' comes from Latin ''pars'' (''orationis''), meaning part (of speech). The term has slightly different meanings in different branches of
linguistics Linguistics is the science, scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure ...
and
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includi ...
. Traditional sentence parsing is often performed as a method of understanding the exact meaning of a sentence or word, sometimes with the aid of devices such as
sentence diagram A sentence diagram is a pictorial representation of the grammatical structure of a sentence. The term "sentence diagram" is used more when teaching written language, where sentences are ''diagrammed''. The model shows the relations between words ...
s. It usually emphasizes the importance of grammatical divisions such as subject and predicate. Within computational linguistics the term is used to refer to the formal analysis by a computer of a sentence or other string of words into its constituents, resulting in a parse tree showing their syntactic relation to each other, which may also contain semantic and other information (
p-values In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...
). Some parsing algorithms may generate a ''parse forest'' or list of parse trees for a
syntactically ambiguous Syntactic ambiguity, also called structural ambiguity, amphiboly or amphibology, is a situation where a sentence may be interpreted in more than one way due to ambiguous sentence structure. Syntactic ambiguity arises not from the range of mean ...
input. The term is also used in psycholinguistics when describing language comprehension. In this context, parsing refers to the way that human beings analyze a sentence or phrase (in spoken language or text) "in terms of grammatical constituents, identifying the parts of speech, syntactic relations, etc." This term is especially common when discussing which linguistic cues help speakers interpret garden-path sentences. Within computer science, the term is used in the analysis of
computer languages A computer language is a formal language used to communicate with a computer. Types of computer languages include: * Construction language – all forms of communication by which a human can specify an executable problem solution to a comput ...
, referring to the syntactic analysis of the input code into its component parts in order to facilitate the writing of
compilers In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
and
interpreters Interpreting is a translational activity in which one produces a first and final target-language output on the basis of a one-time exposure to an expression in a source language. The most common two modes of interpreting are simultaneous interp ...
. The term may also be used to describe a split or separation.


Human languages


Traditional methods

The traditional grammatical exercise of parsing, sometimes known as ''clause analysis'', involves breaking down a text into its component parts of speech with an explanation of the form, function, and syntactic relationship of each part. This is determined in large part from study of the language's conjugations and
declensions In linguistics, declension (verb: ''to decline'') is the changing of the form of a word, generally to express its syntactic function in the sentence, by way of some inflection. Declensions may apply to nouns, pronouns, adjectives, adverbs, and a ...
, which can be quite intricate for heavily inflected languages. To parse a phrase such as 'man bites dog' involves noting that the singular noun 'man' is the subject of the sentence, the verb 'bites' is the third person singular of the present tense of the verb 'to bite', and the singular noun 'dog' is the object of the sentence. Techniques such as
sentence diagram A sentence diagram is a pictorial representation of the grammatical structure of a sentence. The term "sentence diagram" is used more when teaching written language, where sentences are ''diagrammed''. The model shows the relations between words ...
s are sometimes used to indicate relation between elements in the sentence. Parsing was formerly central to the teaching of grammar throughout the English-speaking world, and widely regarded as basic to the use and understanding of written language. However, the general teaching of such techniques is no longer current.


Computational methods

In some machine translation and natural language processing systems, written texts in human languages are parsed by computer programs. Human sentences are not easily parsed by programs, as there is substantial ambiguity in the structure of human language, whose usage is to convey meaning (or
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comp ...
) amongst a potentially unlimited range of possibilities but only some of which are germane to the particular case. So an utterance "Man bites dog" versus "Dog bites man" is definite on one detail but in another language might appear as "Man dog bites" with a reliance on the larger context to distinguish between those two possibilities, if indeed that difference was of concern. It is difficult to prepare formal rules to describe informal behaviour even though it is clear that some rules are being followed. In order to parse natural language data, researchers must first agree on the
grammar In linguistics, the grammar of a natural language is its set of structural constraints on speakers' or writers' composition of clauses, phrases, and words. The term can also refer to the study of such constraints, a field that includes domain ...
to be used. The choice of syntax is affected by both
linguistic Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
and computational concerns; for instance some parsing systems use lexical functional grammar, but in general, parsing for grammars of this type is known to be
NP-complete In computational complexity theory, a problem is NP-complete when: # it is a problem for which the correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by trying ...
. Head-driven phrase structure grammar is another linguistic formalism which has been popular in the parsing community, but other research efforts have focused on less complex formalisms such as the one used in the Penn
Treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empiri ...
.
Shallow parsing Shallow parsing (also chunking or light parsing) is an analysis of a sentence which first identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings ...
aims to find only the boundaries of major constituents such as noun phrases. Another popular strategy for avoiding linguistic controversy is dependency grammar parsing. Most modern parsers are at least partly statistical; that is, they rely on a
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
of training data which has already been annotated (parsed by hand). This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts. ''(See
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
.)'' Approaches which have been used include straightforward PCFGs (probabilistic context-free grammars), maximum entropy, and
neural net Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...
s. Most of the more successful systems use ''lexical'' statistics (that is, they consider the identities of the words involved, as well as their part of speech). However such systems are vulnerable to overfitting and require some kind of
smoothing In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the dat ...
to be effective. Parsing algorithms for natural language cannot rely on the grammar having 'nice' properties as with manually designed grammars for programming languages. As mentioned earlier some grammar formalisms are very difficult to parse computationally; in general, even if the desired structure is not context-free, some kind of context-free approximation to the grammar is used to perform a first pass. Algorithms which use context-free grammars often rely on some variant of the
CYK algorithm In computer science, the Cocke–Younger–Kasami algorithm (alternatively called CYK, or CKY) is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, ...
, usually with some heuristic to prune away unlikely analyses to save time. ''(See
chart parsing In computer science, a chart parser is a type of parser suitable for ambiguous grammars (including grammars of natural languages). It uses the dynamic programming approach—partial hypothesized results are stored in a structure called a chart a ...
.)'' However some systems trade speed for accuracy using, e.g., linear-time versions of the shift-reduce algorithm. A somewhat recent development has been parse reranking in which the parser proposes some large number of analyses, and a more complex system selects the best option. In
natural language understanding Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subtopic of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an A ...
applications, semantic parsers convert the text into a representation of its meaning.


Psycholinguistics

In psycholinguistics, parsing involves not just the assignment of words to categories (formation of ontological insights), but the evaluation of the meaning of a sentence according to the rules of syntax drawn by inferences made from each word in the sentence (known as connotation). This normally occurs as words are being heard or read. Consequently, psycholinguistic models of parsing are of necessity ''incremental'', meaning that they build up an interpretation as the sentence is being processed, which is normally expressed in terms of a partial syntactic structure. Creation of initially wrong structures occurs when interpreting
garden-path sentence A garden-path sentence is a grammatically correct sentence that starts in such a way that a reader's most likely interpretation will be incorrect; the reader is lured into a parse that turns out to be a dead end or yields a clearly unintended me ...
s.


Discourse analysis

Discourse analysis Discourse analysis (DA), or discourse studies, is an approach to the analysis of written, vocal, or sign language use, or any significant semiotic event. The objects of discourse Analysis (discourse, writing, conversation, communicative event) ...
examines ways to analyze language use and semiotic events. Persuasive language may be called rhetoric.


Computer languages


Parser

A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree,
abstract syntax tree In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurr ...
or other hierarchical structure, giving a structural representation of the input while checking for correct syntax. The parsing may be preceded or followed by other steps, or these may be combined into a single step. The parser is often preceded by a separate lexical analyser, which creates tokens from the sequence of input characters; alternatively, these can be combined in
scannerless parsing In computer science, scannerless parsing (also called lexerless parsing) performs tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a pipeli ...
. Parsers may be programmed by hand or may be automatically or semi-automatically generated by a
parser generator In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine. The most common type of compiler- ...
. Parsing is complementary to templating, which produces formatted ''output.'' These may be applied to different domains, but often appear together, such as the scanf/
printf The printf format string is a control parameter used by a class of functions in the input/output libraries of C and many other programming languages. The string is written in a simple template language: characters are usually copied literal ...
pair, or the input (front end parsing) and output (back end code generation) stages of a compiler. The input to a parser is often text in some
computer language A computer language is a formal language used to communicate with a computer. Types of computer languages include: * Construction language – all forms of communication by which a human can specify an executable problem solution to a comput ...
, but may also be text in a natural language or less structured textual data, in which case generally only certain parts of the text are extracted, rather than a parse tree being constructed. Parsers range from very simple functions such as scanf, to complex programs such as the frontend of a
C++ compiler This page is intended to list all current compilers, compiler generators, interpreters, translators, tool foundations, assemblers, automatable command line interfaces ( shells), etc. Ada Compilers ALGOL 60 compilers ALGOL 68 compilers cf. ...
or the
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...
parser of a web browser. An important class of simple parsing is done using
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s, in which a group of regular expressions defines a
regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
and a regular expression engine automatically generating a parser for that language, allowing
pattern matching In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be ...
and extraction of text. In other contexts regular expressions are instead used prior to parsing, as the lexing step whose output is then used by the parser. The use of parsers varies by input. In the case of data languages, a parser is often found as the file reading facility of a program, such as reading in HTML or XML text; these examples are markup languages. In the case of
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
s, a parser is a component of a
compiler In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs tha ...
or interpreter, which parses the
source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the w ...
of a
computer programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming l ...
to create some form of internal representation; the parser is a key step in the
compiler frontend In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
. Programming languages tend to be specified in terms of a
deterministic context-free grammar In formal grammar theory, the deterministic context-free grammars (DCFGs) are a proper subset of the context-free grammars. They are the subset of context-free grammars that can be derived from deterministic pushdown automata, and they generate the ...
because fast and efficient parsers can be written for them. For compilers, the parsing itself can be done in one pass or multiple passes – see
one-pass compiler In computer programming, a one-pass compiler is a compiler that passes through the parts of each compilation unit only once, immediately translating each part into its final machine code. This is in contrast to a multi-pass compiler which conver ...
and multi-pass compiler. The implied disadvantages of a one-pass compiler can largely be overcome by adding fix-ups, where provision is made for code relocation during the forward pass, and the fix-ups are applied backwards when the current program segment has been recognized as having been completed. An example where such a fix-up mechanism would be useful would be a forward GOTO statement, where the target of the GOTO is unknown until the program segment is completed. In this case, the application of the fix-up would be delayed until the target of the GOTO was recognized. Conversely, a backward GOTO does not require a fix-up, as the location will already be known. Context-free grammars are limited in the extent to which they can express all of the requirements of a language. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out at the semantic analysis (contextual analysis) step. For example, in
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
the following is syntactically valid code: x = 1 print(x) The following code, however, is syntactically valid in terms of the context-free grammar, yielding a syntax tree with the same structure as the previous, but violates the semantic rule requiring variables to be initialized before use: x = 1 print(y)


Overview of process

The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic. The first stage is the token generation, or
lexical analysis In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified ...
, by which the input character stream is split into meaningful symbols defined by a grammar of
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s. For example, a calculator program would look at an input such as "12 * (3 + 4)^2" and split it into the tokens 12, *, (, 3, +, 4, ), ^, 2, each of which is a meaningful symbol in the context of an arithmetic expression. The lexer would contain rules to tell it that the characters *, +, ^, ( and ) mark the start of a new token, so meaningless tokens like "12*" or "(3" will not be generated. The next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear. However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers. These rules can be formally expressed with
attribute grammar An attribute grammar is a formal way to supplement a formal grammar with semantic information processing. Semantic information is stored in attributes associated with terminal and nonterminal symbols of the grammar. The values of attributes are resu ...
s. The final phase is
semantic parsing Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Application ...
or analysis, which is working out the implications of the expression just validated and taking the appropriate action. In the case of a calculator or interpreter, the action is to evaluate the expression or program; a compiler, on the other hand, would generate some kind of code. Attribute grammars can also be used to define these actions.


Types of parsers

The ''task'' of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways: *
Top-down parsing Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top- ...
- Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of the given
formal grammar In formal language theory, a grammar (when the context is not given, often called a formal grammar for clarity) describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe ...
rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules.Aho, A.V., Sethi, R. and Ullman, J.D. (1986) " Compilers: principles, techniques, and tools." '' Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA. '' This is known as the primordial soup approach. Very similar to sentence diagramming, primordial soup breaks down the constituencies of sentences. *
Bottom-up parsing In computer science, parsing reveals the grammatical structure of linear input text, as a first step in working out its meaning. Bottom-up parsing recognizes the text's lowest-level small details first, before its mid-level structures, and leavin ...
- A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on.
LR parser In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, Canonical LR(1) parsers, Minimal LR(1) parse ...
s are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.
LL parser In computer science, an LL parser (Left-to-right, leftmost derivation) is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence. An LL parser is called a ...
s and
recursive-descent parser In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure implements one of the nonterminals of the grammar. Thus t ...
are examples of top-down parsers that cannot accommodate left recursive production rules. Although it has been believed that simple implementations of top-down parsing cannot accommodate direct and indirect left-recursion and may require exponential time and space complexity while parsing ambiguous context-free grammars, more sophisticated algorithms for top-down parsing have been created by Frost, Hafiz, and CallaghanFrost, R., Hafiz, R. and Callaghan, P. (2007)
Modular and Efficient Top-Down Parsing for Ambiguous Left-Recursive Grammars
." ''10th International Workshop on Parsing Technologies (IWPT), ACL-SIGPARSE '', Pages: 109 - 120, June 2007, Prague.
Frost, R., Hafiz, R. and Callaghan, P. (2008)
Parser Combinators for Ambiguous Left-Recursive Grammars
" '' 10th International Symposium on Practical Aspects of Declarative Languages (PADL), ACM-SIGPLAN '', Volume 4902/2008, Pages: 167 - 181, January 2008, San Francisco.
which accommodate ambiguity and
left recursion In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language (on the left) and a suffix (on ...
in polynomial time and which generate polynomial-size representations of the potentially exponential number of parse trees. Their algorithm is able to produce both left-most and right-most derivations of an input with regard to a given context-free grammar. An important distinction with regard to parsers is whether a parser generates a ''leftmost derivation'' or a ''rightmost derivation'' (see context-free grammar). LL parsers will generate a leftmost
derivation Derivation may refer to: Language * Morphological derivation, a word-formation process * Parse tree or concrete syntax tree, representing a string's syntax in formal grammars Law * Derivative work, in copyright law * Derivation proceeding, a proc ...
and LR parsers will generate a rightmost derivation (although usually in reverse). Some ' algorithms have been designed for
visual programming languages In computing, a visual programming language (visual programming system, VPL, or, VPS) is any programming language that lets users create programs by manipulating program elements ''graphically'' rather than by specifying them ''textually''. A VPL ...
. Parsers for visual languages are sometimes based on
graph grammar In computer science, graph transformation, or graph rewriting, concerns the technique of creating a new graph out of an original graph algorithmically. It has numerous applications, ranging from software engineering (software construction and also ...
s. Adaptive parsing algorithms have been used to construct "self-extending"
natural language user interface Natural-language user interface (LUI or NLUI) is a type of computer human interface where linguistic phenomena such as verbs, phrases and clauses act as UI controls for creating, selecting and modifying data in software applications. In interface d ...
s.


Implementation

The simplest parser APIs read the entire input file, do some intermediate computation, and then write the entire output file. (Such as in-memory multi-pass compilers). Those simple parsers won't work when there isn't enough memory to store the entire input file or the entire output file. They also won't work for never-ending streams of data from the real world. Some alternative API approaches for parsing such data: * push parsers that call the registered handlers (callbacks) as soon as the parser detects relevant tokens in the input stream (such as
Expat An expatriate (often shortened to expat) is a person who resides outside their native country. In common usage, the term often refers to educated professionals, skilled workers, or artists taking positions outside their home country, either ...
) * pull parsers * incremental parsers (such as incremental
chart parser In computer science, a chart parser is a type of parser suitable for ambiguous grammars (including grammars of natural languages). It uses the dynamic programming approach—partial hypothesized results are stored in a structure called a chart and ...
s) that, as the text of the file is edited by a user, does not need to completely re-parse the entire file. * Active vs passive parsers Song-Chun Zhu
"Classic Parsing Algorithms"


Parser development software

Some of the well known parser development tools include the following: *
ANTLR In computer-based language recognition, ANTLR (pronounced ''antler''), or ANother Tool for Language Recognition, is a parser generator that uses LL(*) for parsing. ANTLR is the successor to the Purdue Compiler Construction Tool Set (PCCTS), firs ...
* Bison *
Coco/R Coco/R is a compiler generator that takes wirth syntax notationIn the manual, however, it is referred as L-attributed Extended Backus–Naur Form syntax (EBNF). grammars of a source language and generates a scanner and a parser for that lang ...
* Definite clause grammar *
GOLD Gold is a chemical element with the symbol Au (from la, aurum) and atomic number 79. This makes it one of the higher atomic number elements that occur naturally. It is a bright, slightly orange-yellow, dense, soft, malleable, and ductile me ...
*
JavaCC JavaCC (Java Compiler Compiler) is an open-source software, open-source parser generator and Lexical analysis, lexical analyzer generator written in the Java (programming language), Java programming language. JavaCC is similar to yacc in that it ...
*
Lemon The lemon (''Citrus limon'') is a species of small evergreen trees in the flowering plant family Rutaceae, native to Asia, primarily Northeast India (Assam), Northern Myanmar or China. The tree's ellipsoidal yellow fruit is used for culin ...
*
Lex Lex or LEX may refer to: Arts and entertainment * ''Lex'', a daily featured column in the ''Financial Times'' Games * Lex, the mascot of the word-forming puzzle video game ''Bookworm'' * Lex, the protagonist of the word-forming puzzle video ga ...
* LuZc *
Parboiled Parboiling (or leaching) is the partial or semi boiling of food as the first step in cooking. The word is from the Old French 'parboillir' (to boil thoroughly) but by mistaken association with 'part' it has acquired its current meaning. The wo ...
*
Parsec The parsec (symbol: pc) is a unit of length used to measure the large distances to astronomical objects outside the Solar System, approximately equal to or (au), i.e. . The parsec unit is obtained by the use of parallax and trigonometry, an ...
*
Ragel Ragel is a finite-state machine compiler and a parser generator. Initially Ragel supported output for C, C++ and Assembly source code, and was expanded to support several other languages including Objective C, D, Go, Ruby, and Java. Additiona ...
*
Spirit Parser Framework The Spirit Parser Framework is an object oriented recursive descent parser generator framework implemented using template metaprogramming techniques. Expression templates allow users to approximate the syntax of extended Backus–Naur form (EBNF) ...
*
Syntax Definition Formalism The Syntax Definition Formalism (SDF) is a metasyntax used to define context-free grammars: that is, a formal way to describe formal languages. It can express the entire range of context-free grammars. Its current version is SDF3. A parser and pars ...
* SYNTAX * XPL * Yacc * PackCC


Lookahead

Lookahead establishes the maximum incoming tokens that a parser can use to decide which rule it should use. Lookahead is especially relevant to LL, LR, and
LALR parser In computer science, an LALR parser or Look-Ahead LR parser is a simplified version of a canonical LR parser, to parse a text according to a set of production rules specified by a formal grammar for a computer language. ("LR" means left-to-rig ...
s, where it is often explicitly indicated by affixing the lookahead to the algorithm name in parentheses, such as LALR(1). Most
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
s, the primary target of parsers, are carefully defined in such a way that a parser with limited lookahead, typically one, can parse them, because parsers with limited lookahead are often more efficient. One important change to this trend came in 1990 when
Terence Parr Terence John Parr (born 1964 in Los Angeles) is a professor of computer science at the University of San Francisco. He is best known for his ANTLR parser generator and contributions to parsing theory. He also developed the StringTemplate engin ...
created
ANTLR In computer-based language recognition, ANTLR (pronounced ''antler''), or ANother Tool for Language Recognition, is a parser generator that uses LL(*) for parsing. ANTLR is the successor to the Purdue Compiler Construction Tool Set (PCCTS), firs ...
for his Ph.D. thesis, a
parser generator In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine. The most common type of compiler- ...
for efficient LL(''k'') parsers, where ''k'' is any fixed value. LR parsers typically have only a few actions after seeing each token. They are shift (add this token to the stack for later reduction), reduce (pop tokens from the stack and form a syntactic construct), end, error (no known rule applies) or conflict (does not know whether to shift or reduce). Lookahead has two advantages. * It helps the parser take the correct action in case of conflicts. For example, parsing the if statement in the case of an else clause. * It eliminates many duplicate states and eases the burden of an extra stack. A C language non-lookahead parser will have around 10,000 states. A lookahead parser will have around 300 states. Example: Parsing the Expression {, class="toccolours" , colspan=3 , Set of expression parsing rules (called grammar) is as follows, , - , Rule1: , , E → E + E , , style="padding-left:1em" , Expression is the sum of two expressions. , - , Rule2: , , E → E * E , , style="padding-left:1em" , Expression is the product of two expressions. , - , Rule3: , , E → number , , style="padding-left:1em" , Expression is a simple number , - , Rule4: , , colspan=2 , + has less precedence than * Most programming languages (except for a few such as APL and Smalltalk) and algebraic formulas give higher precedence to multiplication than addition, in which case the correct interpretation of the example above is . Note that Rule4 above is a semantic rule. It is possible to rewrite the grammar to incorporate this into the syntax. However, not all such rules can be translated into syntax. ; Simple non-lookahead parser actions Initially Input = , +, 2, *, 3# Shift "1" onto stack from input (in anticipation of rule3). Input = , 2, *, 3Stack = # Reduces "1" to expression "E" based on rule3. Stack = # Shift "+" onto stack from input (in anticipation of rule1). Input = , *, 3Stack =
, + The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
# Shift "2" onto stack from input (in anticipation of rule3). Input = , 3Stack =
, +, 2 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
# Reduce stack element "2" to Expression "E" based on rule3. Stack =
, +, E The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
# Reduce stack items
, +, E The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
and new input "E" to "E" based on rule1. Stack = # Shift "*" onto stack from input (in anticipation of rule2). Input = Stack = ,*# Shift "3" onto stack from input (in anticipation of rule3). Input = [] (empty) Stack = [E, *, 3] # Reduce stack element "3" to expression "E" based on rule3. Stack = [E, *, E] # Reduce stack items [E, *, E] and new input "E" to "E" based on rule2. Stack = The parse tree and resulting code from it is not correct according to language semantics. To correctly parse without lookahead, there are three solutions: * The user has to enclose expressions within parentheses. This often is not a viable solution. * The parser needs to have more logic to backtrack and retry whenever a rule is violated or not complete. The similar method is followed in LL parsers. * Alternatively, the parser or grammar needs to have extra logic to delay reduction and reduce only when it is absolutely sure which rule to reduce first. This method is used in LR parsers. This correctly parses the expression but with many more states and increased stack depth. ; Lookahead parser actions # Shift 1 onto stack on input 1 in anticipation of rule3. It does not reduce immediately. # Reduce stack item 1 to simple Expression on input + based on rule3. The lookahead is +, so we are on path to E +, so we can reduce the stack to E. # Shift + onto stack on input + in anticipation of rule1. # Shift 2 onto stack on input 2 in anticipation of rule3. # Reduce stack item 2 to Expression on input * based on rule3. The lookahead * expects only E before it. # Now stack has E + E and still the input is *. It has two choices now, either to shift based on rule2 or reduction based on rule1. Since * has higher precedence than + based on rule4, we shift * onto stack in anticipation of rule2. # Shift 3 onto stack on input 3 in anticipation of rule3. # Reduce stack item 3 to Expression after seeing end of input based on rule3. # Reduce stack items E * E to E based on rule2. # Reduce stack items E + E to E based on rule1. The parse tree generated is correct and simply than non-lookahead parsers. This is the strategy followed in
LALR parser In computer science, an LALR parser or Look-Ahead LR parser is a simplified version of a canonical LR parser, to parse a text according to a set of production rules specified by a formal grammar for a computer language. ("LR" means left-to-rig ...
s.


See also

*
Backtracking Backtracking is a class of algorithms for finding solutions to some computational problems, notably constraint satisfaction problems, that incrementally builds candidates to the solutions, and abandons a candidate ("backtracks") as soon as it d ...
*
Chart parser In computer science, a chart parser is a type of parser suitable for ambiguous grammars (including grammars of natural languages). It uses the dynamic programming approach—partial hypothesized results are stored in a structure called a chart and ...
* Compiler-compiler * Deterministic parsing * Generating strings *
Grammar checker A grammar checker, in computing terms, is a program, or part of a program, that attempts to verify written text for grammatical correctness. Grammar checkers are most often implemented as a feature of a larger program, such as a word processor, b ...
*
LALR parser In computer science, an LALR parser or Look-Ahead LR parser is a simplified version of a canonical LR parser, to parse a text according to a set of production rules specified by a formal grammar for a computer language. ("LR" means left-to-rig ...
*
Lexical analysis In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified ...
* Pratt parser *
Shallow parsing Shallow parsing (also chunking or light parsing) is an analysis of a sentence which first identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings ...
* Left corner parser *
Parsing expression grammar In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 200 ...
*
DMS Software Reengineering Toolkit The DMS Software Reengineering Toolkit is a proprietary set of program transformation tools available for automating custom source program analysis, modification, translation or generation of software systems for arbitrary mixtures of source langu ...
*
Program transformation A program transformation is any operation that takes a computer program and generates another program. In many cases the transformed program is required to be semantically equivalent to the original, relative to a particular formal semantics and ...
* Source code generation * Inverse parser


References


Further reading

* Chapman, Nigel P.
''LR Parsing: Theory and Practice''
Cambridge University Press Cambridge University Press is the university press of the University of Cambridge. Granted letters patent by King Henry VIII in 1534, it is the oldest university press in the world. It is also the King's Printer. Cambridge University Pre ...
, 1987. * Grune, Dick; Jacobs, Ceriel J.H.
''Parsing Techniques - A Practical Guide''
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands. Originally published by Ellis Horwood, Chichester, England, 1990;


External links


The Lemon LALR Parser Generator

Stanford Parser
The Stanford Parser
Turin University Parser
Natural language parser for the Italian, open source, developed in Common Lisp by Leonardo Lesmo, University of Torino, Italy.

{{Strings Algorithms on strings Compiler construction