Scannerless parsing
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, scannerless parsing (also called lexerless parsing) performs tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a pipeline of a lexer followed by a
parser Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lat ...
, executing concurrently. A language grammar is scannerless if it uses a single formalism to express both the lexical (word level) and phrase level structure of the language. Dividing processing into a lexer followed by a parser is more modular; scannerless parsing is primarily used when a clear lexer–parser distinction is unneeded or unwanted. Examples of when this is appropriate include TeX, most
wiki A wiki ( ) is an online hypertext publication collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be either open to the pub ...
grammars,
makefile In software development, Make is a build automation tool that automatically builds executable programs and libraries from source code by reading files called ''Makefiles'' which specify how to derive the target program. Though integrated ...
s, simple application-specific
scripting language A scripting language or script language is a programming language that is used to manipulate, customize, and automate the facilities of an existing system. Scripting languages are usually interpreted at runtime rather than compiled. A scripting ...
s, and Raku.


Advantages

* Only one
metalanguage In logic and linguistics, a metalanguage is a language used to describe another language, often called the ''object language''. Expressions in a metalanguage are often distinguished from those in the object language by the use of italics, quota ...
is needed * Non-regular lexical structure is handled easily * "Token classification" is unneeded which removes the need for design accommodations such as " the lexer hack" and language
reserved word In a computer language, a reserved word (also known as a reserved identifier) is a word that cannot be used as an identifier, such as the name of a variable, function, or label – it is "reserved from use". This is a syntactic definition, and a re ...
s (such as "while" in C) * Grammars can be compositional (can be merged without human intervention)


Disadvantages

* Since the lexical scanning and syntactic parsing are combined, the resulting parser tends to be more complicated and thus harder to understand and debug. The same will hold for the associated grammar, if a grammar is used to generate the parser. * The resulting parser tends to be significantly less efficient than a lexer-parser pipeline with regard to both time and memory.


Implementations


SGLR
is a parser for the modular Syntax Definition Formalism SDF, and is part of the ASF+SDF Meta-Environment and the Stratego/XT program transformation system.
JSGLR
a pure Java implementation of SGLR, also based on SDF. * TXL supports character-level parsing.
dparser
generates ANSI C code for scannerless
GLR parser A GLR parser (GLR standing for "Generalized LR", where L stands for "left-to-right" and R stands for "rightmost (derivation)") is an extension of an LR parser algorithm to handle non-deterministic and ambiguous grammars. The theoretical foundat ...
s. * Spirit allows for both scannerless and scanner-based parsing. * SBP is a scannerless parser for boolean grammars (a superset of context-free grammars), written in Java.
Laja
is a two phase scannerless parser generator with support for mapping the grammar rules into objects, written in Java. * The Raku Grammars feature of the general purpose programming language Raku.
PyParsing
is a scannerless parser written in pure Python. *
META II META II is a domain-specific programming language for writing compilers. It was created in 1963–1964 by Dewey Val Schorre at UCLA. META II uses what Schorre called syntax equations. Its operation is simply explained as: Each ''syntax equation' ...
Has built in token parsers functions. * TREE-META Like META II also is scannerless having builtin lexer functions. * CWIC Compiler for Writing and Implementing Compilers. Has token rules as part of its language. Rules in CWIC were compiled into boolean functions returning success or failure.


Notes

* This is because parsing at the character level makes the language recognized by the parser a single
context-free language In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG). Context-free languages have many applications in programming languages, in particular, most arithmetic expressions are generated by ...
defined on characters, as opposed to a context-free language of sequences of strings in
regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
s. Some lexerless parsers handle the entire class of context-free languages, which is closed under composition.


References


Further reading

* {{Parsers Parsing algorithms