In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, scannerless parsing (also called lexerless parsing) performs tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a
pipeline of a
lexer
In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' (strings with an assigned and thus identified m ...
followed by a
parser
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
, executing
concurrently. A
language grammar is scannerless if it uses a single formalism to express both the
lexical (word level) and phrase level structure of the language.
Dividing processing into a lexer followed by a parser is more modular; scannerless parsing is primarily used when a clear lexer–parser distinction is unneeded or unwanted. Examples of when this is appropriate include
TeX, most
wiki
A wiki ( ) is an online hypertext publication collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be either open to the pu ...
grammars,
makefile
In software development, Make is a build automation tool that automatically builds executable programs and libraries from source code by reading files called ''Makefiles'' which specify how to derive the target program. Though integrated deve ...
s, simple application-specific
scripting language
A scripting language or script language is a programming language that is used to manipulate, customize, and automate the facilities of an existing system. Scripting languages are usually interpreted at runtime rather than compiled.
A scripti ...
s, and
Raku.
Advantages
* Only one
metalanguage
In logic and linguistics, a metalanguage is a language used to describe another language, often called the ''object language''. Expressions in a metalanguage are often distinguished from those in the object language by the use of italics, quota ...
is needed
* Non-regular lexical structure is handled easily
* "Token classification" is unneeded which removes the need for design accommodations such as "
the lexer hack In computer programming, the lexer hack is a common solution to the problems in parsing ANSI C, due to the reference grammar being context-sensitive. In C, classifying a sequence of characters as a variable name or a type name requires contextual i ...
" and language
reserved word
In a computer language, a reserved word (also known as a reserved identifier) is a word that cannot be used as an identifier, such as the name of a variable, function, or label – it is "reserved from use". This is a syntactic definition, and a r ...
s (such as "while" in
C)
* Grammars can be compositional (can be merged without human intervention)
Disadvantages
* Since the lexical scanning and syntactic parsing are combined, the resulting parser tends to be more complicated and thus harder to understand and debug. The same will hold for the associated grammar, if a grammar is used to generate the parser.
* The resulting parser tends to be significantly less efficient than a lexer-parser pipeline with regard to both time and memory.
Implementations
SGLRis a parser for the modular Syntax Definition Formalism
SDF, and is part of the
ASF+SDF
ASF may refer to:
Arts and entertainment
* Alabama Shakespeare Festival, a drama festival
* '' Asimov's Science Fiction'', a U.S.-based English-language science fiction magazine containing SF stories
Science and technology
Biological
* ...
Meta-Environment and the
Stratego/XT program transformation system.
JSGLR a pure Java implementation of SGLR, also based on
SDF.
*
TXL supports character-level parsing.
dparsergenerates ANSI C code for scannerless
GLR parser
A GLR parser (GLR standing for "Generalized LR", where L stands for "left-to-right" and R stands for "rightmost (derivation)") is an extension of an LR parser algorithm to handle non-deterministic and ambiguous grammars. The theoretical foundatio ...
s.
*
Spirit allows for both scannerless and scanner-based parsing.
*
SBP is a scannerless parser for
boolean grammars (a superset of context-free grammars), written in Java.
Lajais a two phase scannerless parser generator with support for mapping the grammar rules into objects, written in Java.
* The
Raku Grammars feature of the general purpose programming language
Raku.
PyParsingis a scannerless parser written in pure Python.
*
META II Has built in token parsers functions.
*
TREE-META Like META II also is scannerless having builtin lexer functions.
*
CWIC Compiler for Writing and Implementing Compilers. Has token rules as part of its language. Rules in CWIC were compiled into boolean functions returning success or failure.
Notes
* This is because parsing at the character level makes the language recognized by the parser a single
context-free language
In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG).
Context-free languages have many applications in programming languages, in particular, most arithmetic expressions are generated by ...
defined on characters, as opposed to a context-free language of sequences of strings in
regular language
In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
s. Some lexerless parsers handle the entire class of context-free languages, which is closed under composition.
References
Further reading
*
{{Parsers
Parsing algorithms