Lexer Hack
   HOME

TheInfoList



OR:

In
computer programming Computer programming or coding is the composition of sequences of instructions, called computer program, programs, that computers can follow to perform tasks. It involves designing and implementing algorithms, step-by-step specifications of proc ...
, the lexer hack is a solution to
parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...
context-sensitive grammars such as C, where classifying a sequence of characters as a variable name or a type name requires contextual information, by feeding contextual information backwards from the parser to the lexer. Lexer hack is frowned upon in modern compilers as it creates tight coupling between otherwise largely independent steps in the compilation process. Instead, identifier-like tokens are tokenized as identifiers and later disambiguated by the parser, allowing cleaner
separation of concerns In computer science, separation of concerns (sometimes abbreviated as SoC) is a design principle for separating a computer program into distinct sections. Each section addresses a separate '' concern'', a set of information that affects the code o ...
.


Problem

The fundamental problem is differentiating types from other identifiers. In the following example, the lexical class of A cannot be determined without further contextual information: A * B; This code could either be multiplication or a declaration, depending on context. In more detail, in a
compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...
, the lexer performs one of the earliest stages of converting the
source code In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer. Since a computer, at base, only ...
to a program. It scans the text to extract meaningful ''tokens'', such as words, numbers, and strings. The parser analyzes sequences of tokens attempting to match them to syntax rules representing language structures, such as loops and variable declarations. A problem occurs here if a single sequence of tokens can ambiguously match more than one syntax rule. This ambiguity can happen in C if the lexer does not distinguish between variable and
typedef typedef is a reserved keyword in the programming languages C, C++, and Objective-C. It is used to create an additional name (''alias'') for another data type, but does not create a new type, except in the obscure case of a qualified typedef of ...
identifiers. For example, in the C expression: A * B; the lexer may find these tokens: # ?? 'A' # operator '*' # identifier 'B' # punctuation ';' Depending on whether A is a typedef name or not it may be desirable to tokenize A as either an ''identifier'' or a ''type'' so the parser does not have to handle an ambiguous parse. This grammatical ambiguity is known as the "typedef-name: identifier" problem, due to the name of the problematic production rule.


The hack solution

The solution generally consists of feeding information from the semantic
symbol table In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier, symbol, constant, procedure and function in a program's source code is associated with information ...
back into the lexer. That is, rather than functioning as a pure one-way
pipeline A pipeline is a system of Pipe (fluid conveyance), pipes for long-distance transportation of a liquid or gas, typically to a market area for consumption. The latest data from 2014 gives a total of slightly less than of pipeline in 120 countries ...
from the lexer to the parser, there is a backchannel from semantic analysis back to the lexer. This mixing of parsing and semantic analysis is generally regarded as inelegant, which is why it is called a "
hack Hack may refer to: Arts, entertainment, and media Games * Hack (Unix video game), ''Hack'' (Unix video game), a 1984 roguelike video game * .hack (video game series), ''.hack'' (video game series), a series of video games by the multimedia fran ...
". Without added context, the lexer cannot distinguish type identifiers from other identifiers because all identifiers have the same format. With the hack in the above example, when the lexer finds the identifier ''A'' it should be able to classify the token as a type identifier. The rules of the language would be clarified by specifying that typecasts require a type identifier and the ambiguity disappears. The problem also exists in C++ and parsers can use the same hack.


Alternative solutions

This problem does not arise (and hence needs no "hack" in order to solve) when using
lexerless parsing In computer science, scannerless parsing (also called lexerless parsing) performs tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a pipeline ...
techniques, as these are intrinsically contextual. These are generally seen as less elegant designs, however, because they lack the modularity of having a concurrent lexer and parser in a pipeline. Some parser generators, such as the byacc-derived BtYacc ("Backtracking Yacc"), give the generated parser the ability to try multiple attempts to parse the tokens. In the problem described here, if an attempt fails because of semantic information about the identifier, it can backtrack and attempt other rules. The
Clang Clang () is a compiler front end for the programming languages C, C++, Objective-C, Objective-C++, and the software frameworks OpenMP, OpenCL, RenderScript, CUDA, SYCL, and HIP. It acts as a drop-in replacement for the GNU Compiler ...
parser handles the situation in a completely different way, namely by using a non-reference lexical grammar. Clang's lexer does not attempt to differentiate between type names and variable names: it simply reports the current token as an identifier. The parser then uses Clang's semantic analysis library to determine the nature of the identifier. This allows a simpler and more maintainable architecture than The Lexer Hack. This is also the approach used in most other modern languages, which do not distinguish different classes of identifiers in the lexical grammar, but instead defer them to the parsing or semantic analysis phase, when sufficient information is available.


See also

*
Dangling else The dangling else is a problem in programming of parser generators in which an optional else clause in an if–then(–else) statement can make nested conditional statements ambiguous. Formally, the reference context-free grammar of the language ...
*
Most vexing parse The most vexing parse is a counterintuitive form of syntactic ambiguous grammar, ambiguity resolution in the C++ programming language. In certain situations, the C++ grammar cannot distinguish between the Initialization (programming), creation of ...


References


Citations

* http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elkhound/index.html * http://cs.nyu.edu/rgrimm/papers/pldi06.pdf * http://cens.ioc.ee/local/man/CompaqCompilers/ladebug/ladebug-manual-details.html
DOI.org


* http://groups.google.com/group/comp.compilers/browse_frm/thread/db7f68e9d8b49002/fa20bf5de9c73472?lnk=st&q=%2B%22the+lexer+hack%22&rnum=1&hl=en#fa20bf5de9c73472 {{DEFAULTSORT:Lexer hack, The C (programming language) C++ Parsing Articles with example C code