Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include

identifiers An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique ''class'' of objects, where the "object" or class may be an idea, person, physical countable object (or class thereof), or physical mass ...

, operators, grouping symbols,

data type In computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these ...

s and language keywords. Lexical tokenization is related to the type of tokenization used in

large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...

s (LLMs) but with two differences. First, lexical tokenization is usually based on a

lexical grammar In computer science, a lexical grammar or lexical structure is a formal grammar defining the syntax of tokens. The program is written using characters that are defined by the lexical structure of the language used. The character set is equivalent ...

, whereas LLM tokenizers are usually

probability Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...

-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.

Rule-based programs

A rule-based program, performing lexical tokenization, is called ''tokenizer'', or ''scanner'', although ''scanner'' is also a term for the first stage of a lexer. A lexer forms the first phase of a compiler frontend in processing. Analysis generally occurs in one pass. Lexers and parsers are most often used for compilers, but can be used for other computer language tools, such as prettyprinters or linters. Lexing can be divided into two stages: the ''scanning'', which segments the input string into syntactic units called ''lexemes'' and categorizes these into token classes, and the ''evaluating'', which converts lexemes into processed values. Lexers are generally quite simple, with most of the complexity deferred to the

syntactic analysis Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term ''pa ...

or semantic analysis phases, and can often be generated by a

lexer generator Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...

, notably lex or derivatives. However, lexers can sometimes include some complexity, such as phrase structure processing to make input easier and simplify the parser, and may be written partly or fully by hand, either to support more features or for performance.

Disambiguation of "lexeme"

What is called "lexeme" in rule-based

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

is not equal to what is called

lexeme A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms ta ...

in linguistics. What is called "lexeme" in rule-based natural language processing can be equal to the linguistic equivalent only in

analytic language An analytic language is a type of natural language in which a series of root/stem words is accompanied by prepositions, postpositions, particles and modifiers, using affixes very rarely. This is opposed to synthetic languages, which synthesi ...

s, such as English, but not in highly

synthetic language A synthetic language is a language that is characterized by denoting syntactic relationships between words via inflection or agglutination. Synthetic languages are statistically characterized by a higher morpheme-to-word ratio relative to an ...

s, such as

fusional language Fusional languages or inflected languages are a type of synthetic language, distinguished from agglutinative languages by their tendency to use single inflectional morphemes to denote multiple grammatical, syntactic, or semantic features. For ...

s. What is called a lexeme in rule-based natural language processing is more similar to what is called a

word A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...

in linguistics (not to be confused with a word in computer architecture), although in some cases it may be more similar to a

morpheme A morpheme is any of the smallest meaningful constituents within a linguistic expression and particularly within a word. Many words are themselves standalone morphemes, while other words contain multiple morphemes; in linguistic terminology, this ...

Lexical token and lexical tokenization

A ''lexical token'' is a

string String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...

with an assigned and thus identified meaning, in contrast to the probabilistic token used in

s. A lexical token consists of a ''token name'' and an optional ''token value''. The token name is a category of a rule-based lexical unit.page 111, "Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, as quoted in https://stackoverflow.com/questions/14954721/what-is-the-difference-between-token-and-lexeme , (, ; , - , operator , , Symbols that operate on arguments and produce results. , , , , , - , literal , , Numeric, logical, textual, and reference literals. , , , , , - , comment , , Line or block comments. Usually discarded. , , , , - ,

whitespace White space or whitespace may refer to: Technology * Whitespace characters, characters in computing that represent horizontal or vertical space * White spaces (radio), allocated but locally unused radio frequencies * TV White Space Database, a m ...

, , Groups of non-printable characters. Usually discarded. , , – Consider this expression in the C programming language: : The lexical analysis of this expression yields the following sequence of tokens: :

 identifier, x), (operator, =), (identifier, a), (operator, +), (identifier, b), (operator, *), (literal, 2), (separator, ;)/code>

A token name is what might be termed a part of speech 

In grammar, a part of speech or part-of-speech ( abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of  lexical items) that have similar  grammatical properties. Words that are  ...
 in linguistics.

''Lexical tokenization'' is the conversion of a raw text into (semantically or syntactically) meaningful lexical tokens, belonging to categories defined by a "lexer" program, such as identifiers, operators, grouping symbols, and data types. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing 

Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...
 input.

For example, in the text string 

String or strings may refer to:

*String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects
 Arts, entertainment, and media Films
*  ''Strings'' (1991 film), a Canadian anim ...
:
: The quick brown fox jumps over the lazy dog

the string is not implicitly segmented on spaces, as a natural language 



A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and  change. It can take different forms, typically either a  spoken language or a sign language. Natural languages ...
 speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string " " or regular expression 





A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of  characters that specifies a  match pattern in text. Usually such patterns are used by  string-searching algorithms for "find"  ...
 /\s/).

When a token class represents more than one possible lexeme, the lexer often saves enough information to reproduce the original lexeme, so that it can be used in  semantic analysis. The parser typically retrieves this information from the lexer and stores it in the abstract syntax tree 




An abstract syntax tree (AST) is a data structure used in computer science to represent the structure of a program or code snippet. It is a tree representation of the  abstract syntactic structure of text (often source code) written in a formal ...
. This is necessary in order to avoid information loss in the case where numbers may also be valid identifiers.

Tokens are identified based on the specific rules of the lexer. Some methods used to identify tokens include regular expression 





A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of  characters that specifies a  match pattern in text. Usually such patterns are used by  string-searching algorithms for "find"  ...
s, specific sequences of characters termed a flag 






A flag is a piece of textile, fabric (most often rectangular) with distinctive colours and design. It is used as a symbol, a signalling device, or for decoration. The term ''flag'' is also used to refer to the graphic design employed, and fla ...
, specific separating characters called delimiter 



A delimiter is a sequence of one or more Character (computing), characters for specifying the boundary between separate, independent regions in plain text, Expression (mathematics), mathematical expressions or other Data stream, data streams. An ...
s, and explicit definition by a dictionary. Special characters, including punctuation characters, are commonly used by lexers to identify tokens because of their natural use in written and programming languages. A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser 

Parsing, syntax analysis, or syntactic analysis is a process of analyzing a  string of symbols, either in  natural language,  computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term '' ...
. For example, a typical lexical analyzer recognizes parentheses as tokens but does nothing to ensure that each "(" is matched with a ")". 

When a lexer feeds tokens to the parser, the representation used is typically an enumerated type 
In computer programming, an enumerated type (also called enumeration, enum, or factor in the R (programming language), R programming language, a status variable in the JOVIAL programming language, and a categorical variable in statistics) is a data ...
, which is a list of number representations. For example, "Identifier" can be represented with 0, "Assignment operator" with 1, "Addition operator" with 2, etc.

Tokens are often defined by regular expression 





A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of  characters that specifies a  match pattern in text. Usually such patterns are used by  string-searching algorithms for "find"  ...
s, which are understood by a lexical analyzer generator such as  lex, or handcoded equivalent  finite-state automata. The lexical analyzer (generated automatically by a tool like lex or hand-crafted) reads in a stream of characters, identifies the  lexemes in the stream, and categorizes them into tokens. This is termed ''tokenizing''. If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing 

Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...
. From there, the interpreted data may be loaded into data structures for general use, interpretation, or compiling 





In computing, a compiler is a computer program that  translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs tha ...
.

  Lexical grammar 

The specification of a programming language 




A programming language is a system of notation for writing computer programs.
Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
 often includes a set of rules, the lexical grammar 
In computer science, a lexical grammar or lexical structure is a formal grammar defining the syntax of  tokens. The program is written using characters that are defined by the lexical structure of the language used. The character set is equivalent  ...
, which defines the lexical syntax. The lexical syntax is usually a regular language 



In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
, with the grammar rules consisting of regular expression 





A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of  characters that specifies a  match pattern in text. Usually such patterns are used by  string-searching algorithms for "find"  ...
s; they define the set of possible character sequences (lexemes) of a token. A lexer recognizes strings, and for each kind of string found, the lexical program takes an action, most simply producing a token.

Two important common lexical categories are  white space and  comments. These are also defined in the grammar and processed by the lexer but may be discarded (not producing any tokens) and considered ''non-significant'', at most separating two tokens (as in if x instead of ifx). There are two important exceptions to this. First, in off-side rule 




The off-side rule describes syntax of a computer programming language that defines the bounds of a  code block via  indentation.

The term was coined by  Peter Landin, possibly as a pun on the  offside law in association football.

An off-side  ...
 languages that delimit  blocks with indenting, initial whitespace is significant, as it determines block structure, and is generally handled at the lexer level; see  phrase structure, below. Secondly, in some uses of lexers, comments and whitespace must be preserved – for examples, a  prettyprinter also needs to output the comments and some debugging tools may provide messages to the programmer showing the original source code. In the 1960s, notably for ALGOL 





ALGOL (; short for "Algorithmic Language") is a family of  imperative computer programming languages originally developed in 1958. ALGOL heavily influenced many other languages and was the standard method for algorithm description used by the  ...
, whitespace and comments were eliminated as part of the  line reconstruction phase (the initial phase of the  compiler frontend), but this separate phase has been eliminated and these are now handled by the lexer.

  Details 

  Scanner 

The first stage, the ''scanner'', is usually based on a finite-state machine 






A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...
 (FSM). It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are termed  lexemes). For example, an ''integer'' lexeme may contain any sequence of numerical digit 



A numerical digit (often shortened to just digit) or numeral is a single symbol used alone (such as "1"), or in combinations (such as "15"), to represent  numbers in  positional notation, such as the common  base 10. The name "digit" origin ...
 characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is termed the '' maximal munch'', or ''longest match'', rule). In some languages, the lexeme creation rules are more complex and may involve backtracking 



Backtracking is a class of algorithms for finding solutions to some computational problems, notably constraint satisfaction problems, that incrementally builds candidates to the solutions, and abandons a candidate ("backtracks") as soon as it de ...
 over previously read characters. For example, in C, one 'L' character is not enough to distinguish between an identifier that begins with 'L' and a wide-character string literal.

  Evaluator 

A lexeme 


A lexeme () is a unit of  lexical meaning that underlies a set of words that are related through  inflection. It is a basic abstract unit of meaning, a  unit of  morphological  analysis in linguistics that roughly corresponds to a set of forms ta ...
, however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the ''evaluator'', which goes over the characters of the lexeme to produce a ''value''. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing: Only the type is needed. Similarly, sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments. The evaluators for identifiers are usually simple (literally representing the identifier), but may include some  unstropping. The evaluators for integer literal In computer science, an integer literal is a kind of literal (computer programming), literal for an integer (computer science), integer whose Value (computer science), value is directly represented in source code. For example, in the assignment stat ...
s may pass the string on (deferring evaluation to the semantic analysis phase), or may perform evaluation themselves, which can be involved for different bases or floating point numbers. For a simple quoted string literal, the evaluator needs to remove only the quotes, but the evaluator for an  escaped string literal incorporates a lexer, which unescapes the escape sequences.

For example, in the source code of a computer program, the string
:  
might be converted into the following lexical token stream; whitespace is suppressed and special characters have no value:
 IDENTIFIER net_worth_future
 EQUALS
 OPEN_PARENTHESIS
 IDENTIFIER assets
 MINUS
 IDENTIFIER liabilities
 CLOSE_PARENTHESIS
 SEMICOLON

Lexers may be written by hand. This is practical if the list of tokens is small, but lexers generated by automated tooling as part of a compiler-compiler 




In computer science, a compiler-compiler or compiler generator is a programming tool that creates a Parsing#Computer_languages, parser, interpreter (computer software), interpreter, or compiler from some form of formal description of a programm ...
 toolchain 

A toolchain is a set of  software development tools used to  build and otherwise  develop software. Often, the tools are executed sequentially and form a pipeline such that the output of one tool is the input for the next. Sometimes the term is us ...
 are more practical for a larger number of potential tokens. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a  production rule in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a  state transition table for a finite-state machine 






A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...
 (which is plugged into template code for compiling and executing).

Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an  English-based language, an IDENTIFIER token might be any English alphabetic character or an underscore, followed by any number of instances of ASCII alphanumeric characters and/or underscores. This could be represented compactly by the string . This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9".

Regular expressions and the finite-state machines they generate are not powerful enough to handle recursive patterns, such as "''n'' opening parentheses, followed by a statement, followed by ''n'' closing parentheses." They are unable to keep count, and verify that ''n'' is the same on both sides, unless a finite set of permissible values exists for ''n''. It takes a full parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end (see example in the '' Structure and Interpretation of Computer Programs'' book).

  Obstacles 

Typically, lexical tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often, a tokenizer relies on simple heuristics, for example:
* Punctuation and whitespace may or may not be included in the resulting list of tokens.
* All contiguous strings of alphabetic characters are part of one token; likewise with numbers.
* Tokens are separated by whitespace 
White space or whitespace may refer to:
 Technology
* Whitespace characters, characters in computing that represent horizontal or vertical space
*  White spaces (radio), allocated but locally unused radio frequencies
*  TV White Space Database, a m ...
 characters, such as a space or line break, or by punctuation characters.

In languages that use inter-word spaces (such as most that use the Latin alphabet, and most programming languages), this approach is fairly straightforward. However, even here there are many edge cases such as  contractions, hyphen 




The hyphen  is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation.

The hyphen is sometimes confused with dashes (en dash , em dash  and others), which are wider, or with t ...
ated words, emoticon 








An emoticon (, , rarely , ), short for emotion icon, is a pictorial representation of a facial expression using Character (symbol), characters—usually punctuation marks, numbers and Alphabet, letters—to express a person's feelings, mood ...
s, and larger constructs such as  URIs (which for some purposes may count as single tokens). A classic example is "New York-based", which a naive tokenizer may break at the space even though the better break is (arguably) at the hyphen.

Tokenization is particularly difficult for languages written in scriptio continua 



 (Latin for 'continuous script'), also known as  or , is a style of writing without  spaces or  other marks between the words or sentences. The form also lacks punctuation, diacritics, or distinguished letter case. 
In the West, the oldest Greek ...
, which exhibit no word boundaries, such as Ancient Greek 








Ancient Greek (, ; ) includes the forms of the Greek language used in ancient Greece and the classical antiquity, ancient world from around 1500 BC to 300 BC. It is often roughly divided into the following periods: Mycenaean Greek (), Greek ...
,  Chinese, or  Thai. Agglutinative language 


An agglutinative language is a type of language that primarily forms words by stringing together morphemes (word parts)—each typically representing a single grammatical meaning—without significant modification to their forms ( agglutinations) ...
s, such as Korean, also make tokenization tasks complicated.

Some ways to address the more difficult problems include developing more complex heuristics, querying a table of common special cases, or fitting the tokens to a language model 


A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...
 that identifies collocations in a later processing step.

  Lexer generator 


Lexers are often generated by a ''lexer generator'', analogous to parser generator 




In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser,  interpreter, or compiler from some form of formal description of a programming language and machine.

The most common type of compiler- ...
s, and such tools often come together. The most established is  lex, paired with the yacc 

Yacc (Yet Another Compiler-Compiler) is a computer program for the  Unix operating system developed by  Stephen C. Johnson.  It is a  lookahead left-to-right rightmost derivation (LALR) parser generator, generating a  LALR parser (the part of a co ...
 parser generator, or rather some of their many reimplementations, like  flex (often paired with  GNU Bison). These generators are a form of domain-specific language 
A domain-specific language (DSL) is a computer language specialized to a particular application  domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging  ...
, taking in a lexical specification – generally regular expressions with some markup – and emitting a lexer.

These tools yield very fast development, which is very important in early development, both to get a working lexer and because a language specification may change often. Further, they often provide advanced features, such as pre- and post-conditions which are hard to program by hand. However, an automatically generated lexer may lack flexibility, and thus may require some manual modification, or an all-manually written lexer.

Lexer performance is a concern, and optimizing is worthwhile, more so in stable languages where the lexer runs very often (such as C or HTML). lex/flex-generated lexers are reasonably fast, but improvements of two to three times are possible using more tuned generators. Hand-written lexers are sometimes used, but modern lexer generators produce faster lexers than most hand-coded ones. The lex/flex family of generators uses a table-driven approach which is much less efficient than the directly coded approach. With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements. Tools like  re2c have proven to produce engines that are between two and three times faster than flex produced engines. It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools.

  Phrase structure 

Lexical analysis mainly segments the input stream of characters into tokens, simply grouping the characters into pieces and categorizing them. However, the lexing may be significantly more complex; most simply, lexers may omit tokens or insert added tokens. Omitting tokens, notably whitespace and comments, is very common when these are not needed by the compiler. Less commonly, added tokens may be inserted. This is done mainly to group tokens into  statements, or statements into blocks, to simplify the parser.

  Line continuation 

 Line continuation is a feature of some languages where a newline is normally a statement terminator. Most often, ending a line with a backslash (immediately followed by a newline 






A newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or  ...
) results in the line being ''continued'' – the following line is ''joined'' to the prior line. This is generally done in the lexer: The backslash and newline are discarded, rather than the newline being tokenized. Examples include  bash, other shell scripts and Python.

  Semicolon insertion 

Many languages use the semicolon as a statement terminator. Most often this is mandatory, but in some languages the semicolon is optional in many contexts. This is mainly done at the lexer level, where the lexer outputs a semicolon into the token stream, despite one not being present in the input character stream, and is termed ''semicolon insertion'' or ''automatic semicolon insertion''. In these cases, semicolons are part of the formal phrase grammar of the language, but may not be found in input text, as they can be inserted by the lexer. Optional semicolons or other terminators or separators are also sometimes handled at the parser level, notably in the case of  trailing commas or semicolons.

Semicolon insertion is a feature of BCPL 





BCPL ("Basic Combined Programming Language") is a  procedural,  imperative, and structured programming language.  Originally intended for writing compilers for other languages, BCPL is no longer in common use. However, its influence is still f ...
 and its distant descendant  Go, though it is absent in B or C. Semicolon insertion is present in JavaScript 





JavaScript (), often abbreviated as JS, is a programming language and core technology of the  World Wide Web, alongside HTML and  CSS. Ninety-nine percent of websites use JavaScript on the  client side for  webpage behavior.

Web browsers have ...
, though the rules are somewhat complex and much-criticized; to avoid bugs, some recommend always using semicolons, while others use initial semicolons, termed defensive semicolon 






The Syntax (programming languages), syntax of JavaScript is the set of rules that define a correctly structured JavaScript program.

The examples below make use of the log function of the console object present in most browsers for Standard s ...
s, at the start of potentially ambiguous statements.

Semicolon insertion (in languages with semicolon-terminated statements) and line continuation (in languages with newline-terminated statements) can be seen as complementary: Semicolon insertion adds a token even though newlines generally do ''not'' generate tokens, while line continuation prevents a token from being generated even though newlines generally ''do'' generate tokens.

  Off-side rule 


The off-side rule 




The off-side rule describes syntax of a computer programming language that defines the bounds of a  code block via  indentation.

The term was coined by  Peter Landin, possibly as a pun on the  offside law in association football.

An off-side  ...
 (blocks determined by indenting) can be implemented in the lexer, as in  Python, where increasing the indenting results in the lexer emitting an INDENT token and decreasing the indenting results in the lexer emitting one or more DEDENT tokens. These tokens correspond to the opening brace  in languages that use braces for blocks and means that the phrase grammar does not depend on whether braces or indenting are used. This requires that the lexer hold state, namely a stack of indent levels, and thus can detect changes in indenting when this changes, and thus the lexical grammar is not  context-free: INDENT–DEDENT depend on the contextual information of prior indent levels.

  Context-sensitive lexing 

Generally lexical grammars are context-free, or almost so, and thus require no looking back or ahead, or backtracking, which allows a simple, clean, and efficient implementation. This also allows simple one-way communication from lexer to parser, without needing any information flowing back to the lexer.

There are exceptions, however. Simple examples include semicolon insertion in Go, which requires looking back one token; concatenation of consecutive string literals in Python, which requires holding one token in a buffer before emitting it (to see if the next token is another string literal); and the off-side rule in Python, which requires maintaining a count of indent level (indeed, a stack of each indent level). These examples all only require lexical context, and while they complicate a lexer somewhat, they are invisible to the parser and later phases.

A more complex example is the lexer hack In computer programming, the lexer hack is a solution to parsing  context-sensitive grammars such as  C, where classifying a sequence of characters as a variable name or a type name requires contextual information, by feeding contextual information  ...
 in C, where the token class of a sequence of characters cannot be determined until the semantic analysis phase since typedef 

typedef is a  reserved keyword in the programming languages  C,  C++, and Objective-C. It is used to create an additional name (''alias'') for another data type, but does not create a new type, except in the obscure case of a  qualified typedef of ...
 names and variable names are lexically identical but constitute different token classes. Thus in the hack, the lexer calls the semantic analyzer (say, symbol table) and checks if the sequence requires a typedef name. In this case, information must flow back not from the parser only, but from the semantic analyzer back to the lexer, which complicates design.

  See also 

*  Lexical frequency analysis
* Lexicalization 

In linguistics, lexicalization is the process of adding words, set phrases, or word patterns to a language's lexicon.

Whether '' word formation'' and ''lexicalization'' refer to the same process is controversial within the field of linguistics. M ...

* Lexical semantics 

Lexical semantics (also known as lexicosemantics), as a subfield of linguistics, linguistic semantics, is the study of word meanings.Pustejovsky, J. (2005) Lexical Semantics: Overview' in Encyclopedia of Language and Linguistics, second edition, V ...

* List of parser generators 


This is a list of notable lexer generators and parser generators for various language classes.
  Regular languages 
Regular languages are a category of languages (sometimes termed Chomsky hierarchy, Chomsky Type 3) which can be matched by a state ...


  References 



  Sources 


* ''Compiling with C# and Java'', Pat Terry, 2005, 
* ''Algorithms + Data Structures = Programs'', Niklaus Wirth, 1975, 
* ''Compiler Construction'', Niklaus Wirth, 1996, 
* Sebesta, R. W. (2006). Concepts of programming languages (Seventh edition) pp. 177. Boston: Pearson/Addison-Wesley.


  External links 

* 
* 

Word Mention Segmentation Task
 an analysis

{{Natural Language Processing

  
 Compiler construction
 Programming language implementation
 Parsing