In
computer programming
Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming involves tasks such as anal ...
and
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, "maximal munch" or "longest match" is the principle that when creating some construct, as much of the available input as possible should be consumed.
The earliest known use of this term is by R.G.G. Cattell in his PhD thesis on automatic derivation of
code generators for
compilers
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
.
Application
For instance, the
lexical syntax of many
programming languages
A programming language is a system of notation for writing computer program, computer programs. Most programming languages are text-based formal languages, but they may also be visual programming language, graphical. They are a kind of computer ...
requires that
tokens be built from the maximum possible number of characters from the input stream. This is done to resolve the problem of inherent ambiguity in commonly used
regular expression
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s such as
-z
(one or more lower-case letters).
The term is also used in
compilers
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
in the
instruction selection __NOTOC__
In computer science, ''instruction selection'' is the stage of a compiler backend that transforms its middle-level intermediate representation (IR) into a low-level IR. In a typical compiler, instruction selection precedes both instruction ...
stage to describe a method of "tiling" — determining how a structured tree representing a program in an
intermediate language
An intermediate representation (IR) is the data structure or code used internally by a compiler or virtual machine to represent source code. An IR is designed to be conducive to further processing, such as optimization and translation. A "good" ...
should be converted into linear
machine code
In computer programming, machine code is any low-level programming language, consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). Each instruction causes the CPU to perform a ver ...
. An entire subtree might be converted into just one machine instruction, and the problem is how to split the tree into non-overlapping "tiles", each representing one machine instruction. An effective strategy is simply to make a tile of the largest subtree possible at any given point, which is called "maximal munch".
Drawbacks
In some situations, "maximal munch" leads to undesirable or unintuitive outcomes. For instance, in the
C programming language, the statement
x=y/*z;
(without any whitespace) will probably lead to a syntax error, since the
/*
character sequence initiates a (unintended) comment that is either unterminated or terminated by the end token
*/
of some later, unrelated ''actual'' comment (comments in C do not nest). What was actually meant in the statement was to assign to the variable
x
the result of dividing the value in
y
by the value obtained by dereferencing
pointer
Pointer may refer to:
Places
* Pointer, Kentucky
* Pointers, New Jersey
* Pointers Airport, Wasco County, Oregon, United States
* The Pointers, a pair of rocks off Antarctica
People with the name
* Pointer (surname), a surname (including a list ...
z
; this would be valid code. It can be stated by making use of whitespace, or using
x=y/(*z);
.
Another example, in
C++, uses the "angle bracket" characters
<
and
>
in the syntax for
template specialization
Generic programming is a style of computer programming in which algorithms are written in terms of types ''to-be-specified-later'' that are then ''instantiated'' when needed for specific types provided as parameters. This approach, pioneered by ...
, but two consecutive
>
characters are interpreted as the
right-shift operator
>>
. Prior to C++11, the following code would produce a parse error, because the right-shift operator token is encountered instead of two right-angle-bracket tokens:
std::vector> my_mat_11; //Incorrect in C++03, correct in C++11.
std::vector > my_mat_03; //Correct in either C++03 or C++11.
The
C++11
C++11 is a version of the ISO/ IEC 14882 standard for the C++ programming language. C++11 replaced the prior version of the C++ standard, called C++03, and was later replaced by C++14. The name follows the tradition of naming language versio ...
standard adopted in August 2011
amended the grammar so that a right-shift token is accepted as synonymous with a pair of right angle brackets (as in
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
), which complicates the grammar but allows the continued use of the maximal munch principle. An exception to the maximal munch rule had to be added anyway to deal with the sequence
<::
which can appear in templates. In that case, unless the sequence is followed by
:
or
>
the character
<
is interpreted as its own token instead of part of the token
<:
.
Alternatives
Programming languages researchers have also responded by replacing or supplementing the principle of maximal munch with other lexical disambiguation tactics. One approach is to utilize "follow restrictions", which instead of directly taking the longest match will put some restrictions on what characters can ''follow'' a valid match. For example, stipulating that strings matching
-z
cannot be followed by an alphabetic character achieves the same effect as maximal munch with that regular expression. (In the context of regular expressions, the maximal munch principle is referred to as ''greediness'' and contrasted with ''
laziness
Laziness (also known as indolence) is disinclination to activity or exertion despite having the ability to act or to
exert oneself. It is often used as a pejorative; terms for a person seen to be lazy
include " couch potato", "slacker", and " ...
''.) Another approach is to keep the principle of maximal munch but make it subordinate to some other principle, such as context (''e.g.'', the right-shift token in Java would not be matched in the context of a
generics
Generic or generics may refer to:
In business
* Generic term, a common name used for a range or class of similar things not protected by trademark
* Generic brand, a brand for a product that does not have an associated brand or trademark, other ...
expression, where it is syntactically invalid).
[Van Wyk ''et al.'', 63.]
References
Bibliography
*
*
*
*
* {{Cite book
, last=Van Wyk
, first=Eric
, last2=Schwerdfeger
, first2=August
, title=Context-Aware Scanning for Parsing Extensible Languages
, journal=GPCE '07: Proceedings of the 6th International Conference on Generative Programming and Component Engineering
, year=2007
, pages=63–72
, publisher=ACM
, location=New York
, doi=10.1145/1289971.1289983, isbn=9781595938558
Compilers