Parser generator
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, a compiler-compiler or compiler generator is a programming tool that creates a
parser Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lat ...
, interpreter, or
compiler In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs tha ...
from some form of formal description of a
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
and machine. The most common type of compiler-compiler is more precisely called a parser generator. It only handles syntactic analysis. The input of a parser generator is a
grammar In linguistics, the grammar of a natural language is its set of structural constraints on speakers' or writers' composition of clauses, phrases, and words. The term can also refer to the study of such constraints, a field that includes doma ...
file, typically written in
Backus–Naur form In computer science, Backus–Naur form () or Backus normal form (BNF) is a metasyntax notation for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document format ...
(BNF) or extended Backus–Naur form (EBNF) that defines the
syntax In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure ( constituenc ...
of a target programming language. The output is the
source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the ...
of a parser for the programming language. The output of the (compiled) parser source code is a parser. It may be either standalone or embedded. This parser takes as an input the source code of the target programming language source and performs some action or outputs an
abstract syntax tree In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurr ...
(AST). Parser generators do not handle the
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comput ...
of the AST, or the generation of machine code for the target machine."A Syntax Directed Compiler for ALGOL 60" Edgar T. Irons, Communications of the ACM Volume 4 Issue 1, Jan. 1961. A metacompiler is a software development tool used mainly in the construction of
compiler In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs tha ...
s, translators, and
interpreters Interpreting is a translational activity in which one produces a first and final target-language output on the basis of a one-time exposure to an expression in a source language. The most common two modes of interpreting are simultaneous interp ...
for other programming languages. The input to a metacompiler is a
computer program A computer program is a sequence or set of instructions in a programming language for a computer to Execution (computing), execute. Computer programs are one component of software, which also includes software documentation, documentation and oth ...
written in a specialized programming
metalanguage In logic and linguistics, a metalanguage is a language used to describe another language, often called the ''object language''. Expressions in a metalanguage are often distinguished from those in the object language by the use of italics, quota ...
designed mainly for the purpose of constructing compilers. The language of the compiler produced is called the object language. The minimal input producing a compiler is a metaprogram specifying the object language grammar and
semantic Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comput ...
transformations into an object program.


Variants

A typical parser generator associates executable code with each of the rules of the grammar that should be executed when these rules are applied by the parser. These pieces of code are sometimes referred to as semantic action routines since they define the semantics of the syntactic structure that is analyzed by the parser. Depending upon the type of parser that should be generated, these routines may construct a
parse tree A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is used primarily in co ...
(or
abstract syntax tree In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurr ...
), or generate executable code directly. One of the earliest (1964), surprisingly powerful, versions of compiler-compilers is
META II META II is a domain-specific programming language for writing compilers. It was created in 1963–1964 by Dewey Val Schorre at UCLA. META II uses what Schorre called syntax equations. Its operation is simply explained as: Each ''syntax equation' ...
, which accepted an analytical grammar with output facilities that produce stack machine code, and is able to compile its own source code and other languages. Among the earliest programs of the original
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, ...
versions being built at
Bell Labs Nokia Bell Labs, originally named Bell Telephone Laboratories (1925–1984), then AT&T Bell Laboratories (1984–1996) and Bell Labs Innovations (1996–2007), is an American industrial research and scientific development company owned by mul ...
was the two-part lex and
yacc Yacc (Yet Another Compiler-Compiler) is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a Look Ahead Left-to-Right Rightmost Derivation (LALR) parser generator, generating a LALR parser (the part of a co ...
system, which was normally used to output
C programming language ''The C Programming Language'' (sometimes termed ''K&R'', after its authors' initials) is a computer programming book written by Brian Kernighan and Dennis Ritchie, the latter of whom originally designed and implemented the language, as well a ...
code, but had a flexible output system that could be used for everything from programming languages to text file conversion. Their modern
GNU GNU () is an extensive collection of free software (383 packages as of January 2022), which can be used as an operating system or can be used in parts with other operating systems. The use of the completed GNU tools led to the family of operat ...
versions are
flex Flex or FLEX may refer to: Computing * Flex (language), developed by Alan Kay * FLEX (operating system), a single-tasking operating system for the Motorola 6800 * FlexOS, an operating system developed by Digital Research * FLEX (protocol), a com ...
and
bison Bison are large bovines in the genus ''Bison'' (Greek: "wild ox" (bison)) within the tribe Bovini. Two extant and numerous extinct species are recognised. Of the two surviving species, the American bison, ''B. bison'', found only in North A ...
. Some experimental compiler-compilers take as input a formal description of programming language semantics, typically using
denotational semantics In computer science, denotational semantics (initially known as mathematical semantics or Scott–Strachey semantics) is an approach of formalizing the meanings of programming languages by constructing mathematical objects (called ''denotations' ...
. This approach is often called 'semantics-based compiling', and was pioneered by Peter Mosses' Semantic Implementation System (SIS) in 1978. However, both the generated compiler and the code it produced were inefficient in time and space. No production compilers are currently built in this way, but research continues. The Production Quality Compiler-Compiler ( PQCC) project at
Carnegie Mellon University Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. One of its predecessors was established in 1900 by Andrew Carnegie as the Carnegie Technical Schools; it became the Carnegie Institute of Technology ...
does not formalize semantics, but does have a semi-formal framework for machine description. Compiler-compilers exist in many flavors, including bottom-up rewrite machine generators (se
JBurg
used to tile syntax trees according to a rewrite grammar for code generation, and attribute grammar parser generators (e.g. ANTLR can be used for simultaneous type checking, constant propagation, and more during the parsing stage).


Metacompilers

As a metacompiler's
metalanguage In logic and linguistics, a metalanguage is a language used to describe another language, often called the ''object language''. Expressions in a metalanguage are often distinguished from those in the object language by the use of italics, quota ...
will usually be a powerful string and symbol processing language, they often have strong applications for general-purpose applications, including generating a wide range of other software engineering and analysis tools. Besides being useful for
domain-specific language A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging ...
development, a metacompiler is a prime example of a domain-specific language, designed for the domain of compiler writing. A metacompiler is a metaprogram ''usually written in its own metalanguage'' or an existing computer programming language. The process of a metacompiler, written in its own metalanguage, compiling itself is equivalent to self-hosting compiler. Most common compilers written today are self-hosting compilers. Self-hosting is a powerful tool, of many metacompilers, allowing the easy extension of their own metaprogramming metalanguage. The feature that separates a metacompiler apart from other compiler compilers is that it takes as input a specialized metaprogramming language that describes all aspects of the compiler's operation. A metaprogram produced by a metacompiler is as complete a program as a
program Program, programme, programmer, or programming may refer to: Business and management * Program management, the process of managing several related projects * Time management * Program, a part of planning Arts and entertainment Audio * Programm ...
written in C++,
BASIC BASIC (Beginners' All-purpose Symbolic Instruction Code) is a family of general-purpose, high-level programming languages designed for ease of use. The original version was created by John G. Kemeny and Thomas E. Kurtz at Dartmouth College ...
or any other general
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
. The metaprogramming
metalanguage In logic and linguistics, a metalanguage is a language used to describe another language, often called the ''object language''. Expressions in a metalanguage are often distinguished from those in the object language by the use of italics, quota ...
is a powerful attribute allowing easier development of computer programming languages and other computer tools. Command line processors, text string transforming and analysis are easily coded using metaprogramming metalanguages of metacompilers. A full featured development package includes a linker and a run time support
library A library is a collection of materials, books or media that are accessible for use and not just for display purposes. A library provides physical (hard copies) or digital access (soft copies) materials, and may be a physical location or a vi ...
. Usually, a machine-oriented system programming language, such as C or C++, is needed to write the support library. A library consisting of support functions needed for the compiling process usually completes the full metacompiler package.


The meaning of metacompiler

In computer science, the prefix ''
meta Meta (from the Greek μετά, '' meta'', meaning "after" or "beyond") is a prefix meaning "more comprehensive" or "transcending". In modern nomenclature, ''meta''- can also serve as a prefix meaning self-referential, as a field of study or end ...
'' is commonly used to mean ''about (its own category)''. For example,
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
are data that describe other data. A language that is used to describe other languages is a
metalanguage In logic and linguistics, a metalanguage is a language used to describe another language, often called the ''object language''. Expressions in a metalanguage are often distinguished from those in the object language by the use of italics, quota ...
. Meta may also mean ''on a higher level of abstraction''. A
metalanguage In logic and linguistics, a metalanguage is a language used to describe another language, often called the ''object language''. Expressions in a metalanguage are often distinguished from those in the object language by the use of italics, quota ...
operates on a higher level of abstraction in order to describe properties of a language.
Backus–Naur form In computer science, Backus–Naur form () or Backus normal form (BNF) is a metasyntax notation for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document format ...
(BNF) is a formal
metalanguage In logic and linguistics, a metalanguage is a language used to describe another language, often called the ''object language''. Expressions in a metalanguage are often distinguished from those in the object language by the use of italics, quota ...
originally used to define
ALGOL 60 ALGOL 60 (short for ''Algorithmic Language 1960'') is a member of the ALGOL family of computer programming languages. It followed on from ALGOL 58 which had introduced code blocks and the begin and end pairs for delimiting them, representing a ...
. BNF is a weak
metalanguage In logic and linguistics, a metalanguage is a language used to describe another language, often called the ''object language''. Expressions in a metalanguage are often distinguished from those in the object language by the use of italics, quota ...
, for it describes only the
syntax In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure ( constituenc ...
and says nothing about the
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comput ...
or meaning. Metaprogramming is the writing of
computer program A computer program is a sequence or set of instructions in a programming language for a computer to Execution (computing), execute. Computer programs are one component of software, which also includes software documentation, documentation and oth ...
s with the ability to treat
program Program, programme, programmer, or programming may refer to: Business and management * Program management, the process of managing several related projects * Time management * Program, a part of planning Arts and entertainment Audio * Programm ...
s as their data. A metacompiler takes as input a metaprogram written in a specialized metalanguages (a higher level abstraction) specifically designed for the purpose of metaprogramming. The output is an executable object program. An analogy can be drawn: That as a ''C++'' compiler takes as input a ''C++'' programming language program, a ''meta''compiler takes as input a ''meta''programming
metalanguage In logic and linguistics, a metalanguage is a language used to describe another language, often called the ''object language''. Expressions in a metalanguage are often distinguished from those in the object language by the use of italics, quota ...
program.


Forth metacompiler

Many advocates of the language
Forth Forth or FORTH may refer to: Arts and entertainment * ''forth'' magazine, an Internet magazine * ''Forth'' (album), by The Verve, 2008 * ''Forth'', a 2011 album by Proto-Kaw * Radio Forth, a group of independent local radio stations in Scotla ...
call the process of creating a new implementation of Forth a meta-compilation and that it constitutes a metacompiler. The Forth definition of metacompiler is: :"A metacompiler is a compiler which processes its own source code, resulting in an executable version of itself." This Forth use of the term metacompiler is disputed in mainstream computer science. See
Forth (programming language) Forth is a procedural, stack-oriented programming language and interactive environment designed by Charles H. "Chuck" Moore and first used by other programmers in 1970. Although not an acronym, the language's name in its early years was ofte ...
and History of compiler construction. The actual Forth process of compiling itself is a combination of a Forth being a self-hosting
extensible programming Extensible programming is a term used in computer science to describe a style of computer programming that focuses on mechanisms to extend the programming language, compiler and runtime environment. Extensible programming languages, supporting this ...
language and sometimes
cross compilation A cross compiler is a compiler capable of creating executable code for a platform other than the one on which the compiler is running. For example, a compiler that runs on a PC but generates code that runs on an Android smartphone is a cross ...
, long established terminology in computer science. Metacompilers are a general compiler writing system. Besides the Forth metacompiler concept being indistinguishable from self-hosting and extensible language. The actual process acts at a lower level defining a minimum subset of forth ''words'', that can be used to define additional forth words, A full Forth implementation can then be defined from the base set. This sounds like a bootstrap process. The problem is that almost every general purpose language compiler also fits the Forth metacompiler description. : When (self-hosting compiler) X processes its own source code, resulting in an executable version of itself, X is a metacompiler. Just replace X with any common language, C, C++,
Pascal Pascal, Pascal's or PASCAL may refer to: People and fictional characters * Pascal (given name), including a list of people with the name * Pascal (surname), including a list of people and fictional characters with the name ** Blaise Pascal, Frenc ...
,
COBOL COBOL (; an acronym for "common business-oriented language") is a compiled English-like computer programming language designed for business use. It is an imperative, procedural and, since 2002, object-oriented language. COBOL is primarily u ...
, Fortran,
Ada Ada may refer to: Places Africa * Ada Foah, a town in Ghana * Ada (Ghana parliament constituency) * Ada, Osun, a town in Nigeria Asia * Ada, Urmia, a village in West Azerbaijan Province, Iran * Ada, Karaman, a village in Karaman Province, T ...
,
Modula-2 Modula-2 is a structured, procedural programming language developed between 1977 and 1985/8 by Niklaus Wirth at ETH Zurich. It was created as the language for the operating system and application software of the Lilith personal workstation. It ...
, etc. And X would be a metacompiler according to the Forth usage of metacompiler. A metacompiler operates at an abstraction level above the compiler it compiles. It only operates at the same (self-hosting compiler) level when compiling itself. One has to see the problem with this definition of metacompiler. It can be applied to most any language. However, on examining the concept of programming in Forth, adding new words to the dictionary, extending the language in this way is metaprogramming. It is this metaprogramming in Forth that makes it a metacompiler. Programming in Forth is adding new words to the language. Changing the language in this way is metaprogramming. Forth is a metacompiler because Forth is a language specifically designed for metaprogramming. Programming in Forth is extending Forth adding words to the Forth vocabulary creates a new Forth
dialect The term dialect (from Latin , , from the Ancient Greek word , 'discourse', from , 'through' and , 'I speak') can refer to either of two distinctly different types of linguistic phenomena: One usage refers to a variety of a language that is ...
. Forth is a specialized metacompiler for Forth language dialects.


History

The first compiler-compiler to use that name was written by Tony Brooker in 1960 and was used to create compilers for the
Atlas An atlas is a collection of maps; it is typically a bundle of maps of Earth or of a region of Earth. Atlases have traditionally been bound into book form, but today many atlases are in multimedia formats. In addition to presenting geogra ...
computer at the
University of Manchester The University of Manchester is a public university, public research university in Manchester, England. The main campus is south of Manchester city centre, Manchester City Centre on Wilmslow Road, Oxford Road. The university owns and operates majo ...
, including the Atlas Autocode compiler. The early history of metacompilers is closely tied with the history of SIG/PLAN Working group 1 on Syntax Driven Compilers. The group was started primarily through the effort of Howard Metcalfe in the Los Angeles area. In the fall of 1962 Howard Metcalfe designed two compiler-writing interpreters. One used a bottom-to-top analysis technique based on a method described by Ledley and Wilson. The other used a top-to-bottom approach based on a work by Glennie to generate random English sentences from a context-free grammar. At the same time, Val Schorre described two "meta machines", one generative and one analytic. The generative machine was implemented and produced random algebraic expressions. Meta I the first metacompiler was implemented by Schorre on an IBM 1401 at UCLA in January 1963. His original interpreters and metamachines were written directly in a pseudo-machine language.
META II META II is a domain-specific programming language for writing compilers. It was created in 1963–1964 by Dewey Val Schorre at UCLA. META II uses what Schorre called syntax equations. Its operation is simply explained as: Each ''syntax equation' ...
, however, was written in a higher-level metalanguage able to describe its own compilation into the pseudo-machine language. Lee Schmidt at Bolt, Beranek, and Newman wrote a metacompiler in March 1963 that utilized a CRT display on the time-sharing PDP-l. This compiler produced actual machine code rather than interpretive code and was partially bootstrapped from Meta I. Schorre bootstrapped Meta II from Meta I during the Spring of 1963. The paper on the refined metacompiler system presented at the 1964 Philadelphia ACM conference is the first paper on a metacompiler available as a general reference. The syntax and implementation technique of Schorre's system laid the foundation for most of the systems that followed. The system was implemented on a small 1401, and was used to implement a small
ALGOL ALGOL (; short for "Algorithmic Language") is a family of imperative computer programming languages originally developed in 1958. ALGOL heavily influenced many other languages and was the standard method for algorithm description used by the ...
-like language. Many similar systems immediately followed. Roger Rutman of AC Delco developed and implemented LOGIK, a language for logical design simulation, on the IBM 7090 in January 1964. This compiler used an algorithm that produced efficient code for Boolean expressions. Another paper in the 1964 ACM proceedings describe
Meta III
developed b
Schneider
and Johnson at UCLA for the IBM 7090. Meta III represents an attempt to produce efficient machine code, for a large class of languages. Meta III was implemented completely in assembly language. Two compilers were written in Meta III, CODOL, a compiler-writing demonstration compiler, and PUREGOL, a dialect of ALGOL 60. (It was pure gall to call it ALGOL). Late in 1964, Lee Schmidt bootstrapped the metacompiler EQGEN, from the PDP-l to the Beckman 420. EQGEN was a logic equation generating language. In 1964, System Development Corporation began a major effort in the development of metacompilers. This effort includes powerful metacompilers, Bookl, and Book2 written in
Lisp A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech. Types * A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lispin ...
which have extensive tree-searching and backup ability. An outgrowth of one of the Q-32 systems at SDC is Meta 5. The Meta 5 system incorporates backup of the input stream and enough other facilities to parse any context-sensitive language. This system was successfully released to a wide number of users and had many string-manipulation applications other than compiling. It has many elaborate push-down stacks, attribute setting and testing facilities, and output mechanisms. That Meta 5 successfully translates
JOVIAL JOVIAL is a high-level programming language based on ALGOL 58, specialized for developing embedded systems (specialized computer systems designed to perform one or a few dedicated functions, usually embedded as part of a larger, more complete dev ...
programs to
PL/I PL/I (Programming Language One, pronounced and sometimes written PL/1) is a procedural, imperative computer programming language developed and published by IBM. It is designed for scientific, engineering, business and system programming. I ...
programs demonstrates its power and flexibility. Robert McClure at
Texas Instruments Texas Instruments Incorporated (TI) is an American technology company headquartered in Dallas, Texas, that designs and manufactures semiconductors and various integrated circuits, which it sells to electronics designers and manufacturers globa ...
invented a compiler-compiler called TMG (presented in 1965). TMG was used to create early compilers for programming languages like B,
PL/I PL/I (Programming Language One, pronounced and sometimes written PL/1) is a procedural, imperative computer programming language developed and published by IBM. It is designed for scientific, engineering, business and system programming. I ...
and
ALTRAN ALTRAN (ALgebraic TRANslator) is a programming language for the formal manipulation of rational functions of several variables with integer coefficients. It was developed at Bell Labs in 1960s. ALTRAN is a FORTRAN version of ALPAK rational algebr ...
. Together with metacompiler of Val Schorre, it was an early inspiration for the last chapter of
Donald Knuth Donald Ervin Knuth ( ; born January 10, 1938) is an American computer scientist, mathematician, and professor emeritus at Stanford University. He is the 1974 recipient of the ACM Turing Award, informally considered the Nobel Prize of computer sc ...
's ''
The Art of Computer Programming ''The Art of Computer Programming'' (''TAOCP'') is a comprehensive monograph written by the computer scientist Donald Knuth presenting programming algorithms and their analysis. Volumes 1–5 are intended to represent the central core of com ...
''. The LOT system was developed during 1966 at Stanford Research Institute and was modeled very closely after Meta II. It had new special-purpose constructs allowing it to generate a compiler which could in turn, compile a subset of PL/I. This system had extensive statistic-gathering facilities and was used to study the characteristics of top-down analysis. SIMPLE is a specialized translator system designed to aid the writing of pre-processors for PL/I, SIMPLE, written in PL/I, is composed of three components: An executive, a syntax analyzer and a semantic constructor. The TREE-META compiler was developed at Stanford Research Institute in Menlo Park, California. April 1968. The early metacompiler history is well documented in th
TREE META manual.
TREE META paralleled some of the SDC developments. Unlike earlier metacompilers it separated the semantics processing from the syntax processing. The syntax rules contained
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
building operations that combined recognized language elements with tree nodes. The tree structure representation of the input was then processed by a simple form of unparse rules. The unparse rules used node recognition and attribute testing that when matched resulted in the associated action being performed. In addition like tree element could also be tested in an unparse rule. Unparse rules were also a recursive language being able to call unparse rules passing elements of thee tree before the action of the unparse rule was performed. The concept of the metamachine originally put forth by Glennie is so simple that three hardware versions have been designed and one actually implemented. The latter at Washington University in St. Louis. This machine was built from macro-modular components and has for instructions the codes described by Schorre. CWIC (Compiler for Writing and Implementing Compilers) is the last known Schorre metacompiler. It was developed at Systems Development Corporation by Erwin Book, Dewey Val Schorre and Steven J. Sherman With the full power of (lisp 2) a list processing language optimizing algorithms could operate on syntax generated lists and trees before code generation. CWIC also had a symbol table built into the language. With the resurgence of domain-specific languages and the need for parser generators which are easy to use, easy to understand, and easy to maintain, metacompilers are becoming a valuable tool for advanced software engineering projects. Other examples of parser generators in the yacc vein are ANTLR, Coco/R, CUP,
GNU Bison GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in the BNF notation (a context-free language), warns about any parsing ambiguities, and generates a parser that reads sequen ...
, Eli, FSL, SableCC, SID (Syntax Improving Device),J. M. Foster,
A syntax improving program
." ''The Computer Journal'' 11:1:31-34, 1968
and
JavaCC JavaCC (Java Compiler Compiler) is an open-source software, open-source parser generator and Lexical analysis, lexical analyzer generator written in the Java (programming language), Java programming language. JavaCC is similar to yacc in that it ...
. While useful, pure parser generators only address the parsing part of the problem of building a compiler. Tools with broader scope, such as PQCC, Coco/R and
DMS Software Reengineering Toolkit The DMS Software Reengineering Toolkit is a proprietary set of program transformation tools available for automating custom source program analysis, modification, translation or generation of software systems for arbitrary mixtures of source langu ...
provide considerable support for more difficult post-parsing activities such as semantic analysis, code optimization and generation.


Schorre metalanguages

The earliest Schorre metacompilers, META I and META II, were developed by D. Val Schorre at UCLA. Other Schorre based metacompilers followed. Each adding improvements to language analysis and/or code generation. In programming it is common to use the programming language name to refer to both the compiler and the programming language, the context distinguishing the meaning. A C++ program is compiled using a C++ compiler. That also applies in the following. For example, META II is both the compiler and the language. The metalanguages in the Schorre line of metacompilers are functional programming languages that use top down grammar analyzing syntax equations having embedded output transformation constructs. A syntax equation:
 = ;
is a compiled ''test'' function returning ''success'' or ''failure''. is the function name. is a form of logical expression consisting of tests that may be grouped, have alternates, and output productions. A ''test'' is like a ''bool'' in other languages, ''success'' being ''true'' and ''failure'' being ''false''. Defining a programming language analytically top down is natural. For example, a program could be defined as:
 program = $declaration;
Defining a program as a sequence of zero or more declaration(s). In the Schorre META ''X'' languages there is a driving rule. The program rule above is an example of a driving rule. The program rule is a ''test'' function that calls declaration, a ''test'' rule, that returns ''success'' or ''failure''. The $ loop operator repeatedly calling declaration until ''failure'' is returned. The $ operator is always successful, even when there are zero declaration. Above program would always return success. (In CWIC a long fail can bypass declaration. A long-fail is part of the backtracking system of CWIC) The character sets of these early compilers were limited. The character / was used for the alternant (or) operator. "A or B" is written as A / B. Parentheses ( ) are used for grouping. A (B / C) Describes a construct of A followed by B or C. As a boolean expression it would be A ''and'' (B ''or'' C) A sequence X Y has an implied X and Y meaning. ( ) are grouping and / the or operator. The order of evaluation is always left to right as an input character sequence is being specified by the ordering of the tests. Special operator words whose first character is a "." are used for clarity. .EMPTY is used as the last alternate when no previous alternant need be present. X (A / B / .EMPTY) Indicates that X is optionally followed by A or B. This is a specific characteristic of these metalanguages being programming languages. Backtracking is avoided by the above. Other compiler constructor systems may have declared the three possible sequences and left it up to the parser to figure it out. The characteristics of the metaprogramming metalanguages above are common to all Schorre metacompilers and those derived from them.


META I

META I was a hand compiled metacompiler used to compile META II. Little else is known of META I except that the initial compilation of META II produced nearly identical code to that of the hand coded META I compiler.


META II

Each rule consists optionally of tests, operators, and output productions. A rule attempts to match some part of the input program source character stream returning success or failure. On success the input is advanced over matched characters. On failure the input is not advanced. Output productions produced a form of assembly code directly from a syntax rule.


TREE-META

TREE-META introduced tree building operators :<''node_name''> and ''<''number''>'' moving the output production transforms to unparsed rules. The tree building operators were used in the grammar rules directly transforming the input into an
abstract syntax tree In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurr ...
. Unparse rules are also test functions that matched tree patterns. Unparse rules are called from a grammar rule when an
abstract syntax tree In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurr ...
is to be transformed into output code. The building of an
abstract syntax tree In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurr ...
and unparse rules allowed local optimizations to be performed by analyzing the parse tree. Moving of output productions to the unparse rules made a clear separation of grammar analysis and code production. This made the programming easier to read and understand.


CWIC

In 1968–1970, Erwin Book, Dewey Val Schorre, and Steven J. Sherman developed CWIC. (Compiler for Writing and Implementing Compilers) at
System Development Corporation System Development Corporation (SDC) was a computer software company based in Santa Monica, California. Founded in 1955, it is considered the first company of its kind. History SDC began as the systems engineering group for the SAGE air-defens ...
br>Charles Babbage Institute Center for the History of Information Technology (Box 12, folder 21)
CWIC is a compiler development system composed of three special-purpose, domain specific, languages, each intended to permit the description of certain aspects of translation in a straight forward manner. The syntax language is used to describe the recognition of source text and the construction from it to an intermediate
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
structure. The generator language is used to describe the transformation of the
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
into appropriate object language. The syntax language follows Dewey Val Schorre's previous line of metacompilers. It most resembles TREE-META having
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
building operators in the syntax language. The unparse rules of TREE-META are extended to work with the object based generator language based on LISP 2. CWIC includes three languages: * Syntax: Transforms the source program input, into list structures using grammar transformation formula. A parsed expression structure is passed to a generator by placement of a generator call in a rule. A
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
is represented by a list whose first element is a node object. The language has operators, < and >, specifically for making lists. The colon : operator is used to create node objects. :ADD creates an ADD node. The exclamation ! operator combines a number of parsed entries with a node to make a
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
. Trees created by syntax rules are passed to generator functions, returning success or failure. The syntax language is very close to TREE-META. Both use a colon to create a node. CWIC's
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
building exclamation ! functions the same as TREE-META's number> * Generator: a named series of transforming rules, each consisting of an unparse, pattern matching, rule. and an output production written in a LISP 2 like language. the translation was to IBM 360 binary machine code. Other facilities of the generator language generalized output. * MOL-360: an independent mid level implementation language for the IBM System/360 family of computers developed in 1968 and used for writing the underlying support library.


Generators language

Generators Language had semantics similar to
Lisp A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech. Types * A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lispin ...
. The parse
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
was thought of as a recursive list. The general form of a Generator Language function is:
  function-name(first-unparse_rule) => first-production_code_generator
           (second-unparse_rule) => second-production_code_generator
           (third-unparse_rule) => third-production_code_generator
               ...
The code to process a given
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
included the features of a general purpose programming language, plus a form: <stuff>, which would emit (stuff) onto the output file. A generator call may be used in the unparse_rule. The generator is passed the element of unparse_rule pattern in which it is placed and its return values are listed in (). For example:
  expr_gen(ADD xpr_gen(x),expr_gen(y) =>
                                <AR + (x*16)+y;>
                                releasereg(y);
                                return x;
               (SUB xpr_gen(x),expr_gen(y)=>
                                <SR + (x*16)+y;>
                                releasereg(y);
                                return x;
               (MUL xpr_gen(x),expr_gen(y)=>
                              .
                              .
                              .
               (x)=>     r1 = getreg();
                            load(r1, x);
                            return r1;
...
That is, if the parse
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
looks like (ADD something1>,, expr_gen(x) would be called with and return x. A variable in the unparse rule is a local variable that can be used in the production_code_generator. expr_gen(y) is called with and returns y. Here is a generator call in an unparse rule is passed the element in the position it occupies. Hopefully in the above x and y will be registers on return. The last transforms is intended to load an atomic into a register and return the register. The first production would be used to generate the 360 "AR" (Add Register) instruction with the appropriate values in general registers. The above example is only a part of a generator. Every generator expression evaluates to a value that con then be further processed. The last transform could just as well have been written as:
               (x)=>     return load(getreg(), x);
In this case load returns its first parameter, the register returned by getreg(). the functions load and getreg are other CWIC generators.


CWIC addressed

domain-specific language A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging ...
s before the term
domain-specific language A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging ...
existed

From the authors of CWIC: "A metacompiler assists the task of compiler-building by automating its non creative aspects, those aspects that are the same regardless of the language which the produced compiler is to translate. This makes possible the design of languages which are appropriate to the specification of a particular problem. It reduces the cost of producing processors for such languages to a point where it becomes economically feasible to begin the solution of a problem with language design."


Examples

* ANTLR
BNFC
*
GNU Bison GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in the BNF notation (a context-free language), warns about any parsing ambiguities, and generates a parser that reads sequen ...
* Coco/R, Coco-2 *
DMS Software Reengineering Toolkit The DMS Software Reengineering Toolkit is a proprietary set of program transformation tools available for automating custom source program analysis, modification, translation or generation of software systems for arbitrary mixtures of source langu ...
, a program transformation system with parser generators * Epsilon Grammar Studio *
Lemon The lemon (''Citrus limon'') is a species of small evergreen trees in the flowering plant family Rutaceae, native to Asia, primarily Northeast India (Assam), Northern Myanmar or China. The tree's ellipsoidal yellow fruit is used for culin ...
parser generator * LRStar: LR(*) parser generator *
META II META II is a domain-specific programming language for writing compilers. It was created in 1963–1964 by Dewey Val Schorre at UCLA. META II uses what Schorre called syntax equations. Its operation is simply explained as: Each ''syntax equation' ...
* parboiled, a Java library for building parsers. * Packrat parser * PackCC, a packrat parser with
left recursion In the formal language theory of computer science, left recursion is a special case of recursion where a string is recognized as part of a language by the fact that it decomposes into a string from that same language (on the left) and a suffix (on ...
support. * PQCC, a compiler-compiler that is more than a parser generator. * Syntax Improving Device (SID) *
SYNTAX In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure ( constituenc ...
, an integrated toolset for compiler construction. * TREE-META *
Yacc Yacc (Yet Another Compiler-Compiler) is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a Look Ahead Left-to-Right Rightmost Derivation (LALR) parser generator, generating a LALR parser (the part of a co ...
*
Xtext Xtext is an open-source software framework for developing programming languages and domain-specific languages (DSLs). Unlike standard parser generators, Xtext generates not only a parser, but also a class model for the abstract syntax tree, as ...
* XPL *
JavaCC JavaCC (Java Compiler Compiler) is an open-source software, open-source parser generator and Lexical analysis, lexical analyzer generator written in the Java (programming language), Java programming language. JavaCC is similar to yacc in that it ...


See also

*
Parsing expression grammar In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 200 ...
* LL parser *
LR parser In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, Canonical LR(1) parsers, Minimal LR(1) parse ...
*
Simple LR parser In computer science, a Simple LR or SLR parser is a type of LR parser with small parse tables and a relatively simple parser generator algorithm. As with other types of LR(1) parser, an SLR parser is quite efficient at finding the single correct ...
*
LALR parser In computer science, an LALR parser or Look-Ahead LR parser is a simplified version of a canonical LR parser, to parse a text according to a set of production rules specified by a formal grammar for a computer language. ("LR" means left-to-rig ...
*
GLR parser A GLR parser (GLR standing for "Generalized LR", where L stands for "left-to-right" and R stands for "rightmost (derivation)") is an extension of an LR parser algorithm to handle non-deterministic and ambiguous grammars. The theoretical foundat ...
* Domain analysis *
Domain-specific language A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging ...
* History of compiler construction ** History of compiler construction#Self-hosting compilers *
Metacompilation Metacompilation is a computation which involves metasystem transitions (MST) from a computing machine ''M'' to a metamachine ''M' '' which controls, analyzes and imitates the work of ''M''. Semantics-based program transformation, such as partial e ...
*
Program transformation A program transformation is any operation that takes a computer program and generates another program. In many cases the transformed program is required to be semantically equivalent to the original, relative to a particular formal semantics and ...


References and notes


Further reading

* * * * *


External links


Computer50.org
Brooker Autocodes
Catalog.compilertools.net
The Catalog of Compiler Construction Tools
Labraj.uni-mb.si
Lisa
Skenz.it
Jflex and Cup resources {{Parsers Parsing Compiler construction Metaprogramming Pattern matching programming languages Program transformation tools Extensible syntax programming languages Domain-specific programming languages Program analysis Software design Compiler theory