
In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, an abstract syntax tree (AST), or just syntax tree, is a
tree
In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
representation of the
abstract syntactic structure of text (often
source code
In computing, source code, or simply code, is any collection of code, with or without comment (computer programming), comments, written using a human-readable programming language, usually as plain text. The source code of a Computer program, p ...
) written in a
formal language
In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules.
The alphabet of a formal language consists of s ...
. Each node of the tree denotes a construct occurring in the text.
The syntax is "abstract" in the sense that it does not represent every detail appearing in the real syntax, but rather just the structural or content-related details. For instance, grouping
parentheses are implicit in the tree structure, so these do not have to be represented as separate nodes. Likewise, a syntactic construct like an if-condition-then statement may be denoted by means of a single node with three branches.
This distinguishes abstract syntax trees from concrete syntax trees, traditionally designated
parse trees. Parse trees are typically built by a
parser during the source code translation and
compiling process. Once built, additional information is added to the AST by means of subsequent processing, e.g.,
contextual analysis.
Abstract syntax trees are also used in
program analysis and
program transformation systems.
Application in compilers
Abstract syntax trees are
data structures widely used in
compilers to represent the structure of program code. An AST is usually the result of the
syntax analysis phase of a compiler. It often serves as an intermediate representation of the program through several stages that the compiler requires, and has a strong impact on the final output of the compiler.
Motivation
An AST has several properties that aid the further steps of the compilation process:
* An AST can be edited and enhanced with information such as properties and annotations for every element it contains. Such editing and annotation is impossible with the source code of a program, since it would imply changing it.
* Compared to the
source code
In computing, source code, or simply code, is any collection of code, with or without comment (computer programming), comments, written using a human-readable programming language, usually as plain text. The source code of a Computer program, p ...
, an AST does not include inessential punctuation and delimiters (braces, semicolons, parentheses, etc.).
* An AST usually contains extra information about the program, due to the consecutive stages of analysis by the compiler. For example, it may store the position of each element in the source code, allowing the compiler to print useful error messages.
ASTs are needed because of the inherent nature of programming languages and their documentation. Languages are often
ambiguous by nature. In order to avoid this ambiguity, programming languages are often specified as a
context-free grammar
In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form
:A\ \to\ \alpha
with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be ...
(CFG). However, there are often aspects of programming languages that a CFG can't express, but are part of the language and are documented in its specification. These are details that require a context to determine their validity and behaviour. For example, if a language allows new types to be declared, a CFG cannot predict the names of such types nor the way in which they should be used. Even if a language has a predefined set of types, enforcing proper usage usually requires some context. Another example is
duck typing
Duck typing in computer programming is an application of the duck test—"If it walks like a duck and it quacks like a duck, then it must be a duck"—to determine whether an object can be used for a particular purpose. With nominative t ...
, where the type of an element can change depending on context.
Operator overloading is yet another case where correct usage and final function are context-dependent.
Design
The design of an AST is often closely linked with the design of a compiler and its expected features.
Core requirements include the following:
* Variable types must be preserved, as well as the location of each declaration in source code.
* The order of executable statements must be explicitly represented and well defined.
* Left and right components of binary operations must be stored and correctly identified.
* Identifiers and their assigned values must be stored for assignment statements.
These requirements can be used to design the data structure for the AST.
Some operations will always require two elements, such as the two terms for addition. However, some language constructs require an arbitrarily large number of children, such as argument lists passed to programs from the
command shell. As a result, an AST used to represent code written in such a language has to also be flexible enough to allow for quick addition of an unknown quantity of children.
To support compiler verification it should be possible to unparse an AST into source code form. The source code produced should be sufficiently similar to the original in appearance and identical in execution, upon recompilation.
The AST is used intensively during
semantic analysis, where the compiler checks for correct usage of the elements of the program and the language. The compiler also generates
symbol tables based on the AST during semantic analysis. A complete traversal of the tree allows verification of the correctness of the program.
After verifying correctness, the AST serves as the base for code generation. The AST is often used to generate an intermediate representation (IR), sometimes called an
intermediate language, for the code generation.
Other usages
AST differencing
AST differencing, or for short tree differencing, consists of computing the list of differences between two ASTs. This list of differences is typically called an edit script. The edit script directly refers to the AST of the code. For instance, an edit action may the addition of a new AST node representing a function. The problem of computing good AST edit scripts is hard, with two main challenges: handling move actions, and scaling to fine-grained ASTs with thousands of nodes.
Structured merge
With proper AST differencing, ASTs can be used for merging versions and branches, this is known as
structured merge. The advantage is to avoid spurious conflicts due to lines and formatting.
Clone detection
An AST is a powerful abstraction to perform code
clone detection.
See also
*
Abstract semantic graph (ASG), also called ''term graph''
*
Composite pattern In software engineering, the composite pattern is a partitioning design pattern. The composite pattern describes a group of objects that are treated the same way as a single instance of the same type of object. The intent of a composite is to "comp ...
*
Control-flow graph
In computer science, a control-flow graph (CFG) is a representation, using graph notation, of all paths that might be traversed through a program during its execution. The control-flow graph was discovered by Frances E. Allen, who noted tha ...
*
Directed acyclic graph (DAG)
*
Document Object Model
The Document Object Model (DOM) is a cross-platform and language-independent interface that treats an XML or HTML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a docum ...
(DOM)
*
Expression tree
*
Extended Backus–Naur form
*
Lisp
A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech.
Types
* A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lispi ...
, a family of languages written in trees, with macros to manipulate code trees
*
Parse tree, also known as ''concrete syntax tree''
*
Semantic resolution tree
A semantic resolution tree is a tree used for the definition of the semantics of a programming language
A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, ...
(SRT)
*
Shunting-yard algorithm
*
Symbol table
*
TreeDL
Tree Description Language (TreeDL) is a computer language for description of strictly-typed tree data structures and operations on them. The main use of TreeDL is in the development of language-oriented tools (compilers, translators, etc.) for the ...
References
Further reading
* (overview of AST implementation in various language families)
*
*
*
External links
AST View an
Eclipse plugin to
visualize
''Visualize'' is a video release by Def Leppard. A compilation of promo videos, interviews, and concert footage. On DVD, it is bundled with '' Video Archive''. It won a 1993 Metal Edge Readers' Choice Award for "Best Home Video."Metal Edge, June ...
a
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
abstract syntax tree
*
*
eli project Abstract Syntax Tree
Unparsing
In computing, an unparser is a system that constructs a set of characters or image components from a given parse tree.''Software Science and Engineering'' edited by Ikuo Nakata 1991 page 168
An unparser is in effect the reverse of a traditional ...
*
* (
OMG
OMG may refer to:
* Oh my God (sometimes also Oh my Goodness or Oh my Gosh), a common abbreviation, often used in SMS messages and Internet communication
Acronyms
* OMG is the IATA code for Omega Airport, Omega, Namibia
* Operational manoeuvre ...
standard).
JavaParser The JavaParser library provides you with an Abstract Syntax Tree of your Java code. The AST structure then allows you to work with your Java code in an easy programmatic way.
Spoon A library to analyze, transform, rewrite, and transpile Java source code. It parses source files to build a well-designed AST with powerful analysis and transformation API.
{{DEFAULTSORT:Abstract Syntax Tree
Trees (data structures)
Formal languages