
In
computer science, an abstract syntax tree (AST), or just syntax tree, is a
tree representation of the
abstract syntactic structure of text (often
source code) written in a
formal language. Each node of the tree denotes a construct occurring in the text.
The syntax is "abstract" in the sense that it does not represent every detail appearing in the real syntax, but rather just the structural or content-related details. For instance, grouping
parentheses are implicit in the tree structure, so these do not have to be represented as separate nodes. Likewise, a syntactic construct like an if-condition-then statement may be denoted by means of a single node with three branches.
This distinguishes abstract syntax trees from concrete syntax trees, traditionally designated
parse tree
A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is used primarily in co ...
s. Parse trees are typically built by a
parser during the source code translation and
compiling
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
process. Once built, additional information is added to the AST by means of subsequent processing, e.g.,
contextual analysis.
Abstract syntax trees are also used in
program analysis and
program transformation systems.
Application in compilers
Abstract syntax trees are
data structures
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
widely used in
compilers to represent the structure of program code. An AST is usually the result of the
syntax analysis
In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency), ...
phase of a compiler. It often serves as an intermediate representation of the program through several stages that the compiler requires, and has a strong impact on the final output of the compiler.
Motivation
An AST has several properties that aid the further steps of the compilation process:
* An AST can be edited and enhanced with information such as properties and annotations for every element it contains. Such editing and annotation is impossible with the source code of a program, since it would imply changing it.
* Compared to the
source code, an AST does not include inessential punctuation and delimiters (braces, semicolons, parentheses, etc.).
* An AST usually contains extra information about the program, due to the consecutive stages of analysis by the compiler. For example, it may store the position of each element in the source code, allowing the compiler to print useful error messages.
ASTs are needed because of the inherent nature of programming languages and their documentation. Languages are often
ambiguous
Ambiguity is the type of meaning (linguistics), meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations wikt:plausible#Adjective, plausible. A common aspect of ambiguity is uncertainty. It ...
by nature. In order to avoid this ambiguity, programming languages are often specified as a
context-free grammar
In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form
:A\ \to\ \alpha
with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be empt ...
(CFG). However, there are often aspects of programming languages that a CFG can't express, but are part of the language and are documented in its specification. These are details that require a context to determine their validity and behaviour. For example, if a language allows new types to be declared, a CFG cannot predict the names of such types nor the way in which they should be used. Even if a language has a predefined set of types, enforcing proper usage usually requires some context. Another example is
duck typing, where the type of an element can change depending on context.
Operator overloading is yet another case where correct usage and final function are context-dependent.
Design
The design of an AST is often closely linked with the design of a compiler and its expected features.
Core requirements include the following:
* Variable types must be preserved, as well as the location of each declaration in source code.
* The order of executable statements must be explicitly represented and well defined.
* Left and right components of binary operations must be stored and correctly identified.
* Identifiers and their assigned values must be stored for assignment statements.
These requirements can be used to design the data structure for the AST.
Some operations will always require two elements, such as the two terms for addition. However, some language constructs require an arbitrarily large number of children, such as argument lists passed to programs from the
command shell. As a result, an AST used to represent code written in such a language has to also be flexible enough to allow for quick addition of an unknown quantity of children.
To support compiler verification it should be possible to unparse an AST into source code form. The source code produced should be sufficiently similar to the original in appearance and identical in execution, upon recompilation.
The AST is used intensively during
semantic analysis, where the compiler checks for correct usage of the elements of the program and the language. The compiler also generates
symbol tables based on the AST during semantic analysis. A complete traversal of the tree allows verification of the correctness of the program.
After verifying correctness, the AST serves as the base for code generation. The AST is often used to generate an intermediate representation (IR), sometimes called an
intermediate language, for the code generation.
Other usages
AST differencing
AST differencing, or for short tree differencing, consists of computing the list of differences between two ASTs. This list of differences is typically called an edit script. The edit script directly refers to the AST of the code. For instance, an edit action may the addition of a new AST node representing a function. The problem of computing good AST edit scripts is hard, with two main challenges: handling move actions, and scaling to fine-grained ASTs with thousands of nodes.
Structured merge
With proper AST differencing, ASTs can be used for merging versions and branches, this is known as
structured merge. The advantage is to avoid spurious conflicts due to lines and formatting.
Clone detection
An AST is a powerful abstraction to perform code
clone detection.
See also
*
Abstract semantic graph (ASG), also called ''term graph''
*
Composite pattern
*
Control-flow graph
*
Directed acyclic graph (DAG)
*
Document Object Model (DOM)
*
Expression tree
*
Extended Backus–Naur form
*
Lisp
A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech.
Types
* A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lisping ...
, a family of languages written in trees, with macros to manipulate code trees
*
Parse tree
A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is used primarily in co ...
, also known as ''concrete syntax tree''
*
Semantic resolution tree
A semantic resolution tree is a tree used for the definition of the semantics of a programming language
A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, ...
(SRT)
*
Shunting-yard algorithm
*
Symbol table
*
TreeDL
Tree Description Language (TreeDL) is a computer language for description of strictly-typed tree data structures and operations on them. The main use of TreeDL is in the development of language-oriented tools (compilers, translators, etc.) for the ...
References
Further reading
* (overview of AST implementation in various language families)
*
*
*
External links
AST View an
Eclipse
An eclipse is an astronomical event that occurs when an astronomical object or spacecraft is temporarily obscured, by passing into the shadow of another body or by having another body pass between it and the viewer. This alignment of three ce ...
plugin to
visualize
''Visualize'' is a video release by Def Leppard. A compilation of promo videos, interviews, and concert footage. On DVD, it is bundled with '' Video Archive''. It won a 1993 Metal Edge Readers' Choice Award for "Best Home Video."Metal Edge, June ...
a
Java abstract syntax tree
*
*
eli project Abstract Syntax Tree
Unparsing
In computing, an unparser is a system that constructs a set of characters or image components from a given parse tree.''Software Science and Engineering'' edited by Ikuo Nakata 1991 page 168
An unparser is in effect the reverse of a traditional ...
*
* (
OMG standard).
JavaParser The JavaParser library provides you with an Abstract Syntax Tree of your Java code. The AST structure then allows you to work with your Java code in an easy programmatic way.
Spoon A library to analyze, transform, rewrite, and transpile Java source code. It parses source files to build a well-designed AST with powerful analysis and transformation API.
{{DEFAULTSORT:Abstract Syntax Tree
Trees (data structures)
Formal languages