HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includi ...
, an abstract syntax tree (AST), or just syntax tree, is a
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
representation of the abstract syntactic structure of text (often
source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the w ...
) written in a
formal language In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of sy ...
. Each node of the tree denotes a construct occurring in the text. The syntax is "abstract" in the sense that it does not represent every detail appearing in the real syntax, but rather just the structural or content-related details. For instance, grouping
parentheses A bracket is either of two tall fore- or back-facing punctuation marks commonly used to isolate a segment of text or data from its surroundings. Typically deployed in symmetric pairs, an individual bracket may be identified as a 'left' or 'r ...
are implicit in the tree structure, so these do not have to be represented as separate nodes. Likewise, a syntactic construct like an if-condition-then statement may be denoted by means of a single node with three branches. This distinguishes abstract syntax trees from concrete syntax trees, traditionally designated parse trees. Parse trees are typically built by a
parser Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lat ...
during the source code translation and
compiling In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
process. Once built, additional information is added to the AST by means of subsequent processing, e.g., contextual analysis. Abstract syntax trees are also used in
program analysis In computer science, program analysis is the process of automatically analyzing the behavior of computer programs regarding a property such as correctness, robustness, safety and liveness. Program analysis focuses on two major areas: program op ...
and
program transformation A program transformation is any operation that takes a computer program and generates another program. In many cases the transformed program is required to be semantically equivalent to the original, relative to a particular formal semantics and ...
systems.


Application in compilers

Abstract syntax trees are
data structures In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
widely used in
compilers In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
to represent the structure of program code. An AST is usually the result of the syntax analysis phase of a compiler. It often serves as an intermediate representation of the program through several stages that the compiler requires, and has a strong impact on the final output of the compiler.


Motivation

An AST has several properties that aid the further steps of the compilation process: * An AST can be edited and enhanced with information such as properties and annotations for every element it contains. Such editing and annotation is impossible with the source code of a program, since it would imply changing it. * Compared to the
source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the w ...
, an AST does not include inessential punctuation and delimiters (braces, semicolons, parentheses, etc.). * An AST usually contains extra information about the program, due to the consecutive stages of analysis by the compiler. For example, it may store the position of each element in the source code, allowing the compiler to print useful error messages. ASTs are needed because of the inherent nature of programming languages and their documentation. Languages are often ambiguous by nature. In order to avoid this ambiguity, programming languages are often specified as a context-free grammar (CFG). However, there are often aspects of programming languages that a CFG can't express, but are part of the language and are documented in its specification. These are details that require a context to determine their validity and behaviour. For example, if a language allows new types to be declared, a CFG cannot predict the names of such types nor the way in which they should be used. Even if a language has a predefined set of types, enforcing proper usage usually requires some context. Another example is
duck typing Duck typing in computer programming is an application of the duck test—"If it walks like a duck and it quacks like a duck, then it must be a duck"—to determine whether an object can be used for a particular purpose. With nominative ty ...
, where the type of an element can change depending on context.
Operator overloading In computer programming, operator overloading, sometimes termed ''operator ad hoc polymorphism'', is a specific case of polymorphism, where different operators have different implementations depending on their arguments. Operator overloading i ...
is yet another case where correct usage and final function are context-dependent.


Design

The design of an AST is often closely linked with the design of a compiler and its expected features. Core requirements include the following: * Variable types must be preserved, as well as the location of each declaration in source code. * The order of executable statements must be explicitly represented and well defined. * Left and right components of binary operations must be stored and correctly identified. * Identifiers and their assigned values must be stored for assignment statements. These requirements can be used to design the data structure for the AST. Some operations will always require two elements, such as the two terms for addition. However, some language constructs require an arbitrarily large number of children, such as argument lists passed to programs from the
command shell In computing, a shell is a computer program that exposes an operating system's services to a human user or other programs. In general, operating system shells use either a command-line interface (CLI) or graphical user interface (GUI), depending ...
. As a result, an AST used to represent code written in such a language has to also be flexible enough to allow for quick addition of an unknown quantity of children. To support compiler verification it should be possible to unparse an AST into source code form. The source code produced should be sufficiently similar to the original in appearance and identical in execution, upon recompilation. The AST is used intensively during semantic analysis, where the compiler checks for correct usage of the elements of the program and the language. The compiler also generates
symbol table In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier (or symbols), constants, procedures and functions in a program's source code is associated with info ...
s based on the AST during semantic analysis. A complete traversal of the tree allows verification of the correctness of the program. After verifying correctness, the AST serves as the base for code generation. The AST is often used to generate an intermediate representation (IR), sometimes called an
intermediate language An intermediate representation (IR) is the data structure or code used internally by a compiler or virtual machine to represent source code. An IR is designed to be conducive to further processing, such as optimization and translation. A " ...
, for the code generation.


Other usages


AST differencing

AST differencing, or for short tree differencing, consists of computing the list of differences between two ASTs. This list of differences is typically called an edit script. The edit script directly refers to the AST of the code. For instance, an edit action may the addition of a new AST node representing a function. The problem of computing good AST edit scripts is hard, with two main challenges: handling move actions, and scaling to fine-grained ASTs with thousands of nodes.


Structured merge

With proper AST differencing, ASTs can be used for merging versions and branches, this is known as structured merge. The advantage is to avoid spurious conflicts due to lines and formatting.


Clone detection

An AST is a powerful abstraction to perform code
clone detection In computer programming, duplicate code is a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity. Duplicate code is generally considered undesirable for a n ...
.


See also

*
Abstract semantic graph In computer science, an abstract semantic graph (ASG) or term graph is a form of abstract syntax in which an expression of a formal or programming language is represented by a graph whose vertices are the expression's subterms. An ASG is at a hig ...
(ASG), also called ''term graph'' *
Composite pattern In software engineering, the composite pattern is a partitioning design pattern. The composite pattern describes a group of objects that are treated the same way as a single instance of the same type of object. The intent of a composite is to "comp ...
*
Control-flow graph In computer science, a control-flow graph (CFG) is a representation, using graph notation, of all paths that might be traversed through a program during its execution. The control-flow graph was discovered by Frances E. Allen, who noted that ...
*
Directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one v ...
(DAG) *
Document Object Model The Document Object Model (DOM) is a cross-platform and language-independent interface that treats an XML or HTML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document wi ...
(DOM) *
Expression tree A binary expression tree is a specific kind of a binary tree used to represent expressions. Two common types of expressions that a binary expression tree can represent are algebraic and boolean. These trees can represent expressions that contain ...
*
Extended Backus–Naur form In computer science, extended Backus–Naur form (EBNF) is a family of metasyntax notations, any of which can be used to express a context-free grammar. EBNF is used to make a formal description of a formal language such as a computer programm ...
* Lisp, a family of languages written in trees, with macros to manipulate code trees * Parse tree, also known as ''concrete syntax tree'' * Semantic resolution tree (SRT) *
Shunting-yard algorithm In computer science, the shunting yard algorithm is a method for parsing arithmetical or logical expressions, or a combination of both, specified in infix notation. It can produce either a postfix notation string, also known as Reverse Polish n ...
*
Symbol table In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier (or symbols), constants, procedures and functions in a program's source code is associated with info ...
* TreeDL


References


Further reading

* (overview of AST implementation in various language families) * * *


External links


AST View
an Eclipse plugin to visualize a
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...
abstract syntax tree * *
eli project
Abstract Syntax Tree Unparsing * * ( OMG standard).
JavaParser
The JavaParser library provides you with an Abstract Syntax Tree of your Java code. The AST structure then allows you to work with your Java code in an easy programmatic way.
Spoon
A library to analyze, transform, rewrite, and transpile Java source code. It parses source files to build a well-designed AST with powerful analysis and transformation API. {{DEFAULTSORT:Abstract Syntax Tree Trees (data structures) Formal languages