computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

, the syntax of a

computer language A computer language is a formal language used to communicate with a computer. Types of computer languages include: * Software construction#Construction languages, Construction language – all forms of communication by which a human can Comput ...

is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to

programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...

s, where the document represents

source code In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer. Since a computer, at base, only ...

, and to

markup language A markup language is a Encoding, text-encoding system which specifies the structure and formatting of a document and potentially the relationships among its parts. Markup can control the display of a document or enrich its content to facilitate au ...

s, where the document represents data. The syntax of a language defines its surface form. Text-based computer languages are based on sequences of characters, while visual programming languages are based on the spatial layout and connections between symbols (which may be textual or graphical). Documents that are syntactically invalid are said to have a syntax error. When designing the syntax of a language, a designer might start by writing down examples of both legal and illegal strings, before trying to figure out the general rules from these examples. Syntax therefore refers to the ''form'' of the code, and is contrasted with

semantics Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...

– the ''meaning''. In processing computer languages, semantic processing generally comes after syntactic processing; however, in some cases, semantic processing is necessary for complete syntactic analysis, and these are done together or concurrently. In a

compiler In computing, a compiler is a computer program that Translator (computing), translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primaril ...

, the syntactic analysis comprises the frontend, while the semantic analysis comprises the backend (and middle end, if this phase is distinguished).

Levels of syntax

Computer language syntax is generally distinguished into three levels: * Words – the lexical level, determining how characters form tokens; * Phrases – the grammar level, narrowly speaking, determining how tokens form phrases; * Context – determining what objects or variables names refer to, if types are valid, etc. Distinguishing in this way yields modularity, allowing each level to be described and processed separately and often independently. First, a lexer turns the linear sequence of characters into a linear sequence of tokens; this is known as "

lexical analysis Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...

" or "lexing". Second, the parser turns the linear sequence of tokens into a hierarchical syntax tree; this is known as "

parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...

" narrowly speaking. This ensures that the line of tokens conform to the formal grammars of the programming language. The parsing stage itself can be divided into two parts: the

parse tree A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...

, or "concrete syntax tree", which is determined by the grammar, but is generally far too detailed for practical use, and the abstract syntax tree (AST), which simplifies this into a usable form. The AST and contextual analysis steps can be considered a form of semantic analysis, as they are adding meaning and interpretation to the syntax, or alternatively as informal, manual implementations of syntactical rules that would be difficult or awkward to describe or implement formally. Thirdly, the contextual analysis resolves names and checks types. This modularity is sometimes possible, but in many real-world languages an earlier step depends on a later step – for example, the lexer hack in C is because tokenization depends on context. Even in these cases, syntactical analysis is often seen as approximating this ideal model. The levels generally correspond to levels in the

Chomsky hierarchy The Chomsky hierarchy in the fields of formal language theory, computer science, and linguistics, is a containment hierarchy of classes of formal grammars. A formal grammar describes how to form strings from a formal language's alphabet that are v ...

. Words are in a

regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...

, specified in the lexical grammar, which is a Type-3 grammar, generally given as

regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

s. Phrases are in a context-free language (CFL), generally a

deterministic context-free language In formal language theory, deterministic context-free languages (DCFL) are a proper subset of context-free languages. They are context-free languages that can be accepted by a deterministic pushdown automaton. DCFLs are always unambiguous, meanin ...

(DCFL), specified in a

phrase structure grammar The term phrase structure grammar was originally introduced by Noam Chomsky as the term for grammar studied previously by Emil Post and Axel Thue ( Post canonical systems). Some authors, however, reserve the term for more restricted grammars in t ...

, which is a Type-2 grammar, generally given as production rules in

Backus–Naur form In computer science, Backus–Naur form (BNF, pronounced ), also known as Backus normal form, is a notation system for defining the Syntax (programming languages), syntax of Programming language, programming languages and other Formal language, for ...

(BNF). Phrase grammars are often specified in much more constrained grammars than full

context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the fo ...

s, in order to make them easier to parse; while the LR parser can parse any DCFL in linear time, the simple LALR parser and even simpler LL parser are more efficient, but can only parse grammars whose production rules are constrained. In principle, contextual structure can be described by a

context-sensitive grammar A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any Production (computer science), production rules may be surrounded by a context of terminal symbol, terminal and nonterminal symbols. Cont ...

, and automatically analyzed by means such as attribute grammars, though, in general, this step is done manually, via name resolution rules and

type checking In computer programming, a type system is a logical system comprising a set of rules that assigns a property called a ''type'' (for example, integer, floating point, string) to every '' term'' (a word, phrase, or other set of symbols). Usu ...

, and implemented via a symbol table which stores names and types for each scope. Tools have been written that automatically generate a lexer from a lexical specification written in regular expressions and a parser from the phrase grammar written in BNF: this allows one to use

declarative programming In computer science, declarative programming is a programming paradigm—a style of building the structure and elements of computer programs—that expresses the logic of a computation without describing its control flow. Many languages that ap ...

, rather than need to have procedural or functional programming. A notable example is the lex-

yacc Yacc (Yet Another Compiler-Compiler) is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a lookahead left-to-right rightmost derivation (LALR) parser generator, generating a LALR parser (the part of a co ...

pair. These automatically produce a ''concrete'' syntax tree; the parser writer must then manually write code describing how this is converted to an ''abstract'' syntax tree. Contextual analysis is also generally implemented manually. Despite the existence of these automatic tools, parsing is often implemented manually, for various reasons – perhaps the phrase structure is not context-free, or an alternative implementation improves performance or error-reporting, or allows the grammar to be changed more easily. Parsers are often written in functional languages, such as

Haskell Haskell () is a general-purpose, statically typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research, and industrial applications, Haskell pioneered several programming language ...

, or in scripting languages, such as Python or

Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language". Perl was developed ...

, or in C or C++.

Examples of errors

As an example, (add 1 1) is a syntactically valid Lisp program (assuming the 'add' function exists, else name resolution fails), adding 1 and 1. However, the following are invalid: (_ 1 1) lexical error: '_' is not valid (add 1 1 parsing error: missing closing ')' The lexer is unable to identify the first error – all it knows is that, after producing the token LEFT_PAREN, '(' the remainder of the program is invalid, since no word rule begins with '_'. The second error is detected at the parsing stage: The parser has identified the "list" production rule due to the '(' token (as the only match), and thus can give an error message; in general it may be ambiguous. Type errors and undeclared variable errors are sometimes considered to be syntax errors when they are detected at compile-time (which is usually the case when compiling strongly-typed languages), though it is common to classify these kinds of error as

semantic Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...

errors instead.Semantic Errors in Java
/ref> As an example, the Python code 'a' + 1 contains a type error because it adds a string literal to an integer literal. Type errors of this kind can be detected at compile-time: They can be detected during parsing (phrase analysis) if the compiler uses separate rules that allow "integerLiteral + integerLiteral" but not "stringLiteral + integerLiteral", though it is more likely that the compiler will use a parsing rule that allows all expressions of the form "LiteralOrIdentifier + LiteralOrIdentifier" and then the error will be detected during contextual analysis (when type checking occurs). In some cases this validation is not done by the compiler, and these errors are only detected at runtime. In a dynamically typed language, where type can only be determined at runtime, many type errors can only be detected at runtime. For example, the Python code a + b is syntactically valid at the phrase level, but the correctness of the types of a and b can only be determined at runtime, as variables do not have types in Python, only values do. Whereas there is disagreement about whether a type error detected by the compiler should be called a syntax error (rather than a static semantic error), type errors which can only be detected at program execution time are always regarded as semantic rather than syntax errors.

Syntax definition

The syntax of textual programming languages is usually defined using a combination of

s (for lexical structure) and

(a metalanguage for grammatical structure) to inductively specify

syntactic categories A syntactic category is a syntactic unit that theories of syntax assume. Word classes, largely corresponding to traditional parts of speech (e.g. noun, verb, preposition, etc.), are syntactic categories. In phrase structure grammars, the ''phrasa ...

( nonterminal) and '' terminal'' symbols. Syntactic categories are defined by rules called ''productions'', which specify the values that belong to a particular syntactic category. Terminal symbols are the concrete characters or strings of characters (for example keywords such as ''define'', ''if'', ''let'', or ''void'') from which syntactically valid programs are constructed. Syntax can be divided into context-free syntax and context-sensitive syntax. Context-free syntax are rules directed by the metalanguage of the programming language. These would not be constrained by the context surrounding or referring that part of the syntax, whereas context-sensitive syntax would. A language can have different equivalent grammars, such as equivalent regular expressions (at the lexical levels), or different phrase rules which generate the same language. Using a broader category of grammars, such as LR grammars, can allow shorter or simpler grammars compared with more restricted categories, such as LL grammar, which may require longer grammars with more rules. Different but equivalent phrase grammars yield different parse trees, though the underlying language (set of valid documents) is the same.

Example: Lisp S-expressions

Below is a simple grammar, defined using the notation of regular expressions and

Extended Backus–Naur form Extension, extend or extended may refer to: Mathematics Logic or set theory * Axiom of extensionality * Extensible cardinal * Extension (model theory) * Extension (proof theory) * Extension (predicate logic), the set of tuples of values ...

. It describes the syntax of

S-expression In computer programming, an S-expression (or symbolic expression, abbreviated as sexpr or sexp) is an expression in a like-named notation for nested List (computing), list (Tree (data structure), tree-structured) data. S-expressions were invented ...

s, a data syntax of the programming language

Lisp Lisp (historically LISP, an abbreviation of "list processing") is a family of programming languages with a long history and a distinctive, fully parenthesized Polish notation#Explanation, prefix notation. Originally specified in the late 1950s, ...

, which defines productions for the syntactic categories ''expression'', ''atom'', ''number'', ''symbol'', and ''list'': expression = atom , list atom = number , symbol number = - 0'-'9' symbol = A'-'Z''A'-'Z''0'-'9'].* list = '(', expression*, ')' This grammar specifies the following: * an ''expression'' is either an ''atom'' or a ''list''; * an ''atom'' is either a ''number'' or a ''symbol''; * a ''number'' is an unbroken sequence of one or more decimal digits, optionally preceded by a plus or minus sign; * a ''symbol'' is a letter followed by zero or more of any characters (excluding whitespace); and * a ''list'' is a matched pair of parentheses, with zero or more ''expressions'' inside it. Here the decimal digits, upper- and lower-case characters, and parentheses are terminal symbols. The following are examples of well-formed token sequences in this grammar: '12345', '()', '(A B C232 (1))'

Complex grammars

The grammar needed to specify a programming language can be classified by its position in the

. The phrase grammar of most programming languages can be specified using a Type-2 grammar, i.e., they are

s, though the overall syntax is context-sensitive (due to variable declarations and nested scopes), hence Type-1. However, there are exceptions, and for some languages the phrase grammar is Type-0 (Turing-complete). In some languages like Perl and Lisp the specification (or implementation) of the language allows constructs that execute during the parsing phase. Furthermore, these languages have constructs that allow the programmer to alter the behavior of the parser. This combination effectively blurs the distinction between parsing and execution, and makes syntax analysis an

undecidable problem In computability theory and computational complexity theory, an undecidable problem is a decision problem for which it is proved to be impossible to construct an algorithm that always leads to a correct yes-or-no answer. The halting problem is an ...

in these languages, meaning that the parsing phase may not finish. For example, in Perl it is possible to execute code during parsing using a BEGIN statement, and Perl function prototypes may alter the syntactic interpretation, and possibly even the syntactic validity of the remaining code. Colloquially this is referred to as "only Perl can parse Perl" (because code must be executed during parsing, and can modify the grammar), or more strongly "even Perl cannot parse Perl" (because it is undecidable). Similarly, Lisp macros introduced by the defmacro syntax also execute during parsing, meaning that a Lisp compiler must have an entire Lisp run-time system present. In contrast, C macros are merely string replacements, and do not require code execution.

Syntax versus semantics

The syntax of a language describes the form of a valid program, but does not provide any information about the meaning of the program or the results of executing that program. The meaning given to a combination of symbols is handled by semantics (either formal or hard-coded in a

reference implementation In the software development process, a reference implementation (or, less frequently, sample implementation or model implementation) is a program that implements all requirements from a corresponding specification. The reference implementation ...

). Valid syntax must be established before semantics can make meaning out of it. Not all syntactically correct programs are semantically correct. Many syntactically correct programs are nonetheless ill-formed, per the language's rules; and may (depending on the language specification and the soundness of the implementation) result in an error on translation or execution. In some cases, such programs may exhibit undefined behavior. Even when a program is well-defined within a language, it may still have a meaning that is not intended by the person who wrote it. Using

natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...

as an example, it may not be possible to assign a meaning to a grammatically correct sentence or the sentence may be false: * " Colorless green ideas sleep furiously." is grammatically well formed but has no generally accepted meaning. * "John is a married bachelor." is grammatically well formed but expresses a meaning that cannot be true. The following C language fragment is syntactically correct, but performs an operation that is not semantically defined (because is a

null pointer In computing, a null pointer (sometimes shortened to nullptr or null) or null reference is a value saved for indicating that the Pointer (computer programming), pointer or reference (computer science), reference does not refer to a valid Object (c ...

, the operations and have no meaning): complex *p = NULL; complex abs_p = sqrt (p->real * p->real + p->im * p->im); As a simpler example, int x; printf("%d", x); is syntactically valid, but not semantically defined, as it uses an

uninitialized variable In computing, an uninitialized variable is a variable (programming), variable that is declared but is not set to a definite known value before it is used. It will have ''some'' value, but not a predictable one. As such, it is a programming error an ...

. Even though compilers for some programming languages (e.g., Java and C#) would detect uninitialized variable errors of this kind, they should be regarded as

errors rather than syntax errors.Issue of syntax or semantics?
/ref>

References

External links

* Various syntactic constructs used i
computer programming languages
*Python error “ImportError: No module named” Why? How? Command-Line
[Solved2021
/nowiki>">olved2021">[Solved2021
/nowiki> {{Webarchive, url=https://web.archive.org/web/20211009062019/https://usingpython.shodkk.com/python-error-importerror-no-module-named-why-how-command-line-solved2021/ , date=2021-10-09 Programming language syntax, Programming language topics Source code