
In
computer programming
Computer programming or coding is the composition of sequences of instructions, called computer program, programs, that computers can follow to perform tasks. It involves designing and implementing algorithms, step-by-step specifications of proc ...
, an S-expression (or symbolic expression, abbreviated as sexpr or sexp) is an expression in a like-named notation for nested
list
A list is a Set (mathematics), set of discrete items of information collected and set forth in some format for utility, entertainment, or other purposes. A list may be memorialized in any number of ways, including existing only in the mind of t ...
(
tree
In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, e.g., including only woody plants with secondary growth, only ...
-structured) data. S-expressions were invented for, and popularized by, the programming language
Lisp
Lisp (historically LISP, an abbreviation of "list processing") is a family of programming languages with a long history and a distinctive, fully parenthesized Polish notation#Explanation, prefix notation.
Originally specified in the late 1950s, ...
, which uses them for
source code
In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer.
Since a computer, at base, only ...
as well as data.
Characteristics
In the usual parenthesized
syntax
In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituenc ...
of Lisp, an S-expression is classically defined
[John McCarthy (1960/2006)]
Recursive functions of symbolic expressions
. Originally published in Communications of the ACM
''Communications of the ACM'' (''CACM'') is the monthly journal of the Association for Computing Machinery (ACM).
History
It was established in 1958, with Saul Rosen as its first managing editor. It is sent to all ACM members.
Articles are i ...
. as
# an atom of the form
''x''
, or
# an
expression of the form
(''x'' . ''y'')
where ''x'' and ''y'' are S-expressions.
This definition reflects LISP's representation of a list as a series of "cells", each one an
ordered pair
In mathematics, an ordered pair, denoted (''a'', ''b''), is a pair of objects in which their order is significant. The ordered pair (''a'', ''b'') is different from the ordered pair (''b'', ''a''), unless ''a'' = ''b''. In contrast, the '' unord ...
. In plain lists, ''y'' points to the next cell (if any), thus forming a
list
A list is a Set (mathematics), set of discrete items of information collected and set forth in some format for utility, entertainment, or other purposes. A list may be memorialized in any number of ways, including existing only in the mind of t ...
. The
recursive clause of the definition means that both this representation and the S-expression notation can represent any
binary tree
In computer science, a binary tree is a tree data structure in which each node has at most two children, referred to as the ''left child'' and the ''right child''. That is, it is a ''k''-ary tree with . A recursive definition using set theor ...
. However, the representation can in principle allow
circular references, in which case the structure is not a tree at all, but a
cyclic graph, and cannot be represented in classical S-expression notation
unless a convention for cross-reference is provided (analogous to SQL
foreign key
A foreign key is a set of attributes in a table that refers to the primary key of another table, linking these two tables. In the context of relational databases, a foreign key is subject to an inclusion dependency constraint that the tuples ...
s,
SGML
The Standard Generalized Markup Language (SGML; International Organization for Standardization, ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on t ...
/
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
IDREFs, etc.). Modern Lisp dialects such as
Common Lisp
Common Lisp (CL) is a dialect of the Lisp programming language, published in American National Standards Institute (ANSI) standard document ''ANSI INCITS 226-1994 (S2018)'' (formerly ''X3.226-1994 (R1999)''). The Common Lisp HyperSpec, a hyperli ...
and
Scheme provide such syntax via ''datum labels'', with which objects can be marked, which can then recur elsewhere, indicating shared rather than duplicated structure, enabling the
reader or
printer to detect and thus trigger evaluation or display of cycles without infinitely recursing
:
#n=(''x'' ''y'' . #n#)
The definition of an atom varies per context; in the original definition by
John McCarthy,
it was assumed that there existed "an infinite set of distinguishable
atomic symbols" represented as "strings of capital
Latin letters and digits with single embedded blanks" (a subset of
character string
In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). ...
and numeric
literals).
Most modern sexpr notations allow more general quoted strings (for example including punctuation or full
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
), and use an abbreviated notation to represent lists with more than 2 members, so that
:
(''x'' ''y'' ''z'')
stands for
:
(''x'' . (''y'' . (''z'' . NIL)))
NIL
is the special end-of-list
object (alternatively written
()
, which is the only representation in
Scheme).
In the Lisp family of programming languages, S-expressions are used to represent both source code and data. Other uses of S-expressions are in Lisp-derived languages such as
DSSSL, and as
mark-up in
communication protocol
A communication protocol is a system of rules that allows two or more entities of a communications system to transmit information via any variation of a physical quantity. The protocol defines the rules, syntax, semantics (computer science), sem ...
s like
IMAP and
John McCarthy's
CBCL. It is also used as text representation of
WebAssembly. The details of the syntax and supported
data type
In computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these ...
s vary in the different languages, but the most common feature among these languages is the use of S-expressions and prefix notation.
Datatypes and syntax
There are many variants of the S-expression format, supporting a variety of different syntaxes for different datatypes. The most widely supported are:
* ''Lists and pairs'':
(1 () (2 . 3) (4))
* ''Symbols'':
with-hyphen
?@!$
, a symbol with spaces,
* ''Strings'':
"Hello, world!"
* ''Integers'':
-9876543210
* ''Floating-point numbers'':
-0.0
6.28318
6.022e23
The character
#
is often used to prefix extensions to the syntax, e.g.
#x10
for hexadecimal integers, or
#\C
for characters.
Use in Lisp
When representing source code in Lisp, the first element of an S-expression is commonly an operator or function name and any remaining elements are treated as arguments. This is called "prefix notation" or "
Polish notation
Polish notation (PN), also known as normal Polish notation (NPN), Łukasiewicz notation, Warsaw notation, Polish prefix notation, Eastern Notation or simply prefix notation, is a mathematical notation in which Operation (mathematics), operator ...
". As an example, the
Boolean expression written
4 (2 + 2)
in
C, is represented as
(= 4 (+ 2 2))
in Lisp's s-expr-based prefix notation.
As noted above, the precise definition of "atom" varies across LISP-like languages. A quoted string can typically contain anything but a quote, while
an unquoted identifier atom can typically contain anything but quotes, whitespace characters, parentheses, brackets, braces, backslashes, and semicolons. In either case, a prohibited character can typically be included by escaping it with a preceding backslash.
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
support varies.
The recursive case of the s-expr definition is traditionally implemented using
cons cells.
S-expressions were originally intended only for data to be manipulated by
M-expressions, but the first implementation of Lisp was an interpreter of S-expression encodings of M-expressions, and Lisp programmers soon became accustomed to using S-expressions for both code and data.
This means that Lisp is
homoiconic; that is, the primary representation of programs is also a data structure in a primitive type of the language itself.
Nested lists can be written as S-expressions:
((milk juice) (honey marmalade))
is a two-element S-expression whose elements are also two-element S-expressions. The whitespace-separated notation used in Lisp (and this article) is typical. Line breaks (newline characters) usually qualify as separators. This is a simple
context-free grammar
In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules
can be applied to a nonterminal symbol regardless of its context.
In particular, in a context-free grammar, each production rule is of the fo ...
for a tiny subset of English written as an S-expression (Gazdar/Melish, Natural Language Processing in Lisp), where S=sentence, NP=Noun Phrase, VP=Verb Phrase, V=Verb:
(((S) (NP VP))
((VP) (V))
((VP) (V NP))
((V) died)
((V) employed)
((NP) nurses)
((NP) patients)
((NP) Medicenter)
((NP) "Dr Chan"))
Program code can be written in S-expressions, usually using prefix notation. Example in
Common Lisp
Common Lisp (CL) is a dialect of the Lisp programming language, published in American National Standards Institute (ANSI) standard document ''ANSI INCITS 226-1994 (S2018)'' (formerly ''X3.226-1994 (R1999)''). The Common Lisp HyperSpec, a hyperli ...
:
(defun factorial (x)
(if (zerop x)
1
(* x (factorial (- x 1)))))
S-expressions can be read in Lisp using the function READ. READ reads the textual representation of an S-expression and returns Lisp data. The function PRINT can be used to output an S-expression. The output then can be read with the function READ, when all printed data objects have a readable representation. Lisp has readable representations for numbers, strings, symbols, lists and many other data types. Program code can be formatted as pretty printed S-expressions using the function PPRINT (note: with two Ps, short for ''pretty''-print).
Lisp programs are valid S-expressions, but not all S-expressions are valid Lisp programs.
(1.0 + 3.1)
is a valid S-expression, but not a valid Lisp program, since Lisp uses prefix notation and a floating point number (here 1.0) is not valid as an operation (the first element of the expression).
An S-expression preceded by a single quotation mark, as in
'x
, is
syntactic sugar
In computer science, syntactic sugar is syntax within a programming language that is designed to make things easier to read or to express. It makes the language "sweeter" for human use: things can be expressed more clearly, more concisely, or in an ...
for a
quoted S-expression, in this case
(quote x)
.
Relation to XML
S-expressions are often compared to
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
: one key difference is that S-expressions have just one form of containment, the dotted pair, while XML tags can contain simple attributes, other tags, or
CDATA, each using different syntax. Another is that S-expressions do not define a reference mechanism, whereas XML provides a notion of unique identifiers and references to them. For simple use cases, S-expressions are simpler than XML, but for more advanced use cases, XML has a query language called
XPath
XPath (XML Path Language) is an expression language designed to support the query or transformation of XML documents. It was defined by the World Wide Web Consortium (W3C) in 1999, and can be used to compute values (e.g., strings, numbers, or ...
which many tools and third party libraries use to simplify the handling of XML data.
Standardization
Standards for some Lisp-derived programming languages include a specification for their S-expression syntax. These include
Common Lisp
Common Lisp (CL) is a dialect of the Lisp programming language, published in American National Standards Institute (ANSI) standard document ''ANSI INCITS 226-1994 (S2018)'' (formerly ''X3.226-1994 (R1999)''). The Common Lisp HyperSpec, a hyperli ...
(ANSI standard document ANSI INCITS 226-1994 (R2004)),
Scheme (R5RS and
R6RS), and
ISLISP.
In May 1997,
Ron Rivest
Ronald Linn Rivest (;
born May 6, 1947) is an American cryptographer and computer scientist whose work has spanned the fields of algorithms and combinatorics, cryptography, machine learning, and election integrity.
He is an Institute Profess ...
submitted an
Internet Draft to be considered for publication as an
RFC. The draft defined a syntax based on Lisp S-expressions but intended for general-purpose data storage and exchange (similar to
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
) rather than specifically for programming. It was never approved as an RFC, but it has since been cited and used by other RFCs (e.g. RFC 2693) and several other publications. It was originally intended for use in
SPKI.
Rivest's format defines an S-expression as being either an octet-string (a series of
byte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
s) or a finite list of other S-expressions. It describes three interchange formats for expressing this structure. One is the "advanced transport", which is very flexible in terms of formatting, and is syntactically similar to Lisp-style expressions, but they are not identical. The advanced transport, for example, allows octet-strings to be represented verbatim (the string's length followed by a colon and the entire raw string), a quoted form allowing escape characters,
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
,
Base64
In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits ...
, or placed directly as a "token" if it meets certain conditions. (Rivest's tokens differ from Lisp tokens in that the former are just for convenience and aesthetics, and treated exactly like other strings, while the latter have specific syntactical meaning.)
Rivest's draft defines a
canonical representation "for digital signature purposes". It is intended to be compact, easier to parse, and unique for any abstract S-expression. It only allows verbatim strings, and prohibits whitespace as formatting outside strings. Finally there is the "basic transport representation", which is either the canonical form or the same encoded as Base64 and surrounded by
braces, the latter intended to safely transport a canonically encoded S-expression in a system which might change spacing (e.g. an email system which has 80-character-wide lines and wraps anything longer than that).
This format has not been widely adapted for use outside of SPKI (some of the users being
GnuPG, libgcrypt,
Nettle
Nettle refers to plants with stinging hairs, particularly those of the genus '' Urtica''. It can also refer to plants which resemble ''Urtica'' species in appearance but do not have stinging hairs. Plants called "nettle" include:
* ball nettle ...
, and
GNU lsh). Rivest's S-expressions web page provides
C source code for a parser and generator (available under the
MIT license
The MIT License is a permissive software license originating at the Massachusetts Institute of Technology (MIT) in the late 1980s. As a permissive license, it puts very few restrictions on reuse and therefore has high license compatibility.
Unl ...
), which could be adapted and embedded into other programs.
In addition, there are no restrictions on independently implementing the format.
See also
*
cons
*
CAR and CDR
*
Fexpr
*
Lambda calculus
In mathematical logic, the lambda calculus (also written as ''λ''-calculus) is a formal system for expressing computability, computation based on function Abstraction (computer science), abstraction and function application, application using var ...
*
M-expression
*
Canonical S-expressions
*
Comparison of data serialization formats
References
External links
sfsexpthe small, fast S-expression library for C/C++ on GitHub
minilisp by Léon Bottou.
S-expressions on Rosettacodehas implementations of readers and writers in many languages.
{{Lisp programming language
Lisp (programming language)
Data serialization formats