x = "foo"
, where "foo"
is a string foo
. Methods such as Syntax
Bracketed delimiters
Most modern programming languages use bracket delimiters (also balanced delimiters) to specify string literals. Double quotations are the most common quoting delimiters used: "Hi There!" An empty string is literally written by a pair of quotes with no character at all in between: "" Some languages either allow or mandate the use of single quotations instead of double quotations (the string must begin and end with the same kind of quotation mark and the type of quotation mark may or may not give slightly different semantics): 'Hi There!' These quotation marks are ''unpaired'' (the same character is used as an opener and a closer), which is a hangover from thePaired delimiters
A number of languages provide for paired delimiters, where the opening and closing delimiters are different. These also often allow nested strings, so delimiters can be embedded, so long as they are paired, but still result in delimiter collision for embedding an unpaired closing delimiter. Examples include(The quick (brown fox))
and m4, which uses the "The quick brown fox"
or
; this derives from the single quotations in Unix shells and the use of braces in C for compound statements, since blocks of code is in Tcl syntactically the same thing as string literals – that the delimiters are paired is essential for making this feasible.
The Whitespace delimiters
String literals might be ended by newlines. One example isThere might be special syntax for multi-line strings. In
No delimiters
Some programming languages, such as Perl and PHP, allow string literals without any delimiters in some contexts. In the following Perl program, for example,red
, green
, and blue
are string literals, but are unquoted:
Declarative notation
In the original FORTRAN programming language (for example), string literals were written in so-called ''Hollerith'' notation, where a decimal count of the number of characters was followed by the letter H, and then the characters of the string:Constructor functions
C++ has two styles of string, one inherited from C (delimited by"
), and the safer std::string
in the C++ Standard Library. The std::string
class is frequently used in the same way a string literal would be used in other languages, and is often preferred to C-style strings for its greater flexibility and safety. But it comes with a performance penalty for string literals, as std::string
usually allocates memory dynamically, and must copy the C-style string literal to it at run time.
Before C++11, there was no literal for C++ strings (C++11 allows "this is a C++ string"s
with the s
at the end of the literal), so the normal constructor syntax was used, for example:
* std::string str = "initializer syntax";
* std::string str("converting constructor syntax");
* std::string str = string("explicit constructor syntax");
all of which have the same interpretation. Since C++11, there is also new constructor syntax:
* std::string str;
* auto str = "constexpr literal syntax"s;
Delimiter collision
When using quoting, if one wishes to represent the delimiter itself in a string literal, one runs into the problem of ''"""
as the second quote is interpreted as the end of the string literal, not as the value of the string, and similarly one cannot write "This is "in quotes", but invalid."
as the middle quoted portion is instead interpreted as outside of quotes. There are various solutions, the most general-purpose of which is using escape sequences, such as "\""
or "This is \"in quotes\" and properly escaped."
, but there are many other solutions.
Paired quotes, such as braces in Tcl, allow nested strings, such as
but do not otherwise solve the problem of delimiter collision, since an unbalanced closing delimiter cannot simply be included, as in }
.
Doubling up
A number of languages, including Pascal,Dual quoting
Some languages, such as Fortran,q" and ending with ">/code> and ending with
or similarly for other delimiter character (any of () <> or []). D also supports here document-style strings via similar syntax.
In some programming languages, such as Bourne shell, sh and Multiple quoting
A further extension is the use of ''multiple quoting'', which allows the author to choose which characters should specify the bounds of a string literal. For example, inR"delimiter(
and end with )delimiter"
. The delimiter can be from zero to 16 characters long and may contain any member of the basic source character set except whitespace characters, parentheses, or backslash. A variant of multiple quoting is the use of and ">
and
to delimit literal strings (initial newline stripped, otherwise raw), but the opening brackets can include any number of equal signs, and only closing brackets with the same number of signs close the string. For example:
s/regex/replacement/
the default slash /
delimiters can be replaced by another character, as in s,regex,replacement,
.
Constructor functions
Another option, which is rarely used in modern languages, is to use a function to construct a string, rather than representing it via a literal. This is generally not used in modern languages because the computation is done at run time, rather than at parse time. For example, early forms ofCHR$
function, which returns a string containing the character corresponding to its argument. In sprintf
The C programming language provides many standard library functions for file input and output. These functions make up the bulk of the C standard library header . The functionality descends from a "portable I/O package" written by Mike Lesk ...
and the %c
"character" format specifier, though in the presence of other workarounds this is generally not used:
std::string
stringification operator.
Escape sequences
Escape sequences are a general technique for representing characters that are otherwise difficult to represent directly, including delimiters, nonprinting characters (such as backspaces), newlines, and whitespace characters (which are otherwise impossible to distinguish visually), and have a long history. They are accordingly widely used in string literals, and adding an escape sequence (either to a single character or throughout a string) is known as escaping. One character is chosen as a prefix to give encodings for characters that are difficult or impossible to include directly. Most commonly this is\\
and for delimited strings the delimiter itself can be encoded by escaping, say by \"
for ". A regular expression for such escaped strings can be given as follows, as found in the "(\\., \\"*"meaning "a quote; followed by zero or more of either an escaped character (backslash followed by something, possibly backslash or quote), or a non-escape, non-quote character; ending in a quote" – the only issue is distinguishing the terminating quote from a quote preceded by a backslash, which may itself be escaped. Multiple characters can follow the backslash, such as
\uFFFF
, depending on the escaping scheme.
An escaped string must then itself be lexically analyzed, converting the escaped string into the unescaped string that it represents. This is done during the evaluation phase of the overall lexing of the computer language: the evaluator of the lexer of the overall language executes its own lexer for escaped string literals.
Among other things, it must be possible to encode the character that normally terminates the string constant, plus there must be some way to specify the escape character itself. Escape sequences are not always pretty or easy to use, so many compilers also offer other means of solving the common problems. Escape sequences, however, solve every delimiter problem and most compilers interpret escape sequences. When an escape character is inside a string literal, it means "this is the start of the escape sequence". Every escape sequence specifies one character which is to be placed directly into the string. The actual number of characters required in an escape sequence varies. The escape character is on the top/left of the keyboard, but the editor will translate it, therefore it is not directly tapeable into a string. The backslash is used to represent the escape character in a string literal.
Many languages support the use of Nested escaping
When code in one programming language is embedded inside another, embedded strings may require multiple levels of escaping. This is particularly common in regular expressions and SQL query within other languages, or other languages inside shell scripts. This double-escaping is often difficult to read and author. Incorrect quoting of nested strings can present a security vulnerability. Use of untrusted data, as in data fields of an SQL query, should useRaw strings
A few languages provide a method of specifying that a literal is to be processed without any language-specific interpretation. This avoids the need for escaping, and yields more legible strings. Raw strings are particularly useful when a common character needs to be escaped, notably in regular expressions (nested as string literals), where backslash\
is widely used, and in DOS/Windows paths, where backslash is used as a path separator. The profusion of backslashes is known as \\
, and thus an escaped regular expression matching a UNC name begins with 8 backslashes, "\\\\\\\\"
, due to needing to escape the string and the regular expression. Using raw strings reduces this to 4 (escaping in the regular expression), as in C# @"\\\\"
.
In XML documents, Multiline string literals
In many languages, string literals can contain literal newlines, spanning several lines. Alternatively, newlines can be escaped, most often as\n
. For example:
<<END
where END
can be any word, and the closing delimiter is END
on a line by itself, serving as a content boundary – the <<
is due to redirecting stdin from the literal. Due to the delimiter being arbitrary, these also avoid the problem of delimiter collision. These also allow initial tabs to be stripped via the variant syntax <<-END
though leading spaces are not stripped. The same syntax has since been adopted for multiline string literals in a number of languages, most notably Perl, and are also referred to as ''here documents,'' and retain the syntax, despite being strings and not involving redirection. As with other string literals, these can sometimes have different behavior specified, such as variable interpolation.
Python, whose usual string literals do not allow literal newlines, instead has a special form of string, designed for multiline literals, called ''triple quoting''. These use a tripled delimiter, either
or """
. These literals are especially used for inline documentation, known as string trim
, while string map
can be used to strip indentation.
String literal concatenation
A few languages provide string literal concatenation, where adjacent string literals are implicitly joined into a single literal at compile time. This is a feature of C, C++, D, Ruby, and Python, which copied it from C.Python-ideas,+
operator) and concatenation during constant folding, which occurs at compile time, but in a later phase (after phrase analysis or "parsing"). Most languages, such as C#, Java and Perl, do not support implicit string literal concatenation, and instead require explicit concatenation, such as with the +
operator (this is also possible in D and Python, but illegal in C/C++ – see below); in this case concatenation may happen at compile time, via constant folding, or may be deferred to run time.
Motivation
In C, where the concept and term originate, string literal concatenation was introduced for two reasons: * To allow long strings to span multiple lines with proper indentation in contrast to line continuation, which destroys the indentation scheme; and * To allow the construction of string literals by macros (viachar 'n''/code> (C) or const char 'n''/code> (C++), which cannot be added; this is not a restriction in most other languages.
This is particularly important when used in combination with the C preprocessor
The C preprocessor is the macro preprocessor for the C, Objective-C and C++ computer programming languages. The preprocessor provides the ability for the inclusion of header files, macro expansions, conditional compilation, and line contro ...
, to allow strings to be computed following preprocessing, particularly in macros. As a simple example:
char *file_and_message = __FILE__ ": message";
will (if the file is called a.c) expand to:
char *file_and_message = "a.c" ": message";
which is then concatenated, being equivalent to:
char *file_and_message = "a.c: message";
A common use case is in constructing printf or scanf format string
The printf format string is a control parameter used by a class of functions in the input/output libraries of C and many other programming languages. The string is written in a simple template language: characters are usually copied liter ...
s, where format specifiers are given by macros.
A more complex example use
stringification
of integers (by the preprocessor) to define a macro that expands to a sequence of string literals, which are then concatenated to a single string literal with the file name and line number:
#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)
#define AT __FILE__ ":" TOSTRING(__LINE__)
Beyond syntactic requirements of C/C++, implicit concatenation is a form of syntactic sugar
In computer science, syntactic sugar is syntax within a programming language that is designed to make things easier to read or to express. It makes the language "sweeter" for human use: things can be expressed more clearly, more concisely, or in a ...
, making it simpler to split string literals across several lines, avoiding the need for line continuation (via backslashes) and allowing one to add comments to parts of strings. For example, in Python, one can comment a regular expression
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
in this way:
re.compile(" -Za-z_ # letter or underscore
" -Za-z0-9_" # letter, digit or underscore
)
Problems
Implicit string concatenation is not required by modern compilers, which implement constant folding, and causes hard-to-spot errors due to unintentional concatenation from omitting a comma, particularly in vertical lists of strings, as in:
l = foo',
'bar'
'zork'
Accordingly, it is not used in most languages, and it has been proposed for deprecation from D and Python. However, removing the feature breaks backwards compatibility, and replacing it with a concatenation operator introduces issues of precedence – string literal concatenation occurs during lexing, prior to operator evaluation, but concatenation via an explicit operator occurs at the same time as other operators, hence precedence is an issue, potentially requiring parentheses to ensure desired evaluation order.
A subtler issue is that in C and C++, there are different types of string literals, and concatenation of these has implementation-defined behavior, which poses a potential security risk.
Different kinds of strings
Some languages provide more than one kind of literal, which have different behavior. This is particularly used to indicate raw strings (no escaping), or to disable or enable variable interpolation, but has other uses, such as distinguishing character sets. Most often this is done by changing the quoting character or adding a prefix or suffix. This is comparable to prefixes and suffixes to integer literal In computer science, an integer literal is a kind of literal for an integer whose value is directly represented in source code. For example, in the assignment statement x = 1, the string 1 is an integer literal indicating the value 1, while in the ...
s, such as to indicate hexadecimal numbers or long integers.
One of the oldest examples is in shell scripts, where single quotes indicate a raw string or "literal string", while double quotes have escape sequences and variable interpolation.
For example, in Python, raw strings are preceded by an r
or R
– compare 'C:\\Windows'
with r'C:\Windows'
(though, a Python raw string cannot end in an odd number of backslashes). Python 2 also distinguishes two types of strings: 8-bit ASCII ("bytes") strings (the default), explicitly indicated with a b
or B
prefix, and Unicode strings, indicated with a u
or U
prefix. while in Python 3 strings are Unicode by default and bytes are a separate bytes
type that when initialized with quotes must be prefixed with a b
.
C#'s notation for raw strings is called @-quoting.
@"C:\Foo\Bar\Baz\"
While this disables escaping, it allows double-up quotes, which allow one to represent quotes within the string:
@"I said, ""Hello there."""
C++11 allows raw strings, unicode strings (UTF-8, UTF-16, and UTF-32), and wide character strings, determined by prefixes. It also adds literals for the existing C++ string
, which is generally preferred to the existing C-style strings.
In Tcl, brace-delimited strings are literal, while quote-delimited strings have escaping and interpolation.
Perl has a wide variety of strings, which are more formally considered operators, and are known as quote and quote-like operators. These include both a usual syntax (fixed delimiters) and a generic syntax, which allows a choice of delimiters; these include:
'' "" `` // m// qr// s/// y///
q qq qx qw m qr s tr y
REXX
Rexx (Restructured Extended Executor) is a programming language that can be interpreted or compiled. It was developed at IBM by Mike Cowlishaw. It is a structured, high-level programming language designed for ease of learning and reading. ...
uses suffix characters to specify characters or strings using their hexadecimal or binary code. E.g.,
'20'x
"0010 0000"b
"00100000"b
all yield the space character
In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area ...
, avoiding the function call X2C(20)
.
String interpolation
In some languages, string literals may contain placeholders referring to variables or expressions in the current context
Context may refer to:
* Context (language use), the relevant constraints of the communicative situation that influence language use, language variation, and discourse summary
Computing
* Context (computing), the virtual environment required to s ...
, which are evaluated (usually at run time). This is referred to as ''variable interpolation'', or more generally string interpolation
In computer programming, string interpolation (or variable interpolation, variable substitution, or variable expansion) is the process of evaluating a string literal containing one or more placeholders, yielding a result in which the placeholders ...
. Languages that support interpolation generally distinguish strings literals that are interpolated from ones that are not. For example, in sh-compatible Unix shells (as well as Perl and Ruby), double-quoted (quotation-delimited, ") strings are interpolated, while single-quoted (apostrophe-delimited, ' ) strings are not. Non-interpolated string literals are sometimes referred to as "raw strings", but this is distinct from "raw string" in the sense of escaping. For example, in Python, a string prefixed with r
or R
has no escaping or interpolation, a normal string (no prefix) has escaping but no interpolation, and a string prefixed with f
or F
has escaping and interpolation.
For example, the following Perl
Perl is a family of two High-level programming language, high-level, General-purpose programming language, general-purpose, Interpreter (computing), interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it ...
code:
$name = "Nancy";
$greeting = "Hello World";
print "$name said $greeting to the crowd of people.";
produces the output:
Nancy said Hello World to the crowd of people.
In this case, the metacharacter
A metacharacter is a character that has a special meaning to a computer program, such as a shell interpreter or a regular expression (regex) engine.
In POSIX extended regular expressions, there are 14 metacharacters that must be ''escaped'' (prec ...
character ($) (not to be confused with the sigil
A sigil () is a type of symbol used in magic. The term has usually referred to a pictorial signature of a deity or spirit. In modern usage, especially in the context of chaos magic, sigil refers to a symbolic representation of the practitio ...
in the variable assignment statement) is interpreted to indicate variable interpolation, and requires some escaping if it needs to be outputted literally.
This should be contrasted with the printf
The printf format string is a control parameter used by a class of functions in the input/output libraries of C and many other programming languages. The string is written in a simple template language: characters are usually copied liter ...
function, which produces the same output using notation such as:
printf "%s said %s to the crowd of people.", $name, $greeting;
but does not perform interpolation: the %s
is a placeholder in a printf format string
The printf format string is a control parameter used by a class of functions in the input/output libraries of C and many other programming languages. The string is written in a simple template language: characters are usually copied literal ...
, but the variables themselves are outside the string.
This is contrasted with "raw" strings:
print '$name said $greeting to the crowd of people.';
which produce output like:
$name said $greeting to the crowd of people.
Here the $ characters are not metacharacters, and are not interpreted to have any meaning other than plain text.
Embedding source code in string literals
Languages that lack flexibility in specifying string literals make it particularly cumbersome to write programming code that generates other programming code. This is particularly true when the generation language is the same or similar to the output language.
For example:
* writing code to produce quine
Quine may refer to:
* Quine (surname), people with the surname ''Quine''
* Willard Van Orman Quine, the philosopher, or things named after him:
** Quine (computing), a program that produces its source code as output
** Quine–McCluskey algorithm, ...
s
* generating an output language from within a web template
A web template system in web publishing lets web designers and developers work with ''web templates'' to automatically generate custom web pages, such as the results from a search. This reuses static web page elements while defining dynamic e ...
;
* using XSLT
XSLT (Extensible Stylesheet Language Transformations) is a language originally designed for transforming XML documents into other XML documents, or other formats such as HTML for web pages, plain text or XSL Formatting Objects, which may subse ...
to generate XSLT, or SQL to generate more SQL
* generating a PostScript
PostScript (PS) is a page description language in the electronic publishing and desktop publishing realm. It is a dynamically typed, concatenative programming language. It was created at Adobe Systems by John Warnock, Charles Geschke, ...
representation of a document for printing purposes, from within a document-processing application written in C or some other language.
* writing shader
In computer graphics, a shader is a computer program that calculates the appropriate levels of light, darkness, and color during the rendering of a 3D scene - a process known as '' shading''. Shaders have evolved to perform a variety of spec ...
s
Nevertheless, some languages are particularly well-adapted to produce this sort of self-similar output, especially those that support multiple options for avoiding delimiter collision.
Using string literals as code that generates other code may have adverse security implications, especially if the output is based at least partially on untrusted user input. This is particularly acute in the case of Web-based applications, where malicious users can take advantage of such weaknesses to subvert the operation of the application, for example by mounting an SQL injection
In computing, SQL injection is a code injection technique used to attack data-driven applications, in which malicious SQL statements are inserted into an entry field for execution (e.g. to dump the database contents to the attacker). SQL inj ...
attack.
See also
* Character literal A character literal is a type of literal in programming for the representation of a single character's value within the source code of a computer program.
Languages that have a dedicated character data type generally include character literals; th ...
* XML Literals
* Sigil (computer programming)
In computer programming, a sigil () is a symbol affixed to a variable name, showing the variable's datatype or scope, usually a prefix, as in $foo, where $ is the sigil.
'' Sigil'', from the Latin '' sigillum'', meaning a "little sign", means ' ...
Notes
References
External links
Literals In Programming
{{DEFAULTSORT:String Literal
Source code
Literal
Literal may refer to:
* Interpretation of legal concepts:
** Strict constructionism
** The plain meaning rule (a.k.a. "literal rule")
* Literal (mathematical logic), certain logical roles taken by propositions
* Literal (computer programmin ...
Articles with example Python (programming language) code