Escape sequences in C
   HOME

TheInfoList



OR:

Escape sequences are used in the programming languages C and
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
, and their design was copied in many other languages such as
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
, PHP, C#, etc. An escape sequence is a sequence of characters that does not represent itself when used inside a character or string
literal Literal may refer to: * Interpretation of legal concepts: ** Strict constructionism ** The plain meaning rule The plain meaning rule, also known as the literal rule, is one of three rules of statutory construction traditionally applied by ...
, but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly. In C, all escape sequences consist of two or more characters, the first of which is the backslash, (called the "
Escape character In computing and telecommunication, an escape character is a character (computing), character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharac ...
"); the remaining characters determine the interpretation of the escape sequence. For example, is an escape sequence that denotes a
newline Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or ...
character.


Motivation

Suppose we want to print out on one line, followed by on the next line. One could attempt to represent the string to be printed as a single literal as follows: #include int main() This is not valid in C, since a string literal may not span multiple logical source lines. This can be worked around by printing the newline character using its numerical value ( in
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
), #include int main() This instructs the program to print , followed by the
byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...
whose numerical value is , followed by . While this will indeed work when the machine uses the ASCII encoding, it will not work on systems that use other encodings, that have a different numerical value for the newline character. It is also not a good solution because it still does not allow to represent a newline character inside a literal, and instead takes advantage of the semantics of
printf The printf format string is a control parameter used by a class of functions in the input/output libraries of C and many other programming languages. The string is written in a simple template language: characters are usually copied literal ...
. In order to solve these problems and ensure maximum portability between systems, C interprets inside a literal as a newline character, whatever that may be on the target system: #include int main() In this code, the
escape sequence In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding (and possibly terminating) characters. Examples * In C and ma ...
does not stand for a backslash followed by the letter , because the backslash causes an "escape" from the normal way characters are interpreted by the compiler. After seeing the backslash, the compiler expects another character to complete the escape sequence, and then translates the escape sequence into the bytes it is intended to represent. Thus, represents a string with an embedded newline, regardless of whether it is used inside or anywhere else. This raises the issue of how to represent an actual backslash inside a literal. This is done by using the escape sequence , as seen in the next section. Some languages don't have escape sequences, for example
Pascal Pascal, Pascal's or PASCAL may refer to: People and fictional characters * Pascal (given name), including a list of people with the name * Pascal (surname), including a list of people and fictional characters with the name ** Blaise Pascal, Frenc ...
. Instead a command including a newline would be used ( includes a newline, excludes it). writeln('Hello'); write('world!');


Table of escape sequences

The following escape sequences are defined in standard C. This table also shows the values they map to in ASCII. However, these escape sequences can be used on any system with a C compiler, and may map to different values if the system does not use a character encoding based on ASCII. :Note 1.Common non-standard code; see the Notes section below. :Note 2.There may be one, two, or three octal numerals ''n'' present; see the Notes section below. :Note 3.\u takes 4 hexadecimal digits ''h''; see the Notes section below. :Note 4.\U takes 8 hexadecimal digits ''h''; see the Notes section below.


Notes

produces one byte, despite the fact that the platform may use more than one byte to denote a newline, such as the
DOS DOS is shorthand for the MS-DOS and IBM PC DOS family of operating systems. DOS may also refer to: Computing * Data over signalling (DoS), multiplexing data onto a signalling channel * Denial-of-service attack (DoS), an attack on a communicat ...
/
Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for se ...
CRLF Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a ...
sequence, . The translation from to on DOS and Windows occurs when the byte is written out to a file or to the console, and the inverse translation is done when text files are read. A hex escape sequence must have at least one hex digit following , with no upper bound; it continues for as many hex digits as there are. Thus, for example, denotes the byte with the numerical value ABCDEF16, followed by the letter , which is not a hex digit. However, if the resulting integer value is too large to fit in a single byte, the actual numerical value assigned is implementation-defined. Most platforms have 8-bit types, which limits a useful hex escape sequence to two hex digits. However, hex escape sequences longer than two hex digits might be useful inside a wide character or wide string literal(prefixed with L): char s1[] = "\x12"; // single char with value 0x12 (18 in decimal) char s1[] = "\x1234"; // single char with implementation-defined value, unless char is long enough wchar_t s2[] = L"\x1234"; // single wchar_t with value 0x1234, provided wchar_t is long enough (16 bits suffices) An octal escape sequence consists of followed by one, two, or three octal digits. The octal escape sequence ends when it either contains three octal digits already, or the next character is not an octal digit. For example, is a single octal escape sequence denoting a byte with numerical value 9 (11 in octal), rather than the escape sequence followed by the digit . However, is the octal escape sequence followed by the digit . In order to denote the byte with numerical value 1, followed by the digit , one could use , since C automatically concatenates adjacent string literals. Note that some three-digit octal escape sequences may be too large to fit in a single byte; this results in an implementation-defined value for the byte actually produced. The escape sequence is a commonly used octal escape sequence, which denotes the null character, with value zero.


Non-standard escape sequences

A sequence such as is not a valid escape sequence according to the C standard as it is not found in the table above. The C standard requires such "invalid" escape sequences to be diagnosed (i.e., the compiler must print an error message). Notwithstanding this fact, some compilers may define additional escape sequences, with implementation-defined semantics. An example is the escape sequence, which has 1B as the hexadecimal value in ASCII, represents the
escape character In computing and telecommunication, an escape character is a character (computing), character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharac ...
, and is supported in GCC,
clang Clang is a compiler front end for the C, C++, Objective-C, and Objective-C++ programming languages, as well as the OpenMP, OpenCL, RenderScript, CUDA, and HIP frameworks. It acts as a drop-in replacement for the GNU Compiler Collection ...
and tcc. It wasn't however added to the C standard repertoire, because it has no meaningful equivalent in some
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
s (such as
EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding ...
).


Universal character names

From the C99 standard, C has also supported escape sequences that denote
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
code points in string literals. Such escape sequences are called ''universal character names'', and have the form or , where stands for a hex digit. Unlike the other escape sequences considered, a universal character name may expand into more than one code unit. The sequence denotes the
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
, interpreted as a hexadecimal number. The sequence denotes the code point , interpreted as a hexadecimal number. (Therefore, code points located at U+10000 or higher must be denoted with the syntax, whereas lower code points may use or .) The code point is converted into a sequence of
code unit Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
s in the encoding of the destination type on the target system. For example (where the encoding is
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of e ...
, and
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
for ): char s1[] = "\xC0"; // A single byte with the value 0xC0, not valid UTF-8 char s2[] = "\u00C0"; // Two bytes with values 0xC3, 0x80, the UTF-8 encoding of U+00C0 wchar_t s3[] = L"\xC0"; // A single wchar_t with the value 0x00C0 wchar_t s4[] = L"\u00C0"; // A single wchar_t with the value 0x00C0 A value greater than may be represented by a single if the UTF-32 encoding is used, or two if UTF-16 is used. Importantly, the universal character name always denotes the character "À", regardless of what kind of string literal it is used in, or the encoding in use. The octal and hex escape sequences always denote certain sequences of numerical values, regardless of encoding. Therefore, universal character names are complementary to octal and hex escape sequences; while octal and hex escape sequences represent code units, universal character names represent
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s, which may be thought of as "logical" characters.


See also

*
Escape sequence In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding (and possibly terminating) characters. Examples * In C and ma ...
* Digraph


References


Further reading

* ISO/IEC 9899:1999, ''Programming languages — C'' * * {{cite book , author-last=Lafore , author-first=Robert , author-link=Robert Lafore , date=2001 , title=Object-Oriented Programming in Turbo C++ , publisher=Galgotia Publications , isbn=978-8-18562322-1 , edition=1 C (programming language) Control characters