The byte-order mark (BOM) is a particular usage of the special
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
character code, , whose appearance as a
magic number at the start of a text stream can signal several things to a
program reading the text:
* the byte order, or
endianness
file:Gullivers_travels.jpg, ''Gulliver's Travels'' by Jonathan Swift, the novel from which the term was coined
In computing, endianness is the order in which bytes within a word (data type), word of digital data are transmitted over a data comm ...
, of the text stream in the cases of 16-
bit and 32-bit encodings;
* the fact that the text stream's encoding is Unicode, to a high level of confidence;
* which Unicode character encoding is used.
BOM use is optional. Its presence interferes with the use of
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
by software that does not expect non-
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
bytes at the start of a file but that could otherwise handle the text stream.
Unicode can be encoded in units of 8-bit, 16-bit, or 32-bit integers. For the 16- and 32-bit representations, a computer receiving text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM is encoded in the same scheme as the rest of the document and becomes a
Unicode code point if its bytes are swapped. Hence, the process accessing the text can examine these first few bytes to determine the endianness, without requiring some contract or
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
outside of the text stream itself. Generally the receiving computer will swap the bytes to its own endianness, if necessary, and would no longer need the BOM for processing.
The byte sequence of the BOM differs per Unicode encoding (including ones outside the Unicode standard such as
UTF-7, see
table below), and none of the sequences is likely to appear at the start of text streams stored in other encodings. Therefore, placing an encoded BOM at the start of a text stream can indicate that the text is Unicode and identify the encoding scheme used. This use of the BOM is called a "Unicode signature".
Usage
The BOM is, simply, the Unicode codepoint , encoded in the current encoding. A text file beginning with the bytes
FE FF
suggests that the file is encoded in big-endian UTF-16.
The name ZWNBSP should be used if the BOM appears in the middle of a data stream. Unicode says it should be interpreted as a normal codepoint (namely a
word joiner), not as a BOM. Since Unicode 3.2, this usage has been deprecated in favor of .
The Unicode 1.0 name for this codepoint is also
BYTE ORDER MARK
.
UTF-8
The
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
representation of the BOM is the (
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
) byte sequence
EF BB BF
.
The Unicode Standard permits the BOM in
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8.
UTF-8 supports all 1,112,0 ...
, but does not require or recommend its use. UTF-8 always has the same byte order,
so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work. The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."
An example of not following this recommendation is the IETF
Syslog protocol which requires text to be in UTF-8 and also requires the BOM.
Not using a BOM allows text to be backwards-compatible with software designed for
extended ASCII
Extended ASCII is a repertoire of character encodings that include (most of) the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes critic ...
. For instance many programming languages permit non-
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
bytes in
string literal
string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where , "foo ...
s but not at the start of the file.
A BOM is unnecessary for detecting UTF-8 encoding. UTF-8 is a sparse encoding: a large fraction of possible byte combinations do not result in valid UTF-8 text. Binary data and text in any other encoding are likely to contain byte sequences that are invalid as UTF-8, so existence of such invalid sequences indicates the file is not UTF-8, while lack of invalid sequences is a very strong indication the text ''is'' UTF-8. Practically the only exception is text containing only ASCII-range bytes, as this may be a non-ASCII 7-bit encoding, but this is unlikely in any modern data and even then the difference from ASCII is minor (such as changing '\' to '¥').
Microsoft
Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
compilers and interpreters, and many pieces of software on
Microsoft Windows
Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
such as
Notepad (prior to Windows 10 Build 1903) treat the BOM as a required
magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.
Windows PowerShell (up to 5.1) will add a BOM when it saves UTF-8 XML documents. However, PowerShell Core 6 has added a
-Encoding
switch on some cmdlets called utf8NoBOM so that document can be saved without BOM.
Google Docs also adds a BOM when converting a document to a
plain text
In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects ( floating-point numbers, images, etc.). It may also include a lim ...
file for download.
UTF-16
In
UTF-16, a BOM (
U+FEFF
) may be placed as the first bytes of a file or character stream to indicate the endianness (byte order) of all the 16-bit
code units of the file or stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the character
U+FFFE
, which
is defined by Unicode as a "
" that should never appear in the text.
* If the 16-bit units are represented in
big-endian
'' Jonathan_Swift.html" ;"title="Gulliver's Travels'' by Jonathan Swift">Gulliver's Travels'' by Jonathan Swift, the novel from which the term was coined
In computing, endianness is the order in which bytes within a word (data type), word of d ...
byte order ("UTF-16BE"), the BOM is the (
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
) byte sequence
FE FF
* If the 16-bit units use
little-endian order ("UTF-16LE"), the BOM is the (
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
) byte sequence
FF FE
For the
IANA registered charsets UTF-16BE and UTF-16LE, a byte-order mark should not be used because the names of these character sets already determine the byte order.
If there is no BOM, it is possible to guess whether the text is UTF-16 and its byte order by searching for ASCII characters (i.e. a 0 byte adjacent to a byte in the 0x20-0x7E range, also 0x0A and 0x0D for CR and LF). A large number (i.e. far higher than random chance) in the same order is a very good indication of UTF-16 and whether the 0 is in the even or odd bytes indicates the byte order. However, this can result in ''both'' false positives and false negatives.
Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian." Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore, the presumption of big-endian is widely ignored. The
W3C
The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working together in ...
/
WHATWG
The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, ...
encoding standard used in HTML5 specifies that content labelled either "utf-16" or "utf-16le" are to be interpreted as little-endian "to deal with deployed content". However, if a byte-order mark is present, then that BOM is to be treated as "more authoritative than anything else".
UTF-32
Although a BOM could be used with
UTF-32, this encoding is rarely used for transmission. Otherwise the same rules as for
UTF-16 are applicable.
The BOM for little-endian UTF-32 is the same pattern as a little-endian UTF-16 BOM followed by a UTF-16 NUL character, an unusual example of the BOM being the same pattern in two different encodings. Programmers using the BOM to identify the encoding will have to decide whether UTF-32 or UTF-16 with a NUL first character is more likely.
Byte-order marks by encoding
This table illustrates how the BOM is represented as a byte sequence in various encodings and how those sequences might appear in a text editor that is
interpreting each byte as a legacy encoding (
Windows-1252
Windows-1252 or CP-1252 ( Windows code page 1252) is a legacy single-byte character encoding that is used by default (as the "ANSI code page") in Microsoft Windows throughout the Americas, Western Europe, Oceania, and much of Africa.
Initially ...
and
caret notation
Caret notation is a notation for control characters in ASCII. The notation assigns to control-code 1, sequentially through the alphabet to assigned to control-code 26 (0x1A). For the control-codes outside of the range 1–26, the ...
for the
C0 controls):
See also
*
Left-to-right mark
*
Arabic Presentation Forms-B, block to which code point
U+FEFF
belongs
References
External links
Unicode FAQ: ''UTF-8, UTF-16, UTF-32 & BOM''The Unicode Standard, chapter 2.6 ''Encoding Schemes''The Unicode Standard, chapter 2.13 ''Special Characters and Noncharacters'', section ''Byte Order Mark (BOM)''The Unicode Standard, chapter 16.8 ''Specials'', section ''Byte Order Mark (BOM): U+FEFF''The byte-order mark (BOM) in HTML''
{{DEFAULTSORT:Byte Order Mark
Unicode special code points