HOME

TheInfoList



OR:

The byte order mark (BOM) is a particular usage of the special
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
character, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * The byte order, or
endianness In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the mos ...
, of the text stream in the cases of 16-bit and 32-bit encodings; * The fact that the text stream's encoding is Unicode, to a high level of confidence; * Which Unicode character encoding is used. BOM use is optional. Its presence interferes with the use of
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...
by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream. Unicode can be encoded in units of 8-bit, 16-bit, or 32-bit integers. For the 16- and 32-bit representations, a computer receiving text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM is encoded in the same scheme as the rest of the document and becomes a Unicode code point if its bytes are swapped. Hence, the process accessing the text can examine these first few bytes to determine the endianness, without requiring some contract or metadata outside of the text stream itself. Generally the receiving computer will swap the bytes to its own endianness, if necessary, and would no longer need the BOM for processing. The byte sequence of the BOM differs per Unicode encoding (including ones outside the Unicode standard such as
UTF-7 UTF-7 (7- bit Unicode Transformation Format) is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in In ...
, see table below), and none of the sequences is likely to appear at the start of text streams stored in other encodings. Therefore, placing an encoded BOM at the start of a text stream can indicate that the text is Unicode and identify the encoding scheme used. This use of the BOM character is called a "Unicode signature".


Usage

If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a " zero-width non-breaking space" (inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage is deprecated in favor of the "
Word Joiner The word joiner (WJ) is a format character in Unicode used to indicate that word separation should not occur at a position, when using scripts such as Arabic that do not use explicit spacing. It is encoded since Unicode version 3.2 (released i ...
" character, U+2060. This allows U+FEFF to be used only as a BOM.


UTF-8

The
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...
representation of the BOM is the ( hexadecimal) byte sequence 0xEF,0xBB,0xBF. The Unicode Standard permits the BOM in
UTF-8 UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''. UTF-8 is capable of ...
, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work. The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature." An example of not following this recommendation is the IETF
Syslog In computing, syslog is a standard for message logging. It allows separation of the software that generates messages, the system that stores them, and the software that reports and analyzes them. Each message is labeled with a facility code, i ...
protocol which requires text to be in UTF-8 and also requires the BOM. Not using a BOM allows text to be backwards-compatible with some software that is not Unicode-aware. Examples include programming languages that permit non-
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
bytes in
string literal A string literal or anonymous string is a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally " bracketed delimiters", as in x = "foo", where "foo" is a string ...
s but not at the start of the file. UTF-8 is a sparse encoding in the sense that a large fraction of possible byte combinations do not result in valid UTF-8 text. Binary data and text in any other encoding are likely to contain byte sequences that are invalid as UTF-8. Practically the only exceptions to that are when the text consists purely of ASCII-range bytes. Because all modern encodings use ASCII-range bytes to represent ASCII characters, ASCII-only text can be safely interpreted as UTF-8 regardless of what encoding was intended by the system that emitted the bytes. Because of these considerations, heuristic analysis can detect with high confidence whether UTF-8 is in use, without requiring a BOM.
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washin ...
compilers and interpreters, and many pieces of software on Microsoft Windows such as
Notepad A notebook (also known as a notepad, writing pad, drawing pad, or legal pad) is a book or stack of paper pages that are often ruled and used for purposes such as note-taking, journaling or other writing, drawing, or scrapbooking. History ...
treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.
Windows PowerShell PowerShell is a task automation and configuration management program from Microsoft, consisting of a command-line shell and the associated scripting language. Initially a Windows component only, known as Windows PowerShell, it was made open-so ...
(up to 5.1) will add a BOM when it saves UTF-8 XML documents. However, PowerShell Core 6 has added a -Encoding switch on some cmdlets called utf8NoBOM so that document can be saved without BOM. Google Docs also adds a BOM when converting a document to a
plain text In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects (floating-point numbers, images, etc.). It may also include a limit ...
file for download.


UTF-16

In
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit
code unit Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
of the file or stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the character U+FFFE, which is defined by Unicode as a " " that should never appear in the text. * If the 16-bit units are represented in
big-endian In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most sig ...
byte order, the BOM will appear in the sequence of bytes as 0xFE 0xFF * If the 16-bit units use
little-endian In computing, endianness, also known as byte sex, is the order or sequence of bytes of a word of digital data in computer memory. Endianness is primarily expressed as big-endian (BE) or little-endian (LE). A big-endian system stores the most si ...
order, the BOM will appear in the sequence of bytes as 0xFF 0xFE Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8. For the IANA registered charsets UTF-16BE and UTF-16LE, a byte order mark should not be used because the names of these character sets already determine the byte order. If encountered anywhere in such a text stream, U+FEFF is to be interpreted as a "zero width no-break space". If there is no BOM, it is possible to guess whether the text is UTF-16 and its byte order by searching for ASCII characters (i.e. a 0 byte adjacent to a byte in the 0x20-0x7E range, also 0x0A and 0x0D for CR and LF). A large number (i.e. far higher than random chance) in the same order is a very good indication of UTF-16 and whether the 0 is in the even or odd bytes indicates the byte order. However, this can result in ''both'' false positives and false negatives. Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian." Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore, the presumption of big-endian is widely ignored. The W3C/
WHATWG The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, l ...
encoding standard used in HTML5 specifies that content labelled either "utf-16" or "utf-16le" are to be interpreted as little-endian "to deal with deployed content". However, if a byte-order mark is present, then that BOM is to be treated as "more authoritative than anything else". Programs that interpret UTF-16 as a byte-based encoding may display a garbled mess of characters, but ASCII characters would be recognizable because the low byte of the UTF-16 representation is the same as the ASCII code and therefore would be displayed the same. The upper byte of 0 may be displayed as nothing, white space, a period, or some other unvarying glyph.


UTF-32

Although a BOM could be used with
UTF-32 UTF-32 (32- bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode ...
, this encoding is rarely used for transmission. Otherwise the same rules as for
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
are applicable. The BOM for little-endian UTF-32 is the same pattern as a little-endian UTF-16 BOM followed by a NUL character, an unusual example of the BOM being the same pattern in two different encodings. Programmers using the BOM to identify the encoding will have to decide whether UTF-32 or a NUL first character is more likely.


Byte order marks by encoding

This table illustrates how the BOM character is represented as a byte sequence in various encodings and how those sequences might appear in a text editor that is interpreting each byte as a legacy encoding (
CP1252 Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. It i ...
and
caret notation Caret notation is a notation for control characters in ASCII. The notation assigns to control-code 1, sequentially through the alphabet to assigned to control-code 26 (0x1A). For the control-codes outside of the range 1–26, the ...
for the
C0 controls The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent non-printable character, additional information about the text, such as t ...
):


See also

*
Left-to-right mark The left-to-right mark (LRM) is a control character (an invisible formatting character) used in computerized typesetting (including word processing in a program like Microsoft Word) of text containing a mix of left-to-right scripts (such as Latin ...
*
Arabic Presentation Forms-B Arabic Presentation Forms-B is a Unicode block encoding spacing forms of Arabic diacritics, and contextual letter forms. The special codepoint ZWNBSP is also here, which is only meant for a byte order mark The byte order mark (BOM) is a parti ...
, block to which code point U+FEFF belongs


References


External links


Unicode FAQ: ''UTF-8, UTF-16, UTF-32 & BOM''

The Unicode Standard, chapter 2.6 ''Encoding Schemes''

The Unicode Standard, chapter 2.13 ''Special Characters and Noncharacters'', section ''Byte Order Mark (BOM)''

The Unicode Standard, chapter 16.8 ''Specials'', section ''Byte Order Mark (BOM): U+FEFF''
{{DEFAULTSORT:Byte Order Mark Unicode special code points