The bfloat16 (Brain Floating Point) floating-point format is a

computer number format A computer number format is the internal representation of numeric values in digital device hardware and software, such as in programmable computers and calculators. Numerical values are stored as groupings of bits, such as bytes and words. The ...

occupying 16 bits in

computer memory In computing, memory is a device or system that is used to store information for immediate use in a computer or related computer hardware and digital electronic devices. The term ''memory'' is often synonymous with the term '' primary storag ...

; it represents a wide

dynamic range Dynamic range (abbreviated DR, DNR, or DYR) is the ratio between the largest and smallest values that a certain quantity can assume. It is often used in the context of signals, like sound and light. It is measured either as a ratio or as a base ...

of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit

significand The significand (also mantissa or coefficient, sometimes also argument, or ambiguously fraction or characteristic) is part of a number in scientific notation or in floating-point representation, consisting of its significant digits. Depending on ...

of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms. The bfloat16 format was developed by

Google Brain Google Brain is a deep learning artificial intelligence research team under the umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, Google Brain combines open-ended machine learning research ...

, an artificial intelligence research group at Google. The bfloat16 format is utilized in Intel AI processors, such as Nervana NNP-L1000,

Xeon Xeon ( ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same ar ...

processors ( AVX-512 BF16 extensions), and Intel

FPGA A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturinghence the term '' field-programmable''. The FPGA configuration is generally specified using a hardware d ...

s, Google Cloud TPUs, and

TensorFlow TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learnin ...

. ARMv8.6-A, AMD

ROCm ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous ...

, and

CUDA CUDA (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach ...

also support the bfloat16 format. On these platforms, bfloat16 may also be used in

mixed-precision arithmetic Mixed-precision arithmetic is a form of floating-point arithmetic that uses numbers with varying widths in a single operation. Arithmetic A common usage of mixed-precision arithmetic is for operating on inaccurate numbers with a small width and exp ...

, where bfloat16 numbers may be operated on and expanded to wider data types.

bfloat16 floating-point format

bfloat16 has the following format: *

Sign bit In computer science, the sign bit is a bit in a signed number representation that indicates the sign of a number. Although only signed numeric data types have a sign bit, it is invariably located in the most significant bit position, so the term ...

: 1 bit *

Exponent Exponentiation is a mathematical operation, written as , involving two numbers, the '' base'' and the ''exponent'' or ''power'' , and pronounced as " (raised) to the (power of) ". When is a positive integer, exponentiation corresponds to r ...

width: 8 bits *

Significand The significand (also mantissa or coefficient, sometimes also argument, or ambiguously fraction or characteristic) is part of a number in scientific notation or in floating-point representation, consisting of its significant digits. Depending on ...

precision: 8 bits (7 explicitly stored), as opposed to 24 bits in a classical single-precision floating-point format The bfloat16 format, being a truncated IEEE 754 single-precision 32-bit float, allows for fast

conversion Conversion or convert may refer to: Arts, entertainment, and media * "Conversion" (''Doctor Who'' audio), an episode of the audio drama ''Cyberman'' * "Conversion" (''Stargate Atlantis''), an episode of the television series * "The Conversion" ...

to and from an IEEE 754 single-precision 32-bit float; in conversion to the bfloat16 format, the exponent bits are preserved while the significand field can be reduced by truncation (thus corresponding to round toward 0), ignoring the

NaN Nan or NAN may refer to: Places China * Nan County, Yiyang, Hunan, China * Nan Commandery, historical commandery in Hubei, China Thailand * Nan Province ** Nan, Thailand, the administrative capital of Nan Province * Nan River People Given name ...

special case. Preserving the exponent bits maintains the 32-bit float's range of ≈ 10⁻³⁸ to ≈ 3 × 10³⁸. The bits are laid out as follows:

Contrast with bfloat16 and single precision

Legend

* * * *

Exponent encoding

The bfloat16 binary floating-point exponent is encoded using an

offset-binary Offset binary, also referred to as excess-K, excess-''N'', excess-e, excess code or biased representation, is a method for signed number representation where a signed number n is represented by the bit pattern corresponding to the unsigned numb ...

representation, with the zero offset being 127; also known as exponent bias in the IEEE 754 standard. * E_min = 01_H−7F_H = −126 * E_max = FE_H−7F_H = 127 * Exponent bias = 7F_H = 127 Thus, in order to get the true exponent as defined by the offset-binary representation, the offset of 127 has to be subtracted from the value of the exponent field. The minimum and maximum values of the exponent field (00_H and FF_H) are interpreted specially, like in the IEEE 754 standard formats. The minimum positive normal value is 2⁻¹²⁶ ≈ 1.18 × 10⁻³⁸ and the minimum positive (subnormal) value is 2⁻¹²⁶⁻⁷ = 2⁻¹³³ ≈ 9.2 × 10⁻⁴¹.

Encoding of special values

Positive and negative infinity

Just as in

IEEE 754 The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found ...

, positive and negative infinity are represented with their corresponding

sign bit In computer science, the sign bit is a bit in a signed number representation that indicates the sign of a number. Although only signed numeric data types have a sign bit, it is invariably located in the most significant bit position, so the term ...

s, all 8 exponent bits set (FF_hex) and all significand bits zero. Explicitly, val s_exponent_signcnd +inf = 0_11111111_0000000 -inf = 1_11111111_0000000

Not a Number

Just as in

values are represented with either sign bit, all 8 exponent bits set (FF_hex) and not all significand bits zero. Explicitly, val s_exponent_signcnd +NaN = 0_11111111_klmnopq -NaN = 1_11111111_klmnopq where at least one of ''k, l, m, n, o, p,'' or ''q'' is 1. As with IEEE 754, NaN values can be quiet or signaling, although there are no known uses of signaling bfloat16 NaNs as of September 2018.

Range and precision

Bfloat16 is designed to maintain the number range from the 32-bit IEEE 754 single-precision floating-point format (binary32), while reducing the precision from 24 bits to 8 bits. This means that the precision is between two and three decimal digits, and bfloat16 can represent finite values up to about 3.4 × 10³⁸.

Examples

These examples are given in bit ''representation'', in

hexadecimal In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, he ...

and binary, of the floating-point value. This includes the sign, (biased) exponent, and significand. 3f80 = 0 01111111 0000000 = 1 c000 = 1 10000000 0000000 = −2 7f7f = 0 11111110 1111111 = (2⁸ − 1) × 2⁻⁷ × 2¹²⁷ ≈ 3.38953139 × 10³⁸ (max finite positive value in bfloat16 precision) 0080 = 0 00000001 0000000 = 2⁻¹²⁶ ≈ 1.175494351 × 10⁻³⁸ (min normalized positive value in bfloat16 precision and single-precision floating point) The maximum positive finite value of a normal bfloat16 number is 3.38953139 × 10³⁸, slightly below (2²⁴ − 1) × 2⁻²³ × 2¹²⁷ = 3.402823466 × 10³⁸, the max finite positive value representable in single precision.

Zeros and infinities

0000 = 0 00000000 0000000 = 0 8000 = 1 00000000 0000000 = −0 7f80 = 0 11111111 0000000 = infinity ff80 = 1 11111111 0000000 = −infinity

Special values

4049 = 0 10000000 1001001 = 3.140625 ≈ π ( pi ) 3eab = 0 01111101 0101011 = 0.333984375 ≈ 1/3

NaNs

ffc1 = x 11111111 1000001 => qNaN ff81 = x 11111111 0000001 => sNaN

References

{{DEFAULTSORT:bfloat16 floating-point format Binary arithmetic Floating point types