computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, ...

, minifloats are floating-point values represented with very few

bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...

s. Predictably, they are not well suited for general-purpose numerical calculations. They are used for special purposes, most often in computer graphics, where iterations are small and precision has aesthetic effects.

Machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

also uses similar formats like bfloat16. Additionally, they are frequently encountered as a pedagogical tool in computer-science courses to demonstrate the properties and structures of floating-point arithmetic and IEEE 754 numbers. Minifloats with 16

s are

half-precision In computing, half precision (sometimes called FP16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in applications w ...

numbers (opposed to single and

double precision Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. Flo ...

). There are also minifloats with 8 bits or even fewer. Minifloats can be designed following the principles of the IEEE 754 standard. In this case they must obey the (not explicitly written) rules for the frontier between subnormal and normal numbers and must have special patterns for infinity and

NaN Nan or NAN may refer to: Places China * Nan County, Yiyang, Hunan, China * Nan Commandery, historical commandery in Hubei, China Thailand * Nan Province ** Nan, Thailand, the administrative capital of Nan Province * Nan River People Given name ...

. Normalized numbers are stored with a biased exponent. The new revision of the standard,

IEEE 754-2008 The Institute of Electrical and Electronics Engineers (IEEE) is a 501(c)(3) professional association for electronic engineering and electrical engineering (and associated disciplines) with its corporate office in New York City and its operation ...

, has 16-bit binary minifloats. The

Radeon R300 The R300 GPU, introduced in August 2002 and developed by ATI Technologies, is its third generation of GPU used in ''Radeon'' graphics cards. This GPU features 3D acceleration based upon Direct3D 9.0 and OpenGL 2.0, a major improvement in featur ...

and

R420 The R420 GPU, developed by ATI Technologies, was the company's basis for its 3rd-generation DirectX 9.0/ OpenGL 2.0-capable graphics cards. Used first on the Radeon X800, the R420 was produced on a 0.13 micrometer (130 nm) low-''K'' photoli ...

GPUs used an "fp24" floating-point format with 7 bits of exponent and 16 bits (+1 implicit) of mantissa.. "Full Precision" in Direct3D 9.0 is a proprietary 24-bit floating-point format. Microsoft's D3D9 (Shader Model 2.0) graphics

API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...

initially supported both FP24 (as in ATI's R300 chip) and FP32 (as in Nvidia's NV30 chip) as "Full Precision", as well as FP16 as "Partial Precision" for vertex and pixel shader calculations performed by the graphics hardware.

Notation

A minifloat is usually described using a tuple of four numbers, (''S'', ''E'', ''M'', ''B''): * ''S'' is the length of the sign field. It is usually either 0 or 1. * ''E'' is the length of the exponent field. * ''M'' is the length of the mantissa (significand) field. * ''B'' is the

exponent bias In IEEE 754 floating-point numbers, the exponent is biased in the engineering sense of the word – the value stored is offset from the actual value by the exponent bias, also called a biased exponent. Biasing is done because exponents have to be ...

. A minifloat format denoted by (''S'', ''E'', ''M'', ''B'') is, therefore, bits long. In computer graphics minifloats are sometimes used to represent only integral values. If at the same time subnormal values should exist, the least subnormal number has to be 1. The bias value would be in this case, assuming two special exponent values are used per IEEE. The (''S'', ''E'', ''M'', ''B'') notation can be converted to a (''B'', ''P'', ''L'', ''U'') format as (with IEEE use of exponents).

Example

In this example, a minifloat in 1 byte (8 bit) with 1 sign bit, 4 exponent bits and 3 significand bits (in short, a 1.4.3.−2 minifloat) is used to represent integral values. All IEEE 754 principles should be valid. The only free value is the

, which we define as -2 for integers. The unknown exponent is called for the moment ''x''. Numbers in a different base are marked as ..., for example, 101 = 5. The bit patterns have spaces to visualize their parts.

Representation of zero

0 0000 000 = 0

Subnormal numbers

The significand is extended with "0.": 0 0000 001 = 0.001₂ × 2^''x'' = 0.125 × 2^''x'' = 1 (least subnormal number) ... 0 0000 111 = 0.111₂ × 2^''x'' = 0.875 × 2^''x'' = 7 (greatest subnormal number)

Normalized numbers

The significand is extended with "1.": 0 0001 000 = 1.000₂ × 2^''x'' = 1 × 2^''x'' = 8 (least normalized number) 0 0001 001 = 1.001₂ × 2^''x'' = 1.125 × 2^''x'' = 9 ... 0 0010 000 = 1.000₂ × 2^''x''+1 = 1 × 2^''x''+1 = 16 0 0010 001 = 1.001₂ × 2^''x''+1 = 1.125 × 2^''x''+1 = 18 ... 0 1110 000 = 1.000₂ × 2^''x''+13 = 1.000 × 2^''x''+13 = 65536 0 1110 001 = 1.001₂ × 2^''x''+13 = 1.125 × 2^''x''+13 = 73728 ... 0 1110 110 = 1.110₂ × 2^''x''+13 = 1.750 × 2^''x''+13 = 114688 0 1110 111 = 1.111₂ × 2^''x''+13 = 1.875 × 2^''x''+13 = 122880 (greatest normalized number)

Infinity

0 1111 000 = +infinity 1 1111 000 = −infinity If the exponent field were not treated specially, the value would be 0 1111 000 = 1.000₂ × 2^''x''+14 = 2¹⁷ = 131072

Not a number

''x'' 1111 ''yyy'' = NaN (if ''yyy'' ≠ 000) Without the IEEE 754 special handling of the largest exponent, the greatest possible value would be 0 1111 111 = 1.111₂ × 2^''x''+14 = 1.875 × 2¹⁷ = 245760

Value of the bias

If the least subnormal value (second line above) should be 1, the value of ''x'' has to be ''x'' = 3. Therefore, the bias has to be −2; that is, every stored exponent has to be decreased by −2 or has to be increased by 2, to get the numerical exponent.

Table of values

This is a chart of all possible values with bias 1 when treating the float similarly to an IEEE float.

Properties of this example

Integral minifloats in 1 byte have a greater range of ±122 880 than

two's-complement Two's complement is a mathematical operation to reversibly convert a positive binary number into a negative binary number with equivalent (but negative) value, using the binary digit with the greatest place value (the leftmost bit in big- endian ...

integer with a range −128 to +127. The greater range is compensated by a poor precision, because there are only 4 mantissa bits, equivalent to slightly more than one decimal place. They also have greater range than half-precision minifloats with range ±65 504, also compensated by lack of fractions and poor precision. There are only 242 different values (if +0 and −0 are regarded as different), because 14 of the bit patterns represent NaNs. The values between 0 and 16 have the same bit pattern as minifloat or two's-complement integer. The first pattern with a different value is 00010001, which is 18 as a minifloat and 17 as a two's-complement integer. This coincidence does not occur at all with negative values, because this minifloat is a signed-magnitude format. The (vertical) real line on the right shows clearly the varying density of the floating-point values – a property which is common to any floating-point system. This varying density results in a curve similar to the exponential function. Although the curve may appear smooth, this is not the case. The graph actually consists of distinct points, and these points lie on line segments with discrete slopes. The value of the exponent bits determines the absolute precision of the mantissa bits, and it is this precision that determines the slope of each linear segment.

Arithmetic

Addition

The graphic demonstrates the addition of even smaller (1.3.2.3)-minifloats with 6 bits. This floating-point system follows the rules of IEEE 754 exactly. NaN as operand produces always NaN results. Inf − Inf and (−Inf) + Inf results in NaN too (green area). Inf can be augmented and decremented by finite values without change. Sums with finite operands can give an infinite result (i.e. 14.0 + 3.0 = +Inf as a result is the cyan area, −Inf is the magenta area). The range of the finite operands is filled with the curves ''x'' + ''y'' = ''c'', where ''c'' is always one of the representable float values (blue and red for positive and negative results respectively).

Subtraction, multiplication and division

The other arithmetic operations can be illustrated similarly: image:MinifloatSubtraction_1_3_2_3_72.png, Subtraction image:MinifloatMultiplication_1_3_2_3_72.png, Multiplication image:MinifloatDivision_1_3_2_3_72.png, Division

In embedded devices

Minifloats are also commonly used in embedded devices, especially on

microcontrollers A microcontroller (MCU for ''microcontroller unit'', often also MC, UC, or μC) is a small computer on a single VLSI integrated circuit (IC) chip. A microcontroller contains one or more CPUs ( processor cores) along with memory and programmabl ...

where floating-point will need to be emulated in software. To speed up the computation, the mantissa typically occupies exactly half of the bits, so the register boundary automatically addresses the parts without shifting.

References

External links

OpenGL half float pixel
{{data types Floating point types Computer arithmetic