Binary representation of floating-point numbers (#737)

* Add "Binary representation of floating point numbers" section.

* Adding a half-precision explanatory picture.

* Binary representation of the floating-point numbers.
This commit is contained in:
Oleksii Trekhleb 2021-07-16 11:51:53 +02:00 committed by GitHub
parent 433515f1b2
commit b2d1ec83f0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
9 changed files with 397 additions and 0 deletions

View File

@ -68,6 +68,7 @@ a set of rules that precisely define a sequence of operations.
* **Math** * **Math**
* `B` [Bit Manipulation](src/algorithms/math/bits) - set/get/update/clear bits, multiplication/division by two, make negative etc. * `B` [Bit Manipulation](src/algorithms/math/bits) - set/get/update/clear bits, multiplication/division by two, make negative etc.
* `B` [Binary Floating Point](src/algorithms/math/binary-floating-point) - binary representation of the floating-point numbers.
* `B` [Factorial](src/algorithms/math/factorial) * `B` [Factorial](src/algorithms/math/factorial)
* `B` [Fibonacci Number](src/algorithms/math/fibonacci) - classic and closed-form versions * `B` [Fibonacci Number](src/algorithms/math/fibonacci) - classic and closed-form versions
* `B` [Prime Factors](src/algorithms/math/prime-factors) - finding prime factors and counting them using Hardy-Ramanujan's theorem * `B` [Prime Factors](src/algorithms/math/prime-factors) - finding prime factors and counting them using Hardy-Ramanujan's theorem

View File

@ -0,0 +1,93 @@
# Binary representation of floating-point numbers
Have you ever wondered how computers store the floating-point numbers like `3.1415` (𝝿) or `9.109 × 10⁻³¹` (the mass of the electron in kg) in the memory which is limited by a finite number of ones and zeroes (aka bits)?
It seems pretty straightforward for integers (i.e. `17`). Let's say we have 16 bits (2 bytes) to store the number. In 16 bits we may store the integers in a range of `[0, 65535]`:
```text
(0000000000000000)₂ = (0)₁₀
(0000000000010001)₂ =
(1 × 2⁴) +
(0 × 2³) +
(0 × 2²) +
(0 × 2¹) +
(1 × 2⁰) = (17)₁₀
(1111111111111111)₂ =
(1 × 2¹⁵) +
(1 × 2¹⁴) +
(1 × 2¹³) +
(1 × 2¹²) +
(1 × 2¹¹) +
(1 × 2¹⁰) +
(1 × 2⁹) +
(1 × 2⁸) +
(1 × 2⁷) +
(1 × 2⁶) +
(1 × 2⁵) +
(1 × 2⁴) +
(1 × 2³) +
(1 × 2²) +
(1 × 2¹) +
(1 × 2⁰) = (65535)₁₀
```
If we need a signed integer we may use [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement) and shift the range of `[0, 65535]` towards the negative numbers. In this case, our 16 bits would represent the numbers in a range of `[-32768, +32767]`.
As you might have noticed, this approach won't allow you to represent the numbers like `-27.15625` (numbers after the decimal point are just being ignored).
We're not the first ones who have noticed this issue though. Around ≈36 years ago some smart folks overcame this limitation by introducing the [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) standard for floating-point arithmetic.
The IEEE 754 standard describes the way (the framework) of using those 16 bits (or 32, or 64 bits) to store the numbers of wider range, including the small floating numbers (smaller than 1 and closer to 0).
To get the idea behind the standard we might recall the [scientific notation](https://en.wikipedia.org/wiki/Scientific_notation) - a way of expressing numbers that are too large or too small (usually would result in a long string of digits) to be conveniently written in decimal form.
![Scientific number notation](images/03-scientific-notation.png)
As you may see from the image, the number representation might be split into three parts:
- **sign**
- **fraction (aka significand)** - the valuable digits (the meaning, the payload) of the number
- **exponent** - controls how far and in which direction to move the decimal point in the fraction
The **base** part we may omit by just agreeing on what it will be equal to. In our case, we'll be using `2` as a base.
Instead of using all 16 bits (or 32 bits, or 64 bits) to store the fraction of the number, we may share the bits and store a sign, exponent, and fraction at the same time. Depending on the number of bits that we're going to use to store the number we end up with the following splits:
| Floating-point format | Total bits | Sign bits | Exponent bits | Fraction bits | Base |
| :-------------------- | :--------: | :-------: | :-----------: | :--------------: | :--: |
| [Half-precision](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) | 16 | 1 | 5 | 10 | 2 |
| [Single-precision](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) | 32 | 1 | 8 | 23 | 2 |
| [Double-precision](https://en.wikipedia.org/wiki/Double-precision_floating-point_format) | 64 | 1 | 11 | 52 | 2 |
With this approach, the number of bits for the fraction has been reduced (i.e. for the 16-bits number it was reduced from 16 bits to 10 bits). It means that the fraction might take a narrower range of values now (losing some precision). However, since we also have an exponent part, it will actually increase the ultimate number range and also allow us to describe the numbers between 0 and 1 (if the exponent is negative).
> For example, a signed 32-bit integer variable has a maximum value of 2³¹ 1 = 2,147,483,647, whereas an IEEE 754 32-bit base-2 floating-point variable has a maximum value of ≈ 3.4028235 × 10³⁸.
To make it possible to have a negative exponent, the IEEE 754 standard uses the [biased exponent](https://en.wikipedia.org/wiki/Exponent_bias). The idea is simple - subtract the bias from the exponent value to make it negative. For example, if the exponent has 5 bits, it might take the values from the range of `[0, 31]` (all values are positive here). But if we subtract the value of `15` from it, the range will be `[-15, 16]`. The number `15` is called bias, and it is being calculated by the following formula:
```
exponent_bias = 2 ^ (k1) 1
k - number of exponent bits
```
I've tried to describe the logic behind the converting of floating-point numbers from a binary format back to the decimal format on the image below. Hopefully, it will give you a better understanding of how the IEEE 754 standard works. The 16-bits number is being used here for simplicity, but the same approach works for 32-bits and 64-bits numbers as well.
![Half-precision floating point number format explained in one picture](images/02-half-precision-floating-point-number-explained.png)
> Checkout the [interactive version of this diagram](https://trekhleb.dev/blog/2021/binary-floating-point/) to play around with setting bits on and off, and seeing how it would influence the final result
## Code examples
- See the [bitsToFloat.js](bitsToFloat.js) for the example of how to convert array of bits to the floating point number (the example is a bit artificial but still it gives the overview of how the conversion is going on)
- See the [floatAsBinaryString.js](floatAsBinaryString.js) for the example of how to see the actual binary representation of the floating-point number in JavaScript
## References
You might also want to check out the following resources to get a deeper understanding of the binary representation of floating-point numbers:
- [Here is what you need to know about JavaScripts Number type](https://indepth.dev/posts/1139/here-is-what-you-need-to-know-about-javascripts-number-type)
- [Float Exposed](https://float.exposed/)
- [IEEE754 Visualization](https://bartaz.github.io/ieee754-visualization/)

View File

@ -0,0 +1,32 @@
import { testCases16Bits, testCases32Bits, testCases64Bits } from '../testCases';
import { bitsToFloat16, bitsToFloat32, bitsToFloat64 } from '../bitsToFloat';
describe('bitsToFloat16', () => {
it('should convert floating point binary bits to floating point decimal number', () => {
for (let testCaseIndex = 0; testCaseIndex < testCases16Bits.length; testCaseIndex += 1) {
const [decimal, binary] = testCases16Bits[testCaseIndex];
const bits = binary.split('').map((bitString) => parseInt(bitString, 10));
expect(bitsToFloat16(bits)).toBeCloseTo(decimal, 4);
}
});
});
describe('bitsToFloat32', () => {
it('should convert floating point binary bits to floating point decimal number', () => {
for (let testCaseIndex = 0; testCaseIndex < testCases32Bits.length; testCaseIndex += 1) {
const [decimal, binary] = testCases32Bits[testCaseIndex];
const bits = binary.split('').map((bitString) => parseInt(bitString, 10));
expect(bitsToFloat32(bits)).toBeCloseTo(decimal, 7);
}
});
});
describe('bitsToFloat64', () => {
it('should convert floating point binary bits to floating point decimal number', () => {
for (let testCaseIndex = 0; testCaseIndex < testCases64Bits.length; testCaseIndex += 1) {
const [decimal, binary] = testCases64Bits[testCaseIndex];
const bits = binary.split('').map((bitString) => parseInt(bitString, 10));
expect(bitsToFloat64(bits)).toBeCloseTo(decimal, 14);
}
});
});

View File

@ -0,0 +1,20 @@
import { floatAs32BinaryString, floatAs64BinaryString } from '../floatAsBinaryString';
import { testCases32Bits, testCases64Bits } from '../testCases';
describe('floatAs32Binary', () => {
it('should create a binary representation of the floating numbers', () => {
for (let testCaseIndex = 0; testCaseIndex < testCases32Bits.length; testCaseIndex += 1) {
const [decimal, binary] = testCases32Bits[testCaseIndex];
expect(floatAs32BinaryString(decimal)).toBe(binary);
}
});
});
describe('floatAs64Binary', () => {
it('should create a binary representation of the floating numbers', () => {
for (let testCaseIndex = 0; testCaseIndex < testCases64Bits.length; testCaseIndex += 1) {
const [decimal, binary] = testCases64Bits[testCaseIndex];
expect(floatAs64BinaryString(decimal)).toBe(binary);
}
});
});

View File

@ -0,0 +1,119 @@
/**
* Sequence of 0s and 1s.
* @typedef {number[]} Bits
*/
/**
* @typedef {{
* signBitsCount: number,
* exponentBitsCount: number,
* fractionBitsCount: number,
* }} PrecisionConfig
*/
/**
* @typedef {{
* half: PrecisionConfig,
* single: PrecisionConfig,
* double: PrecisionConfig
* }} PrecisionConfigs
*/
/**
* sign bit
* exponent bits
* fraction bits
*
* X XXXXX XXXXXXXXXX
*
* @type {PrecisionConfigs}
*/
const precisionConfigs = {
// @see: https://en.wikipedia.org/wiki/Half-precision_floating-point_format
half: {
signBitsCount: 1,
exponentBitsCount: 5,
fractionBitsCount: 10,
},
// @see: https://en.wikipedia.org/wiki/Single-precision_floating-point_format
single: {
signBitsCount: 1,
exponentBitsCount: 8,
fractionBitsCount: 23,
},
// @see: https://en.wikipedia.org/wiki/Double-precision_floating-point_format
double: {
signBitsCount: 1,
exponentBitsCount: 11,
fractionBitsCount: 52,
},
};
/**
* Converts the binary representation of the floating point number to decimal float number.
*
* @param {Bits} bits - sequence of bits that represents the floating point number.
* @param {PrecisionConfig} precisionConfig - half/single/double precision config.
* @return {number} - floating point number decoded from its binary representation.
*/
function bitsToFloat(bits, precisionConfig) {
const { signBitsCount, exponentBitsCount } = precisionConfig;
// Figuring out the sign.
const sign = (-1) ** bits[0]; // -1^1 = -1, -1^0 = 1
// Calculating the exponent value.
const exponentBias = 2 ** (exponentBitsCount - 1) - 1;
const exponentBits = bits.slice(signBitsCount, signBitsCount + exponentBitsCount);
const exponentUnbiased = exponentBits.reduce(
(exponentSoFar, currentBit, bitIndex) => {
const bitPowerOfTwo = 2 ** (exponentBitsCount - bitIndex - 1);
return exponentSoFar + currentBit * bitPowerOfTwo;
},
0,
);
const exponent = exponentUnbiased - exponentBias;
// Calculating the fraction value.
const fractionBits = bits.slice(signBitsCount + exponentBitsCount);
const fraction = fractionBits.reduce(
(fractionSoFar, currentBit, bitIndex) => {
const bitPowerOfTwo = 2 ** -(bitIndex + 1);
return fractionSoFar + currentBit * bitPowerOfTwo;
},
0,
);
// Putting all parts together to calculate the final number.
return sign * (2 ** exponent) * (1 + fraction);
}
/**
* Converts the 16-bit binary representation of the floating point number to decimal float number.
*
* @param {Bits} bits - sequence of bits that represents the floating point number.
* @return {number} - floating point number decoded from its binary representation.
*/
export function bitsToFloat16(bits) {
return bitsToFloat(bits, precisionConfigs.half);
}
/**
* Converts the 32-bit binary representation of the floating point number to decimal float number.
*
* @param {Bits} bits - sequence of bits that represents the floating point number.
* @return {number} - floating point number decoded from its binary representation.
*/
export function bitsToFloat32(bits) {
return bitsToFloat(bits, precisionConfigs.single);
}
/**
* Converts the 64-bit binary representation of the floating point number to decimal float number.
*
* @param {Bits} bits - sequence of bits that represents the floating point number.
* @return {number} - floating point number decoded from its binary representation.
*/
export function bitsToFloat64(bits) {
return bitsToFloat(bits, precisionConfigs.double);
}

View File

@ -0,0 +1,61 @@
// @see: https://en.wikipedia.org/wiki/Single-precision_floating-point_format
const singlePrecisionBytesLength = 4; // 32 bits
// @see: https://en.wikipedia.org/wiki/Double-precision_floating-point_format
const doublePrecisionBytesLength = 8; // 64 bits
const bitsInByte = 8;
/**
* Converts the float number into its IEEE 754 binary representation.
* @see: https://en.wikipedia.org/wiki/IEEE_754
*
* @param {number} floatNumber - float number in decimal format.
* @param {number} byteLength - number of bytes to use to store the float number.
* @return {string} - binary string representation of the float number.
*/
function floatAsBinaryString(floatNumber, byteLength) {
let numberAsBinaryString = '';
const arrayBuffer = new ArrayBuffer(byteLength);
const dataView = new DataView(arrayBuffer);
const byteOffset = 0;
const littleEndian = false;
if (byteLength === singlePrecisionBytesLength) {
dataView.setFloat32(byteOffset, floatNumber, littleEndian);
} else {
dataView.setFloat64(byteOffset, floatNumber, littleEndian);
}
for (let byteIndex = 0; byteIndex < byteLength; byteIndex += 1) {
let bits = dataView.getUint8(byteIndex).toString(2);
if (bits.length < bitsInByte) {
bits = new Array(bitsInByte - bits.length).fill('0').join('') + bits;
}
numberAsBinaryString += bits;
}
return numberAsBinaryString;
}
/**
* Converts the float number into its IEEE 754 64-bits binary representation.
*
* @param {number} floatNumber - float number in decimal format.
* @return {string} - 64 bits binary string representation of the float number.
*/
export function floatAs64BinaryString(floatNumber) {
return floatAsBinaryString(floatNumber, doublePrecisionBytesLength);
}
/**
* Converts the float number into its IEEE 754 32-bits binary representation.
*
* @param {number} floatNumber - float number in decimal format.
* @return {string} - 32 bits binary string representation of the float number.
*/
export function floatAs32BinaryString(floatNumber) {
return floatAsBinaryString(floatNumber, singlePrecisionBytesLength);
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 344 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

View File

@ -0,0 +1,71 @@
/**
* @typedef {[number, string]} TestCase
* @property {number} decimal
* @property {string} binary
*/
/**
* @type {TestCase[]}
*/
export const testCases16Bits = [
[-65504, '1111101111111111'],
[-10344, '1111000100001101'],
[-27.15625, '1100111011001010'],
[-1, '1011110000000000'],
[-0.09997558, '1010111001100110'],
[0, '0000000000000000'],
[5.9604644775390625e-8, '0000000000000001'],
[0.000004529, '0000000001001100'],
[0.0999755859375, '0010111001100110'],
[0.199951171875, '0011001001100110'],
[0.300048828125, '0011010011001101'],
[1, '0011110000000000'],
[1.5, '0011111000000000'],
[1.75, '0011111100000000'],
[1.875, '0011111110000000'],
[65504, '0111101111111111'],
];
/**
* @type {TestCase[]}
*/
export const testCases32Bits = [
[-3.40282346638528859812e+38, '11111111011111111111111111111111'],
[-10345.5595703125, '11000110001000011010011000111101'],
[-27.15625, '11000001110110010100000000000000'],
[-1, '10111111100000000000000000000000'],
[-0.1, '10111101110011001100110011001101'],
[0, '00000000000000000000000000000000'],
[1.40129846432481707092e-45, '00000000000000000000000000000001'],
[0.000004560, '00110110100110010000001000011010'],
[0.1, '00111101110011001100110011001101'],
[0.2, '00111110010011001100110011001101'],
[0.3, '00111110100110011001100110011010'],
[1, '00111111100000000000000000000000'],
[1.5, '00111111110000000000000000000000'],
[1.75, '00111111111000000000000000000000'],
[1.875, '00111111111100000000000000000000'],
[3.40282346638528859812e+38, '01111111011111111111111111111111'],
];
/**
* @type {TestCase[]}
*/
export const testCases64Bits = [
[-1.79769313486231570815e+308, '1111111111101111111111111111111111111111111111111111111111111111'],
[-10345.5595703125, '1100000011000100001101001100011110100000000000000000000000000000'],
[-27.15625, '1100000000111011001010000000000000000000000000000000000000000000'],
[-1, '1011111111110000000000000000000000000000000000000000000000000000'],
[-0.1, '1011111110111001100110011001100110011001100110011001100110011010'],
[0, '0000000000000000000000000000000000000000000000000000000000000000'],
[4.94065645841246544177e-324, '0000000000000000000000000000000000000000000000000000000000000001'],
[0.000004560, '0011111011010011001000000100001101000001011100110011110011100100'],
[0.1, '0011111110111001100110011001100110011001100110011001100110011010'],
[0.2, '0011111111001001100110011001100110011001100110011001100110011010'],
[0.3, '0011111111010011001100110011001100110011001100110011001100110011'],
[1, '0011111111110000000000000000000000000000000000000000000000000000'],
[1.5, '0011111111111000000000000000000000000000000000000000000000000000'],
[1.75, '0011111111111100000000000000000000000000000000000000000000000000'],
[1.875, '0011111111111110000000000000000000000000000000000000000000000000'],
[1.79769313486231570815e+308, '0111111111101111111111111111111111111111111111111111111111111111'],
];