Floating-point arithmetic: Difference between revisions

Jump to navigation Jump to search
imported>Vincent Lefèvre
Incidents: Though the issue did not come from floating point, it is still due to the careless use of approximations: if they had use an exact arithmetic (thus not needing to approximate 0.1) or wrote their code based on an error analysis, this would not have happened.
 
imported>Taylor Riastradh Campbell
Remove abuse of unattached superscript and subscript to simulate a diagonal fraction. I'm not aware of a way to make Wikipedia set diagonal fractions, but it's not necessary anyway.
 
Line 37: Line 37:
The speed of floating-point operations, commonly measured in terms of [[FLOPS]], is an important characteristic of a [[computer system]], especially for applications that involve intensive mathematical calculations.
The speed of floating-point operations, commonly measured in terms of [[FLOPS]], is an important characteristic of a [[computer system]], especially for applications that involve intensive mathematical calculations.


A [[floating-point unit]] (FPU, colloquially a math [[coprocessor]]) is a part of a computer system specially designed to carry out operations on floating-point numbers.
Floating-point numbers can be computed using software implementations (softfloat) or hardware implementations (hardfloat). [[floating-point unit|Floating-point units]] (FPUs, colloquially math [[coprocessor|coprocessors]]) are specially designed to carry out operations on floating-point numbers and are part of most computer systems. When FPUs are not available, software implementations can be used instead.


== Overview ==
== Overview ==
Line 53: Line 53:
To derive the value of the floating-point number, the ''significand'' is multiplied by the ''base'' raised to the power of the ''exponent'', equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.
To derive the value of the floating-point number, the ''significand'' is multiplied by the ''base'' raised to the power of the ''exponent'', equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.


Using base-10 (the familiar [[Decimal representation|decimal]] notation) as an example, the number {{val|152853.5047|fmt=commas}}, which has ten decimal digits of precision, is represented as the significand {{val|1528535047|fmt=commas}} together with 5 as the exponent. To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by {{10^|5}} to give {{val|1.528535047|e=5|fmt=commas}}, or {{val|152853.5047|fmt=commas}}. In storing such a number, the base (10) need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.
Using base-10 (the familiar [[Decimal representation|decimal]] notation) as an example, the number {{val|152853.5047|fmt=commas}}, which has ten decimal digits of precision, is represented as the significand {{val|1528535047|fmt=none}} together with 5 as the exponent. To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by {{10^|5}} to give {{val|1.528535047|e=5|fmt=commas}}, or {{val|152853.5047|fmt=commas}}. In storing such a number, the base (10) need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.


Symbolically, this final value is:
Symbolically, this final value is:
Line 87: Line 87:
* [[Fixed-point arithmetic|Fixed-point]] representation uses integer hardware operations controlled by a software implementation of a specific convention about the location of the binary or decimal point, for example, 6 bits or digits from the right. The hardware to manipulate these representations is less costly than floating point, and it can be used to perform normal integer operations, too. Binary fixed point is usually used in special-purpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point is common in commercial applications.
* [[Fixed-point arithmetic|Fixed-point]] representation uses integer hardware operations controlled by a software implementation of a specific convention about the location of the binary or decimal point, for example, 6 bits or digits from the right. The hardware to manipulate these representations is less costly than floating point, and it can be used to perform normal integer operations, too. Binary fixed point is usually used in special-purpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point is common in commercial applications.
* [[Logarithmic number system]]s (LNSs) represent a real number by the logarithm of its absolute value and a sign bit. The value distribution is similar to floating point, but the value-to-representation curve (''i.e.'', the graph of the logarithm function) is smooth (except at 0). Conversely to floating-point arithmetic, in a logarithmic number system multiplication, division and exponentiation are simple to implement, but addition and subtraction are complex. The ([[symmetric level-index arithmetic|symmetric]]) [[level-index arithmetic]] (LI and SLI) of Charles Clenshaw, [[Frank William John Olver|Frank Olver]] and Peter Turner is a scheme based on a [[generalized logarithm]] representation.
* [[Logarithmic number system]]s (LNSs) represent a real number by the logarithm of its absolute value and a sign bit. The value distribution is similar to floating point, but the value-to-representation curve (''i.e.'', the graph of the logarithm function) is smooth (except at 0). Conversely to floating-point arithmetic, in a logarithmic number system multiplication, division and exponentiation are simple to implement, but addition and subtraction are complex. The ([[symmetric level-index arithmetic|symmetric]]) [[level-index arithmetic]] (LI and SLI) of Charles Clenshaw, [[Frank William John Olver|Frank Olver]] and Peter Turner is a scheme based on a [[generalized logarithm]] representation.
* [[Tapered floating-point representation]], used in [[Unum (number format)|Unum]].
* [[Tapered floating-point representation]], used in [[Unum (number format)|Unum]] formats, including [[Unum (number format)#Posit (Type III Unum)|Posit]].
* Some simple rational numbers (''e.g.'', 1/3 and 1/10) cannot be represented exactly in binary floating point, no matter what the precision is. Using a different radix allows one to represent some of them (''e.g.'', 1/10 in decimal floating point), but the possibilities remain limited. Software packages that perform [[fraction|rational arithmetic]] represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "[[bignum]]" arithmetic for the individual integers.
* Some simple rational numbers (''e.g.'', 1/3 and 1/10) cannot be represented exactly in binary floating point, no matter what the precision is. Using a different radix allows one to represent some of them (''e.g.'', 1/10 in decimal floating point), but the possibilities remain limited. Software packages that perform [[fraction|rational arithmetic]] represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "[[bignum]]" arithmetic for the individual integers.
* [[Interval arithmetic]] allows one to represent numbers as intervals and obtain guaranteed bounds on results. It is generally based on other arithmetics, in particular floating point.
* [[Interval arithmetic]] allows one to represent numbers as intervals and obtain guaranteed bounds on results. It is generally based on other arithmetics, in particular floating point.
Line 100: Line 100:
[[File:Konrad Zuse (1992).jpg|thumb|upright=0.7|right|[[Konrad Zuse]], architect of the [[Z3 (computer)|Z3]] computer, which uses a 22-bit binary floating-point representation]]
[[File:Konrad Zuse (1992).jpg|thumb|upright=0.7|right|[[Konrad Zuse]], architect of the [[Z3 (computer)|Z3]] computer, which uses a 22-bit binary floating-point representation]]


In 1938, [[Konrad Zuse]] of Berlin completed the [[Z1 (computer)|Z1]], the first binary, programmable [[mechanical computer]];<ref name="Rojas_1997"/> it uses a 24-bit binary floating-point number representation with a 7-bit signed exponent, a 17-bit significand (including one implicit bit), and a sign bit.<ref name="Rojas_2014"/> The more reliable [[relay]]-based [[Z3 (computer)|Z3]], completed in 1941, has representations for both positive and negative infinities; in particular, it implements defined operations with infinity, such as <math>^1/_\infty = 0</math>, and it stops on undefined operations, such as <math>0 \times \infty</math>.
In 1938, [[Konrad Zuse]] of Berlin completed the [[Z1 (computer)|Z1]], the first binary, programmable [[mechanical computer]];<ref name="Rojas_1997"/> it uses a 24-bit binary floating-point number representation with a 7-bit signed exponent, a 17-bit significand (including one implicit bit), and a sign bit.<ref name="Rojas_2014"/> The more reliable [[relay]]-based [[Z3 (computer)|Z3]], completed in 1941, has representations for both positive and negative infinities; in particular, it implements defined operations with infinity, such as [[division by infinity]], where <math>1/\infty = 0</math>, and it stops on undefined operations, such as <math>0 \times \infty</math>.


Zuse also proposed, but did not complete, carefully rounded floating-point arithmetic that includes <math>\pm\infty</math> and NaN representations, anticipating features of the IEEE Standard by four decades.<ref name="Kahan_1997_JVNL"/> In contrast, [[John von Neumann|von Neumann]] recommended against floating-point numbers for the 1951 [[IAS machine]], arguing that fixed-point arithmetic is preferable.<ref name="Kahan_1997_JVNL"/>
Zuse also proposed, but did not complete, carefully rounded floating-point arithmetic that includes <math>\pm\infty</math> and NaN representations, anticipating features of the IEEE Standard by four decades.<ref name="Kahan_1997_JVNL"/> In contrast, [[John von Neumann|von Neumann]] recommended against floating-point numbers for the 1951 [[IAS machine]], arguing that fixed-point arithmetic is preferable.<ref name="Kahan_1997_JVNL"/>
Line 139: Line 139:
* ''L'' is the smallest exponent of the system,
* ''L'' is the smallest exponent of the system,
* ''U'' is the largest exponent of the system,
* ''U'' is the largest exponent of the system,
is <math>2 \left(B - 1\right) \left(B^{P-1}\right) \left(U - L + 1\right)</math>.
is <math>2 \left(B - 1\right) \left(B^{P-1}\right) \left(U - L + 1\right)</math>, or <math>2 \left(B - 1\right) \left(B^{P-1}\right) \left(U - L + 1\right) + 1</math> considering the value 0.


There is a smallest positive normal floating-point number,
There is a smallest positive normal floating-point number,
Line 268: Line 268:
* The [[bfloat16 floating-point format|bfloat16 format]] requires the same amount of memory (16 bits) as the [[Half-precision floating-point format|IEEE 754 half-precision format]], but allocates 8 bits to the exponent instead of 5, thus providing the same range as a [[Single-precision floating-point format|IEEE 754 single-precision]] number. The tradeoff is a reduced precision, as the trailing significand field is reduced from 10 to 7 bits. This format is mainly used in the training of [[machine learning]] models, where range is more valuable than precision. Many machine learning accelerators provide hardware support for this format.
* The [[bfloat16 floating-point format|bfloat16 format]] requires the same amount of memory (16 bits) as the [[Half-precision floating-point format|IEEE 754 half-precision format]], but allocates 8 bits to the exponent instead of 5, thus providing the same range as a [[Single-precision floating-point format|IEEE 754 single-precision]] number. The tradeoff is a reduced precision, as the trailing significand field is reduced from 10 to 7 bits. This format is mainly used in the training of [[machine learning]] models, where range is more valuable than precision. Many machine learning accelerators provide hardware support for this format.
* The TensorFloat-32<ref name="Kharya_2020"/> format combines the 8 bits of exponent of the bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by [[Nvidia]], which provides hardware support for it in the Tensor Cores of its [[Graphics processing unit|GPUs]] based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.<ref name="Kharya_2020"/>
* The TensorFloat-32<ref name="Kharya_2020"/> format combines the 8 bits of exponent of the bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by [[Nvidia]], which provides hardware support for it in the Tensor Cores of its [[Graphics processing unit|GPUs]] based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.<ref name="Kharya_2020"/>
* The [[Hopper (microarchitecture)|Hopper]] architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).<ref name="NVIDIA_Hopper"/><ref name="Micikevicius_2022"/>
* The [[Hopper (microarchitecture)|Hopper]] and [[CDNA 3]] architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).<ref name="NVIDIA_Hopper"/><ref name="Micikevicius_2022"/>
* The [[Blackwell (microarchitecture)|Blackwell]] GPU architecture includes support for FP6 (E3M2 and E2M3) and FP4 (E2M1) formats. FP4 is the smallest floating-point format which allows for all IEEE 754 principles (see [[minifloat]]).
* The [[Blackwell (microarchitecture)|Blackwell]] and [[CDNA (microarchitecture)|CDNA 4]] GPU architecture includes support for FP6 (E3M2 and E2M3) and FP4 (E2M1) formats. FP4 is the smallest floating-point format which allows for all IEEE 754 principles (see [[minifloat]]).


{| class="wikitable"
{| class="wikitable"
Line 475: Line 475:


=== Literal syntax ===
=== Literal syntax ===
Literals for floating-point numbers depend on languages. They typically use <code>e</code> or <code>E</code> to denote [[scientific notation]]. The [[C (programming language)|C programming language]] and the [[IEEE 754]] standard also define a [[IEEE 754#Hexadecimal literals|hexadecimal literal syntax]] with a base-2 exponent instead of 10.<!-- Also in [[Hexadecimal#Hexadecimal exponential notation]] --> In languages like [[C (programming language)|C]], when the decimal exponent is omitted, a decimal point is needed to differentiate them from integers. Other languages do not have an integer type (such as [[JavaScript]]), or allow overloading of numeric types (such as [[Haskell (programming language)|Haskell]]). In these cases, digit strings such as <code>123</code> may also be floating-point literals.
Literals for floating-point numbers depend on languages. They typically use <code>e</code> or <code>E</code> to denote [[scientific notation]]. The [[C (programming language)|C programming language]] and the [[IEEE 754]] standard also define a [[IEEE 754#Hexadecimal literals|hexadecimal literal syntax]] with a base-2 exponent instead of 10.<!-- Also in [[Hexadecimal#Hexadecimal exponential notation]] --> In languages like [[C (programming language)|C]], when the decimal exponent is omitted, a decimal point is needed to differentiate them from integers. Other languages do not have an integer type (such as [[JavaScript]]), or allow overloading of numeric types (such as [[Haskell]]). In these cases, digit strings such as <code>123</code> may also be floating-point literals.


Examples of floating-point literals are:
Examples of floating-point literals are:
Line 484: Line 484:
* <code>0x1.fffffep+127</code> in C and IEEE 754
* <code>0x1.fffffep+127</code> in C and IEEE 754


==Dealing with exceptional cases {{anchor|Floating point exception|Exception handling}}== <!-- linked from a couple of places within the article -->
== Dealing with exceptional cases {{anchor|Floating point exception|Exception handling}} == <!-- linked from a couple of places within the article -->
{{further|IEEE 754#Exception handling}}
{{Further|IEEE 754#Exception handling}}


Floating-point computation in a computer can run into three kinds of problems:
Floating-point computation in a computer can run into three kinds of problems:
Line 492: Line 492:
* An operation can be legal in principle, but the result can be impossible to represent in the specified format, because the exponent is too large or too small to encode in the exponent field. Such an event is called an overflow (exponent too large), [[arithmetic underflow|underflow]] (exponent too small) or [[Subnormal number|denormalization]] (precision loss).
* An operation can be legal in principle, but the result can be impossible to represent in the specified format, because the exponent is too large or too small to encode in the exponent field. Such an event is called an overflow (exponent too large), [[arithmetic underflow|underflow]] (exponent too small) or [[Subnormal number|denormalization]] (precision loss).


Prior to the IEEE standard, such conditions usually caused the program to terminate, or triggered some kind of [[trap (computing)|trap]] that the programmer might be able to catch. How this worked was system-dependent, meaning that floating-point programs were not [[porting|portable]]. (The term "exception" as used in IEEE 754 is a general term meaning an exceptional condition, which is not necessarily an error, and is a different usage to that typically defined in programming languages such as a C++ or Java, in which an "[[Exception handling|exception]]" is an alternative flow of control, closer to what is termed a "trap" in IEEE 754 terminology.)
Prior to the IEEE standard, such conditions usually caused the program to terminate, or triggered some kind of [[trap (computing)|trap]] that the programmer might be able to catch. How this worked was system-dependent, meaning that floating-point programs were not [[porting|portable]].
 
The term "exception" as used in IEEE 754 is a general term meaning an exceptional condition, which is not necessarily an error, and is a different usage to that typically defined in programming languages such as a C++ or Java, in which an "[[Exception handling|exception]]" is an alternative flow of control, closer to what is termed a "trap" in IEEE 754 terminology. However, in such languages, a control-flow exception such as <code>ArithmeticException</code> may still be thrown.


Here, the required default method of handling exceptions according to IEEE 754 is discussed (the IEEE 754 optional trapping and other "alternate exception handling" modes are not discussed). Arithmetic exceptions are (by default) required to be recorded in "sticky" status flag bits. That they are "sticky" means that they are not reset by the next (arithmetic) operation, but stay set until explicitly reset. The use of "sticky" flags thus allows for testing of exceptional conditions to be delayed until after a full floating-point expression or subroutine: without them exceptional conditions that could not be otherwise ignored would require explicit testing immediately after every floating-point operation. By default, an operation always returns a result according to specification without interrupting computation. For instance, 1/0 returns +∞, while also setting the divide-by-zero flag bit (this default of ∞ is designed to often return a finite result when used in subsequent operations and so be safely ignored).
Here, the required default method of handling exceptions according to IEEE 754 is discussed (the IEEE 754 optional trapping and other "alternate exception handling" modes are not discussed). Arithmetic exceptions are (by default) required to be recorded in "sticky" status flag bits. That they are "sticky" means that they are not reset by the next (arithmetic) operation, but stay set until explicitly reset. The use of "sticky" flags thus allows for testing of exceptional conditions to be delayed until after a full floating-point expression or subroutine: without them exceptional conditions that could not be otherwise ignored would require explicit testing immediately after every floating-point operation. By default, an operation always returns a result according to specification without interrupting computation. For instance, 1/0 returns +∞, while also setting the divide-by-zero flag bit (this default of ∞ is designed to often return a finite result when used in subsequent operations and so be safely ignored).
Line 522: Line 524:
Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow in the usual floating-point formats (assuming an accurate implementation of tan). It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:
Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow in the usual floating-point formats (assuming an accurate implementation of tan). It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:
<syntaxhighlight lang="c">
<syntaxhighlight lang="c">
/* Enough digits to be sure we get the correct approximation. */
// Enough digits to be sure we get the correct approximation.
double pi = 3.1415926535897932384626433832795;
const double pi = 3.1415926535897932384626433832795;
double z = tan(pi/2.0);
double z = tan(pi / 2.0);
</syntaxhighlight>
</syntaxhighlight>
will give a result of 16331239353195370.0. In single precision (using the <code>tanf</code> function), the result will be −22877332.0.
will give a result of 16331239353195370.0. In single precision (using the <code>tanf</code> function), the result will be −22877332.0.
Line 576: Line 578:
=== Incidents ===
=== Incidents ===
* On 25 February 1991, a [[loss of significance]] in a [[MIM-104 Patriot]] missile battery [[MIM-104 Patriot#Failure at Dhahran|prevented it from intercepting]] an incoming [[Al Hussein (missile)|Scud]] missile in [[Dhahran]], [[Saudi Arabia]], contributing to the death of 28 soldiers from the U.S. Army's [[14th Quartermaster Detachment]].<ref name="GAO report IMTEC 92-26"/>  The weapons control computer counted time in an integer number of tenths of a second since boot.  For conversion to a floating-point number of seconds in velocity and position calculations, the software originally multiplied this number by a 24-bit [[Fixed-point arithmetic|fixed-point]] binary approximation to 0.1, specifically <math display="block">0.00011001100110011001100_2 = 0.1 \times (1 - 2^{-20}).</math>  Some parts of the software were later adapted to use a more accurate conversion to floating-point, but some parts were not updated and still used the 24-bit approximation.<ref name="Skeel"/>  These parts of the software drifted from one another by about 3.43&nbsp;milliseconds per hour.  After 20 hours, the discrepancy of about 68.7&nbsp;ms was enough for the radar tracking system to lose track of Scuds; the control system in the Dhahran missile battery had been running for about 100 hours when it failed to track and intercept an incoming Scud.<ref name="GAO report IMTEC 92-26"/>  The failure to intercept arose not from using floating point specifically, but from subtracting two different approximations to unit conversion with different errors when representing time, so the unit conversion error in the difference did not cancel out but rather grew indefinitely with uptime.<ref name="Skeel"/>
* On 25 February 1991, a [[loss of significance]] in a [[MIM-104 Patriot]] missile battery [[MIM-104 Patriot#Failure at Dhahran|prevented it from intercepting]] an incoming [[Al Hussein (missile)|Scud]] missile in [[Dhahran]], [[Saudi Arabia]], contributing to the death of 28 soldiers from the U.S. Army's [[14th Quartermaster Detachment]].<ref name="GAO report IMTEC 92-26"/>  The weapons control computer counted time in an integer number of tenths of a second since boot.  For conversion to a floating-point number of seconds in velocity and position calculations, the software originally multiplied this number by a 24-bit [[Fixed-point arithmetic|fixed-point]] binary approximation to 0.1, specifically <math display="block">0.00011001100110011001100_2 = 0.1 \times (1 - 2^{-20}).</math>  Some parts of the software were later adapted to use a more accurate conversion to floating-point, but some parts were not updated and still used the 24-bit approximation.<ref name="Skeel"/>  These parts of the software drifted from one another by about 3.43&nbsp;milliseconds per hour.  After 20 hours, the discrepancy of about 68.7&nbsp;ms was enough for the radar tracking system to lose track of Scuds; the control system in the Dhahran missile battery had been running for about 100 hours when it failed to track and intercept an incoming Scud.<ref name="GAO report IMTEC 92-26"/>  The failure to intercept arose not from using floating point specifically, but from subtracting two different approximations to unit conversion with different errors when representing time, so the unit conversion error in the difference did not cancel out but rather grew indefinitely with uptime.<ref name="Skeel"/>
* {{Clarify|date=November 2024|reason=It is not clear how this is an incident (the section title may have to be modified to cover more than incidents) and how this is due to floating-point arithmetic (rather than number approximations in general). The term "invisible" may also be misleading without following explanations. |text=[[Salami slicing tactics#Financial schemes|Salami slicing]] is the practice of removing the 'invisible' part of a transaction into a separate account.}}
* {{Clarify|date=November 2024|reason=It is not clear how this is an incident (the section title may have to be modified to cover more than incidents) and how this is due to floating-point arithmetic (rather than number approximations in general). The term "invisible" may also be misleading without following explanations. |text=[[Salami slicing tactics#Financial schemes|Salami slicing]] is the practice of removing the "invisible" part of a transaction into a separate account.}}


=== Machine precision and backward error analysis ===
=== Machine precision and backward error analysis ===
Line 619: Line 621:
Although individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a [[unit in the last place|ULP]], more complicated formulae can suffer from larger errors for a variety of reasons. The loss of accuracy can be substantial if a problem or its data are [[condition number|ill-conditioned]], meaning that the correct result is hypersensitive to tiny perturbations in its data. However, even functions that are well-conditioned can suffer from large loss of accuracy if an algorithm [[numerical stability|numerically unstable]] for that data is used: apparently equivalent formulations of expressions in a programming language can differ markedly in their numerical stability. One approach to remove the risk of such loss of accuracy is the design and analysis of numerically stable algorithms, which is an aim of the branch of mathematics known as [[numerical analysis]]. Another approach that can protect against the risk of numerical instabilities is the computation of intermediate (scratch) values in an algorithm at a higher precision than the final result requires,<ref name="OliveiraStewart_2006"/> which can remove, or reduce by orders of magnitude,<ref name="Kahan_2005_ARITH17"/> such risk: [[Quadruple-precision floating-point format|IEEE 754 quadruple precision]] and [[extended precision]] are designed for this purpose when computing at double precision.<ref name="Kahan_2011_Debug"/><ref group="nb" name="NB_4"/>
Although individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a [[unit in the last place|ULP]], more complicated formulae can suffer from larger errors for a variety of reasons. The loss of accuracy can be substantial if a problem or its data are [[condition number|ill-conditioned]], meaning that the correct result is hypersensitive to tiny perturbations in its data. However, even functions that are well-conditioned can suffer from large loss of accuracy if an algorithm [[numerical stability|numerically unstable]] for that data is used: apparently equivalent formulations of expressions in a programming language can differ markedly in their numerical stability. One approach to remove the risk of such loss of accuracy is the design and analysis of numerically stable algorithms, which is an aim of the branch of mathematics known as [[numerical analysis]]. Another approach that can protect against the risk of numerical instabilities is the computation of intermediate (scratch) values in an algorithm at a higher precision than the final result requires,<ref name="OliveiraStewart_2006"/> which can remove, or reduce by orders of magnitude,<ref name="Kahan_2005_ARITH17"/> such risk: [[Quadruple-precision floating-point format|IEEE 754 quadruple precision]] and [[extended precision]] are designed for this purpose when computing at double precision.<ref name="Kahan_2011_Debug"/><ref group="nb" name="NB_4"/>


For example, the following algorithm is a direct implementation to compute the function {{math|{{var|A}}({{var|x}}) {{=}} ({{var|x}}−1) / (exp({{var|x}}−1) 1)}} which is well-conditioned at 1.0,<ref group="nb" name="NB_5"/> however it can be shown to be numerically unstable and lose up to half the significant digits carried by the arithmetic when computed near 1.0.<ref name="Kahan_2001_JavaHurt"/>
For example, the following algorithm is a direct implementation to compute the function <math>f(x)= \frac{x-1}{\exp(x-1)-1}</math>, which is well-conditioned at <math>x = 1</math>.<ref group="nb" name="NB_5"/> However, it can be shown to be numerically unstable and lose up to half the significant digits carried by the arithmetic when computed near 1.0.<ref name="Kahan_2001_JavaHurt"/>
<syntaxhighlight lang="c" line highlight="3,7">
<syntaxhighlight lang="c">
double A(double X)
#include <math.h>
 
double f(double x)
{
{
        double Y, Z;  // [1]
    double y = x - 1.0;
        Y = X - 1.0;
    double z = exp(y);
        Z = exp(Y);
    if (z != 1.0) {
        if (Z != 1.0)
        z = y / (z - 1.0);
                Z = Y / (Z - 1.0); // [2]
    }
        return Z;
    return z;
}
}
</syntaxhighlight>
</syntaxhighlight>


If, however, intermediate computations are all performed in extended precision (e.g. by setting line [1] to [[C99]] {{code|long double}}), then up to full precision in the final double result can be maintained.<ref group="nb" name="NB_6"/> Alternatively, a numerical analysis of the algorithm reveals that if the following non-obvious change to line [2] is made:
A numerical analysis of the algorithm reveals that if the following non-obvious change to the line <code>z = y / (z - 1.0);</code> is made:
<syntaxhighlight lang="c">
<syntaxhighlight lang="c">
Z = log(Z) / (Z - 1.0);
z = log(z) / (z - 1.0);
</syntaxhighlight>
</syntaxhighlight>
then the algorithm becomes numerically stable and can compute to full double precision.
then the algorithm becomes numerically stable and can compute to full double precision.<ref name="Kahan_2001_JavaHurt"/><!-- Note: this is old; nowadays, one would use expm1, which has been designed to avoid the loss of accuracy due to the z - 1.0 cancellation. -->


To maintain the properties of such carefully constructed numerically stable programs, careful handling by the [[compiler]] is required. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area: C99 is an example of a language where such optimizations are carefully specified to maintain numerical precision. See the external references at the bottom of this article.
To maintain the properties of such carefully constructed numerically stable programs, careful handling by the [[compiler]] is required. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area: C99 is an example of a language where such optimizations are carefully specified to maintain numerical precision. See the external references at the bottom of this article.
Line 644: Line 648:
As decimal fractions can often not be exactly represented in binary floating-point, such arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of a moon around Saturn or the mass of a [[proton]]), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact.<ref name="Kahan_2005_ARITH17"/><ref name="Kahan_2000_Marketing"/> An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation.<ref name="Speleotrove_2012"/> The "decimal" data type of the [[C Sharp (programming language)|C#]] and [[Python (programming language)|Python]] programming languages, and the decimal formats of the [[IEEE 754-2008]] standard, are designed to avoid the problems of binary floating-point representations when applied to human-entered exact decimal values, and make the arithmetic always behave as expected when numbers are printed in decimal.
As decimal fractions can often not be exactly represented in binary floating-point, such arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of a moon around Saturn or the mass of a [[proton]]), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact.<ref name="Kahan_2005_ARITH17"/><ref name="Kahan_2000_Marketing"/> An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation.<ref name="Speleotrove_2012"/> The "decimal" data type of the [[C Sharp (programming language)|C#]] and [[Python (programming language)|Python]] programming languages, and the decimal formats of the [[IEEE 754-2008]] standard, are designed to avoid the problems of binary floating-point representations when applied to human-entered exact decimal values, and make the arithmetic always behave as expected when numbers are printed in decimal.


Expectations from mathematics may not be realized in the field of floating-point computation. For example, it is known that <math>(x+y)(x-y) = x^2-y^2\,</math>, and that <math>\sin^2{\theta}+\cos^2{\theta} = 1\,</math>, however these facts cannot be relied on when the quantities involved are the result of floating-point computation.
Expectations from mathematics may not be realized in the field of floating-point computation. For example, it is known that <math>(x+y)(x-y) = x^2-y^2\,</math>, and that <math>\sin^2{\theta}+\cos^2{\theta} = 1\,</math>. However, these facts cannot be relied on when the quantities involved are the result of floating-point computation.


The use of the equality test (<code>if (x==y) ...</code>) requires care when dealing with floating-point numbers. Even simple expressions like <code>0.6/0.2-3==0</code> will, on most computers, fail to be true<ref name="Christiansen_Perl"/> (in IEEE 754 double precision, for example, <code>0.6/0.2 - 3</code> is approximately equal to {{val|-4.44089209850063e-16}}). Consequently, such tests are sometimes replaced with "fuzzy" comparisons (<code>if (abs(x-y) < epsilon) ...</code>, where epsilon is sufficiently small and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly, and can require numerical analysis to bound epsilon.<ref name="Higham_2002"/> Values derived from the primary data representation and their comparisons should be performed in a wider, extended, precision to minimize the risk of such inconsistencies due to round-off errors.<ref name="Kahan_2000_Marketing"/> It is often better to organize the code in such a way that such tests are unnecessary. For example, in [[computational geometry]], exact tests of whether a point lies off or on a line or plane defined by other points can be performed using adaptive precision or exact arithmetic methods.<ref name="Shewchuk"/>
The use of the equality test (<code>if (x==y) ...</code>) requires care when dealing with floating-point numbers. Even simple expressions like <code>0.6 / 0.2 - 3 == 0</code> will, on most computers, fail to be true<ref name="Christiansen_Perl"/> (in IEEE 754 double precision, for example, <code>0.6 / 0.2 - 3</code> is approximately equal to {{val|-4.44089209850063e-16}}). Consequently, such tests are sometimes replaced with "fuzzy" comparisons (<code>if (abs(x-y) < epsilon) ...</code>, where epsilon is sufficiently small and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly, and can require numerical analysis to bound epsilon.<ref name="Higham_2002"/> Values derived from the primary data representation and their comparisons should be performed in a wider, extended, precision to minimize the risk of such inconsistencies due to round-off errors.<ref name="Kahan_2000_Marketing"/> It is often better to organize the code in such a way that such tests are unnecessary. For example, in [[computational geometry]], exact tests of whether a point lies off or on a line or plane defined by other points can be performed using adaptive precision or exact arithmetic methods.<ref name="Shewchuk"/>


Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are [[matrix inversion]], [[eigenvector]] computation, and differential equation solving. These algorithms must be very carefully designed, using numerical approaches such as [[iterative refinement]], if they are to work well.<ref name="Kahan_1997_Cantilever"/>
Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are [[matrix inversion]], [[eigenvector]] computation, and differential equation solving. These algorithms must be very carefully designed, using numerical approaches such as [[iterative refinement]], if they are to work well.<ref name="Kahan_1997_Cantilever"/>
Line 657: Line 661:
The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000; the lost digits are not regained. The [[Kahan summation algorithm]] may be used to reduce the errors.<ref name="Higham_2002"/>
The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000; the lost digits are not regained. The [[Kahan summation algorithm]] may be used to reduce the errors.<ref name="Higham_2002"/>


Round-off error can affect the convergence and accuracy of iterative numerical procedures. As an example, [[Archimedes]] approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. As noted above, computations may be rearranged in a way that is mathematically equivalent but less prone to error ([[numerical analysis]]). Two forms of the recurrence formula for the circumscribed polygon are:{{citation needed|reason=Not obvious formulas|date=June 2016}}
Round-off error can affect the convergence and accuracy of iterative numerical procedures. As an example, [[Archimedes]] approximated π by calculating the perimeters of polygons [[Inscribed figure|inscribing]] and [[circumscribing]] a circle, starting with hexagons, and successively doubling the number of sides. As noted above, computations may be rearranged in a way that is mathematically equivalent but less prone to error ([[numerical analysis]]). Two forms of the recurrence formula for the circumscribed polygon are:{{citation needed|reason=Not obvious formulas|date=June 2016}}
* <math display="inline">t_0 = \frac{1}{\sqrt{3}}</math>
* <math display="inline">t_0 = \frac{1}{\sqrt{3}}</math>
* First form: <math display=inline>t_{i+1} = \frac{\sqrt{t_i^2+1}-1}{t_i}</math>
* First form: <math display=inline>t_{i+1} = \frac{\sqrt{t_i^2+1}-1}{t_i}</math>
Line 701: Line 705:


=== "Fast math" optimization ===
=== "Fast math" optimization ===
The aforementioned lack of [[Associative property|associativity]] of floating-point operations in general means that [[compilers]] cannot as effectively reorder arithmetic expressions as they could with integer and fixed-point arithmetic, presenting a roadblock in optimizations such as [[common subexpression elimination]] and auto-[[Single instruction, multiple data|vectorization]].<ref name="Vectorizers"/> The "fast math" option on many compilers (ICC, GCC, Clang, MSVC...) turns on reassociation along with unsafe assumptions such as a lack of NaN and infinite numbers in IEEE 754. Some compilers also offer more granular options to only turn on reassociation. In either case, the programmer is exposed to many of the precision pitfalls mentioned above for the portion of the program using "fast" math.<ref name="FPM"/>
The aforementioned lack of [[Associative property|associativity]] of floating-point operations in general means that [[compilers]] cannot as effectively reorder arithmetic expressions as they could with integer and fixed-point arithmetic, presenting a roadblock in optimizations such as [[common subexpression elimination]] and auto-[[Single instruction, multiple data|vectorization]].<ref name="Vectorizers"/> The "fast math" option on many compilers (ICC, GCC, Clang, MSVC...) turns on reassociation along with unsafe assumptions such as a lack of NaN and infinite numbers in IEEE 754. Some compilers also offer more granular options to only turn on reassociation or to mark specific regions of code for more aggressive optimization.<ref>{{cite web |last1=Kaylor |first1=Andy |title=Towards Useful Fast-Math |url=https://llvm.org/devmtg/2024-10/slides/techtalk/Kaylor-Towards-Useful-Fast-Math.pdf |date=October 2024}}</ref> In either case, the programmer is exposed to many of the precision pitfalls mentioned above for the portion of the program using "fast" math.<ref name="FPM"/>


In some compilers (GCC and Clang), turning on "fast" math may cause the program to [[Subnormal_number#Disabling_subnormal_floats_at_the_code_level|disable subnormal floats]] at startup, affecting the floating-point behavior of not only the generated code, but also any program using such code as a [[Library (computing)|library]].<ref name="harmful"/>
In some compilers (GCC and Clang [when a GCC installation is present]), turning on "fast" math may cause the program to [[Subnormal_number#Disabling_subnormal_floats_at_the_code_level|disable subnormal floats]] at startup, affecting the floating-point behavior of not only the generated code, but also any program using such code as a [[Library (computing)|library]]. This was fixed in GCC 13.<ref name="harmful"/>


In most [[Fortran]] compilers, as allowed by the ISO/IEC 1539-1:2004 Fortran standard, reassociation is the default, with breakage largely prevented by the "protect parens" setting (also on by default). This setting stops the compiler from reassociating beyond the boundaries of parentheses.<ref name="Gen"/> [[Intel Fortran Compiler]] is a notable outlier.<ref name="zheevd"/>
In most [[Fortran]] compilers, as allowed by the ISO/IEC 1539-1:2004 Fortran standard, reassociation is the default, with breakage largely prevented by the "protect parens" setting (also on by default). This setting stops the compiler from reassociating beyond the boundaries of parentheses.<ref name="Gen"/> [[Intel Fortran Compiler]] is a notable outlier.<ref name="zheevd"/>
Line 720: Line 724:
* [[Fixed-point arithmetic]]
* [[Fixed-point arithmetic]]
* [[Floating-point error mitigation]]
* [[Floating-point error mitigation]]
* [[Floating origin]] – a technique in 3D rendering to mitigate precision loss of floating-point formats
* [[FLOPS]]
* [[FLOPS]]
* [[Gal's accurate tables]]
* [[Gal's accurate tables]]
Line 740: Line 745:
<ref group="nb" name="NB_1">Computer hardware does not necessarily compute the exact value; it simply has to produce the equivalent rounded result as though it had computed the infinitely precise result.</ref>
<ref group="nb" name="NB_1">Computer hardware does not necessarily compute the exact value; it simply has to produce the equivalent rounded result as though it had computed the infinitely precise result.</ref>
<ref group="nb" name="NB_2">The enormous complexity of modern [[division algorithm]]s once led to a famous error. An early version of the [[Intel Pentium]] chip was shipped with a [[FDIV|division instruction]] that, on rare occasions, gave slightly incorrect results. Many computers had been shipped before the error was discovered. Until the defective computers were replaced, patched versions of compilers were developed that could avoid the failing cases. See ''[[Pentium FDIV bug]]''.</ref>
<ref group="nb" name="NB_2">The enormous complexity of modern [[division algorithm]]s once led to a famous error. An early version of the [[Intel Pentium]] chip was shipped with a [[FDIV|division instruction]] that, on rare occasions, gave slightly incorrect results. Many computers had been shipped before the error was discovered. Until the defective computers were replaced, patched versions of compilers were developed that could avoid the failing cases. See ''[[Pentium FDIV bug]]''.</ref>
<ref group="nb" name="NB_3">But an attempted computation of cos(π) yields −1 exactly. Since the derivative is nearly zero near π, the effect of the inaccuracy in the argument is far smaller than the spacing of the floating-point numbers around −1, and the rounded result is exact.</ref>
<ref group="nb" name="NB_3">But an attempted computation of <math>\cos(\pi)</math> yields <math>-1</math> exactly. Since the derivative is nearly zero near <math>\pi</math>, the effect of the inaccuracy in the argument is far smaller than the spacing of the floating-point numbers around <math>-1</math>, and the rounded result is exact.</ref>
<ref group="nb" name="NB_4">[[William Morton Kahan|William Kahan]] notes: "Except in extremely uncommon situations, extra-precise arithmetic generally attenuates risks due to roundoff at far less cost than the price of a competent error-analyst."</ref>
<ref group="nb" name="NB_4">[[William Morton Kahan|William Kahan]] notes: "Except in extremely uncommon situations, extra-precise arithmetic generally attenuates risks due to roundoff at far less cost than the price of a competent error-analyst."</ref>
<ref group="nb" name="NB_5">The [[Taylor expansion]] of this function demonstrates that it is well-conditioned near 1: A(x) = 1 − (x−1)/2 + (x−1)^2/12 (x−1)^4/720 + (x−1)^6/30240 (x−1)^8/1209600 + ... for &#124;x−1&#124; < π.</ref>
<ref group="nb" name="NB_5">The [[Taylor expansion]] of this function demonstrates that it is well-conditioned near <math>x = 1</math>: <math>f(x) = 1 - \frac{x-1}{2} + \frac{(x-1)^2}{12} - \frac{(x-1)^4}{720} + \frac{(x-1)^6}{30240} - \frac{(x-1)^8}{1209600} + \cdots</math> for <math>|x-1| < \pi</math>.</ref>
<ref group="nb" name="NB_6">If [[long double]] is [[IEEE quad precision]] then full double precision is retained; if long double is [[IEEE double extended precision]] then additional, but not full precision is retained.</ref>
<ref group="nb" name="NB_7">The equivalence of the two forms can be verified algebraically by noting that the [[denominator]] of the fraction in the second form is the [[conjugate (algebra)|conjugate]] of the [[numerator]] of the first. By multiplying the top and bottom of the first expression by this conjugate, one obtains the second expression.</ref>
<ref group="nb" name="NB_7">The equivalence of the two forms can be verified algebraically by noting that the [[denominator]] of the fraction in the second form is the [[conjugate (algebra)|conjugate]] of the [[numerator]] of the first. By multiplying the top and bottom of the first expression by this conjugate, one obtains the second expression.</ref>
<ref group="nb" name="NB_8">Octal (base-8) floating-point arithmetic is used in the [[Ferranti Atlas]] (1962)<!-- Beebe -->, [[Burroughs B5500]] (1964), [[Burroughs B5700]] (1971)<!-- Beebe -->, [[Burroughs B6700]] (1971)<!-- Beebe --> and [[Burroughs B7700]] (1972)<!-- Beebe --> computers.</ref>
<ref group="nb" name="NB_8">Octal (base-8) floating-point arithmetic is used in the [[Ferranti Atlas]] (1962)<!-- Beebe -->, [[Burroughs B5500]] (1964), [[Burroughs B5700]] (1971)<!-- Beebe -->, [[Burroughs B6700]] (1971)<!-- Beebe --> and [[Burroughs B7700]] (1972)<!-- Beebe --> computers.</ref>
Line 858: Line 862:
{{Data types}}
{{Data types}}


[[Category:Articles with example C code]]
[[Category:Computer arithmetic]]
[[Category:Floating point| ]]
[[Category:Floating point| ]]
[[Category:Computer arithmetic]]
[[Category:Articles with example C code]]