Floating point addition and subtraction #
To add, we shift the smaller number to near the bigger number, perform a 128-bit addition, and normalize the result.
Normalization starting with 128 bits #
Adjust an exponent for addition. The result is (t, d+1), where d - 65 is the amount
to shift left by, and t is the final exponent.
Equations
Instances For
The difference result expressed as a difference
add_adjust produces small d + 1
add_adjust doesn't overflow r
add_adjust normalizes r
add_adjust doesn't overflow r
add_adjust normalizes r
add_adjust doesn't make r too big
add_adjust doesn't make r too big
add_adjust never produces n = .min
add_adjust results in normalized n
Turn an almost normalized (n ≤ 2^63) value into a Floating, shifting right by at most 1
Equations
- One or more equations did not get rendered due to their size.
Instances For
add_shift_r rounds in the correct direction
High 64 bit - 128 bit subtraction #
Addition (definition) #
Exactly rounded floating point addition, with 0 < x and special cases removed
Equations
- One or more equations did not get rendered due to their size.
Instances For
Exactly rounded floating point addition, with 0 < x and special cases removed
Equations
- x.pos_add y up = Floating.add_shift_r (x.add_to_128 y up) x.s up
Instances For
Exactly rounded floating point addition, with most special cases removed
Equations
Instances For
Exactly rounded floating point addition
Equations
Instances For
Addition is correct #
add_bigger doesn't care about sign
add_bigger doesn't care about abs
add_bigger is antisymmetric
The shifting we do of y in add_to_128 produces a small value
Subtraction #
We use x - y = x + -y.