JoshJers' Ramblings

Protecting Coders From Ourselves: Better Mutex Protection

Josh Jersild — Thu, 23 Jan 2025 13:55:44 PST

At some point, if you’ve done multithreaded programming, you’ve probably used a mutex (or some other locking mechanism). Locks are relatively straightforward to understand and use (“lock this thing before accessing your data and unlock it when you’re done”), but they do have their issues. The most commonly-discussed issue with using locks is the dreaded deadlock, scourge of many a poor soul who needed to hold multiple locks at once for something.

But we’re not here to talk about deadlocks…instead, I’d like to focus on a problem that I have encountered much more frequently: accidentally reading or modifying protected data without having acquired the lock.

That’s right, we’re once again going to try to protect ourselves from our worst enemy: ourselves.

Note

As with the previous entry in this semi-series, the examples and implementation are in C++, but you can probably build something similar in your language of choice (unless it doesn’t need it).

An Easy Mistake

It turns out it’s surprisingly easy to not grab a lock before reading or writing things that need to be synchronized. Here’s a toy example:

void Foo::IncrementThing(int c)
{
  // Lock the mutex correctly!
  std::lock_guard lock { m_mutex };
  m_thing += c;
}

void Foo::DecrementThing(int c)
{
  m_thing -= c; // Whoops, no lock!
}

In the real world, “oops I accessed a thing outside of the lock” bugs tend to be more subtle - it can be especially hard to notice when a thing that should be present isn’t. Since all of the accesses are to normal members of your class/struct/global scope/whatever, it’s easy to slip and put an access to a value where it wasn’t intended, and wasn’t propertly protected.

But what if instead you could do something more like the following:

// Note: this is pseudocode, not actual C++
class Foo
{
  // ... class stuff ...

  // Mysterious m_state object that protects everything inside of it with a mutex
  MutexProtected m_state
  {
    int m_thing = 0;
  };
};

void Foo::IncrementThing(int c)
{
  lock (m_state) // Lock here
  {
    m_thing += c; // Only accessible in the lock
  }

  // m_thing += c; // Won't compile: Can't access outside of lock
}

Basically, what if you could have some sort of protected wrapper around the state that walls it off and makes it inaccessible except when the lock is actually held? With something like that it would become much more difficult to get at values when it shouldn’t be allowed.

Additionally, it would help both with organization (forcing the mutex-protected values together in the code) as well as making the intent clear: values that are protected sit inside the protected block, clarifying which values are (and are not) intended to be accessed solely through the mutex, even to folks who were not the original author (the list of which stealthily includes the original author, one month in the future).

With C++ it’s possible to get quite close to the above:

class Foo
{
  // ... class stuff ...

  // State structure for the protected member(s)
  struct State { int m_thing = 0; };

  // The mutex and the state it's protecting
  MutexProtected<State> m_state;
};

void Foo::IncrementThing(int c)
{
  // lock the mutex and get the inner state
  auto state = m_state.Lock();

  // Use the lock to access the values within.
  state->m_thing += c;
}

The nice thing is, a basic implementation of this is fairly compact. There are just two classes involved: MutexProtected and its lock object, MutexLocked.

EDIT (2025-01-25): People have pointed out two existing versions of this: boost has boost::synchronized_value, and there’s cs_libguarded which looks like it has a few variants on this concept, too. So if you don’t feel like rolling your own you can always check one of those out instead!

MutexProtected

The outer class, MutexProtected, has very few moving parts:

template <typename T>
class MutexProtected
{
public:
  MutexLocked<T> Lock()
    { return { &m_t, m_mutex }; }

private:
  std::mutex m_mutex;
  T m_t;  
};

There’s a lock function that returns a MutexLocked (which we’ll get to in a moment), and it contains both a mutex as well as a T (the templated type), which contains the data to be protected by the mutex. In the earlier example, this was the State struct that contained the protected m_thing value, but it could contain any number of values/objects that should only be accessed from within the same lock.

Note

You may be wondering “there’s a Lock, so why is there no corresponding Unlock function?” - this is because we’re taking advantage of the classic C++ idiom of RAII, so the returned MutexLocked object holds the lifetime of the lock, and unlocks the mutex when it goes out of scope. As such, there’s no need to manually unlock the mutex; it will happen automatically.

MutexLocked

But what is the thing that the Lock function returns? Well, that’s the MutexLocked class and it looks like this:

template <typename T>
class MutexLocked
{
public:
  // Add the standard pointer-like accessors.
  T *operator->() const { return m_p; }
  T &operator *() const { return *m_p; }

private:
  // Construct this with a pointer to the state and the mutex to lock.
  MutexLocked(T *p, std::mutex &mutex)
  : m_lock{std::lock_guard(mutex)}, m_p{p}
    { }

  std::lock_guard<std::mutex> m_lock;
  T *m_p = nullptr;

  template <typename T>
  friend class MutexProtected;
};

As you can see, it also doesn’t have much to it! It constructs with a pointer to the state object (the m_t in the MutexProtected) and the mutex (which it immediately locks in the constructor), and so all it does is hold onto the lock until destruction time, while giving the user a way to access the state object (via the two public operators).

So, using this setup in full:

Call Lock on the MutexProtected to get a MutexLocked
Access the protected state through that acquired object (using the -> operator) to do whatever needs to be done within the lock
Let the MutexLocked leave scope, at which point the lock is released

A nice little feature of this is if you do have a single thing that you’re doing in the lock (Say, inserting a value into a protected queue), the above steps can even fit nicely on a single line (while still being perfectly readable), thanks to the rules of C++ temporary object lifetimes:

void Foo::EnqueueItem(Item *i)
{
  state.Lock()->queue.Enqueue(i); // Lock/Update/Unlock

  // Do more stuff here outside of the lock, it's guaranteed to be released
}

Sub-Functions That Require A Lock

This type of object also helps with a secondary part of this problem, which is when your class has functions (that are likely private or protected) that require the lock to already be acquired before you call it:

// This function is intended to only be called while the lock is held
void Foo::AdjustStateWithLockHeld(int delta)
  { m_thing += delta; }

Void Foo::IncrementThing(int delta)
{
  std::lock_guard lock { m_mutex };

  // Call this function while the lock is held only
  AdjustStateWithLockHeld(delta);
}

Void Foo::DecrementThing(int delta)
{
  // Oh no I have once again forgotten to lock the mutex before doing the thing
  AdjustStateWithLockHeld(-delta);
}

In the above example, AdjustStateWithLockHeld is assuming that the lock is being held by its caller and that it’s free to manipulate the state within it. It doesn’t make much sense in this example, but if you have mutex-protected state that has a complex update process (such as multiple things needing to be kept in sync), it can be nice to move such logic to a subroutine.

Thankfully, using the MutexProtected object, the state is only accessible through a MutexLocked object. In order for the sub-function to be able to do anything with the state, it would need to additionally take a reference to the MutexLocked for the state in question, thus effectively ensuring that the mutex is locked by the caller:

// Now the function takes the locked State object as a parameter:
void Foo::AdjustStateWithLockHeld(
    MutexLocked<State> &state, 
    int delta)
  { state->m_thing += delta; }

Void Foo::DecrementThing(int delta)
{
  // Now state has to be locked to get the object to pass to the sub-function!
  auto state = m_state.Lock();
  AdjustStateWithLockHeld(state, -delta);
}

A Variation

For completeness, I also wanted to mention an alternate form of this that I thought of while designing it: where, rather than Lock returning an object that represents the scoped lock, you instead pass a lambda (or function) that gets all of the state values in it:

struct State
{
  int a, b;
};

MutexProtected<State> state;

state.Lock(
  [&](int &a, int &b)
  {
    // "a" and "b" correspond to the two values of the same name in the state structure.
    //  All the things that need to modify state values must, then, happen in this lambda,
    //  as there is no way to access the struct members directly.
    a += someVariable;
    b -= 2;
  });

The main advantage this has over the other form is that it makes it considerably more difficult to accidentally (or intentionally?) grab a reference to the internal state structure that could then persist outside of the lock. However, in practice I felt like that kind of mistake is not going to be super common relative to the problem being solved, but also should be easy to catch during code review. Plus, there are additional downsides to this version that I didn’t like:

It’s easy to get the order or names of the lambda paramters wrong and end up doing the wrong things with the wrong values since they’d effectively have to match the order in the state.
It gets harder to call sub-functions that need the lock (you’d have to pass all of the state objects that need updating which can be a pain in practice).
It’s worse to debug, since you have to step into the Lock call instead of just over it, and then into the lambda from there.
- This is also the reason I didn’t consider another variant where instead of a parameter per state object it’s a single reference to State and you access it that way.

All said, I felt like having a lock object that provides access to the inner state as-is (via the -> operator) was cleaner in practice, and easier to step through in the debugger.

Limitations

This, of course, isn’t a perfect solution:

There are absolutely cases where some things need to be accessible outside of the mutex lock (i.e. an atomic which can be safely read at any time but only gets updated from within the mutex due to sequencing issues), and as such those values couldn’t live inside of the inner state object.
Also, this is C++ so there’s nothing preventing someone from grabbing a reference or pointer to the state object from the lock and holding onto it until after the lock ends, then partying on the data. However, that kind of code is more likely to get caught at review time.
There are likely other cases (multiple locks, perhaps, which I haven’t fully thought through with this because I haven’t needed to) where this would present problems. This is definitely primarily intended for the “I have a set of data that should only ever be accessed during a lock” case.

Potential Improvements

Also, there are some improvements that could be made:

Perhaps instead of using a lock_guard you could use a unique_lock which would let you wait on a critical section using the lock (if you need to wait for the state to be in some specific configuration), which could even be built into the MutexProtected class (or a similar one) if the critical section is a core part of the usage.
It’s a good idea to declare a custom move constructor and move assignment operator for MutexLocked that nulls the m_t pointer of the object being moved from so that you can’t do a move of it to some other location (which subsequently releases the lock) and then still party on the internal pointer. (I also have asserts in my production version in the pointer-access operators that assert the pointer is non-null)
Similarly, it may be nice to have a constructor for MutexProtected that takes constructor arguments for the contained state structure (similar to, say std::vector::emplace_back), especially if you are going to have state structures that cannot default construct.
There are also other flavors of locking that might occur: for instance, a TryLock function that returns a std::optional>, and only locks the lock (and returns the locked state) if there is no contention on the mutex.

Closing Time

All in all, having an abstraction like this makes it way more difficult to party all over internal state without properly locking the mutex first. Switching some old code to use this actually found a couple places where I’d done things incorrectly (reading values that should have been mutex protected on read, in those cases).

So, yeah, by protecting our data from being accessed when it shouldn’t be, we’re also protecting ourselves from ourselves.

Protecting Coders From Ourselves: Min, Max, Lerp, and Clamp

Josh Jersild — Fri, 10 Jan 2025 23:13:29 PST

Imagine this: you’ve got some value, x, that you want to ensure is at least 1. That is to say, you want to ensure its minimum value is 1. So, being the smart, experienced programmer that you are, you write the following:

x = Min(x, 1);

You give yourself the small, satisfied nod of a job well done and run the program and then it all goes immediately sideways because that should have been Max and not Min.

If you’ve been writing code for basically any length of time, the above was probably less an “imagine this” and more a “remember this” because if you’re anything like me, you’ve done this over. and over. and over.

Inspired by once again mistakenly using Min instead of Max to limit the minimum allowed value of something, I’ve decided to start a little series (will it have more than one entry? who knows!) called Protecting Coders From Ourselves, in which we rework some bit of API surface to make it less error-prone. We’re going to deal with the “I chose the wrong Min/Max again” problem, but first I want to talk about Clamp and Lerp.

Note

The examples here are in C++, but the concepts should be relevant to basically any language.

Clamp and Lerp

Clamp

Clamp is a simple enough function: Take some value v and make sure it is no less than min and no greater than max. Almost every clamp function in every library I’ve seen has three parameters, one of which represents the value to be clamped, and two of which represent the range that it should be clamped within:

Clamp(a, b, c)

The question, of course, is “which parameter is which?” Many languages (ex: C++, C#, HLSL, and Rust) have the following arrangement in their standard libraries:

Clamp(valueToClamp, min, max)

where the first parameter is the value being clamped and the last two are the min and max ends of the range.

But I’ve also seen this one (looking at you, CSS):

Clamp(min, valueToClamp, max)

This one puts the value to clamp in the middle of the range (which, honestly, is conceptually where it belongs).

Lerp

Another common function that takes a value and a range is Lerp, which uses a value in the range [0, 1] to linearly interpolate between two endpoint values. Most lerp functions that I’ve seen (ex: C++, C#, and HLSL) have their parameters ordered as follows:

Lerp(a, b, t)

where a and b are the range endpoints and t is the interpolating value.

Depending on the projects you work on, you maybe don’t write code that uses Lerp very often (or ever), but I do, and for me the combination of Lerp and Clamp are a source of constant, mild confusion.

Parameter Confusion

Both of these functions take three parameters, two of which represent a range and one which is a value that is either limited by or used to interpolate within the range. In isolation it’s easy to rationalize the order of the parameters for each:

Clamp(v, min, max): Clamp v to be within the range [min, max].
Clamp(min, v, max): Clamp such that min <= v <= max.
Lerp(a, b, t): Get a linearly-interpolated value between a and b using t.

…but in combination I am constantly second-guessing which order the parameters need to go in. Sometimes, for instance, I’ll write the equivalent of Lerp(t, a, b) and wonder why nothing is working the way I expect.

That brings us (finally) to the question of the article: how can we make it clear which parameters are which in these functions?

An obvious way to do this, given language support, is to make use of named parameters when calling the function:

Clamp(v=value, min=0, max=5)

but not all languages (looking at you, C++) support named parameters, so what then?

Grouping the Range Values

What if instead of the above, calls looked more like this:

Clamp(a, {b, c});
Lerp({a, b}, c);

With this added structure, it’s clear even without reasonable variable names which part is the range and which is the value, and it’s much more difficult to accidentally call them with the parameters in the wrong order, since, for instance, Lerp(t, {a, b}) wouldn’t even compile.

To do this, we need a simple range structure. in C++ it could look something like this:

template <typename T>
struct ValueRange
{
  // Use a constructor to ensure both endpoints are required.
  ValueRange(T a_, T b_)
    : a{a_}, b{b_} {}

  T a;
  T b;
};

Using this range, then, you could define Clamp and Lerp as follows:

template <typename T>
T Clamp(T v, ValueRange<T> range)
{
  return Max(range.a, Min(range.b, v)); 
}

template <typename T>
T Lerp(ValueRange<T> range, T t)
{
  return range.b * t + range.a * (1 - t);
}

Now instead of three parameters, they take two, which matches how they work conceptually: with a range and a value in some order.

Once you start grouping your input parameters , you may start seeing other places to do it, like IsInRange<Inclusive>(v, {0, 20}).

Min and Max

Back to the Min/Max problem. As stated perfectly by Tom Forsyth:

“Almost every time I use [min or max], I think very carefully and then pick the wrong one.”

This tends to happen because you think “I need to make sure the max value of x is 10” and it just feels right to turn that into Max(x, 10)…that’s how it gets you. Or, well, me. That’s how it gets me. Basically every time I have to do this, I will choose the wrong one on the first try, have my code explode on me, and then go back and headdesk at it until it turns into the correct one.

Basically. Every. Time.

Reframing The Problem

But what if you looked at it a different way? What if you instead thought of it as “I want to clamp x so that it’s no larger than 10”? You want something that just clamps one end of it, in a clear way. What if you could declare a one-sided clamp:

// Using "Open" to declare a side of the range is open. Same as:
//  x = Min(x, 10);
x = Clamp(x, {Open, 10});

// Another alternative, ensure that y is no smaller than 2, same as: 
//  y = Max(y, 2);
y = Clamp(y, {2, Open});

To do this efficiently, we’ll have multiple overloads of Clamp, and define two additional “range” structures, each of which takes an OpenEnded_t:

// Declare this as a nice, type-safe enum class
enum class OpenEnded_t
{
  Open,
};

// But make "Open" easy to reach using C++20's "using enum" feature.
using enum OpenEnded_t;

// This is a "range" where only the "a" value is specified, the "b" end is open
template <typename T>
struct ValueRangeOpenB
{
  ValueRangeOpenB(T a_, OpenEnded_t)
    : a(a_) { }
  T a;
};

// Like the above, but it's the "b" end that's specified while "a" is open
template <typename T>
struct ValueRangeOpenA
{
  ValueRangeOpenA(OpenEnded_t, T b_)
    : b(b_) { }
  T b;
};

Note

You can, of course, call Open whatever you’d prefer: I considered many options (including Unbounded, Infinite, OpenEnded, and None), but Open was short and, to my mind, clear.

If you have a global Open function (or you’re in a class that has a function named Open), this likely won’t work. You could add a second enum value called OpenEnded that could be used interchangeably with Open for that case, or just specify it fully qualified, or just pick a less-inconvenient name.

Once you have these structures, you can define two additional overloads of Clamp:

template <typename T>
T Clamp(T v, ValueRangeOpenB<T> range)
  { return Max(v, range.a); }


template <typename T>
T Clamp(T v, ValueRangeOpenA<T> range)
  { return Min(v, range.b); }

These just turn into the correct call to Min or Max, but now you can think of it in terms of limiting one side of its range or the other, rather than trying to A Beautiful Mind your way into picking the correct function right off the bat.

Now, finally, you can limit a value in multiple ways using the same concept, which can make it easier to reason about when you’re writing the code, and also easier to understand when you’re reading it a month later.

// Keep within a range:
x = Clamp(x, {1, 5});

// Limit the lower bound:
y = Clamp(y, {1, Open});

// Limit the upper bound:
z = Clamp(z, {Open, 5});

Final Thoughts

These are ideas I first proposed at my job, and got immediate buy-in from the dev team, because we all kept making the same kinds of mistakes with these functions. There are, of course, times when Min and Max are still the appropriate function to use (like when you’re thinking “I need the minimum of these values”) - but when you’re trying to limit the range of something, Clamp is a clearer declaration of intent.

There are ways to improve these functions:

In C++ I highly recommend making all of this constexpr (including the constructors) so that you can use these functions at compile time as well.
- Depending on your codebase it may be desirable to additionally mark them [[nodiscard]] and noexcept.
- Also, restricting the template types using C++20 concepts can help give you better error messages if you try to compile it with something that it can’t work with.
The full-range Clamp function may want to have some validation that b >= a (perhaps an assert).

Parser Stuff: Strings

Josh Jersild — Fri, 03 Jan 2025 13:38:29 PST

A while back I was working on a programming language idea and while I haven’t made any progress on it in ages, I really liked the string design that I came up with. I don’t know that any of the ideas are original, but I haven’t seen anything exactly like it so I figure I’d throw the idea out into the ether in case anyone else happens to do something similar in their own personal language that they definitely shouldn’t be making 😆

(I’ll note upfront: this design uses dollar signs ($) and backticks(`). There are many languages that do so (like Javascript!) but these keys are not universally on all keyboards internationally so for those locales it may be more difficult to type these out…my language design was pretty much just for me with my standard US keyboard so I didn’t take this into account)

Basic Strings

There’s nothing fancy about these, they’re just like most other languages’ basic strings:

"This is a string"
"This is a string with a newline at the end: \n"
"Quotes? \"Escape them\""
"Oops forgot to end this one, it's a compiler error

Just a pair of double quotes with everything between being non-quotes (or escaped quotes), contained within a single line.

Raw Strings

Raw strings in my language design are kind of a blend between C++11-style raw strings and C#11 raw strings (the elevens!), using a different delimiting character variation than I’ve seen elsewhere. In this case, the simplest one starts and ends with a pair of backticks:

``This is a single raw string
it is multiple lines long``

It can contain any character sequence except the delimiting sequence (again, at simplest a pair of backticks: ``)

But what if your string needs to contain a consecutive pair of backticks? This is where the C++11 raw string inspiration comes in: you can put any string of characters between the ticks (excluding ticks, obviously, or newlines), and then the start and end have to match.

`uniqueString`This is a single string
it contains backticks without terminating: `` ... see?
This is the last line and ends here:`uniqueString`

This one starts with `uniqueString`, and so the only thing that will terminate it is that same sequence: `uniqueString` (with the tick marks around it).

To add to this, cribbing from C#11 it will:

Trim the very first newline if there is one
Also trim the last newline if there is one
Unindent every line of it based on the indentation of the final quote sequence:

myString = ``
  This is actually the first line of the string, the newline was ignored
  {
    indented further
  }
  ``; // Note that this is indended 2 spaces

which turns into the string (note the lack of being completely indented:

This is actually the first line, the newline was ignored
{
  indented further
}

This makes it easier to generate code (or text files or whatever) that are properly indented, without having to make the indenting of the string in your code all weird.

Interpolated Strings

I’m additionally adding string interpolated string support (which is a weird term), using a mix of C# and Javascript’s setup:

$"This string has a ${value} in it"

If a string starts with $, it’s treated as an interpolated string. A string-convertible expression can be inserted in-place in the string within ${}.

But what if you need to have the character sequence ${ in your string? Add more dollar signs to the start, and you need that many dollar signs before a { to enter the Interpolation Zone:

$$"This string has a $${value} in it, ${but this isn't one}"

Raw strings can also be used as interpolated strings (making for some nice codegen), same rules apply:

    $$``
      Interpolated string with
        multiple lines and a $${value} in it.
        ${this is not a value because only one $}
      ``

If value == 5 this would turn into the following string (upon formatting):

Interpolated string with
  multiple lines and a 5 in it.
  ${this is not a value because only one $}

I Just Think They’re Neat

Anyway, I think this is a really nice combination of properties that make it easy to format strings nicely without being overly-complicated to actually use (unlike C++'s raw strings, which I have to look up literally every time I need to use one). Need a hardcoded regex or path with backslashes? In most cases, just use a raw string with `` on either end:

``C:\Path\With\Single\Backslashes``

``^[\r\n \t]*Hi[\r\n \t]*$``

Hope this was at least mildly interesting to someone!

Emulating the FMAdd Instruction, Part 2: 64-bit Floats

Josh Jersild — Thu, 02 Jan 2025 02:22:15 PST

(This post follows Part 1: 32-bit floats and will make very little sense without having read that one first. Honestly, it might make little sense having read that one first, I dunno!)

Last time we went over how to calculate the results of the FMAdd instruction (a fused-multiply-add calculated as if it had infinite internal precision) for 32-bit single-precision float values:

Calculate the double-precision product of a and b
Add this product to c to get a double-precision sum
Calculate the error of the sum
Use the error to odd-round the sum
Round the double-precision sum back down to single precision

This requires casting up to a 64-bit double-precision float to get extra bits of precision. But what if you can’t do that? What if you’re using doubles? You can’t just (in most cases) cast up to a quad-precision float. So what do you do?

Double-Precision FMAdd

(Like the 32-bit version, this is based on Emulation of FMA and correctly-rounded sums: proved algorithms using rounding to odd by Sylvie Boldo and Guillaume Melquiond, there are additional details in there)

To do this natively as doubles, we need to invent a new operation: MulWithError. This is the multiplication equivalent of the AddWithError function from the 32-bit solution:

(double prod, double err) MulWithError(double x, double y)
{
  double prod = x * y;
  double err = // ??? how do we do this
  return (prod, err);
}

We’ll get to how to implement that in a moment, but first we’ll walk through how to use that function to calculate a proper FMAdd.

We need to do the following:

Calculate the product of a and b and the error of that product
Calculate the sum of that product and c (giving us a * b + c) and the error of this sum
- We’re not using the error of the product … yet
Add the two error terms (product error and sum error) together, rounding the result to odd
Add this summed error term to our actual result, which will round normally.

In code, that looks like this:

// Start with an "OddRoundedToAdd" helper since we do
//  this operation frequently
double OddRoundedAdd(double x, double y)
{
  (double sum, double err) = AddwithError(x, y);
  return RoundToOdd(sum, err);
}

double FMAdd(double a, double b, double c)
{
  (double ab, double abErr) = MulWithError(a, b);
  (double abc, double abcErr) = AddWithError(ab, c);

  // Odd-round the sum of the two errors before 
  //  adding it in to the final result.
  double err = OddRoundedAdd(abErr, abcErr);
  return abc + err;
}

By keeping the error terms from both the product and the sum, we have kept all of the exact result. That is, we can assemble the mathematically-exact result given enough precision by doing abc + abErr + abcErr.

But we can’t do infinite-precision addition of three values. However, we can odd-round an intermediate result, the same way we did with the single-precision case.

In this case, we know that abErr and abcErr both (necessarily) have much lower magnitudes than the final result, as each error value’s highest bit is lower than the lowest bit of the mantissa of their respective operations. So, if we odd-round the sum of these two values, it actually effectively fulfills the condition of having more bits of precision than the final result. Thus, if we add the error terms together with odd rounding, the odd-rounded fake-sticky final digit will be taken into account by the actual sticky bit used when doing the final sum of the result and error terms.

I hope that makes sense?

Breaking Down Multiplication With Error

(Like AddWithError, this is based on A Floating-Point Technique For Extending the Available Precision by T.J. Dekker)

So how do we calculate the error term of a 64-bit multiply? We can’t use 128-bit values, but what we can do is break each 64-bit value up into two values, each with less bits of precision.

We’ll break x and y (our two multiplicands) up into high and low values, where:

x = xh + xl;
y = yh + yl;

We do this by breaking the double’s mantissa up:

xh contains the top 25 bits of x’s mantissa (plus its implied 1, giving it 26 bits of precision).
xl contains the bottom 27 bits of x’s mantissa. The highest-set ‘1’ in this mantissa will become the implied 1 bit that’s part of the standard floating-point format, so this value will have 27 bits of precision, max.
yh and yl are the same, but for y.

(double h, double l) Split(double v)
{
  // In C++ this Zero function can be implemented by  masking off
  //  the bottom 27 bits by casting to a 64-bit int:
  // constexpr uint64_t Mask = ~0x07ff'ffff;
  // double h = std::bit_cast(
  //              std::bit_cast(v) & Mask);
  double h = ZeroBottom27BitsOfMantissa(v);

  // We can get the lower bits of the mantissa (correctly normalized and
  //  with correct signs) by subtracting the extracted upper bits from 
  //  the original value.
  double l = v - h;
  return (h, l);
}

What does this split give us? Well, we can now break the multiplication up into a sum of multiplies that now each have enough bits of precision to be exactly representable (27BitValueA * 27BitValueB == 54BitValue, which fits perfectly in a double (with the implied 1 bit), using our old friend from Algebra, FOIL:

    x * y 
= (xh + xl) * (yh + yl) 
= xh*yh + xh*yl + xl*yh + xl*yl;

We can’t actually do those adds directly, but what we can do is similar to how we did AddWithError: use a sequence of precision-preserving operations to calculate the difference between that idealized result and our rounded product:

(double prod, double err) MulWithError(double x, double y)
{
  double prod = x * y;

  (xh, xl) = Split(x);
  (yh, yl) = Split(y);

  // Parentheses to demonstrate the precise order these 
  //  operations must occur in
  double err = (((xh*yh - prod) + xh*yl) + xl*yh) + xl*yl;
  return (prod, err);
}

It works like this:

Calculate the (rounded) product of x and y
Subtract that rounded product from the product of xh and yh
- These should have roughly the same magnitude (and definitely the same sign) so this is a precision-preserving subtraction.
- Since |xh * yh| <= rounded(|x * y|) (because xh and yh are truncated versions of x and y and thus have lower magnitudes) this is a smaller - larger operation and we’ll get a result with a sign opposite that of the final product.
Keep adding in next-lower-magnitudes of values, which will continue to preserve precision
- (because we have a value that is opposite-sign these are effectively subtractions, in the same way that a + -b is)
- It’s also worth noting here that xh*yl and xl*yh will have equivalent magnitudes so the order that you add them in doesn’t matter, as long as they’re both after xh*yh and before xl*yl

Once you’ve done that, you have the computed product as well as the error term, and we can then follow our FMAdd algorithm above to calculate the FMAdd.

So, that’s it, we’re done, right?

Edge Cases

Nope! Well, yes if you just wanted the gist, but now it’s time to get into all those annoying implementation details that the papers this is based on completely glossed over. Here’s where it gets ugly (unless you thought it was already ugly, in which case, sorry, it’s about to get worse somehow).

In our single-precision case, we didn’t have to worry about exponent overflow or underflow because we were using double-precision intermediates, which not only have additional mantissa range, but also additional exponent range.

It’s possible that the product of a * b (an intermediate value in our calculation) goes out of range of what a double can represent, but that the addition of c might bring the final result back into range (which can happen when the sign of c is opposite the sign of a * b). This causes a different set of errors on either end:

If a*b is too large to be represented, it turns into infinity which means that adding c in will just leave it as infinity even though the final result should have been a representable value (albeit one with a very large magnitude)
If a*b is too small to be represented, it will go subnormal which means bits of the intermediate result will slide off the bottom of the mantissa and we lose bits of information, which can causes us to round incorrectly at our final result.

To solve this, we’ll introduced a bias into the calculation, for when the value goes very small or very large:

double CalculateFMAddBias(double a, double b, double c)
{
  // Calculate what our final result would be if we just did it normally
  double testResult = Abs(a * b + c);

  if (testResult < Pow2(-500) && Max(a, b, c) < Pow2(800))
  {
    // Our result is very small and our maximum value is not so large 
    //  that we'll blow up with a bias, so bias our values up to
    //  ensure we don't go subnormal in our intermediate result
    return Pow2(110);
  }
  else if (IsInfinite(testResult))
  {
    // We hit infinity, but that might be due to exponent overflow,
    //  so bias everything down (this may cause c to go subnormal, 
    //  but if that's the case then a*b on its own is infinity and
    //  so it won't affect the final result in any way)
    return Pow2(-55);
  }
  else
  {
    // No bias needed
    return 1.0;
  }
}

For any results that aren’t extreme, the bias will remain 1.0 but, for values at the extremes, we’ll scale our intermediates down (using powers of 2 which only affect the exponent and not the mantissa) into a range such that we can’t temporarily poke outside of range. Also note that my choices of powers of 2 are not perfectly chosen, I didn’t bother trying to figure out the exact right biases/thresholds so I just picked ones that I knew were good enough.

So then we do our FMAdd calculation as before, but with the bias introduced (and then backed out at the end):

// Do our multiplication with the bias applied to 'a'
//  (the choice of applying it to 'a' vs 'b' is completely
//  arbitrary)
(double ab, double abErr) = MulWithError(a * bias, b);

// Then the sum with the bias applied to 'c'
(double abc, double abcErr) = AddWithError(ab, c * bias);

err = OddRoundedAdd(abErr, abcErr);

// Calculate our final result then un-bias the result.
return (abc + err) / bias;

Alright, we’ve avoided both overflow and underflow and everything is great, right?

Two (Point Five) Last Annoying Implementation Details

Nope, sorry again! It turns out there are still two cases we need to deal with.

Case 1: Infinity or NaN even with the bias

If our result (without error applied) hits infinity even with the avoid-infinity bound, then we should just go ahead and return now to avoid Causing Problems Later (that is, turning what should be infinity into a NaN). And if it’s already NaN we can just return now because it’s going to be NaN forever.

Except, there’s one additional necessary check here, for a case caught by dzaima over on Bluesky): in the event a and b are finite numbers but a * b blows up to infinity, and then c is the opposite infinity, the correct return value is whichever infinity (positive or negative) c is, so in our early-out check has to catch that case as well:

  if (IsInfiniteOrNaN(abc))
  { 
    // If we got NaN (or Inf, which won't affect the output) and 
    //  a and b are both finite but c is infinite, return c (without
    //  this check, we will incorrectly return NaN instead of -Inf
    //  for FMAdd(1e200, 1e200, -Infinity))
    if (IsInfinite(c) && !IsInfiniteOrNaN(a) && !IsInfiniteOrNaN(b))
      { return c; }

    // Otherwise, return whichever Inf or NaN we got directly;
    return abc;
  }

Case 2: Subnormal Results

If our result is subnormal (after the bias is backed out), then it’s going to lose bits of precision as it shifts down (because the exponent can’t go any lower so instead the value itself shifts down the mantissa), which means whoops here’s another rounding step, and the dreaded double-rounding returns.

In this case we need to actually odd-round the addition of the error term as well, so that when the bias is backed out and it rounds, it does the correct thing:

// Multiply the smallest-representable normalized value by our avoid-
//  subnormal bias. Any (biased) value below this will go subnormal.
//  (In production code it'd be nicer to use something like
//  std::numeric_limits instead of hard-coding -1022)
const double SubnormThreshold = Pow2(-1022) * AvoidDenormalBias;

if (bias == AvoidSubnormalBias && Abs(abc) < SubnormThreshold)
{
  // Odd-round the addition of the error so that the rounding that 
  //  happens on the divide by the bias is correct.
  (double finalSum, finalSumErr) = AddWithError(abc, err);
  finalSum = RoundToOdd(finalSum, finalSumErr);
  return finalSum / bias;
}

And this almost works, except there’s one more annoying case, and that’s where our result is going subnormal, but only by exactly one bit. Remember that the odd-rounding trick only works if we have two or more bits so that the final rounding works properly, but in this case we’re truncating the mantissa by exactly one bit, so we have to do even more work:

Split the value that will be shifting down into a high and low part (same as we did for the multiply)
Add our error term to the low part of it
- This preserves additional bits of the error term since we gave ourselves more headroom by removing the upper half of its mantissa
Remove the bias from both the high and low parts separately
- Removing the bias from the high part doesn’t round since we know the lowest bit is 0
- Removing the bias from the low part applies the actual final rounding (correctly) since we gave ourselves more bits to work with
Sum the halves back together and return that as our final result
- This sum is (thankfully) perfectly representable by the final precision and doesn’t introduce any additional error.

const double OneBitSubnormalThreshold = 
  OneBitSubnormalThreshold * 0.5;
if (Abs(finalResult.result) >= k_oneBitDenormThreshold)
{
  // Split into halves
  (rh, rl) = Split(finalSum);

  // Add the error term into the low part of the split
  rl = OddRoundedAddition(rl, finalSumErr);

  // Scale them both down by the bias. Note that 
  //  the rh division cannot round since the lowest bit
  //  is 0
  rh /= bias;

  // This division is what actually introduces the final
  //  rounding (correctly, since we gave ourselves more
  //  bits to work with)
  rl /= bias;

  // This sum is perfectly representable by the final
  //  precision and will not introduce additional error.
  return rh + rl;
}

OMG Are We Done Now?

As far as I’m aware, those are all the implementation details to doing a 64-bit double-precision FMAdd implementation. It’s conceptually not that much more complicated than the 32-bit one, but mechanically it’s worse, plus there are those fun extra edge cases to consider.

Here’s the final code:

(double h, double l) Split(double v)
{
  double h = ZeroBottom27BitsOfMantissa(v);
  double l = v - h;
  return (h, l);
}

(double prod, double err) MulWithError(
  double x, 
  double y)
{
  double prod = x * y;

  (xh, xl) = Split(x);
  (yh, yl) = Split(y);
  double err = 
    (((xh*yh - prod) + xh*yl) + xl*yh) + xl*yl;
  return (prod, err);
}

double OddRoundedAdd(double x, double y)
{
  (double sum, double err) = AddwithError(x, y);
  return RoundToOdd(sum, err);
}

double FMAdd(double a, double b, double c)
{
  const double AvoidSubnormalBias = Pow2(110);
  double bias = 1.0;
  {
    // Calculate our final result as if done normally
    double testResult = Abs(a * b + c);

    // Bias if the result goes too low or too high
    if (testResult < Pow2(-500) && Max(a, b, c) < Pow2(800))
      { bias = AvoidSubnormalBias; } // too low
    else if (IsInfinite(testResult))
      { bias = Pow2(-55); } // too high
  }

  // Calculate using our bias
  (double ab, double abErr) = MulWithError(a * bias, b);
  (double abc, double abcErr) = AddWithError(ab, c * bias);

  // Check for infinity or NaN and return early
  if (IsInfiniteOrNaN(abc))
  { 
    // Handle the case of "a multiply of two finite values hit infinity
    //  even *with* the bias, but c is the opposite infinity" case and
    //  return the correct result  of "c"
    if (IsInfinite(c) && !IsInfiniteOrNaN(a) && !IsInfiniteOrNaN(b))
      { return c; }

    // Otherwise just return the inf or nan directly
    return abc; 
  }

  // Odd-round the intermediate error resultt
  double err = OddRoundedAdd(abErr, abcErr);

  // Multiply the smallest-representable normalized value by our avoid-
  //  subnormal bias. Any (biased) value below this will go subnormal
  const double SubnormThreshold = Pow2(-1022) * AvoidSubnormalBias;

  if (bias == AvoidSubnormalBias && Abs(abc) < SubnormThreshold)
  {
    (double finalSum, finalSumErr) = AddWithError(abc, err);

    // This is half of SubnormThreshold. Any value between SubnormThresold
    //  and this value will only lose a single bit of precision when
    //  the bias is removed, which requires some extra care
    const double OneBitSubnormalThreshold = 
      OneBitSubnormalThreshold * 0.5;
    if (Abs(finalSum) >= OneBitSubnormalThreshold)
    {
      // Split into halves
      (rh, rl) = Split(finalSum);

      // Add the error term into the LOW part of our split value
      rl = OddRoundedAdd(rl, finalSumErr);

      // Divide out the bias from both halves (which will cause rl to
      //  round to its final, correctly-rounded value) then sum them 
      //  together (which is perfectly representable).
      rh /= bias;
      rl /= bias;
      return rh + rl;
    }
    else
    {
      // For more-than-one-bit subnormals, we do an odd-rounded addition of
      //  the error term and then divide out the bias, doing full rounding
      //  just once.
      finalSum = RoundToOdd(finalSum, finalSumErr);
      return finalSum / bias;
    }
  }
  else
  {
    // Not subnormal, so we can calculate our final result normally and un-
    //  bias the result.
    return (abc + err) / bias;
  }
}

Compare that to the 32-bit version and you can see why this one got its own post:

float FMAdd(float a, float b, float c)
{
  double product = double(a) * double(b);
  (double sum, double err) = AddWithError(product, c);
  sum = RoundToOdd(sum, err);
  return float(sum);
}

Hopefully you never have to actually implement this yourself, but if you do? I hope this helps.

Emulating the FMAdd Instruction, Part 1: 32-bit Floats

Josh Jersild — Tue, 31 Dec 2024 21:51:24 PST

A thing that I had to do at work is write an emulation of the FMAdd (fused multiply-add) instruction for hardware where it wasn’t natively supported (specifically I was writing a SIMD implementation, but the idea is the same), and so I thought I’d share a little bit about how FMAdd works, since I’ve already been posting about how float rounding works.

So, screw it, here we go with another unnecessarily technical, mathy post!

What is the FMAdd Instruction?

A fused multiply-add is basically doing a multiply and an add as a single operation, and it gives you the result as if it were computed with infinite precision and then rounded down at the final result. FMAdd computes (a * b) + c without intermediate floating-point error being introduced:

float FMAdd(float a, float b, float c)
{
  // ??? Somehow do this with no intermediate rounding
  return (a * b) + c; 
}

Computing it normally (using the code above) for some values will get you double rounding (explained in a moment) which means you might be an extra bit off (or, more formally, one ULP) from where your actual result should be. An extra bit doesn’t sound like a lot, but it can add up over many operations.

Fused multiply-add avoids this extra rounding, making it more accurate than a multiply followed by a separate add, which is great! (It can also be faster if it’s supported by hardware but, as you’ll see, computing it without a dedicated instruction on the CPU is actually surprisingly spendy, especially once you get into doing it for 64-bit floats, but sometimes you need precision instead of performance).

Double Rounding

Double rounding happens when the intermediate value rounds down (or up), then the final result also rounds in the same direction - but because of the first rounding, actually overshoots the correctly-rounded final value by a bit.

Here’s an example using two successive sums of some 4-bit float values. We’ll do the following sum (in top-down order):

  1.000 * 2^4
+ 1.001 * 2^0
+ 1.100 * 2^0

The first sum, done with “infinite” internal precision, looks like this:

  1.000 * 2^4
+ 1.001 * 2^0
  -----------
  1.000 0000 * 2^4
+ 0.000 1001 * 2^4
  ----------------
  1.000 1001 * 2^4

If we were to then use that result directly (with no intermediate rounding) and do the second sum, only rounding the final result:

   1.000 1001 * 2^4
 + 0.000 1100 * 2^4
   ----------
 = 1.001 0101 * 2^4
-> 1.001      * 2^4 // Rounds down

The final result rounds (to nearest) to 1.001.

However, if we were to round that intermediate value to 4 bits first, we’d get this:

   1.000 1001 * 2^4
-> 1.001      * 2^4 // Rounded up
 + 0.000 1100 * 2^4
   ----------
   1.001 1100 * 2^4
-> 1.010      * 2^4 // Up again

In this one, we end up with 1.010 instead of 1.001 because of the intermediate rounding, which pushed us past the correctly-rounded final result.

How to Pretend That You Have Infinite Precision

Okay, for FMAdd we want to calculate a multiply, and then somehow throw an add in there and have it act as if we didn’t lose any precision on the multiply.

First we’re going to handle the case of 32-bit floats (singles) because it’s a wildly simpler case on CPUs that have 64-bit floats (doubles).

(also, sorry in advance, the term “double” for a “double-precision float” and the “double” in “double rounding” are two different instances of “double” but I’ve written so much of this post and like hell am I changing it now so hopefully it’s not too confusing)

The immediately obvious thing to try to get an accurate single-precision FMA is “hey, what if we do the multiply and add as doubles and then round the result back down to a single”:

float FMAdd(float a, float b, float c)
{
  // Do the math as 64-bit floats and truncate at the end. 
  //  Surely that's good enough, right?
  return float((double(a) * double(b)) + double(c)); 
}

While that gives a much better result than doing it as pure 32-bits, it actually can still have double rounding. But where does the extra rounding come from, in this case?

The multiply itself isn’t the source of the first rounding: Surprisingly (to me, at least): casting two singles to doubles and multiplying those together always results in an exact answer - this is because each of the single-precision values has 24 bits of precision, but a double can store 53 bits of precision, which is more than enough to store the result of multipling two singles (2 * 24 bits of precision max). Since floats are stored as:

sign * 1.mantissa * 2^(exponent)

…it means we’re multiplying two numbers of the form 1.xxxxxxxxxx and 1.yyyyyyyyy together then adding the exponents together to get the new number, so unlike addition and subtraction (where, say, 1 + 1*10^60 requires a ton of extra precision), if two float numbers have wildly different exponents it doesn’t actually matter because the exponents and significand values are handled separately.

To illustrate this, let’s pretend we have two 4-digit (base 10) numbers and we multiply them and store the result using 8 digits (double precision):

  (1.234 * 2^1) * (1.457 * 2^100)
-> (1.2340000 * 1.4570000) * (2^1 * 2^100)
 =  1.7979380 * 2^101  // no rounding here!

Great, so the double-precision multiply is fine and introduces no rounding at all. So then how do we get double rounding?

As mentioned above, an add (or subtract) can introduce rounding:

  (1.234 * 10^0) + (1.457 * 10^9)
-> (1.2340000 * 10^0) + (1.4570000 * 10^9)
 = (1.45700001234 * 10^9) // Too many digits!
-> 1.45700000 // Rounded to nearest here

This rounding happens at double precision (so well below the threshold of our target 32-bit result), but there’s still rounding, and then the value is rounded again when converted back down to single precision. That’s the double rounding and the source of a potential error.

Okay, so, double rounding is bad? Kinda! But it turns out there is a way to introduce a new rounding mode to use for the first rounding that, in the right situations, does not introduce any additional error and ensures that your final result is correct.

A New Rounding Mode?

(This technique is based off of the paper Emulation of FMA and correctly-rounded sums: proved algorithms using rounding to odd by Sylvie Boldo and Guillaume Melquiond; if you want the full technical details of this whole process, that’s where you’ll find them. Believe it or not, I’m actually trying to go into less detail!)

The key to eliminating the extra precision loss is by using a non-standard rounding mode: rounding to odd. Standard floating point rounding calculates results with some additional bits of precision (three bits, to be precise), and then rounds based on the result (usually using “round to nearest with round to even on ties”, although that detail doesn’t end up mattering here - this technique works with any standard rounding mode).

So, assume that we have some way of calculating a double precision addition and also having access to the error between the calculated result and the mathematically exact result. Given those two values we can perform a Round To Odd step:

double RoundToOdd(double value, double errorTerm)
{
  if (errorTerm != 0.0 // if the result is not exact
    && LowestBitOfMantissa(value) == 0) //and mantissa is even
  {
    // We need to round, so round either up or down to odd
    if (errorTerm > 0)
    {
      // Round up to an odd value
      value = AddOneBitToMantissa(value);
    }
    else // (errorTerm < 0)
    {
      // Round down to an odd value
      value = SubtractOneBitFromMantissa(value);
    }
  }
}

Basically: if we have any error at all, and the mantissa is currently even, either add or subtract a single bit’s worth of mantissa, based on the sign of the error.

(In practice, I found that I also had to ensure the result was not Infinity before doing this operation, since I implemented this using some bitwise shenanigans that would end up “rounding” Infinity to NaN, so, you know, watch out for that).

Why Does Odd-Rounding the Intermediate Value Work?

Round to odd works as long as we have more bits of value than the final result - specifically we need at least two extra bits. Standard float rounding makes use of something called a “sticky” bit - basically the lowest bit of the extra precision is a 1 if any of the bits below it would have been 1.

And, hey, that is basically what “round to odd” does!

If the mantissa is odd, regardless of whether there’s error or not the lowest bit is already odd.
If the error was positive and the mantissa was even, we set the lower bit to 1 anyway, effectively stickying (yeah that’s a word now) all the error bits below it.
If the error was negative and the mantissa was even, we subtract 1 from the mantissa, making the lower bit odd, and effectively sticky since some of the digits below it are also 1s.

Effectively, round to odd is just “emulate having a sticky bit at the bottom of your intermediate result” - that way, you have a guaranteed tiebreaker for the final rounding step.

But note that I said it requires you to have at least two extra bits. In the case of our using-doubles-instead-of-singles intermediate addition, good news: we have way more than two extra bits - our intermediate value is a whole-ass double-precision float, so we have 29 extra bits vs. our single-precision final value and (mathematically speaking) 29 is greater than 2.

So, for the true single-precision FMAdd instruction we need to do the following:

float FMAdd(float a, float b, float c)
{
  double product = double(a) * double(b); // No rounding here
  
  // Calculate our sum, but somehow get the error along with it
  (double sum, double err) = AddWithError(product, c);

  // Round our intermediate value to odd
  sum = RoundToOdd(sum, err);

  // Final rounding here, which now does the correct thing and gives us
  //  a properly-rounded final result (as if we'd used infinite bits)
  return float(sum);
}

That’s it! …wait, what’s that AddWithError function, we haven’t even–

Calculating An Exact Addition Result

Right, we need to calculate that intermediate addition along with some accurate error term. It turns out it’s possible to calculate a set of numbers, sum and error where mathematicallyExactSum = sum + error.

For this, we have to dive back to July of 1971 and check out the actually typewritten paper A Floating-Point Technique For Extending the Available Precision by T.J. Dekker. Give that a read if you want way more details on this whole thing.

Calculating the error term of adding two numbers (I’ll use x and y) is relatively straightforward if |x| > |y|:

sum = x + y;
err = y - (sum - x);

(this is equation 4.14 in the linked paper)

This is just a different ordering of (x + y) - sum that preserves accuracy: due to the nature of the values involved in these subtractions (sum’s value is directly related to those of x and y, and y is smaller than x), it turns out that each of those subtractions is an exact result (the paper has a proof of this, and it’s a lot so I’m not going to expand on that here), so we get the precise difference between the calculated sum and the real sum.

But this only works if you know that x’s magnitude is larger than (or equal to) y’s. If you don’t know which of the two values has a larger magnitude, you can do a bit more work and end up with:

(double sum, double err) AddWithError(
  double x, 
  double y)
{
  double sum = a + b;
  double intermediate = sum - x;
  double err1 = y - intermediate;
  double err2 = x - (sum - intermediate);
  return (sum, err1 + err2);      
}

(This is effectively the expanded version of listing 4.16 from the linked paper)

err1 here is the same as the value in the first version we calculated (a precision-preserving rewrite of (x + y) - sum)
err2 is, mathematically, x - (sum - (sum - x)) or 0; its goal is to calculate the error involved in calculating err1, since without the |x| > |y| guarantee those subtractions might NOT be exact … but these ones will be.
Thus, summing these two error terms together gives us a final, precise error term.

(More details in the paper, hopefully this isn’t too glossed over that it loses any meaning)

Finally, the End (For Single-Precision Floats)

So, yeah, that’s how you implement the FMAdd instruction for single-precision floats on a machine that has double-precision support:

Calculate the double-precision product of a and b
Add this product to c to get a double-precision sum
Calculate the error of the sum
Use the error to odd-round the sum
Round the double-precision sum back down to single precision

But what if you have to calculate FMAdd for double-precision floats? You can’t easily just cast up to, like, quad-precision floats and do the work there, so what now? Can you still do this?

The answer is yes, but it’s a lot more work, and that’s what the next post is about.

C++17-Style Hex Floats (And How To Parse Them)

Josh Jersild — Mon, 30 Dec 2024 12:00:00 PST

C++17 added support for hex float literals, so you can put more bit-accurate floating point values into your code. They’re handy to have, and I wanted to be able to parse them from a text file in a C# app I was writing.

Some examples:

0x1.0p0         // == 1.0
0x1.8p1         // == 3
0x8.0p-3        // == 1.0
0x0.8p1         // == 1.0
0xAB.CDEFp-10   // == 0.16777776181697846
0x0.0000000ABp0 // == 2.4883775040507317E-09

What Am I Even Looking At Here?

(Before reading this, if you don’t have a good feel for how floats work, maybe check my previous post about floats (and float rounding))

I had a bit of a mental block on this number format for a bit - like, what does it even mean to have fractional hex digits? But it turns out it’s a concept that we already use all the time and my brain just needed some prodding to make the connection.

With our standard base 10 numbers, moving the decimal point left one digit means dividing the number by 10:

12.3 == 1.23 * 10^1 == 0.123 * 10^2

Hex floats? Same deal, just in 16s instead:

0x1B.C8 == 0x1.BC8 * 16^1 == 0x0.1BC8 * 16^2

Okay, so now what’s the “p” part in the number? Well, that’s the start of the exponent. A standard float has an exponent starting with ‘e’:

1.3e2 == 1.3 * 10^2

But ‘e’ is a hex digit, so you can’t use ‘e’ anymore as the exponent starter, so they chose ‘p’ instead (why not ‘x’, the second letter? Probably because a hex number starts with ‘0x’, so ‘x’ also already has a use - but ‘p’ is free so it wins)

The exponent for a hex float is in powers of 2 (so it corresponds perfectly to the exponent as it is stored in the value), so:

0x1.ABp3 == 0x1.AB * 2^3

So that’s how a hex float literal works! Here’s a quick breakdown:

[ '.' [fractional-hex-digits]] 'p'

(Parsing code below the fold)

Okay Now What’s This About Parsing Them?

C++ conveniently has functions to parse these (std::strtod/strtof handle this nicely). However, if you’re (hypothetically) making a parser, and you happen to be writing it in C# which does not have an inbuilt way to parse these, then you’ll have to parse your own.

It ended up being a little more complicated than I thought for a couple reasons.

Arbitrarily long hex strings seem to be supported, which means you need to track both where the top-most non-zero hex value starts (i.e. skip any leading zeros) but also properly handle bits that are extra tiny which may affect rounding.
In order to properly handle said rounding, running through the hex digits in reverse and pushing them in from the top ends up being a nice strategy, because float rounding works via a sticky bit that stays set as things right-shift down through it.

Ultimately I ended up with the following algorithm (in C#) to parse a hex float, which I’m definitely sure is ~perfect~ and has absolutely no bugs whatsoever. It is also absolutely the most efficient version of this possible, with no room for improvement. Yep.

I’m throwing it in here in case anyone ever finds it useful.

static bool IsDigit(char c) => (c >= '0' && c <= '9');
static bool IsHexDigit(char c) 
  => IsDigit(c) 
    || (c >= 'A' && c <= 'F') 
    || (c >= 'a' && c <= 'f');

double ParseHexFloat(string s)
{ 
  // This doesn't handle a negative sign, if only because the parser I have only
  // needed to support positive values, but it'd be easy to add
  if (s.Length < 2 || s[0] != '0' || char.ToLowerInvariant(s[1]) != 'x')
    { throw new FormatException("Missing 0x prefix"); }

  if (s.Length < 3 || !IsHexDigit(s[2]))
    { throw new FormatException(
        "Hex float literal must contain at least one whole part digit"); }
    
  int i = 2;
  int decimalPointIndex = -1;
  int firstNonZeroHexDigitIndex = -1;

  // Scan through our digits, looking for the index of the first set (non-zero)
  //  hex value and the decimal point (if we have one).
  while (i < s.Length && (IsHexDigit(s[i]) || s[i] == '.'))
  {
    if (s[i] == '.')
    {
      // Found the decimal point! Hopefully there wasn't already one!
      if (decimalPointIndex >= 0)
        { throw new FormatException("Too many decimal points"); }

      decimalPointIndex = i;
    }
    else if (s[i] != '0' && firstNonZeroHexDigitIndex < 0)
      { firstNonZeroHexDigitIndex = i; } // Here's our top-most set hex value.

    i++;
  }

  // Also make a note of where our last hex digit was (usually the digit before
  //  the 'p' that we should be at right now)
  int lastHexDigitIndex = i - 1;

  // ... but if the previous character was the decimal point, the last hex
  //  digit is before that.
  if (lastHexDigitIndex == decimalPointIndex)
    { lastHexDigitIndex--; }

  // If we didn't find a decimal point, it's EFFECTIVELY here, at the 'p'
  if (decimalPointIndex < 0)
    { decimalPointIndex = i; }

  // Validate and skip the 'p' character
  if (i >= s.Length || char.ToLowerInvariant(s[i]) != 'p')
    { throw new FormatException("Missing exponent 'p'"); }
  i++;

  // Grab the sign if we have one
  bool negativeExponent = false;
  if (i < s.Length && (s[i] == '+' || s[i] == '-'))
  {
    negativeExponent = (s[i] == '-');
    i++;
  }

  if (i >= s.Length)
    { throw new FormatException("Missing exponent digits"); }

  // Parse the exponent!
  int exponent = 0;
  while (i < s.Length && IsDigit(s[i]))
  {
    if (int.MaxValue / 10 < exponent)
      { throw new FormatException("Exponent overflow"); }
    
    exponent *= 10;
    exponent += (int)(s[i] - '0');
    i++;
  }

  if (negativeExponent)
    { exponent = -exponent; }

  // If we had no non-zero hex digits, there's no point in continuing, it's
  //  zero. 
  if (firstNonZeroHexDigitIndex < 0)
    { return 0.0; }
    
  if (i != s.Length)
    { throw new FormatException("Unexpected characters at end of string"); }
    
  // We have the supplied exponent, but we want to massage it a bit. In a
  //  (IEEE) floating-point value, the mantissa is entirely fractional - that
  //  is, the value is 1.mantissa * 2^(exponent) - there's an implied 1
  //  (excluding subnormal floats, which we'll handle properly, but we can
  //  ignore for the moment). Two things need to happen here:
  //   1. We need to adjust the exponent based on the position of the first
  //      non-zero hex digit, to match the fact that we're parsing hex digits
  //      such that the top hex digit is sitting in the top 4 bits of our 64-
  //      bit int.
  //   2. But we EXPECT a single bit to be above the mantissa (the implied
  //      1) so subtract 1 from our adjustment to take into account that
  //      there will be 4 bits in that hex, so if we had parsed a single "1"
  //      (from, say, 0x1p0, which just equals 1.0) our effective exponent
  //      should be 3 (which we will later shift back down to 0 to position
  //      the 1s bit at the very top)
  if (decimalPointIndex >= firstNonZeroHexDigitIndex)
    { exponent += ((decimalPointIndex - firstNonZeroHexDigitIndex) * 4) - 1; }
  else
    { exponent += ((decimalPointIndex - firstNonZeroHexDigitIndex + 1) * 4) - 1; }

  // Now that we have the exponent and know the bounds of our hex digits,
  //  we can parse backwards through the hex digits, shifting them in from
  //  the top. We do this so that we can easily handle the rounding to the
  //  final 53 bits of significand (by ensuring that we don't ever shift
  //  any 1s off the bottom)
  ulong mantissa = 0;
  for (i = lastHexDigitIndex; i >= firstNonZeroHexDigitIndex; i--)
  {
    // Skip the '.' if there was one.
    if (s[i] == '.')
      { continue; }

    char c = char.ToLowerInvariant(s[i]);
    ulong v = (c >= 'a' && c <= 'f') 
      ? (ulong)(c - 'a' + 10) 
      : (ulong)(c - '0');

    // Shift the mantissa down, but keep any 1s that happen to be in the bottom
    //  4 bits (this is a reasonably-efficient emulation of the "sticky bit"
    //  that is used to round a floating point number properly.
    mantissa = (mantissa >> 4) | (mantissa & 0xf);

    // Add in our parsed hex value, putting its 4 bits at the very top of the
    //  mantissa ulong.
    mantissa |= (v << 60);
  }

  // We know the mantissa is non-zero (checked earlier), and we want to position
  //  the highest set bit at the top of our ulong so shift up until the top bit
  //  is set (and adjust our exponent down 1 to compensate).
  while ((mantissa & 0x8000_0000_0000_0000ul) == 0)
  {
    mantissa <<= 1;
    exponent--;
  }

  const int DoubleExponentBias = 1023;
  const int MaxBiasedDoubleExponent = 1023 + DoubleExponentBias;
  const ulong MantissaMask = 0x000f_ffff_ffff_fffful;
  const ulong ImpliedOneBit = 0x0010_0000_0000_0000ul;
  const int ExponentShift = 52;
  const int MantissaShiftRight = sizeof(double)*8 - ExponentShift - 1;

  // Exponents are stored in a biased form (they can't go negative) so add our
  //  bias now.
  exponent += DoubleExponentBias;

  if (exponent <= 0)
  {
    // We have a subnormal value, which means there is no implied 1, so first
    //  we need to shift our mantissa down by one to get rid of the implied 1.
    //  (note that we're not letting any 1s shift off the bottom, keeping them
    //  sticky)
    mantissa = (mantissa >> 1) | (mantissa & 1);

    // Continue to denormalize the mantissa until our exponent reaches zero
    while (exponent < 0)
    {
      mantissa = (mantissa >> 1) | (mantissa & 1);
      exponent++;
    }
  }

  // Now to do the actual rounding of the mantissa. Sometimes floating point
  //  rounding needs 3 bits (guard, round, sticky) to do rounding, but in our
  //  case, two will suffice: one bit that represents the uppermost bit that
  //  shifts right off of the edge of the mantissa (i.e. the "0.5" bit) and
  //  then "literally any 1 bit underneath that" (which is why we've been
  //  holding on to extra 1s when shifting right) that is the tiebreaker
  bool roundBit      = (mantissa & 0b10000000000) != 0;
  bool tiebreakerBit = (mantissa & 0b01111111111) != 0;

  // Now that we have those bits, we can shift our mantissa down into its
  //  proper place (as the lower 53 bits of our 64-bit ulong).
  mantissa >>= MantissaShiftRight;
  if (roundBit)
  {
    // If there's a tiebreaker, we'll increment the mantissa. Otherwise,
    //  if there's a tie (could round either way), we round so that the
    //  mantissa value is even (lowest bit in the double is 0)
    if (tiebreakerBit || (mantissa & 1) != 0)
      { mantissa++; }

    // If we have a subnormal float we may have overflowed into the implied 1
    //  bit, otherwise we might have overflowed into the ... I guess the
    //  implied *2* bit?
    ulong overflowedMask = ImpliedOneBit << ((exponent == 0) ? 0 : 1);
    if ((mantissa & overflowedMask) != 0)
    {
      // Shift back down one. This is not going to drop a 1 off the bottom
      //  because if we overflowed it means we were odd, and added one to
      //  become even.
      exponent++;
      mantissa >>= 1;
    }
  }

  // It's possible that the truncation we ended up with a 0 mantissa after all,
  //  so our final value has rounded allll the way down to 0.
  if (mantissa == 0)
    { return 0.0; }

  // If our exponent is too large to be represented, this value is infinity.
  if (exponent > MaxBiasedDoubleExponent)
    { return double.PositiveInfinity; }

  // Mask off the implied one bit (if we have one)
  mantissa &= ~ImpliedOneBit;

  // Alright assemble the final double's bits, which means shifting and
  //  adding the exponent into its proper place.
  //  (if we had a sign to apply we'd apply it to the top bit). 
  ulong assembled = mantissa | (((ulong)exponent) << ExponentShift);
  double result = BitConverter.UInt64BitsToDouble(assembled);
  return result;
}

Wordpress No More

Josh Jersild — Sun, 29 Dec 2024 12:00:00 PST

I’ve ported the blog off of Wordpress onto a static site generator (Specifically, 11ty/Eleventy). I have a bit more control over the format here, and it’s easier for me to write pages (Wordpress was fighting me on all sorts of formatting which I can just do now).

What this means is I can finally start copying over the rest of my posts that were on (the now-defunct, sadly) cohost.org (rest easy, little eggbug).

Likely there are things on the new site that aren’t set up correctly yet, so if you happen to notice anything, find me on Bluesky or Mastodon and let me know!

And now, for fun, here’s a pic I took in November at Arches National Park:

Floating Point Numbers and Rounding

Josh Jersild — Tue, 10 Jan 2023 12:00:00 PST

I was writing about how to parse C++17-style hex floating point literals, and in doing so I ended up writing a bunch about how floats work in general (and specifically how floating point rounding happens), so I opted to split it off from that post into its own, since it’s already probably way too many words as it is?

Here we go!

How Floats Work

If you don’t know, a floating point number (At least, an IEEE 754 float, which effectively all modern hardware supports), consists of three parts:

Sign bit – the upper bit of the float is the sign bit: 1 if the float is negative, 0 if it’s positive.
Exponent – the next few bits (8 bits for a 32-bit float, 11 bits for a 64-bit float) contain the exponent data, which is the power of two to multiply the given hex value with. (Note that the exponent is stored in a biased way – more on that in a moment)
Mantissa – the remaining bits (23 for 32-bit float, 52 for a 64-bit float) represent the fractional part of the float’s value.

In general (with the exception of subnormal floats and 0.0, explained in a bit) there is an implied 1 in the float: that is, if the mantissa has a value of “FC3CA0000” the actual float is 1.FC3CA0000 (the mantissa bits are all to the right of the decimal point) before the exponent is applied. Having this implied 1 gives an extra bit of precision to the value since you don’t even have to store that extra 1 bit anywhere – it’s implied! Clever.

The exponent represents the power of two involved (Pow2(exponent)), which has the nice property that multiplies or divides of a float by powers of two do not (usually, except at extremes) affect the precision of the number, dividing by 2 simply decrements the exponent by 1, and multiplying by 2 increments the exponent by 1.

For a double-precision (64-bit) float, the maximum representable exponent is 1023 and the minimum is -1022. These are stored in 11 bits, and they’re biased (which is to say that the stored 11 bits is actualExponent + bias where the bias is 1023. That means that this range of [-1022, 1023] is actually stored as [1, 2046] (00000000001 and 11111111110 in binary). This range uses all but two of the possible 11-bit values, which are used to represent two sets of special cases:

Exponent value 00000000000b represents a subnormal float – that is, it still has the effective exponent of -1022 (the minimum representable exponent) but it does NOT have the implied 1 – values smaller than this start to lose bits of precision for every division by 2 as it can’t decrement the exponent any farther and so ends up sliding the mantissa to the right instead.
- For this 00000000000b exponent, if the mantissa is 0, then you have a value of 0.0 (or, in a quirk of floating point math, -0.0 if the sign bit is set).
Exponent value 11111111111b represents one of two things:
- If the mantissa is zero, this is infinity (either positive or negative infinity depending on the sign bit).
- If the mantissa is non-zero, it’s NaN (not a number).
  - (There are two types of NaN, quiet and signaling. Those are a subject for another time, but the difference bit-wise is whether the upper bit of the mantissa is set: if 1 it’s quiet, if 0 then it’s signalling).

If you wanted to write a bit of math to calculate the value of a 64-bit float (ignoring the two special exponent cases) it would look something like this (where bias in this case is 1023):

(signBit ? -1 : 1) 
  * (1 + (mantissaBits / Pow2(52 + 1)) 
  * Pow2(exponentBits - bias)

Standard Float Rounding

Okay, knowing how floats are stored, clearly math in the computer isn’t done with infinite precision, so when you do an operation that drops some precision, how does the result get rounded?

When operations are done with values with mismatched exponents, the value with the lowest exponent is effectively shifted to the right by the difference to match the exponents.

For example, here’s the subtraction of two four-bit-significand (3 bits of mantissa plus the implied 1) floats:

  1.000 * 2^5
- 1.001 * 2^1

The number being subtracted has the smaller exponent, so we end up shifting it to the right to compensate (for now, doing it as if we had no limit on extra digits):

  1.000 0000 * 2^5
- 0.000 1001 * 2^5 // Shifted right to match exponents
------------------
  0.111 0111 * 2^5
  1.110 111  * 2^4 // shifted left to normalize (fix the implied 1)
  1.111      * 2^4 // round up since we had more than half off the edge

Note that in this example, the value being subtracted shifted completely off the side of the internal mantissa bit count. Since we can’t store infinite off-the-end digits, what do we do?

Float math uses three extra bits (to the “right” of the mantissa), called the guard bit, the round bit, and the sticky bit.

As the mantissa shifts off the end, it shifts into these bits. This works basically like a normal shift right, with the exception that the moment that ANY 1 bit get shifted into the sticky bit, it stays 1 from that point on (that’s what makes it sticky).

For instance:

      G R S
1.001 0 0 0 * 2^1
0.100 1 0 0 * 2^2 // 1 shifts into the guard bit
0.010 0 1 0 * 2^3 // now into the round bit
0.001 0 0 1 * 2^4 // now into the sticky bit
0.000 1 0 1 * 2^5 // sticky bit stays 1 now

Note that the sticky bit stayed 1 on that last shift, even though in a standard right shift it would have gone off the end. Basically if you take the mantissa plus the 3 GRS bits (not to be confused with certain cough other meanings of GRS) and shift it to the right, the operation is the equivalent of:

mantissaAndGRS = (mantissaAndGRS >> 1) | (mantissaAndGRS & 1)

Now when determining whether to round, you can take the 3 GRS bits and treat them as GRS/8 (i.e. GRS bits of 100b are the equivalent of 0.5 (4/8), and 101b is 0.625 (5/8)), and use that as the fraction that determines whether/how you round.

The standard float rounding mode is round-to-nearest, even-on-ties (that is, if it could round either way (think 1.5, which is equally close to either 1.0 or 2.0), you round to whichever of the neighboring values is even (so 1.5 and 2.5 would both round to 2).

Using our bits, the logic, then, is this:

If the guard bit is not set, then it rounds down (fraction is < 0.5), mantissa doesn’t change.
If the guard bit IS set:
- If round bit or sticky bit is set, always round up (fraction is > 0.5), mantissa increases by 1.
- Otherwise, it’s a tie (exactly 0.5, could round either way), so round such that the mantissa is even (the lower bit of the mantissa is 0), mantissa increments if the lower bit was 1 (to make it even).

Okay, so if all we care about is guard bit and then , why even have three bits? Isn’t two bits enough?

Nope! Well, sometimes, but not always. Turns out, some operations (like subtraction) can require a left shift by one to normalize the result (like in the above subtraction example), which means if you only had two bits of extra-mantissa information (just, say, a round and sticky bit) you’d be left with one bit of information after the left shift and have no idea if there’s a rounding tiebreaker.

For instance, here’s an operation with the proper full guard, round, and sticky bits:

  1.000 * 2^5
- 1.101 * 2^1

  // Shift into the GRS bits:
  1.000 000 * 2^5
- 0.000 111 * 2^5 // sticky kept that lowest 1
------------------
  0.111 001 * 2^5
  1.110 01  * 2^4 // shift left 1, still 2 digits left
  1.110     * 2^5 // Round down properly

If this were done with only two bits (round and sticky) we would end up with the following:

  1.000 * 2^5
- 1.101 * 2^1

  // Shift into just RS bits:
  1.000 00 * 2^5
- 0.000 11 * 2^5
------------------
  0.111 01 * 2^5
  1.110 1  * 2^4 // shift left 1, still 2 digits left

Once we shift left there, we only have one bit of data, and it’s set. We don’t know whether or not we had a fraction > 0.5 (meaning we have to round up) or < 0.5 (meaning we round to even, which is down in this case).

So the short answer is: three bits because sometimes we have to shift left and lose one bit of the information, and we need at least a “half bit” and a “tiebreaker bit” at all times. Given all the float operations, 3 suffices for this, always.

Resurrected

Josh Jersild — Sun, 01 Jan 2023 12:00:00 PST

A while back (over a year ago) this blog got hacked and so it’s been down for a hot minute. I’ve brought it back online and now I can finally post things on it again – I have a few things that I’ll post here and there but honestly it’s unlikely to ever be really FREQUENT around here.

But for now, it’s back!

Procyon is Greenlit!

Josh Jersild — Sat, 11 Jan 2014 12:00:00 PST

Valve recently announced that Procyon is in the most-recent batch of titles to be given the green light for release on Steam! This is super-exciting news!

There are a few things that I want to add to Procyon before it’s ready for general Steam release: achievements, proper leaderboards, steam overlay support, etc. Basically, all of the steam features that make sense.

It’ll be a bit before it gets done, as we’re working on putting the finishing touches on Infamous: Second Son at work, so I’m a little dead by the time I get home at the moment. But once we’ve shipped, I’ll likely have the time and energy to get Procyon rolling out onto Steam.

Yeaaaaaaaah!