<?xml version="1.0" encoding="utf-8"?>
<rss
  version="2.0" 
  xmlns:content="http://purl.org/rss/1.0/modules/content/"  
  xmlns:dc="http://purl.org/dc/elements/1.1/" 
  xml:base="https://www.drilian.com/" 
  xmlns:atom="http://www.w3.org/2005/Atom"
  xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
  xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
>
  <channel>
    <title>JoshJers&#39; Ramblings</title>
    <atom:link href="https://www.drilian.com/feed/" rel="self" type="application/rss+xml" />
    <link>https://www.drilian.com/</link>
    <description>Infrequently-updated blog about software development, game development, and music</description>
    <lastBuildDate>Wed, 11 Jun 2025 01:41:20 GMT</lastBuildDate>
    <language>en</language>
    <item>
      <title><![CDATA[Protecting Coders From Ourselves: Better Mutex Protection]]></title>
      <link>https://www.drilian.com/posts/2025.01.23-protecting-coders-from-ourselves-better-mutex-protection/</link>
      <pubDate>Thu, 23 Jan 2025 13:55:44 PST</pubDate>
      <dc:creator><![CDATA[Josh Jersild]]></dc:creator>
      <guid>https://www.drilian.com/posts/2025.01.23-protecting-coders-from-ourselves-better-mutex-protection/</guid>
      <content:encoded>
        <![CDATA[
          <p>At some point, if you’ve done multithreaded programming, you’ve probably used a mutex <a href="https://en.wikipedia.org/wiki/Lock_(computer_science)" target="_blank" rel="noopener">(or some other locking mechanism)</a>. Locks are relatively straightforward to understand and use (“lock this thing before accessing your data and unlock it when you’re done”), but they do have their issues. The most commonly-discussed issue with using locks is the dreaded <a href="https://en.wikipedia.org/wiki/Deadlock_(computer_science)" target="_blank" rel="noopener">deadlock</a>, scourge of many a poor soul who needed to hold multiple locks at once for something.</p>
<p>But we’re not here to talk about deadlocks…instead, I’d like to focus on a problem that I have encountered <em>much</em> more frequently: accidentally reading or modifying protected data <strong>without having acquired the lock</strong>.</p>
<p>That’s right, we’re once again going to try to protect ourselves from our worst enemy: ourselves.</p>
<blockquote class="note">
          <div class="header"><span>Note</span></div>
<p>As with the <a href="https://www.drilian.com/posts/2025.01.10-protecting-coders-from-ourselves-min-max-lerp-and-clamp/">previous entry in this semi-series</a>, the examples and implementation are in C++, but you can probably build something similar in your language of choice (unless it doesn’t need it).</p>
<p></p></blockquote><p></p>
<h3>An Easy Mistake</h3>
<p>It turns out it’s <em>surprisingly easy</em> to not grab a lock before reading or writing things that need to be synchronized. Here’s a toy example:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">void</span> <span class="token class-name">Foo</span><span class="token double-colon punctuation">::</span><span class="token function">IncrementThing</span><span class="token punctuation">(</span><span class="token keyword">int</span> c<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token comment">// Lock the mutex correctly!</span>
  std<span class="token double-colon punctuation">::</span>lock_guard lock <span class="token punctuation">{</span> m_mutex <span class="token punctuation">}</span><span class="token punctuation">;</span>
  m_thing <span class="token operator">+=</span> c<span class="token punctuation">;</span>
<span class="token punctuation">}</span>

<span class="token keyword">void</span> <span class="token class-name">Foo</span><span class="token double-colon punctuation">::</span><span class="token function">DecrementThing</span><span class="token punctuation">(</span><span class="token keyword">int</span> c<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  m_thing <span class="token operator">-=</span> c<span class="token punctuation">;</span> <span class="token comment">// Whoops, no lock!</span>
<span class="token punctuation">}</span></code></pre>
<p>In the real world, <em>“oops I accessed a thing outside of the lock”</em> bugs tend to be more subtle - it can be especially hard to notice when a thing that <em>should</em> be present <em>isn’t</em>. Since all of the accesses are to <em>normal members</em> of your class/struct/global scope/whatever, it’s easy to slip and put an access to a value where it wasn’t intended, and wasn’t propertly protected.</p>
<p>But what if instead you could do something more like the following:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token comment">// Note: this is pseudocode, not actual C++</span>
<span class="token keyword">class</span> <span class="token class-name">Foo</span>
<span class="token punctuation">{</span>
  <span class="token comment">// ... class stuff ...</span>

  <span class="token comment">// Mysterious m_state object that protects everything inside of it with a mutex</span>
  MutexProtected m_state
  <span class="token punctuation">{</span>
    <span class="token keyword">int</span> m_thing <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span><span class="token punctuation">;</span>

<span class="token keyword">void</span> <span class="token class-name">Foo</span><span class="token double-colon punctuation">::</span><span class="token function">IncrementThing</span><span class="token punctuation">(</span><span class="token keyword">int</span> c<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token function">lock</span> <span class="token punctuation">(</span>m_state<span class="token punctuation">)</span> <span class="token comment">// Lock here</span>
  <span class="token punctuation">{</span>
    m_thing <span class="token operator">+=</span> c<span class="token punctuation">;</span> <span class="token comment">// Only accessible in the lock</span>
  <span class="token punctuation">}</span>

  <span class="token comment">// m_thing += c; // Won't compile: Can't access outside of lock</span>
<span class="token punctuation">}</span>
</code></pre>
<p>Basically, what if you could have some sort of <strong>protected wrapper</strong> around the state that walls it off and makes it <strong>inaccessible except when the lock is actually held</strong>? With something like that it would become <em>much</em> more difficult to get at values when it shouldn’t be allowed.</p>
<p>Additionally, it would help both with <strong>organization</strong> (forcing the mutex-protected values together in the code) as well as making the <strong>intent</strong> clear: values that are protected sit inside the protected block, clarifying which values are (and are not) intended to be accessed solely through the mutex, even to folks who were not the original author (the list of which stealthily includes the original author, one month in the future).</p>
<p><span class="read-more"></span></p>
<p>With C++ it’s possible to get quite close to the above:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">class</span> <span class="token class-name">Foo</span>
<span class="token punctuation">{</span>
  <span class="token comment">// ... class stuff ...</span>

  <span class="token comment">// State structure for the protected member(s)</span>
  <span class="token keyword">struct</span> <span class="token class-name">State</span> <span class="token punctuation">{</span> <span class="token keyword">int</span> m_thing <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token punctuation">;</span>

  <span class="token comment">// The mutex and the state it's protecting</span>
  MutexProtected<span class="token operator">&lt;</span>State<span class="token operator">></span> m_state<span class="token punctuation">;</span>
<span class="token punctuation">}</span><span class="token punctuation">;</span>

<span class="token keyword">void</span> <span class="token class-name">Foo</span><span class="token double-colon punctuation">::</span><span class="token function">IncrementThing</span><span class="token punctuation">(</span><span class="token keyword">int</span> c<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token comment">// lock the mutex and get the inner state</span>
  <span class="token keyword">auto</span> state <span class="token operator">=</span> m_state<span class="token punctuation">.</span><span class="token function">Lock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

  <span class="token comment">// Use the lock to access the values within.</span>
  state<span class="token operator">-></span>m_thing <span class="token operator">+=</span> c<span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>
<p>The nice thing is, a basic implementation of this is fairly compact. There are just two classes involved: <code>MutexProtected&lt;T&gt;</code><span class="attrs"></span> and its lock object, <code>MutexLocked&lt;T&gt;</code><span class="attrs"></span>.</p>
<blockquote>
<p><strong>EDIT (2025-01-25):</strong> People have pointed out two existing versions of this: boost has <code><a href="https://www.boost.org/doc/libs/1_87_0/doc/html/thread/sds.html" target="_blank" rel="noopener">boost::synchronized_value</a></code>, and there’s <a href="https://github.com/copperspice/cs_libguarded" target="_blank" rel="noopener">cs_libguarded</a> which looks like it has a few variants on this concept, too. So if you don’t feel like rolling your own you can always check one of those out instead!</p>
</blockquote>
<h3>MutexProtected&lt;T&gt;</h3>
<p>The outer class, <code>MutexProtected&lt;T&gt;</code><span class="attrs"></span>, has very few moving parts:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">template</span> <span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token operator">></span>
<span class="token keyword">class</span> <span class="token class-name">MutexProtected</span>
<span class="token punctuation">{</span>
<span class="token keyword">public</span><span class="token operator">:</span>
  MutexLocked<span class="token operator">&lt;</span>T<span class="token operator">></span> <span class="token function">Lock</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token punctuation">{</span> <span class="token operator">&amp;</span>m_t<span class="token punctuation">,</span> m_mutex <span class="token punctuation">}</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>

<span class="token keyword">private</span><span class="token operator">:</span>
  std<span class="token double-colon punctuation">::</span>mutex m_mutex<span class="token punctuation">;</span>
  T m_t<span class="token punctuation">;</span>  
<span class="token punctuation">}</span><span class="token punctuation">;</span></code></pre>
<p>There’s a <strong>lock function</strong> that returns a <code>MutexLocked&lt;T&gt;</code><span class="attrs"></span> (which we’ll get to in a moment), and it contains both a <strong>mutex</strong> as well as a <code>T</code><span class="attrs"></span> (the templated type), which contains <strong>the data to be protected by the mutex</strong>. In the earlier example, this was the <code>State</code><span class="attrs"></span> struct that contained the protected <code>m_thing</code><span class="attrs"></span> value, but it could contain any number of values/objects that should only be accessed from within the same lock.</p>
<blockquote class="note">
          <div class="header"><span>Note</span></div>
<p>You may be wondering “there’s a <code>Lock</code><span class="attrs"></span>, so why is there no corresponding <code>Unlock</code><span class="attrs"></span> function?” - this is because we’re taking advantage of the classic C++ idiom of <a href="https://en.wikipedia.org/wiki/Resource_acquisition_is_initialization" target="_blank" rel="noopener">RAII</a>, so the returned <code>MutexLocked&lt;T&gt;</code><span class="attrs"></span> object holds the lifetime of the lock, and unlocks the mutex when it goes out of scope. As such, there’s no need to manually unlock the mutex; it will happen automatically.</p>
<p></p></blockquote><p></p>
<h3>MutexLocked&lt;T&gt;</h3>
<p>But what is the thing that the <code>Lock</code><span class="attrs"></span> function returns? Well, that’s the <code>MutexLocked&lt;T&gt;</code><span class="attrs"></span> class and it looks like this:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">template</span> <span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token operator">></span>
<span class="token keyword">class</span> <span class="token class-name">MutexLocked</span>
<span class="token punctuation">{</span>
<span class="token keyword">public</span><span class="token operator">:</span>
  <span class="token comment">// Add the standard pointer-like accessors.</span>
  T <span class="token operator">*</span><span class="token keyword">operator</span><span class="token operator">-></span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">const</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> m_p<span class="token punctuation">;</span> <span class="token punctuation">}</span>
  T <span class="token operator">&amp;</span><span class="token keyword">operator</span> <span class="token operator">*</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">const</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token operator">*</span>m_p<span class="token punctuation">;</span> <span class="token punctuation">}</span>

<span class="token keyword">private</span><span class="token operator">:</span>
  <span class="token comment">// Construct this with a pointer to the state and the mutex to lock.</span>
  <span class="token function">MutexLocked</span><span class="token punctuation">(</span>T <span class="token operator">*</span>p<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>mutex <span class="token operator">&amp;</span>mutex<span class="token punctuation">)</span>
  <span class="token operator">:</span> m_lock<span class="token punctuation">{</span>std<span class="token double-colon punctuation">::</span><span class="token function">lock_guard</span><span class="token punctuation">(</span>mutex<span class="token punctuation">)</span><span class="token punctuation">}</span><span class="token punctuation">,</span> m_p<span class="token punctuation">{</span>p<span class="token punctuation">}</span>
    <span class="token punctuation">{</span> <span class="token punctuation">}</span>

  std<span class="token double-colon punctuation">::</span>lock_guard<span class="token operator">&lt;</span>std<span class="token double-colon punctuation">::</span>mutex<span class="token operator">></span> m_lock<span class="token punctuation">;</span>
  T <span class="token operator">*</span>m_p <span class="token operator">=</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span>

  <span class="token keyword">template</span> <span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token operator">></span>
  <span class="token keyword">friend</span> <span class="token keyword">class</span> <span class="token class-name">MutexProtected</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span><span class="token punctuation">;</span></code></pre>
<p>As you can see, it also doesn’t have much to it! It constructs with a pointer to the <strong>state object</strong> (the <code>m_t</code><span class="attrs"></span> in the <code>MutexProtected&lt;T&gt;</code><span class="attrs"></span>) and the <strong>mutex</strong> (which it <strong>immediately locks</strong> in the constructor), and so all it does is hold onto the lock until destruction time, while giving the user a way to access the state object (via the two public operators).</p>
<p>So, using this setup in full:</p>
<ol>
<li>Call <code>Lock</code><span class="attrs"></span> on the <code>MutexProtected&lt;T&gt;</code><span class="attrs"></span> to get a <code>MutexLocked&lt;T&gt;</code><span class="attrs"></span></li>
<li>Access the protected state through that acquired object (using the <code>-&gt;</code><span class="attrs"></span> operator) to do whatever needs to be done within the lock</li>
<li>Let the <code>MutexLocked&lt;T&gt;</code><span class="attrs"></span> leave scope, at which point the lock is released</li>
</ol>
<p>A nice little feature of this is if you do have a single thing that you’re doing in the lock (Say, inserting a value into a protected queue), the above steps can even fit nicely on a single line (while still being perfectly readable), thanks to the rules of C++ <a href="https://en.cppreference.com/w/cpp/language/lifetime" target="_blank" rel="noopener">temporary object lifetimes</a>:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">void</span> <span class="token class-name">Foo</span><span class="token double-colon punctuation">::</span><span class="token function">EnqueueItem</span><span class="token punctuation">(</span>Item <span class="token operator">*</span>i<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  state<span class="token punctuation">.</span><span class="token function">Lock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">-></span>queue<span class="token punctuation">.</span><span class="token function">Enqueue</span><span class="token punctuation">(</span>i<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// Lock/Update/Unlock</span>

  <span class="token comment">// Do more stuff here outside of the lock, it's guaranteed to be released</span>
<span class="token punctuation">}</span></code></pre>
<h3>Sub-Functions That Require A Lock</h3>
<p>This type of object also helps with a secondary part of this problem, which is when your class has functions (that are likely <code>private</code><span class="attrs"></span> or <code>protected</code><span class="attrs"></span>) that require the lock to <em>already be acquired</em> before you call it:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token comment">// This function is intended to only be called while the lock is held</span>
<span class="token keyword">void</span> <span class="token class-name">Foo</span><span class="token double-colon punctuation">::</span><span class="token function">AdjustStateWithLockHeld</span><span class="token punctuation">(</span><span class="token keyword">int</span> delta<span class="token punctuation">)</span>
  <span class="token punctuation">{</span> m_thing <span class="token operator">+=</span> delta<span class="token punctuation">;</span> <span class="token punctuation">}</span>

Void <span class="token class-name">Foo</span><span class="token double-colon punctuation">::</span><span class="token function">IncrementThing</span><span class="token punctuation">(</span><span class="token keyword">int</span> delta<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  std<span class="token double-colon punctuation">::</span>lock_guard lock <span class="token punctuation">{</span> m_mutex <span class="token punctuation">}</span><span class="token punctuation">;</span>

  <span class="token comment">// Call this function while the lock is held only</span>
  <span class="token function">AdjustStateWithLockHeld</span><span class="token punctuation">(</span>delta<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>

Void <span class="token class-name">Foo</span><span class="token double-colon punctuation">::</span><span class="token function">DecrementThing</span><span class="token punctuation">(</span><span class="token keyword">int</span> delta<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token comment">// Oh no I have once again forgotten to lock the mutex before doing the thing</span>
  <span class="token function">AdjustStateWithLockHeld</span><span class="token punctuation">(</span><span class="token operator">-</span>delta<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>
<p>In the above example, <code>AdjustStateWithLockHeld</code><span class="attrs"></span> is assuming that the lock is being held by its caller and that it’s free to manipulate the state within it. It doesn’t make much sense in this example, but if you have mutex-protected state that has a complex update process (such as multiple things needing to be kept in sync), it can be nice to move such logic to a subroutine.</p>
<p>Thankfully, using the <code>MutexProtected&lt;T&gt;</code><span class="attrs"></span> object, the state is only accessible through a <code>MutexLocked&lt;T&gt;</code><span class="attrs"></span> object. In order for the sub-function to be able to do anything with the state, it would need to additionally take a reference to the <code>MutexLocked&lt;T&gt;</code><span class="attrs"></span> for the state in question, thus effectively ensuring that the mutex is locked by the caller:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token comment">// Now the function takes the locked State object as a parameter:</span>
<span class="token keyword">void</span> <span class="token class-name">Foo</span><span class="token double-colon punctuation">::</span><span class="token function">AdjustStateWithLockHeld</span><span class="token punctuation">(</span>
    MutexLocked<span class="token operator">&lt;</span>State<span class="token operator">></span> <span class="token operator">&amp;</span>state<span class="token punctuation">,</span> 
    <span class="token keyword">int</span> delta<span class="token punctuation">)</span>
  <span class="token punctuation">{</span> state<span class="token operator">-></span>m_thing <span class="token operator">+=</span> delta<span class="token punctuation">;</span> <span class="token punctuation">}</span>

Void <span class="token class-name">Foo</span><span class="token double-colon punctuation">::</span><span class="token function">DecrementThing</span><span class="token punctuation">(</span><span class="token keyword">int</span> delta<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token comment">// Now state has to be locked to get the object to pass to the sub-function!</span>
  <span class="token keyword">auto</span> state <span class="token operator">=</span> m_state<span class="token punctuation">.</span><span class="token function">Lock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token function">AdjustStateWithLockHeld</span><span class="token punctuation">(</span>state<span class="token punctuation">,</span> <span class="token operator">-</span>delta<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>
<h3>A Variation</h3>
<p>For completeness, I also wanted to mention an alternate form of this that I thought of while designing it: where, rather than <code>Lock</code><span class="attrs"></span> returning an object that represents the scoped lock, you instead pass a lambda (or function) that gets all of the state values in it:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">struct</span> <span class="token class-name">State</span>
<span class="token punctuation">{</span>
  <span class="token keyword">int</span> a<span class="token punctuation">,</span> b<span class="token punctuation">;</span>
<span class="token punctuation">}</span><span class="token punctuation">;</span>

MutexProtected<span class="token operator">&lt;</span>State<span class="token operator">></span> state<span class="token punctuation">;</span>

state<span class="token punctuation">.</span><span class="token function">Lock</span><span class="token punctuation">(</span>
  <span class="token punctuation">[</span><span class="token operator">&amp;</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">int</span> <span class="token operator">&amp;</span>a<span class="token punctuation">,</span> <span class="token keyword">int</span> <span class="token operator">&amp;</span>b<span class="token punctuation">)</span>
  <span class="token punctuation">{</span>
    <span class="token comment">// "a" and "b" correspond to the two values of the same name in the state structure.</span>
    <span class="token comment">//  All the things that need to modify state values must, then, happen in this lambda,</span>
    <span class="token comment">//  as there is no way to access the struct members directly.</span>
    a <span class="token operator">+=</span> someVariable<span class="token punctuation">;</span>
    b <span class="token operator">-=</span> <span class="token number">2</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre>
<p>The main advantage this has over the other form is that it makes it considerably more difficult to accidentally (or intentionally?) grab a reference to the internal state structure that could then persist outside of the lock. However, in practice I felt like that kind of mistake is not going to be super common relative to the problem being solved, but also should be easy to catch during code review. Plus, there are additional downsides to this version that I didn’t like:</p>
<ul>
<li>It’s easy to get the order or names of the lambda paramters wrong and end up doing the wrong things with the wrong values since they’d effectively have to match the order in the state.</li>
<li>It gets harder to call sub-functions that need the lock (you’d have to pass <em>all</em> of the state objects that need updating which can be a pain in practice).</li>
<li>It’s worse to debug, since you have to step <em>into</em> the Lock call instead of just over it, and then into the lambda from there.
<ul>
<li>This is also the reason I didn’t consider another variant where instead of a parameter per state object it’s a single reference to <code>State</code><span class="attrs"></span> and you access it that way.</li>
</ul>
</li>
</ul>
<p>All said, I felt like having a lock object that provides access to the inner state as-is (via the <code>-&gt;</code><span class="attrs"></span> operator) was cleaner in practice, and easier to step through in the debugger.</p>
<h3>Limitations</h3>
<p>This, of course, isn’t a perfect solution:</p>
<ul>
<li>There are absolutely cases where some things need to be accessible outside of the mutex lock (i.e. an atomic which can be safely read at any time but only gets updated from within the mutex due to sequencing issues), and as such those values couldn’t live inside of the inner state object.</li>
<li>Also, this is C++ so there’s nothing preventing someone from grabbing a reference or pointer to the state object from the lock and holding onto it until after the lock ends, then partying on the data. However, that kind of code is more likely to get caught at review time.</li>
<li>There are likely other cases (multiple locks, perhaps, which I haven’t fully thought through with this because I haven’t needed to) where this would present problems. This is definitely primarily intended for the “I have a set of data that should only ever be accessed during a lock” case.</li>
</ul>
<h3>Potential Improvements</h3>
<p>Also, there are some improvements that could be made:</p>
<ul>
<li>Perhaps instead of using a <code>lock_guard</code><span class="attrs"></span> you could use a <code>unique_lock</code><span class="attrs"></span> which would let you wait on a critical section using the lock (if you need to wait for the state to be in some specific configuration), which could even be built into the <code>MutexProtected&lt;T&gt;</code><span class="attrs"></span> class (or a similar one) if the critical section is a core part of the usage.</li>
<li>It’s a good idea to declare a custom move constructor and move assignment operator for <code>MutexLocked&lt;T&gt;</code><span class="attrs"></span> that nulls the <code>m_t</code><span class="attrs"></span> pointer of the object being moved from so that you can’t do a move of it to some other location (which subsequently releases the lock) and then still party on the internal pointer. (I also have asserts in my production version in the pointer-access operators that assert the pointer is non-null)</li>
<li>Similarly, it may be nice to have a constructor for <code>MutexProtected&lt;T&gt;</code><span class="attrs"></span> that takes constructor arguments for the contained state structure (similar to, say <code><a href="https://en.cppreference.com/w/cpp/container/vector/emplace_back" target="_blank" rel="noopener">std::vector::emplace_back</a></code>), especially if you are going to have state structures that cannot default construct.</li>
<li>There are also other flavors of locking that might occur: for instance, a <code>TryLock</code><span class="attrs"></span> function that returns a <code>std::optional&lt;MutexLocked&lt;T&gt;&gt;</code><span class="attrs"></span>, and only locks the lock (and returns the locked state) if there is no contention on the mutex.</li>
</ul>
<h3>Closing Time</h3>
<p>All in all, having an abstraction like this makes it way more difficult to party all over internal state without properly locking the mutex first. Switching some old code to use this actually found a couple places where I’d done things incorrectly (reading values that should have been mutex protected on read, in those cases).</p>
<p>So, yeah, by protecting our data from being accessed when it shouldn’t be, we’re also protecting ourselves from ourselves.</p>

        ]]>
      </content:encoded>
    </item>
    <item>
      <title><![CDATA[Protecting Coders From Ourselves: Min, Max, Lerp, and Clamp]]></title>
      <link>https://www.drilian.com/posts/2025.01.10-protecting-coders-from-ourselves-min-max-lerp-and-clamp/</link>
      <pubDate>Fri, 10 Jan 2025 23:13:29 PST</pubDate>
      <dc:creator><![CDATA[Josh Jersild]]></dc:creator>
      <guid>https://www.drilian.com/posts/2025.01.10-protecting-coders-from-ourselves-min-max-lerp-and-clamp/</guid>
      <content:encoded>
        <![CDATA[
          <p><em>Imagine this:</em> you’ve got some value, <code>x</code><span class="attrs"></span>, that you want to ensure is at least <code>1</code><span class="attrs"></span>. That is to say, you want to ensure its <strong>minimum value</strong> is <code>1</code><span class="attrs"></span>. So, being the smart, experienced programmer that you are, you write the following:</p>
<pre class="language-cpp"><code class="language-cpp">x <span class="token operator">=</span> <span class="token function">Min</span><span class="token punctuation">(</span>x<span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre>
<p>You give yourself the small, satisfied nod of a job well done and run the program and then it all goes <em>immediately sideways</em> because that should have been <code>Max</code><span class="attrs"></span> and not <code>Min</code><span class="attrs"></span>.</p>
<p>If you’ve been writing code for basically any length of time, the above was probably less an <em>“imagine this”</em> and more a <em>“remember this”</em> because if you’re anything like me, you’ve done this over. and over. and <em>over</em>.</p>
<p>Inspired by <em>once again</em> mistakenly using <code>Min</code><span class="attrs"></span> instead of <code>Max</code><span class="attrs"></span> to limit the minimum allowed value of something, I’ve decided to start a little series (will it have more than one entry? who knows!) called <strong>Protecting Coders From Ourselves</strong>, in which we rework some bit of API surface to make it less error-prone. We’re going to deal with the “I chose the wrong <code>Min</code><span class="attrs"></span>/<code>Max</code><span class="attrs"></span> again” problem, but first I want to talk about <code>Clamp and Lerp</code><span class="attrs"></span>.</p>
<p><span class="read-more"></span></p>
<blockquote class="note">
          <div class="header"><span>Note</span></div>
<p>The examples here are in C++, but the concepts should be relevant to basically any language.</p>
<p></p></blockquote><p></p>
<h3>Clamp and Lerp</h3>
<h4>Clamp</h4>
<p>Clamp is a simple enough function: Take some value <code>v</code><span class="attrs"></span> and make sure it is no less than <code>min</code><span class="attrs"></span> and no greater than <code>max</code><span class="attrs"></span>. Almost every <code>clamp</code><span class="attrs"></span> function in every library I’ve seen has three parameters, one of which represents the value to be clamped, and two of which represent the range that it should be clamped within:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token function">Clamp</span><span class="token punctuation">(</span>a<span class="token punctuation">,</span> b<span class="token punctuation">,</span> c<span class="token punctuation">)</span></code></pre>
<p>The question, of course, is <strong>“which parameter is which?”</strong> Many languages (ex: <a href="https://en.cppreference.com/w/cpp/algorithm/clamp" target="_blank" rel="noopener">C++</a>, <a href="https://learn.microsoft.com/en-us/dotnet/api/system.math.clamp?view=net-9.0" target="_blank" rel="noopener">C#</a>, <a href="https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-clamp" target="_blank" rel="noopener">HLSL</a>, and <a href="https://docs.rs/num/latest/num/fn.clamp.html" target="_blank" rel="noopener">Rust</a>) have the following arrangement in their standard libraries:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token function">Clamp</span><span class="token punctuation">(</span>valueToClamp<span class="token punctuation">,</span> min<span class="token punctuation">,</span> max<span class="token punctuation">)</span></code></pre>
<p>where the first parameter is the value being clamped and the last two are the min and max ends of the range.</p>
<p>But I’ve also seen this one (looking at <em>you</em>, <a href="https://developer.mozilla.org/en-US/docs/Web/CSS/clamp" target="_blank" rel="noopener">CSS</a>):</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token function">Clamp</span><span class="token punctuation">(</span>min<span class="token punctuation">,</span> valueToClamp<span class="token punctuation">,</span> max<span class="token punctuation">)</span></code></pre>
<p>This one puts the value to clamp in the middle of the range (which, honestly, is conceptually where it belongs).</p>
<h4>Lerp</h4>
<p>Another common function that takes a value and a range is <code>Lerp</code><span class="attrs"></span>, which uses a value in the range <code>[0, 1]</code><span class="attrs"></span> to <a href="https://en.wikipedia.org/wiki/Linear_interpolation" target="_blank" rel="noopener">linearly interpolate</a> between two endpoint values. Most lerp functions that I’ve seen (ex: <a href="https://en.cppreference.com/w/cpp/numeric/lerp" target="_blank" rel="noopener">C++</a>, <a href="https://learn.microsoft.com/en-us/dotnet/api/system.double.lerp?view=net-9.0" target="_blank" rel="noopener">C#</a>, and <a href="https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-lerp" target="_blank" rel="noopener">HLSL</a>) have their parameters ordered as follows:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token function">Lerp</span><span class="token punctuation">(</span>a<span class="token punctuation">,</span> b<span class="token punctuation">,</span> t<span class="token punctuation">)</span></code></pre>
<p>where <code>a</code><span class="attrs"></span> and <code>b</code><span class="attrs"></span> are the <strong>range endpoints</strong> and <code>t</code><span class="attrs"></span> is the <strong>interpolating value</strong>.</p>
<p>Depending on the projects you work on, you maybe don’t write code that uses <code>Lerp</code><span class="attrs"></span> very often (or ever), but I do, and for me the combination of <code>Lerp</code><span class="attrs"></span> and <code>Clamp</code><span class="attrs"></span> are a source of constant, mild confusion.</p>
<h3>Parameter Confusion</h3>
<p>Both of these functions take three parameters, two of which represent a <strong>range</strong> and one which is a <strong>value</strong> that is either limited by or used to interpolate within the range. In isolation it’s easy to rationalize the order of the parameters for each:</p>
<ul>
<li><code class="inline-code language-cpp"><span class="token function">Clamp</span><span class="token punctuation">(</span>v<span class="token punctuation">,</span> min<span class="token punctuation">,</span> max<span class="token punctuation">)</span></code>: Clamp <code>v</code><span class="attrs"></span> to be within the range <code>[min, max]</code><span class="attrs"></span>.</li>
<li><code class="inline-code language-cpp"><span class="token function">Clamp</span><span class="token punctuation">(</span>min<span class="token punctuation">,</span> v<span class="token punctuation">,</span> max<span class="token punctuation">)</span></code>: Clamp such that <code>min &lt;= v &lt;= max</code><span class="attrs"></span>.</li>
<li><code class="inline-code language-cpp"><span class="token function">Lerp</span><span class="token punctuation">(</span>a<span class="token punctuation">,</span> b<span class="token punctuation">,</span> t<span class="token punctuation">)</span></code>: Get a linearly-interpolated value between <code>a</code><span class="attrs"></span> and <code>b</code><span class="attrs"></span> using <code>t</code><span class="attrs"></span>.</li>
</ul>
<p>…but in combination I am <em>constantly</em> second-guessing which order the parameters need to go in. Sometimes, for instance, I’ll write the equivalent of <code class="inline-code language-cpp"><span class="token function">Lerp</span><span class="token punctuation">(</span>t<span class="token punctuation">,</span> a<span class="token punctuation">,</span> b<span class="token punctuation">)</span></code> and wonder why nothing is working the way I expect.</p>
<p>That brings us (finally) to the question of the article: <strong>how can we make it clear</strong> which parameters are which in these functions?</p>
<p>An obvious way to do this, given language support, is to make use of <strong>named parameters</strong> when calling the function:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token function">Clamp</span><span class="token punctuation">(</span>v<span class="token operator">=</span>value<span class="token punctuation">,</span> min<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">,</span> max<span class="token operator">=</span><span class="token number">5</span><span class="token punctuation">)</span></code></pre>
<p>but not all languages (looking at <em>you</em>, C++) support named parameters, so what then?</p>
<h3>Grouping the Range Values</h3>
<p>What if instead of the above, calls looked more like this:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token function">Clamp</span><span class="token punctuation">(</span>a<span class="token punctuation">,</span> <span class="token punctuation">{</span>b<span class="token punctuation">,</span> c<span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token function">Lerp</span><span class="token punctuation">(</span><span class="token punctuation">{</span>a<span class="token punctuation">,</span> b<span class="token punctuation">}</span><span class="token punctuation">,</span> c<span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre>
<p>With this added structure, it’s clear <em>even without reasonable variable names</em> which part is the <strong>range</strong> and which is the <strong>value</strong>, and it’s <em>much</em> more difficult to accidentally call them with the parameters in the wrong order, since, for instance, <code class="inline-code language-cpp"><span class="token function">Lerp</span><span class="token punctuation">(</span>t<span class="token punctuation">,</span> <span class="token punctuation">{</span>a<span class="token punctuation">,</span> b<span class="token punctuation">}</span><span class="token punctuation">)</span></code> wouldn’t even compile.</p>
<p>To do this, we need a simple range structure. in C++ it could look something like this:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">template</span> <span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token operator">></span>
<span class="token keyword">struct</span> <span class="token class-name">ValueRange</span>
<span class="token punctuation">{</span>
  <span class="token comment">// Use a constructor to ensure both endpoints are required.</span>
  <span class="token function">ValueRange</span><span class="token punctuation">(</span>T a_<span class="token punctuation">,</span> T b_<span class="token punctuation">)</span>
    <span class="token operator">:</span> a<span class="token punctuation">{</span>a_<span class="token punctuation">}</span><span class="token punctuation">,</span> b<span class="token punctuation">{</span>b_<span class="token punctuation">}</span> <span class="token punctuation">{</span><span class="token punctuation">}</span>

  T a<span class="token punctuation">;</span>
  T b<span class="token punctuation">;</span>
<span class="token punctuation">}</span><span class="token punctuation">;</span></code></pre>
<p>Using this range, then, you could define <code>Clamp</code><span class="attrs"></span> and <code>Lerp</code><span class="attrs"></span> as follows:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">template</span> <span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token operator">></span>
T <span class="token function">Clamp</span><span class="token punctuation">(</span>T v<span class="token punctuation">,</span> ValueRange<span class="token operator">&lt;</span>T<span class="token operator">></span> range<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token keyword">return</span> <span class="token function">Max</span><span class="token punctuation">(</span>range<span class="token punctuation">.</span>a<span class="token punctuation">,</span> <span class="token function">Min</span><span class="token punctuation">(</span>range<span class="token punctuation">.</span>b<span class="token punctuation">,</span> v<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> 
<span class="token punctuation">}</span>

<span class="token keyword">template</span> <span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token operator">></span>
T <span class="token function">Lerp</span><span class="token punctuation">(</span>ValueRange<span class="token operator">&lt;</span>T<span class="token operator">></span> range<span class="token punctuation">,</span> T t<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token keyword">return</span> range<span class="token punctuation">.</span>b <span class="token operator">*</span> t <span class="token operator">+</span> range<span class="token punctuation">.</span>a <span class="token operator">*</span> <span class="token punctuation">(</span><span class="token number">1</span> <span class="token operator">-</span> t<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>
<p>Now instead of <em>three</em> parameters, they take <em>two</em>, which matches how they work conceptually: with a <strong>range</strong> and a <strong>value</strong> in some order.</p>
<p>Once you start grouping your input parameters , you may start seeing other places to do it, like <code class="inline-code language-cpp"><span class="token generic-function"><span class="token function">IsInRange</span><span class="token generic class-name"><span class="token operator">&lt;</span>Inclusive<span class="token operator">></span></span></span><span class="token punctuation">(</span>v<span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">20</span><span class="token punctuation">}</span><span class="token punctuation">)</span></code>.</p>
<h3>Min and Max</h3>
<p>Back to the Min/Max problem. <a href="https://peoplemaking.games/@TomF@mastodon.gamedev.place/113802749875252216" target="_blank" rel="noopener">As stated perfectly by Tom Forsyth</a>:</p>
<blockquote>
<p>“Almost every time I use [min or max], I think very carefully and then pick the wrong one.”</p>
</blockquote>
<p>This tends to happen because you think <em>“I need to make sure the max value of <code>x</code><span class="attrs"></span> is 10”</em> and it just feels <em>right</em> to turn that into <code class="inline-code language-cpp"><span class="token function">Max</span><span class="token punctuation">(</span>x<span class="token punctuation">,</span> <span class="token number">10</span><span class="token punctuation">)</span></code>…<strong>that’s how it gets you</strong>. Or, well, me. That’s how it gets <em>me</em>. Basically every time I have to do this, I will choose the wrong one on the first try, have my code explode on me, and then go back and headdesk at it until it turns into the correct one.</p>
<p>Basically. Every. Time.</p>
<h3>Reframing The Problem</h3>
<p>But what if you looked at it a different way? What if you instead thought of it as <em>“I want to clamp <code>x</code><span class="attrs"></span> so that it’s no larger than 10”</em>? You want something that just clamps one end of it, in a clear way. What if you could declare a <strong>one-sided clamp</strong>:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token comment">// Using "Open" to declare a side of the range is open. Same as:</span>
<span class="token comment">//  x = Min(x, 10);</span>
x <span class="token operator">=</span> <span class="token function">Clamp</span><span class="token punctuation">(</span>x<span class="token punctuation">,</span> <span class="token punctuation">{</span>Open<span class="token punctuation">,</span> <span class="token number">10</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

<span class="token comment">// Another alternative, ensure that y is no smaller than 2, same as: </span>
<span class="token comment">//  y = Max(y, 2);</span>
y <span class="token operator">=</span> <span class="token function">Clamp</span><span class="token punctuation">(</span>y<span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token number">2</span><span class="token punctuation">,</span> Open<span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre>
<p>To do this efficiently, we’ll have multiple overloads of <code>Clamp</code><span class="attrs"></span>, and define two additional “range” structures, each of which takes an <code>OpenEnded_t</code><span class="attrs"></span>:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token comment">// Declare this as a nice, type-safe enum class</span>
<span class="token keyword">enum</span> <span class="token keyword">class</span> <span class="token class-name">OpenEnded_t</span>
<span class="token punctuation">{</span>
  Open<span class="token punctuation">,</span>
<span class="token punctuation">}</span><span class="token punctuation">;</span>

<span class="token comment">// But make "Open" easy to reach using C++20's "using enum" feature.</span>
<span class="token keyword">using</span> <span class="token keyword">enum</span> <span class="token class-name">OpenEnded_t</span><span class="token punctuation">;</span>

<span class="token comment">// This is a "range" where only the "a" value is specified, the "b" end is open</span>
<span class="token keyword">template</span> <span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token operator">></span>
<span class="token keyword">struct</span> <span class="token class-name">ValueRangeOpenB</span>
<span class="token punctuation">{</span>
  <span class="token function">ValueRangeOpenB</span><span class="token punctuation">(</span>T a_<span class="token punctuation">,</span> OpenEnded_t<span class="token punctuation">)</span>
    <span class="token operator">:</span> <span class="token function">a</span><span class="token punctuation">(</span>a_<span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token punctuation">}</span>
  T a<span class="token punctuation">;</span>
<span class="token punctuation">}</span><span class="token punctuation">;</span>

<span class="token comment">// Like the above, but it's the "b" end that's specified while "a" is open</span>
<span class="token keyword">template</span> <span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token operator">></span>
<span class="token keyword">struct</span> <span class="token class-name">ValueRangeOpenA</span>
<span class="token punctuation">{</span>
  <span class="token function">ValueRangeOpenA</span><span class="token punctuation">(</span>OpenEnded_t<span class="token punctuation">,</span> T b_<span class="token punctuation">)</span>
    <span class="token operator">:</span> <span class="token function">b</span><span class="token punctuation">(</span>b_<span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token punctuation">}</span>
  T b<span class="token punctuation">;</span>
<span class="token punctuation">}</span><span class="token punctuation">;</span></code></pre>
<blockquote class="note">
          <div class="header"><span>Note</span></div>
<p>You can, of course, call <code>Open</code><span class="attrs"></span> whatever you’d prefer: I considered many options (including <code>Unbounded</code><span class="attrs"></span>, <code>Infinite</code><span class="attrs"></span>, <code>OpenEnded</code><span class="attrs"></span>, and <code>None</code><span class="attrs"></span>), but <code>Open</code><span class="attrs"></span> was short and, to my mind, clear.</p>
<p>If you have a global <code>Open</code><span class="attrs"></span> function (or you’re in a class that has a function named <code>Open</code><span class="attrs"></span>), this likely won’t work. You could add a second enum value called <code>OpenEnded</code><span class="attrs"></span> that could be used interchangeably with <code>Open</code><span class="attrs"></span> for that case, or just specify it fully qualified, or just pick a less-inconvenient name.</p>
<p></p></blockquote><p></p>
<p>Once you have these structures, you can define two additional overloads of <code>Clamp</code><span class="attrs"></span>:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">template</span> <span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token operator">></span>
T <span class="token function">Clamp</span><span class="token punctuation">(</span>T v<span class="token punctuation">,</span> ValueRangeOpenB<span class="token operator">&lt;</span>T<span class="token operator">></span> range<span class="token punctuation">)</span>
  <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token function">Max</span><span class="token punctuation">(</span>v<span class="token punctuation">,</span> range<span class="token punctuation">.</span>a<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>


<span class="token keyword">template</span> <span class="token operator">&lt;</span><span class="token keyword">typename</span> <span class="token class-name">T</span><span class="token operator">></span>
T <span class="token function">Clamp</span><span class="token punctuation">(</span>T v<span class="token punctuation">,</span> ValueRangeOpenA<span class="token operator">&lt;</span>T<span class="token operator">></span> range<span class="token punctuation">)</span>
  <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token function">Min</span><span class="token punctuation">(</span>v<span class="token punctuation">,</span> range<span class="token punctuation">.</span>b<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span></code></pre>
<p>These just turn into the correct call to <code>Min</code><span class="attrs"></span> or <code>Max</code><span class="attrs"></span>, but now you can think of it in terms of <strong>limiting one side of its range or the other</strong>, rather than trying to A Beautiful Mind your way into picking the correct function right off the bat.</p>
<p>Now, finally, you can limit a value in multiple ways using the same concept, which can make it easier to reason about when you’re writing the code, and also easier to understand when you’re reading it a month later.</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token comment">// Keep within a range:</span>
x <span class="token operator">=</span> <span class="token function">Clamp</span><span class="token punctuation">(</span>x<span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">5</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

<span class="token comment">// Limit the lower bound:</span>
y <span class="token operator">=</span> <span class="token function">Clamp</span><span class="token punctuation">(</span>y<span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token number">1</span><span class="token punctuation">,</span> Open<span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

<span class="token comment">// Limit the upper bound:</span>
z <span class="token operator">=</span> <span class="token function">Clamp</span><span class="token punctuation">(</span>z<span class="token punctuation">,</span> <span class="token punctuation">{</span>Open<span class="token punctuation">,</span> <span class="token number">5</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre>
<h3>Final Thoughts</h3>
<p>These are ideas I first proposed at my job, and got <em>immediate</em> buy-in from the dev team, because we <em>all</em> kept making the same kinds of mistakes with these functions. There are, of course, times when <code>Min</code><span class="attrs"></span> and <code>Max</code><span class="attrs"></span> are still the appropriate function to use (like when you’re thinking “I need the minimum of these values”) - but when you’re trying to limit the range of something, <code>Clamp</code><span class="attrs"></span> is a clearer declaration of intent.</p>
<p>There are ways to improve these functions:</p>
<ul>
<li>In C++ I <em>highly recommend</em> making all of this <code>constexpr</code><span class="attrs"></span> (including the constructors) so that you can use these functions at compile time as well.
<ul>
<li>Depending on your codebase it may be desirable to additionally mark them <code>[[nodiscard]]</code><span class="attrs"></span> and <code>noexcept</code><span class="attrs"></span>.</li>
<li>Also, restricting the template types using <a href="https://en.cppreference.com/w/cpp/language/constraints" target="_blank" rel="noopener">C++20 concepts</a> can help give you better error messages if you try to compile it with something that it can’t work with.</li>
</ul>
</li>
<li>The full-range <code>Clamp</code><span class="attrs"></span> function may want to have some validation that <code>b &gt;= a</code><span class="attrs"></span> (perhaps an <code>assert</code><span class="attrs"></span>).</li>
</ul>

        ]]>
      </content:encoded>
    </item>
    <item>
      <title><![CDATA[Parser Stuff: Strings]]></title>
      <link>https://www.drilian.com/posts/2025.01.03-parser-stuff-strings/</link>
      <pubDate>Fri, 03 Jan 2025 13:38:29 PST</pubDate>
      <dc:creator><![CDATA[Josh Jersild]]></dc:creator>
      <guid>https://www.drilian.com/posts/2025.01.03-parser-stuff-strings/</guid>
      <content:encoded>
        <![CDATA[
          <p>A while back I was working on a programming language idea and while I haven’t made any progress on it in ages, I really liked the string design that I came up with. I don’t know that any of the ideas are original, but I haven’t seen anything exactly like it so I figure I’d throw the idea out into the ether in case anyone else happens to do something similar in their own personal language that they definitely shouldn’t be making 😆</p>
<p>(<strong>I’ll note upfront</strong>: this design uses dollar signs (<code>$</code><span class="attrs"></span>) and backticks(<code>`</code><span class="attrs"></span>). There are many languages that do so (like Javascript!) but these keys are not universally on all keyboards internationally so for those locales it may be more difficult to type these out…my language design was pretty much just for me with my standard US keyboard so I didn’t take this into account)</p>
<p><span class="read-more"></span></p>
<h3>Basic Strings</h3>
<p>There’s nothing fancy about these, they’re just like most other languages’ basic strings:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token string">"This is a string"</span>
<span class="token string">"This is a string with a newline at the end: \n"</span>
<span class="token string">"Quotes? \"Escape them\""</span>
"Oops forgot to end <span class="token keyword">this</span> one<span class="token punctuation">,</span> it<span class="token number">'</span>s a compiler error</code></pre>
<p>Just a pair of double quotes with everything between being non-quotes (or escaped quotes), contained within a single line.</p>
<h3>Raw Strings</h3>
<p>Raw strings in my language design are kind of a blend between <a href="https://en.cppreference.com/w/cpp/language/string_literal" target="_blank" rel="noopener">C++11-style raw strings</a> and <a href="https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/tokens/raw-string" target="_blank" rel="noopener">C#11 raw strings</a> (the elevens!), using a different delimiting character variation than I’ve seen elsewhere. In this case, the simplest one starts and ends with a pair of backticks:</p>
<pre><code><span class="token string">``This is a single raw string
it is multiple lines long``</span></code></pre>
<p>It can contain any character sequence except the delimiting sequence (again, at simplest a pair of backticks: ``)</p>
<p>But what if your string needs to contain a consecutive pair of backticks? This is where the C++11 raw string inspiration comes in: you can put any string of characters between the ticks (excluding ticks, obviously, or newlines), and then the start and end have to match.</p>
<pre><code><span class="token string">`uniqueString`This is a single string
it contains backticks without terminating: `` ... see?
This is the last line and ends here:`uniqueString`</span></code></pre>
<p>This one starts with <code>`uniqueString`</code><span class="attrs"></span>, and so the only thing that will terminate it is that same sequence: <code>`uniqueString`</code><span class="attrs"></span> (with the tick marks around it).</p>
<p>To add to this, cribbing from C#11 it will:</p>
<ol>
<li>Trim the very first newline if there is one</li>
<li>Also trim the last newline if there is one</li>
<li>Unindent every line of it based on the indentation of the final quote sequence:</li>
</ol>
<pre><code>myString <span class="token operator">=</span> <span class="token string">``
  This is actually the first line of the string, the newline was ignored
  {
    indented further
  }
  ``</span><span class="token operator">;</span> <span class="token comment">// Note that this is indended 2 spaces</span></code></pre>
<p>which turns into the string (note the lack of being completely indented:</p>
<pre><code>This is actually the first line, the newline was ignored
{
  indented further
}
</code></pre>
<p>This makes it easier to generate code (or text files or whatever) that are properly indented, without having to make the indenting of the string in your code all weird.</p>
<h3>Interpolated Strings</h3>
<p>I’m additionally adding string interpolated string support (which is a weird term), using a mix of C# and Javascript’s setup:</p>
<pre><code><span class="token string">$"This string has a ${</span>value<span class="token string">} in it"</span></code></pre>
<p>If a string starts with <code>$</code><span class="attrs"></span>, it’s treated as an interpolated string. A string-convertible expression can be inserted in-place in the string within <code>${}</code><span class="attrs"></span>.</p>
<p>But what if you need to have the character sequence <code>${</code><span class="attrs"></span> in your string? Add more dollar signs to the start, and you need that many dollar signs before a <code>{</code><span class="attrs"></span> to enter the Interpolation Zone:</p>
<pre><code><span class="token string">$$"This string has a $${</span>value<span class="token string">} in it, ${but this isn't one}"</span></code></pre>
<p>Raw strings can also be used as interpolated strings (making for some nice codegen), same rules apply:</p>
<pre><code><span class="token string">    $$``
      Interpolated string with
        multiple lines and a $${</span>value<span class="token string">} in it.
        ${this is not a value because only one $}
      ``</span></code></pre>
<p>If <code>value == 5</code><span class="attrs"></span> this would turn into the following string (upon formatting):</p>
<pre><code>Interpolated string with
  multiple lines and a 5 in it.
  ${this is not a value because only one $}
</code></pre>
<h3>I Just Think They’re Neat</h3>
<p>Anyway, I think this is a really nice combination of properties that make it easy to format strings nicely without being overly-complicated to actually use (unlike C++'s raw strings, which I have to look up literally every time I need to use one). Need a hardcoded regex or path with backslashes? In most cases, just use a raw string with <code>``</code><span class="attrs"></span> on either end:</p>
<pre><code><span class="token string">``C:\Path\With\Single\Backslashes``</span></code></pre>
<p>or</p>
<pre><code><span class="token string">``^[\r\n \t]*Hi[\r\n \t]*$``</span></code></pre>
<p>Hope this was at least mildly interesting to someone!</p>

        ]]>
      </content:encoded>
    </item>
    <item>
      <title><![CDATA[Emulating the FMAdd Instruction, Part 2: 64-bit Floats]]></title>
      <link>https://www.drilian.com/posts/2025.01.02-emulating-the-fmadd-instruction-part-2-64-bit-floats/</link>
      <pubDate>Thu, 02 Jan 2025 02:22:15 PST</pubDate>
      <dc:creator><![CDATA[Josh Jersild]]></dc:creator>
      <guid>https://www.drilian.com/posts/2025.01.02-emulating-the-fmadd-instruction-part-2-64-bit-floats/</guid>
      <content:encoded>
        <![CDATA[
          <p>(This post follows <a href="https://www.drilian.com/posts/2024.12.31-emulating-the-fmadd-instruction-part-1-32-bit-floats/">Part 1: 32-bit floats</a> and will make very little sense without having read that one first. Honestly, it might make little sense <em>having</em> read that one first, I dunno!)</p>
<p>Last time we went over how to calculate the results of the FMAdd instruction (a fused-multiply-add calculated as if it had infinite internal precision) for 32-bit <strong>single</strong>-precision float values:</p>
<ul>
<li>Calculate the double-precision product of <code>a</code><span class="attrs"></span> and <code>b</code><span class="attrs"></span></li>
<li>Add this product to <code>c</code><span class="attrs"></span> to get a double-precision sum</li>
<li>Calculate the error of the sum</li>
<li>Use the error to odd-round the sum</li>
<li>Round the double-precision sum back down to single precision</li>
</ul>
<p>This requires casting up to a 64-bit <strong>double</strong>-precision float to get extra bits of precision. But what if you can’t do that? What if you’re using doubles? You can’t just (in most cases) cast up to a <em>quad</em>-precision float. So what do you do?</p>
<h3>Double-Precision FMAdd</h3>
<p>(Like the 32-bit version, this is based on <a href="https://www.lri.fr/~melquion/doc/08-tc.pdf" target="_blank" rel="noopener">Emulation of FMA and correctly-rounded sums: proved algorithms using rounding to odd</a> by Sylvie Boldo and Guillaume Melquiond, there are additional details in there)</p>
<p>To do this natively as doubles, we need to invent a new operation: <code>MulWithError</code><span class="attrs"></span>. This is the multiplication equivalent of the <code>AddWithError</code><span class="attrs"></span> function from the 32-bit solution:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token punctuation">(</span><span class="token keyword">double</span> prod<span class="token punctuation">,</span> <span class="token keyword">double</span> err<span class="token punctuation">)</span> <span class="token function">MulWithError</span><span class="token punctuation">(</span><span class="token keyword">double</span> x<span class="token punctuation">,</span> <span class="token keyword">double</span> y<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token keyword">double</span> prod <span class="token operator">=</span> x <span class="token operator">*</span> y<span class="token punctuation">;</span>
  <span class="token keyword">double</span> err <span class="token operator">=</span> <span class="token comment">// ??? how do we do this</span>
  <span class="token keyword">return</span> <span class="token punctuation">(</span>prod<span class="token punctuation">,</span> err<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>
<p>We’ll get to how to implement that in a moment, but first we’ll walk through how to use that function to calculate a proper FMAdd.</p>
<p><span class="read-more"></span></p>
<p>We need to do the following:</p>
<ul>
<li>Calculate the product of <code>a</code><span class="attrs"></span> and <code>b</code><span class="attrs"></span> and the error of that product</li>
<li>Calculate the sum of that product and <code>c</code><span class="attrs"></span> (giving us <code>a * b + c</code><span class="attrs"></span>) and the error of this sum
<ul>
<li>We’re not using the error of the product … yet</li>
</ul>
</li>
<li>Add the two error terms (product error and sum error) together, rounding the result to odd</li>
<li>Add this summed error term to our actual result, which will round normally.</li>
</ul>
<p>In code, that looks like this:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token comment">// Start with an "OddRoundedToAdd" helper since we do</span>
<span class="token comment">//  this operation frequently</span>
<span class="token keyword">double</span> <span class="token function">OddRoundedAdd</span><span class="token punctuation">(</span><span class="token keyword">double</span> x<span class="token punctuation">,</span> <span class="token keyword">double</span> y<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token punctuation">(</span><span class="token keyword">double</span> sum<span class="token punctuation">,</span> <span class="token keyword">double</span> err<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">AddwithError</span><span class="token punctuation">(</span>x<span class="token punctuation">,</span> y<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">return</span> <span class="token function">RoundToOdd</span><span class="token punctuation">(</span>sum<span class="token punctuation">,</span> err<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>

<span class="token keyword">double</span> <span class="token function">FMAdd</span><span class="token punctuation">(</span><span class="token keyword">double</span> a<span class="token punctuation">,</span> <span class="token keyword">double</span> b<span class="token punctuation">,</span> <span class="token keyword">double</span> c<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token punctuation">(</span><span class="token keyword">double</span> ab<span class="token punctuation">,</span> <span class="token keyword">double</span> abErr<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">MulWithError</span><span class="token punctuation">(</span>a<span class="token punctuation">,</span> b<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token punctuation">(</span><span class="token keyword">double</span> abc<span class="token punctuation">,</span> <span class="token keyword">double</span> abcErr<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">AddWithError</span><span class="token punctuation">(</span>ab<span class="token punctuation">,</span> c<span class="token punctuation">)</span><span class="token punctuation">;</span>

  <span class="token comment">// Odd-round the sum of the two errors before </span>
  <span class="token comment">//  adding it in to the final result.</span>
  <span class="token keyword">double</span> err <span class="token operator">=</span> <span class="token function">OddRoundedAdd</span><span class="token punctuation">(</span>abErr<span class="token punctuation">,</span> abcErr<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">return</span> abc <span class="token operator">+</span> err<span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>
<p>By keeping the error terms from both the product and the sum, we have <strong>kept all of the exact result</strong>. That is, we can assemble the mathematically-exact result given enough precision by doing <code>abc + abErr + abcErr</code><span class="attrs"></span>.</p>
<p>But we can’t do infinite-precision addition of three values. However, we <em>can</em> odd-round an intermediate result, the same way we did with the single-precision case.</p>
<p>In this case, we know that <code>abErr</code><span class="attrs"></span> and <code>abcErr</code><span class="attrs"></span> both (necessarily) have much lower magnitudes than the final result, as each error value’s highest bit is lower than the lowest bit of the mantissa of their respective operations. So, if we odd-round the sum of these two values, it actually effectively fulfills the condition of <strong>having more bits of precision than the final result</strong>. Thus, if we add the error terms together with odd rounding, the odd-rounded fake-sticky final digit will be taken into account by the <em>actual</em> sticky bit used when doing the final sum of the result and error terms.</p>
<p>I hope that makes sense?</p>
<h3>Breaking Down Multiplication With Error</h3>
<p>(Like <code>AddWithError</code><span class="attrs"></span>, this is based on <a href="https://ir.cwi.nl/pub/9159/9159D.pdf" target="_blank" rel="noopener">A Floating-Point Technique For Extending the Available Precision</a> by T.J. Dekker)</p>
<p>So how do we calculate the error term of a 64-bit multiply? We can’t use 128-bit values, but what we <em>can</em> do is break each 64-bit value up into two values, each with less bits of precision.</p>
<p>We’ll break <code>x</code><span class="attrs"></span> and <code>y</code><span class="attrs"></span> (our two multiplicands) up into high and low values, where:</p>
<pre class="language-cpp"><code class="language-cpp">x <span class="token operator">=</span> xh <span class="token operator">+</span> xl<span class="token punctuation">;</span>
y <span class="token operator">=</span> yh <span class="token operator">+</span> yl<span class="token punctuation">;</span></code></pre>
<p>We do this by breaking the double’s mantissa up:</p>
<ul>
<li><code>xh</code><span class="attrs"></span> contains the top 25 bits of <code>x</code><span class="attrs"></span>’s mantissa (plus its implied 1, giving it 26 bits of precision).</li>
<li><code>xl</code><span class="attrs"></span> contains the bottom 27 bits of <code>x</code><span class="attrs"></span>’s mantissa. The highest-set ‘1’ in this mantissa will become the implied <code>1</code><span class="attrs"></span> bit that’s part of the standard floating-point format, so this value will have 27 bits of precision, max.</li>
<li><code>yh</code><span class="attrs"></span> and <code>yl</code><span class="attrs"></span> are the same, but for <code>y</code><span class="attrs"></span>.</li>
</ul>
<pre class="language-cpp"><code class="language-cpp"><span class="token punctuation">(</span><span class="token keyword">double</span> h<span class="token punctuation">,</span> <span class="token keyword">double</span> l<span class="token punctuation">)</span> <span class="token function">Split</span><span class="token punctuation">(</span><span class="token keyword">double</span> v<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token comment">// In C++ this Zero function can be implemented by  masking off</span>
  <span class="token comment">//  the bottom 27 bits by casting to a 64-bit int:</span>
  <span class="token comment">// constexpr uint64_t Mask = ~0x07ff'ffff;</span>
  <span class="token comment">// double h = std::bit_cast&lt;double>(</span>
  <span class="token comment">//              std::bit_cast&lt;uint64_t>(v) &amp; Mask);</span>
  <span class="token keyword">double</span> h <span class="token operator">=</span> <span class="token function">ZeroBottom27BitsOfMantissa</span><span class="token punctuation">(</span>v<span class="token punctuation">)</span><span class="token punctuation">;</span>

  <span class="token comment">// We can get the lower bits of the mantissa (correctly normalized and</span>
  <span class="token comment">//  with correct signs) by subtracting the extracted upper bits from </span>
  <span class="token comment">//  the original value.</span>
  <span class="token keyword">double</span> l <span class="token operator">=</span> v <span class="token operator">-</span> h<span class="token punctuation">;</span>
  <span class="token keyword">return</span> <span class="token punctuation">(</span>h<span class="token punctuation">,</span> l<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>
<p>What does this split give us? Well, we can now break the multiplication up into a sum of multiplies that now each have enough bits of precision to be exactly representable (<code>27BitValueA * 27BitValueB == 54BitValue</code><span class="attrs"></span>, which fits perfectly in a double (with the implied <code>1</code><span class="attrs"></span> bit), using our old friend from Algebra, <a href="https://en.wikipedia.org/wiki/FOIL_method" target="_blank" rel="noopener">FOIL</a>:</p>
<pre class="language-cpp"><code class="language-cpp">    x <span class="token operator">*</span> y 
<span class="token operator">=</span> <span class="token punctuation">(</span>xh <span class="token operator">+</span> xl<span class="token punctuation">)</span> <span class="token operator">*</span> <span class="token punctuation">(</span>yh <span class="token operator">+</span> yl<span class="token punctuation">)</span> 
<span class="token operator">=</span> xh<span class="token operator">*</span>yh <span class="token operator">+</span> xh<span class="token operator">*</span>yl <span class="token operator">+</span> xl<span class="token operator">*</span>yh <span class="token operator">+</span> xl<span class="token operator">*</span>yl<span class="token punctuation">;</span></code></pre>
<p>We can’t actually do those adds directly, but what we can do is similar to how we did <code>AddWithError</code><span class="attrs"></span>: use a sequence of precision-preserving operations to calculate the difference between that idealized result and our rounded product:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token punctuation">(</span><span class="token keyword">double</span> prod<span class="token punctuation">,</span> <span class="token keyword">double</span> err<span class="token punctuation">)</span> <span class="token function">MulWithError</span><span class="token punctuation">(</span><span class="token keyword">double</span> x<span class="token punctuation">,</span> <span class="token keyword">double</span> y<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token keyword">double</span> prod <span class="token operator">=</span> x <span class="token operator">*</span> y<span class="token punctuation">;</span>

  <span class="token punctuation">(</span>xh<span class="token punctuation">,</span> xl<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">Split</span><span class="token punctuation">(</span>x<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token punctuation">(</span>yh<span class="token punctuation">,</span> yl<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">Split</span><span class="token punctuation">(</span>y<span class="token punctuation">)</span><span class="token punctuation">;</span>

  <span class="token comment">// Parentheses to demonstrate the precise order these </span>
  <span class="token comment">//  operations must occur in</span>
  <span class="token keyword">double</span> err <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token punctuation">(</span>xh<span class="token operator">*</span>yh <span class="token operator">-</span> prod<span class="token punctuation">)</span> <span class="token operator">+</span> xh<span class="token operator">*</span>yl<span class="token punctuation">)</span> <span class="token operator">+</span> xl<span class="token operator">*</span>yh<span class="token punctuation">)</span> <span class="token operator">+</span> xl<span class="token operator">*</span>yl<span class="token punctuation">;</span>
  <span class="token keyword">return</span> <span class="token punctuation">(</span>prod<span class="token punctuation">,</span> err<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>
<p>It works like this:</p>
<ul>
<li>Calculate the (rounded) product of <code>x</code><span class="attrs"></span> and <code>y</code><span class="attrs"></span></li>
<li>Subtract that rounded product from the product of <code>xh</code><span class="attrs"></span> and <code>yh</code><span class="attrs"></span>
<ul>
<li>These should have roughly the same magnitude (and definitely the same sign) so this is a precision-preserving subtraction.</li>
<li>Since <code>|xh * yh| &lt;= rounded(|x * y|)</code><span class="attrs"></span> (because <code>xh</code><span class="attrs"></span> and <code>yh</code><span class="attrs"></span> are truncated versions of <code>x</code><span class="attrs"></span> and <code>y</code><span class="attrs"></span> and thus have lower magnitudes) this is a <code>smaller - larger</code><span class="attrs"></span> operation and we’ll get a result with a sign opposite that of the final product.</li>
</ul>
</li>
<li>Keep adding in next-lower-magnitudes of values, which will continue to preserve precision
<ul>
<li>(because we have a value that is opposite-sign these are effectively subtractions, in the same way that <code>a + -b</code><span class="attrs"></span> is)</li>
<li>It’s also worth noting here that <code>xh*yl</code><span class="attrs"></span> and <code>xl*yh</code><span class="attrs"></span> will have equivalent magnitudes so the order that you add them in doesn’t matter, as long as they’re both after <code>xh*yh</code><span class="attrs"></span> and before <code>xl*yl</code><span class="attrs"></span></li>
</ul>
</li>
</ul>
<p>Once you’ve done that, you have the computed product as well as the error term, and we can then follow our FMAdd algorithm above to calculate the FMAdd.</p>
<p>So, that’s it, we’re done, right?</p>
<h3>Edge Cases</h3>
<p>Nope! Well, yes if you just wanted the gist, but now it’s time to get into all those annoying implementation details that the papers this is based on completely glossed over. Here’s where it gets ugly (unless you thought it was already ugly, in which case, sorry, it’s about to get <em>worse</em> somehow).</p>
<p>In our single-precision case, <strong>we didn’t have to worry about exponent overflow or underflow</strong> because we were using double-precision intermediates, which not only have additional mantissa range, but also additional <em>exponent</em> range.</p>
<p>It’s possible that the product of <code>a * b</code><span class="attrs"></span> (an intermediate value in our calculation) goes <strong>out of range</strong> of what a double can represent, but that the addition of <code>c</code><span class="attrs"></span> might bring the final result back into range (which can happen when the sign of <code>c</code><span class="attrs"></span> is opposite the sign of <code>a * b</code><span class="attrs"></span>). This causes a different set of errors on either end:</p>
<ul>
<li>If <code>a*b</code><span class="attrs"></span> is too large to be represented, it turns into <em>infinity</em> which means that adding <code>c</code><span class="attrs"></span> in will just leave it as infinity even though the final result should have been a representable value (albeit one with a very large magnitude)</li>
<li>If <code>a*b</code><span class="attrs"></span> is too small to be represented, it will go <a href="https://en.wikipedia.org/wiki/Subnormal_number" target="_blank" rel="noopener">subnormal</a> which means bits of the intermediate result will slide off the bottom of the mantissa and we lose bits of information, which can causes us to round incorrectly at our final result.</li>
</ul>
<p>To solve this, we’ll introduced a bias into the calculation, for when the value goes very small or very large:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">double</span> <span class="token function">CalculateFMAddBias</span><span class="token punctuation">(</span><span class="token keyword">double</span> a<span class="token punctuation">,</span> <span class="token keyword">double</span> b<span class="token punctuation">,</span> <span class="token keyword">double</span> c<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token comment">// Calculate what our final result would be if we just did it normally</span>
  <span class="token keyword">double</span> testResult <span class="token operator">=</span> <span class="token function">Abs</span><span class="token punctuation">(</span>a <span class="token operator">*</span> b <span class="token operator">+</span> c<span class="token punctuation">)</span><span class="token punctuation">;</span>

  <span class="token keyword">if</span> <span class="token punctuation">(</span>testResult <span class="token operator">&lt;</span> <span class="token function">Pow2</span><span class="token punctuation">(</span><span class="token operator">-</span><span class="token number">500</span><span class="token punctuation">)</span> <span class="token operator">&amp;&amp;</span> <span class="token function">Max</span><span class="token punctuation">(</span>a<span class="token punctuation">,</span> b<span class="token punctuation">,</span> c<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token function">Pow2</span><span class="token punctuation">(</span><span class="token number">800</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
  <span class="token punctuation">{</span>
    <span class="token comment">// Our result is very small and our maximum value is not so large </span>
    <span class="token comment">//  that we'll blow up with a bias, so bias our values up to</span>
    <span class="token comment">//  ensure we don't go subnormal in our intermediate result</span>
    <span class="token keyword">return</span> <span class="token function">Pow2</span><span class="token punctuation">(</span><span class="token number">110</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span>
  <span class="token keyword">else</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">IsInfinite</span><span class="token punctuation">(</span>testResult<span class="token punctuation">)</span><span class="token punctuation">)</span>
  <span class="token punctuation">{</span>
    <span class="token comment">// We hit infinity, but that might be due to exponent overflow,</span>
    <span class="token comment">//  so bias everything down (this may cause c to go subnormal, </span>
    <span class="token comment">//  but if that's the case then a*b on its own is infinity and</span>
    <span class="token comment">//  so it won't affect the final result in any way)</span>
    <span class="token keyword">return</span> <span class="token function">Pow2</span><span class="token punctuation">(</span><span class="token operator">-</span><span class="token number">55</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span>
  <span class="token keyword">else</span>
  <span class="token punctuation">{</span>
    <span class="token comment">// No bias needed</span>
    <span class="token keyword">return</span> <span class="token number">1.0</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span>
<span class="token punctuation">}</span></code></pre>
<p>For any results that aren’t extreme, the bias will remain <code>1.0</code><span class="attrs"></span> but, for values at the extremes, we’ll scale our intermediates down (using powers of 2 which only affect the exponent and not the mantissa) into a range such that we can’t temporarily poke outside of range. Also note that my choices of powers of 2 are not perfectly chosen, I didn’t bother trying to figure out the exact right biases/thresholds so I just picked ones that I knew were good enough.</p>
<p>So then we do our FMAdd calculation as before, but with the bias introduced (and then backed out at the end):</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token comment">// Do our multiplication with the bias applied to 'a'</span>
<span class="token comment">//  (the choice of applying it to 'a' vs 'b' is completely</span>
<span class="token comment">//  arbitrary)</span>
<span class="token punctuation">(</span><span class="token keyword">double</span> ab<span class="token punctuation">,</span> <span class="token keyword">double</span> abErr<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">MulWithError</span><span class="token punctuation">(</span>a <span class="token operator">*</span> bias<span class="token punctuation">,</span> b<span class="token punctuation">)</span><span class="token punctuation">;</span>

<span class="token comment">// Then the sum with the bias applied to 'c'</span>
<span class="token punctuation">(</span><span class="token keyword">double</span> abc<span class="token punctuation">,</span> <span class="token keyword">double</span> abcErr<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">AddWithError</span><span class="token punctuation">(</span>ab<span class="token punctuation">,</span> c <span class="token operator">*</span> bias<span class="token punctuation">)</span><span class="token punctuation">;</span>

err <span class="token operator">=</span> <span class="token function">OddRoundedAdd</span><span class="token punctuation">(</span>abErr<span class="token punctuation">,</span> abcErr<span class="token punctuation">)</span><span class="token punctuation">;</span>

<span class="token comment">// Calculate our final result then un-bias the result.</span>
<span class="token keyword">return</span> <span class="token punctuation">(</span>abc <span class="token operator">+</span> err<span class="token punctuation">)</span> <span class="token operator">/</span> bias<span class="token punctuation">;</span></code></pre>
<p>Alright, we’ve avoided both overflow and underflow and everything is great, right?</p>
<h3>Two (Point Five) Last Annoying Implementation Details</h3>
<p>Nope, sorry again! It turns out there are still two cases we need to deal with.</p>
<h4>Case 1: Infinity or NaN even with the bias</h4>
<p>If our result (without error applied) hits infinity even with the avoid-infinity bound, then we should just go ahead and return now to avoid Causing Problems Later (that is, turning what should be infinity <em>into</em> a NaN). And if it’s already NaN we can just return now because it’s going to be NaN forever.</p>
<p>Except, there’s one additional necessary check here, for a case caught by <a href="https://bsky.app/profile/dzaima.bsky.social" target="_blank" rel="noopener">dzaima over on Bluesky</a>): in the event <code>a</code><span class="attrs"></span> and <code>b</code><span class="attrs"></span> are finite numbers but <code>a * b</code><span class="attrs"></span> blows up to infinity, and then <code>c</code><span class="attrs"></span> is the <em>opposite</em> infinity, the correct return value is whichever infinity (positive or negative) <code>c</code><span class="attrs"></span> is, so in our early-out check has to catch that case as well:</p>
<pre class="language-cpp"><code class="language-cpp">  <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">IsInfiniteOrNaN</span><span class="token punctuation">(</span>abc<span class="token punctuation">)</span><span class="token punctuation">)</span>
  <span class="token punctuation">{</span> 
    <span class="token comment">// If we got NaN (or Inf, which won't affect the output) and </span>
    <span class="token comment">//  a and b are both finite but c is infinite, return c (without</span>
    <span class="token comment">//  this check, we will incorrectly return NaN instead of -Inf</span>
    <span class="token comment">//  for FMAdd(1e200, 1e200, -Infinity))</span>
    <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">IsInfinite</span><span class="token punctuation">(</span>c<span class="token punctuation">)</span> <span class="token operator">&amp;&amp;</span> <span class="token operator">!</span><span class="token function">IsInfiniteOrNaN</span><span class="token punctuation">(</span>a<span class="token punctuation">)</span> <span class="token operator">&amp;&amp;</span> <span class="token operator">!</span><span class="token function">IsInfiniteOrNaN</span><span class="token punctuation">(</span>b<span class="token punctuation">)</span><span class="token punctuation">)</span>
      <span class="token punctuation">{</span> <span class="token keyword">return</span> c<span class="token punctuation">;</span> <span class="token punctuation">}</span>

    <span class="token comment">// Otherwise, return whichever Inf or NaN we got directly;</span>
    <span class="token keyword">return</span> abc<span class="token punctuation">;</span>
  <span class="token punctuation">}</span></code></pre>
<h4>Case 2: Subnormal Results</h4>
<p>If our result is subnormal (after the bias is backed out), then it’s going to lose bits of precision as it shifts down (because the exponent can’t go any lower so instead the value itself shifts down the mantissa), which means whoops here’s another rounding step, and the dreaded <strong>double-rounding</strong> returns.</p>
<p>In this case we need to actually odd-round the addition of the error term as well, so that when the bias is backed out and it rounds, it does the correct thing:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token comment">// Multiply the smallest-representable normalized value by our avoid-</span>
<span class="token comment">//  subnormal bias. Any (biased) value below this will go subnormal.</span>
<span class="token comment">//  (In production code it'd be nicer to use something like</span>
<span class="token comment">//  std::numeric_limits instead of hard-coding -1022)</span>
<span class="token keyword">const</span> <span class="token keyword">double</span> SubnormThreshold <span class="token operator">=</span> <span class="token function">Pow2</span><span class="token punctuation">(</span><span class="token operator">-</span><span class="token number">1022</span><span class="token punctuation">)</span> <span class="token operator">*</span> AvoidDenormalBias<span class="token punctuation">;</span>

<span class="token keyword">if</span> <span class="token punctuation">(</span>bias <span class="token operator">==</span> AvoidSubnormalBias <span class="token operator">&amp;&amp;</span> <span class="token function">Abs</span><span class="token punctuation">(</span>abc<span class="token punctuation">)</span> <span class="token operator">&lt;</span> SubnormThreshold<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token comment">// Odd-round the addition of the error so that the rounding that </span>
  <span class="token comment">//  happens on the divide by the bias is correct.</span>
  <span class="token punctuation">(</span><span class="token keyword">double</span> finalSum<span class="token punctuation">,</span> finalSumErr<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">AddWithError</span><span class="token punctuation">(</span>abc<span class="token punctuation">,</span> err<span class="token punctuation">)</span><span class="token punctuation">;</span>
  finalSum <span class="token operator">=</span> <span class="token function">RoundToOdd</span><span class="token punctuation">(</span>finalSum<span class="token punctuation">,</span> finalSumErr<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">return</span> finalSum <span class="token operator">/</span> bias<span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>
<p>And this almost works, except there’s one <em>more</em> annoying case, and that’s where <strong>our result is going subnormal, but only by exactly one bit</strong>. Remember that the odd-rounding trick only works if we have <strong>two or more</strong> bits so that the final rounding works properly, but in this case we’re truncating the mantissa by exactly one bit, so we have to do even more work:</p>
<ul>
<li>Split the value that will be shifting down into a high and low part (same as we did for the multiply)</li>
<li>Add our error term to the low part of it
<ul>
<li>This preserves additional bits of the error term since we gave ourselves more headroom by removing the upper half of its mantissa</li>
</ul>
</li>
<li>Remove the bias from both the high and low parts separately
<ul>
<li>Removing the bias from the high part doesn’t round since we know the lowest bit is 0</li>
<li>Removing the bias from the low part applies the actual final rounding (correctly) since we gave ourselves more bits to work with</li>
</ul>
</li>
<li>Sum the halves back together and return that as our final result
<ul>
<li>This sum is (thankfully) perfectly representable by the final precision and doesn’t introduce any additional error.</li>
</ul>
</li>
</ul>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">const</span> <span class="token keyword">double</span> OneBitSubnormalThreshold <span class="token operator">=</span> 
  OneBitSubnormalThreshold <span class="token operator">*</span> <span class="token number">0.5</span><span class="token punctuation">;</span>
<span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">Abs</span><span class="token punctuation">(</span>finalResult<span class="token punctuation">.</span>result<span class="token punctuation">)</span> <span class="token operator">>=</span> k_oneBitDenormThreshold<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token comment">// Split into halves</span>
  <span class="token punctuation">(</span>rh<span class="token punctuation">,</span> rl<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">Split</span><span class="token punctuation">(</span>finalSum<span class="token punctuation">)</span><span class="token punctuation">;</span>

  <span class="token comment">// Add the error term into the low part of the split</span>
  rl <span class="token operator">=</span> <span class="token function">OddRoundedAddition</span><span class="token punctuation">(</span>rl<span class="token punctuation">,</span> finalSumErr<span class="token punctuation">)</span><span class="token punctuation">;</span>

  <span class="token comment">// Scale them both down by the bias. Note that </span>
  <span class="token comment">//  the rh division cannot round since the lowest bit</span>
  <span class="token comment">//  is 0</span>
  rh <span class="token operator">/=</span> bias<span class="token punctuation">;</span>

  <span class="token comment">// This division is what actually introduces the final</span>
  <span class="token comment">//  rounding (correctly, since we gave ourselves more</span>
  <span class="token comment">//  bits to work with)</span>
  rl <span class="token operator">/=</span> bias<span class="token punctuation">;</span>

  <span class="token comment">// This sum is perfectly representable by the final</span>
  <span class="token comment">//  precision and will not introduce additional error.</span>
  <span class="token keyword">return</span> rh <span class="token operator">+</span> rl<span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>
<h3>OMG Are We Done Now?</h3>
<p>As far as I’m aware, those are all the implementation details to doing a 64-bit double-precision FMAdd implementation. It’s <em>conceptually</em> not that much more complicated than the 32-bit one, but mechanically it’s worse, plus there are those fun extra edge cases to consider.</p>
<p>Here’s the final code:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token punctuation">(</span><span class="token keyword">double</span> h<span class="token punctuation">,</span> <span class="token keyword">double</span> l<span class="token punctuation">)</span> <span class="token function">Split</span><span class="token punctuation">(</span><span class="token keyword">double</span> v<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token keyword">double</span> h <span class="token operator">=</span> <span class="token function">ZeroBottom27BitsOfMantissa</span><span class="token punctuation">(</span>v<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">double</span> l <span class="token operator">=</span> v <span class="token operator">-</span> h<span class="token punctuation">;</span>
  <span class="token keyword">return</span> <span class="token punctuation">(</span>h<span class="token punctuation">,</span> l<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>

<span class="token punctuation">(</span><span class="token keyword">double</span> prod<span class="token punctuation">,</span> <span class="token keyword">double</span> err<span class="token punctuation">)</span> <span class="token function">MulWithError</span><span class="token punctuation">(</span>
  <span class="token keyword">double</span> x<span class="token punctuation">,</span> 
  <span class="token keyword">double</span> y<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token keyword">double</span> prod <span class="token operator">=</span> x <span class="token operator">*</span> y<span class="token punctuation">;</span>

  <span class="token punctuation">(</span>xh<span class="token punctuation">,</span> xl<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">Split</span><span class="token punctuation">(</span>x<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token punctuation">(</span>yh<span class="token punctuation">,</span> yl<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">Split</span><span class="token punctuation">(</span>y<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">double</span> err <span class="token operator">=</span> 
    <span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token punctuation">(</span>xh<span class="token operator">*</span>yh <span class="token operator">-</span> prod<span class="token punctuation">)</span> <span class="token operator">+</span> xh<span class="token operator">*</span>yl<span class="token punctuation">)</span> <span class="token operator">+</span> xl<span class="token operator">*</span>yh<span class="token punctuation">)</span> <span class="token operator">+</span> xl<span class="token operator">*</span>yl<span class="token punctuation">;</span>
  <span class="token keyword">return</span> <span class="token punctuation">(</span>prod<span class="token punctuation">,</span> err<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>

<span class="token keyword">double</span> <span class="token function">OddRoundedAdd</span><span class="token punctuation">(</span><span class="token keyword">double</span> x<span class="token punctuation">,</span> <span class="token keyword">double</span> y<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token punctuation">(</span><span class="token keyword">double</span> sum<span class="token punctuation">,</span> <span class="token keyword">double</span> err<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">AddwithError</span><span class="token punctuation">(</span>x<span class="token punctuation">,</span> y<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">return</span> <span class="token function">RoundToOdd</span><span class="token punctuation">(</span>sum<span class="token punctuation">,</span> err<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>

<span class="token keyword">double</span> <span class="token function">FMAdd</span><span class="token punctuation">(</span><span class="token keyword">double</span> a<span class="token punctuation">,</span> <span class="token keyword">double</span> b<span class="token punctuation">,</span> <span class="token keyword">double</span> c<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token keyword">const</span> <span class="token keyword">double</span> AvoidSubnormalBias <span class="token operator">=</span> <span class="token function">Pow2</span><span class="token punctuation">(</span><span class="token number">110</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">double</span> bias <span class="token operator">=</span> <span class="token number">1.0</span><span class="token punctuation">;</span>
  <span class="token punctuation">{</span>
    <span class="token comment">// Calculate our final result as if done normally</span>
    <span class="token keyword">double</span> testResult <span class="token operator">=</span> <span class="token function">Abs</span><span class="token punctuation">(</span>a <span class="token operator">*</span> b <span class="token operator">+</span> c<span class="token punctuation">)</span><span class="token punctuation">;</span>

    <span class="token comment">// Bias if the result goes too low or too high</span>
    <span class="token keyword">if</span> <span class="token punctuation">(</span>testResult <span class="token operator">&lt;</span> <span class="token function">Pow2</span><span class="token punctuation">(</span><span class="token operator">-</span><span class="token number">500</span><span class="token punctuation">)</span> <span class="token operator">&amp;&amp;</span> <span class="token function">Max</span><span class="token punctuation">(</span>a<span class="token punctuation">,</span> b<span class="token punctuation">,</span> c<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token function">Pow2</span><span class="token punctuation">(</span><span class="token number">800</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
      <span class="token punctuation">{</span> bias <span class="token operator">=</span> AvoidSubnormalBias<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// too low</span>
    <span class="token keyword">else</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">IsInfinite</span><span class="token punctuation">(</span>testResult<span class="token punctuation">)</span><span class="token punctuation">)</span>
      <span class="token punctuation">{</span> bias <span class="token operator">=</span> <span class="token function">Pow2</span><span class="token punctuation">(</span><span class="token operator">-</span><span class="token number">55</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// too high</span>
  <span class="token punctuation">}</span>

  <span class="token comment">// Calculate using our bias</span>
  <span class="token punctuation">(</span><span class="token keyword">double</span> ab<span class="token punctuation">,</span> <span class="token keyword">double</span> abErr<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">MulWithError</span><span class="token punctuation">(</span>a <span class="token operator">*</span> bias<span class="token punctuation">,</span> b<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token punctuation">(</span><span class="token keyword">double</span> abc<span class="token punctuation">,</span> <span class="token keyword">double</span> abcErr<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">AddWithError</span><span class="token punctuation">(</span>ab<span class="token punctuation">,</span> c <span class="token operator">*</span> bias<span class="token punctuation">)</span><span class="token punctuation">;</span>

  <span class="token comment">// Check for infinity or NaN and return early</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">IsInfiniteOrNaN</span><span class="token punctuation">(</span>abc<span class="token punctuation">)</span><span class="token punctuation">)</span>
  <span class="token punctuation">{</span> 
    <span class="token comment">// Handle the case of "a multiply of two finite values hit infinity</span>
    <span class="token comment">//  even *with* the bias, but c is the opposite infinity" case and</span>
    <span class="token comment">//  return the correct result  of "c"</span>
    <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">IsInfinite</span><span class="token punctuation">(</span>c<span class="token punctuation">)</span> <span class="token operator">&amp;&amp;</span> <span class="token operator">!</span><span class="token function">IsInfiniteOrNaN</span><span class="token punctuation">(</span>a<span class="token punctuation">)</span> <span class="token operator">&amp;&amp;</span> <span class="token operator">!</span><span class="token function">IsInfiniteOrNaN</span><span class="token punctuation">(</span>b<span class="token punctuation">)</span><span class="token punctuation">)</span>
      <span class="token punctuation">{</span> <span class="token keyword">return</span> c<span class="token punctuation">;</span> <span class="token punctuation">}</span>

    <span class="token comment">// Otherwise just return the inf or nan directly</span>
    <span class="token keyword">return</span> abc<span class="token punctuation">;</span> 
  <span class="token punctuation">}</span>

  <span class="token comment">// Odd-round the intermediate error resultt</span>
  <span class="token keyword">double</span> err <span class="token operator">=</span> <span class="token function">OddRoundedAdd</span><span class="token punctuation">(</span>abErr<span class="token punctuation">,</span> abcErr<span class="token punctuation">)</span><span class="token punctuation">;</span>

  <span class="token comment">// Multiply the smallest-representable normalized value by our avoid-</span>
  <span class="token comment">//  subnormal bias. Any (biased) value below this will go subnormal</span>
  <span class="token keyword">const</span> <span class="token keyword">double</span> SubnormThreshold <span class="token operator">=</span> <span class="token function">Pow2</span><span class="token punctuation">(</span><span class="token operator">-</span><span class="token number">1022</span><span class="token punctuation">)</span> <span class="token operator">*</span> AvoidSubnormalBias<span class="token punctuation">;</span>

  <span class="token keyword">if</span> <span class="token punctuation">(</span>bias <span class="token operator">==</span> AvoidSubnormalBias <span class="token operator">&amp;&amp;</span> <span class="token function">Abs</span><span class="token punctuation">(</span>abc<span class="token punctuation">)</span> <span class="token operator">&lt;</span> SubnormThreshold<span class="token punctuation">)</span>
  <span class="token punctuation">{</span>
    <span class="token punctuation">(</span><span class="token keyword">double</span> finalSum<span class="token punctuation">,</span> finalSumErr<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">AddWithError</span><span class="token punctuation">(</span>abc<span class="token punctuation">,</span> err<span class="token punctuation">)</span><span class="token punctuation">;</span>

    <span class="token comment">// This is half of SubnormThreshold. Any value between SubnormThresold</span>
    <span class="token comment">//  and this value will only lose a single bit of precision when</span>
    <span class="token comment">//  the bias is removed, which requires some extra care</span>
    <span class="token keyword">const</span> <span class="token keyword">double</span> OneBitSubnormalThreshold <span class="token operator">=</span> 
      OneBitSubnormalThreshold <span class="token operator">*</span> <span class="token number">0.5</span><span class="token punctuation">;</span>
    <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">Abs</span><span class="token punctuation">(</span>finalSum<span class="token punctuation">)</span> <span class="token operator">>=</span> OneBitSubnormalThreshold<span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
      <span class="token comment">// Split into halves</span>
      <span class="token punctuation">(</span>rh<span class="token punctuation">,</span> rl<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">Split</span><span class="token punctuation">(</span>finalSum<span class="token punctuation">)</span><span class="token punctuation">;</span>

      <span class="token comment">// Add the error term into the LOW part of our split value</span>
      rl <span class="token operator">=</span> <span class="token function">OddRoundedAdd</span><span class="token punctuation">(</span>rl<span class="token punctuation">,</span> finalSumErr<span class="token punctuation">)</span><span class="token punctuation">;</span>

      <span class="token comment">// Divide out the bias from both halves (which will cause rl to</span>
      <span class="token comment">//  round to its final, correctly-rounded value) then sum them </span>
      <span class="token comment">//  together (which is perfectly representable).</span>
      rh <span class="token operator">/=</span> bias<span class="token punctuation">;</span>
      rl <span class="token operator">/=</span> bias<span class="token punctuation">;</span>
      <span class="token keyword">return</span> rh <span class="token operator">+</span> rl<span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <span class="token keyword">else</span>
    <span class="token punctuation">{</span>
      <span class="token comment">// For more-than-one-bit subnormals, we do an odd-rounded addition of</span>
      <span class="token comment">//  the error term and then divide out the bias, doing full rounding</span>
      <span class="token comment">//  just once.</span>
      finalSum <span class="token operator">=</span> <span class="token function">RoundToOdd</span><span class="token punctuation">(</span>finalSum<span class="token punctuation">,</span> finalSumErr<span class="token punctuation">)</span><span class="token punctuation">;</span>
      <span class="token keyword">return</span> finalSum <span class="token operator">/</span> bias<span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
  <span class="token punctuation">}</span>
  <span class="token keyword">else</span>
  <span class="token punctuation">{</span>
    <span class="token comment">// Not subnormal, so we can calculate our final result normally and un-</span>
    <span class="token comment">//  bias the result.</span>
    <span class="token keyword">return</span> <span class="token punctuation">(</span>abc <span class="token operator">+</span> err<span class="token punctuation">)</span> <span class="token operator">/</span> bias<span class="token punctuation">;</span>
  <span class="token punctuation">}</span>
<span class="token punctuation">}</span></code></pre>
<p>Compare that to the 32-bit version and you can see why this one got its own post:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">float</span> <span class="token function">FMAdd</span><span class="token punctuation">(</span><span class="token keyword">float</span> a<span class="token punctuation">,</span> <span class="token keyword">float</span> b<span class="token punctuation">,</span> <span class="token keyword">float</span> c<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token keyword">double</span> product <span class="token operator">=</span> <span class="token keyword">double</span><span class="token punctuation">(</span>a<span class="token punctuation">)</span> <span class="token operator">*</span> <span class="token keyword">double</span><span class="token punctuation">(</span>b<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token punctuation">(</span><span class="token keyword">double</span> sum<span class="token punctuation">,</span> <span class="token keyword">double</span> err<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">AddWithError</span><span class="token punctuation">(</span>product<span class="token punctuation">,</span> c<span class="token punctuation">)</span><span class="token punctuation">;</span>
  sum <span class="token operator">=</span> <span class="token function">RoundToOdd</span><span class="token punctuation">(</span>sum<span class="token punctuation">,</span> err<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">return</span> <span class="token keyword">float</span><span class="token punctuation">(</span>sum<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>
<p>Hopefully you never have to <em>actually</em> implement this yourself, but if you do? I hope this helps.</p>

        ]]>
      </content:encoded>
    </item>
    <item>
      <title><![CDATA[Emulating the FMAdd Instruction, Part 1: 32-bit Floats]]></title>
      <link>https://www.drilian.com/posts/2024.12.31-emulating-the-fmadd-instruction-part-1-32-bit-floats/</link>
      <pubDate>Tue, 31 Dec 2024 21:51:24 PST</pubDate>
      <dc:creator><![CDATA[Josh Jersild]]></dc:creator>
      <guid>https://www.drilian.com/posts/2024.12.31-emulating-the-fmadd-instruction-part-1-32-bit-floats/</guid>
      <content:encoded>
        <![CDATA[
          <p>A thing that I had to do at work is write an emulation of the FMAdd (fused multiply-add) instruction for hardware where it wasn’t natively supported (specifically I was writing a SIMD implementation, but the idea is the same), and so I thought I’d share a little bit about how FMAdd works, since I’ve already been posting about <a href="https://www.drilian.com/posts/2023.01.10-floating-point-numbers-and-rounding/">how float rounding works</a>.</p>
<p>So, screw it, here we go with another unnecessarily technical, mathy post!</p>
<h3>What is the FMAdd Instruction?</h3>
<p>A <strong>fused multiply-add</strong> is basically doing a multiply and an add as a single operation, and it gives you the result as if it were computed with infinite precision and then rounded down at the final result. FMAdd computes <code>(a * b) + c</code><span class="attrs"></span> without intermediate floating-point error being introduced:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">float</span> <span class="token function">FMAdd</span><span class="token punctuation">(</span><span class="token keyword">float</span> a<span class="token punctuation">,</span> <span class="token keyword">float</span> b<span class="token punctuation">,</span> <span class="token keyword">float</span> c<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token comment">// ??? Somehow do this with no intermediate rounding</span>
  <span class="token keyword">return</span> <span class="token punctuation">(</span>a <span class="token operator">*</span> b<span class="token punctuation">)</span> <span class="token operator">+</span> c<span class="token punctuation">;</span> 
<span class="token punctuation">}</span></code></pre>
<p>Computing it normally (using the code above) for some values will get you <strong>double rounding</strong> (explained in a moment) which means you might be an extra bit off (or, more formally, <a href="https://en.wikipedia.org/wiki/Unit_in_the_last_place" target="_blank" rel="noopener">one ULP</a>) from where your actual result should be. An extra bit doesn’t <em>sound</em> like a lot, but it can add up over many operations.</p>
<p>Fused multiply-add avoids this extra rounding, making it more accurate than a multiply followed by a separate add, which is great! (It can also be faster if it’s supported by hardware but, as you’ll see, computing it without a dedicated instruction on the CPU is actually surprisingly spendy, especially once you get into doing it for 64-bit floats, but sometimes you need precision instead of performance).</p>
<p><span class="read-more"></span></p>
<h3>Double Rounding</h3>
<p>Double rounding happens when the intermediate value rounds down (or up), then the final result <em>also</em> rounds in the same direction - but because of the first rounding, actually overshoots the correctly-rounded final value by a bit.</p>
<p>Here’s an example using two successive sums of some 4-bit float values. We’ll do the following sum (in top-down order):</p>
<pre class="language-cpp"><code class="language-cpp">  <span class="token number">1.000</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span>
<span class="token operator">+</span> <span class="token number">1.001</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">0</span>
<span class="token operator">+</span> <span class="token number">1.100</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">0</span></code></pre>
<p>The first sum, done with “infinite” internal precision, looks like this:</p>
<pre class="language-cpp"><code class="language-cpp">  <span class="token number">1.000</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span>
<span class="token operator">+</span> <span class="token number">1.001</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">0</span>
  <span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span>
  <span class="token number">1.000</span> <span class="token number">0000</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span>
<span class="token operator">+</span> <span class="token number">0.000</span> <span class="token number">1001</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span>
  <span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span>
  <span class="token number">1.000</span> <span class="token number">1001</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span></code></pre>
<p>If we were to then use that result directly (with no intermediate rounding) and do the second sum, only rounding the final result:</p>
<pre class="language-cpp"><code class="language-cpp">   <span class="token number">1.000</span> <span class="token number">1001</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span>
 <span class="token operator">+</span> <span class="token number">0.000</span> <span class="token number">1100</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span>
   <span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span>
 <span class="token operator">=</span> <span class="token number">1.001</span> <span class="token number">0101</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span>
<span class="token operator">-></span> <span class="token number">1.001</span>      <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span> <span class="token comment">// Rounds down</span></code></pre>
<p>The final result rounds (to nearest) to <code>1.001</code><span class="attrs"></span>.</p>
<p>However, if we were to round that intermediate value to 4 bits <em>first</em>, we’d get this:</p>
<pre class="language-cpp"><code class="language-cpp">   <span class="token number">1.000</span> <span class="token number">1001</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span>
<span class="token operator">-></span> <span class="token number">1.001</span>      <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span> <span class="token comment">// Rounded up</span>
 <span class="token operator">+</span> <span class="token number">0.000</span> <span class="token number">1100</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span>
   <span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span>
   <span class="token number">1.001</span> <span class="token number">1100</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span>
<span class="token operator">-></span> <span class="token number">1.010</span>      <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span> <span class="token comment">// Up again</span></code></pre>
<p>In this one, we end up with <code>1.010</code><span class="attrs"></span> instead of <code>1.001</code><span class="attrs"></span> because of the intermediate rounding, which pushed us past the correctly-rounded final result.</p>
<h3>How to Pretend That You Have Infinite Precision</h3>
<p>Okay, for FMAdd we want to calculate a multiply, and then <em>somehow</em> throw an add in there and have it act as if we didn’t lose any precision on the multiply.</p>
<p>First we’re going to handle the case of 32-bit floats (<strong>singles</strong>) because it’s a <em>wildly</em> simpler case on CPUs that have 64-bit floats (<strong>doubles</strong>).</p>
<p>(also, sorry in advance, the term “double” for a “double-precision float” and the “double” in “double rounding” are two different instances of “double” but I’ve written so much of this post and like hell am I changing it now so hopefully it’s not too confusing)</p>
<p>The immediately obvious thing to try to get an accurate single-precision FMA is “hey, what if we do the multiply and add as doubles and then round the result back down to a single”:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">float</span> <span class="token function">FMAdd</span><span class="token punctuation">(</span><span class="token keyword">float</span> a<span class="token punctuation">,</span> <span class="token keyword">float</span> b<span class="token punctuation">,</span> <span class="token keyword">float</span> c<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token comment">// Do the math as 64-bit floats and truncate at the end. </span>
  <span class="token comment">//  Surely that's good enough, right?</span>
  <span class="token keyword">return</span> <span class="token keyword">float</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword">double</span><span class="token punctuation">(</span>a<span class="token punctuation">)</span> <span class="token operator">*</span> <span class="token keyword">double</span><span class="token punctuation">(</span>b<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token keyword">double</span><span class="token punctuation">(</span>c<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> 
<span class="token punctuation">}</span></code></pre>
<p>While that gives a much better result than doing it as pure 32-bits, it actually <em>can</em> still have double rounding. But where does the extra rounding come from, in this case?</p>
<p>The multiply itself isn’t the source of the first rounding: Surprisingly (to me, at least): <strong>casting two singles to doubles and multiplying those together <em>always</em> results in an exact answer</strong> - this is because each of the single-precision values has 24 bits of precision, but a double can store 53 bits of precision, which is more than enough to store the result of multipling two singles (<code>2 * 24</code><span class="attrs"></span> bits of precision max). Since floats are stored as:</p>
<pre class="language-cpp"><code class="language-cpp">sign <span class="token operator">*</span> <span class="token number">1.</span>mantissa <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token punctuation">(</span>exponent<span class="token punctuation">)</span></code></pre>
<p>…it means we’re multiplying two numbers of the form <code>1.xxxxxxxxxx</code><span class="attrs"></span> and <code>1.yyyyyyyyy</code><span class="attrs"></span> together then adding the exponents together to get the new number, so unlike addition and subtraction (where, say, <code>1 + 1*10^60</code><span class="attrs"></span> requires a <em>ton</em> of extra precision), if two float numbers have wildly different exponents it doesn’t actually matter because the exponents and significand values are handled separately.</p>
<p>To illustrate this, let’s pretend we have two 4-digit (base 10) numbers and we multiply them and store the result using 8 digits (double precision):</p>
<pre class="language-cpp"><code class="language-cpp">  <span class="token punctuation">(</span><span class="token number">1.234</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">*</span> <span class="token punctuation">(</span><span class="token number">1.457</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">100</span><span class="token punctuation">)</span>
<span class="token operator">-></span> <span class="token punctuation">(</span><span class="token number">1.2340000</span> <span class="token operator">*</span> <span class="token number">1.4570000</span><span class="token punctuation">)</span> <span class="token operator">*</span> <span class="token punctuation">(</span><span class="token number">2</span><span class="token operator">^</span><span class="token number">1</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">100</span><span class="token punctuation">)</span>
 <span class="token operator">=</span>  <span class="token number">1.7979380</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">101</span>  <span class="token comment">// no rounding here!</span></code></pre>
<p>Great, so the double-precision multiply is fine and introduces no rounding at all. So then how do we get double rounding?</p>
<p>As mentioned above, an add (or subtract) can introduce rounding:</p>
<pre class="language-cpp"><code class="language-cpp">  <span class="token punctuation">(</span><span class="token number">1.234</span> <span class="token operator">*</span> <span class="token number">10</span><span class="token operator">^</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token punctuation">(</span><span class="token number">1.457</span> <span class="token operator">*</span> <span class="token number">10</span><span class="token operator">^</span><span class="token number">9</span><span class="token punctuation">)</span>
<span class="token operator">-></span> <span class="token punctuation">(</span><span class="token number">1.2340000</span> <span class="token operator">*</span> <span class="token number">10</span><span class="token operator">^</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token punctuation">(</span><span class="token number">1.4570000</span> <span class="token operator">*</span> <span class="token number">10</span><span class="token operator">^</span><span class="token number">9</span><span class="token punctuation">)</span>
 <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token number">1.45700001234</span> <span class="token operator">*</span> <span class="token number">10</span><span class="token operator">^</span><span class="token number">9</span><span class="token punctuation">)</span> <span class="token comment">// Too many digits!</span>
<span class="token operator">-></span> <span class="token number">1.45700000</span> <span class="token comment">// Rounded to nearest here</span></code></pre>
<p>This rounding happens at double precision (so well below the threshold of our target 32-bit result), but there’s still rounding, and then the value is rounded <em>again</em> when converted back down to single precision. <em>That’s</em> the double rounding and the source of a potential error.</p>
<p>Okay, so, double rounding is bad? Kinda! But it turns out there is a way to introduce a <strong>new rounding mode</strong> to use for the first rounding that, in the right situations, does not introduce any additional error and ensures that your final result is correct.</p>
<h3>A New Rounding Mode?</h3>
<p>(This technique is based off of the paper <a href="https://www.lri.fr/~melquion/doc/08-tc.pdf" target="_blank" rel="noopener">Emulation of FMA and correctly-rounded sums: proved algorithms using rounding to odd</a> by Sylvie Boldo and Guillaume Melquiond; if you want the full technical details of this whole process, that’s where you’ll find them. Believe it or not, I’m actually trying to go into less detail!)</p>
<p>The key to eliminating the extra precision loss is by using a non-standard rounding mode: <strong>rounding to odd</strong>. <a href="https://www.drilian.com/posts/2023.01.10-floating-point-numbers-and-rounding/">Standard floating point rounding</a> calculates results <strong>with some additional bits of precision</strong> (three bits, to be precise), and then rounds based on the result (usually using “round to nearest with round to even on ties”, although that detail doesn’t end up mattering here - this technique works with any standard rounding mode).</p>
<p>So, assume that we have some way of calculating a double precision addition and also having access to the error between the calculated result and the mathematically exact result. Given those two values we can perform a Round To Odd step:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">double</span> <span class="token function">RoundToOdd</span><span class="token punctuation">(</span><span class="token keyword">double</span> value<span class="token punctuation">,</span> <span class="token keyword">double</span> errorTerm<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>errorTerm <span class="token operator">!=</span> <span class="token number">0.0</span> <span class="token comment">// if the result is not exact</span>
    <span class="token operator">&amp;&amp;</span> <span class="token function">LowestBitOfMantissa</span><span class="token punctuation">(</span>value<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token comment">//and mantissa is even</span>
  <span class="token punctuation">{</span>
    <span class="token comment">// We need to round, so round either up or down to odd</span>
    <span class="token keyword">if</span> <span class="token punctuation">(</span>errorTerm <span class="token operator">></span> <span class="token number">0</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
      <span class="token comment">// Round up to an odd value</span>
      value <span class="token operator">=</span> <span class="token function">AddOneBitToMantissa</span><span class="token punctuation">(</span>value<span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <span class="token keyword">else</span> <span class="token comment">// (errorTerm &lt; 0)</span>
    <span class="token punctuation">{</span>
      <span class="token comment">// Round down to an odd value</span>
      value <span class="token operator">=</span> <span class="token function">SubtractOneBitFromMantissa</span><span class="token punctuation">(</span>value<span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
  <span class="token punctuation">}</span>
<span class="token punctuation">}</span></code></pre>
<p>Basically: if we have any error at all, and the mantissa is currently even, either add or subtract a single bit’s worth of mantissa, based on the sign of the error.</p>
<p>(In practice, I found that I also had to ensure the result was not <em>Infinity</em> before doing this operation, since I implemented this using some bitwise shenanigans that would end up “rounding” Infinity to NaN, so, you know, <strong>watch out for that</strong>).</p>
<h3>Why Does Odd-Rounding the Intermediate Value Work?</h3>
<p>Round to odd works <strong>as long as we have more bits of value than the final result</strong> - specifically we need at least two extra bits. Standard float rounding makes use of something called a “sticky” bit - basically the lowest bit of the extra precision is a 1 <strong>if any of the bits below it would have been 1</strong>.</p>
<p>And, hey, that is basically what “round to odd” does!</p>
<ul>
<li>If the mantissa is odd, regardless of whether there’s error or not the lowest bit is already odd.</li>
<li>If the error was positive and the mantissa was even, we set the lower bit to 1 anyway, effectively <em>stickying</em> (yeah that’s a word now) all the error bits below it.</li>
<li>If the error was negative and the mantissa was even, we subtract 1 from the mantissa, making the lower bit odd, and effectively sticky since some of the digits below it are also 1s.</li>
</ul>
<p>Effectively, round to odd is just “emulate having a sticky bit at the bottom of your intermediate result” - that way, you have a guaranteed tiebreaker for the final rounding step.</p>
<p>But note that I said it requires you to have at least two extra bits. In the case of our using-doubles-instead-of-singles intermediate addition, good news: we have <em>way more</em> than two extra bits - our intermediate value is a whole-ass double-precision float, so we have <em>29 extra bits</em> vs. our single-precision final value and (mathematically speaking) 29 is greater than 2.</p>
<p>So, for the true single-precision FMAdd instruction we need to do the following:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token keyword">float</span> <span class="token function">FMAdd</span><span class="token punctuation">(</span><span class="token keyword">float</span> a<span class="token punctuation">,</span> <span class="token keyword">float</span> b<span class="token punctuation">,</span> <span class="token keyword">float</span> c<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token keyword">double</span> product <span class="token operator">=</span> <span class="token keyword">double</span><span class="token punctuation">(</span>a<span class="token punctuation">)</span> <span class="token operator">*</span> <span class="token keyword">double</span><span class="token punctuation">(</span>b<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// No rounding here</span>
  
  <span class="token comment">// Calculate our sum, but somehow get the error along with it</span>
  <span class="token punctuation">(</span><span class="token keyword">double</span> sum<span class="token punctuation">,</span> <span class="token keyword">double</span> err<span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token function">AddWithError</span><span class="token punctuation">(</span>product<span class="token punctuation">,</span> c<span class="token punctuation">)</span><span class="token punctuation">;</span>

  <span class="token comment">// Round our intermediate value to odd</span>
  sum <span class="token operator">=</span> <span class="token function">RoundToOdd</span><span class="token punctuation">(</span>sum<span class="token punctuation">,</span> err<span class="token punctuation">)</span><span class="token punctuation">;</span>

  <span class="token comment">// Final rounding here, which now does the correct thing and gives us</span>
  <span class="token comment">//  a properly-rounded final result (as if we'd used infinite bits)</span>
  <span class="token keyword">return</span> <span class="token keyword">float</span><span class="token punctuation">(</span>sum<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>
<p>That’s it! …wait, what’s that <code>AddWithError</code><span class="attrs"></span> function, we haven’t even–</p>
<h3>Calculating An Exact Addition Result</h3>
<p>Right, we need to calculate that intermediate addition along with some accurate error term. It turns out it’s possible to calculate a set of numbers, <strong>sum</strong> and <strong>error</strong> where <code>mathematicallyExactSum = sum + error</code><span class="attrs"></span>.</p>
<p>For this, we have to dive back to July of 1971 and check out the <em>actually typewritten</em> paper <a href="https://ir.cwi.nl/pub/9159/9159D.pdf" target="_blank" rel="noopener">A Floating-Point Technique For Extending the Available Precision</a> by T.J. Dekker. Give that a read if you want way more details on this whole thing.</p>
<p>Calculating the error term of adding two numbers (I’ll use <code>x</code><span class="attrs"></span> and <code>y</code><span class="attrs"></span>) is <em>relatively</em> straightforward if <code>|x| &gt; |y|</code><span class="attrs"></span>:</p>
<pre class="language-cpp"><code class="language-cpp">sum <span class="token operator">=</span> x <span class="token operator">+</span> y<span class="token punctuation">;</span>
err <span class="token operator">=</span> y <span class="token operator">-</span> <span class="token punctuation">(</span>sum <span class="token operator">-</span> x<span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre>
<p>(this is equation 4.14 in the linked paper)</p>
<p>This is just a different ordering of <code>(x + y) - sum</code><span class="attrs"></span> that preserves accuracy: due to the nature of the values involved in these subtractions (<code>sum</code><span class="attrs"></span>’s value is directly related to those of <code>x</code><span class="attrs"></span> and <code>y</code><span class="attrs"></span>, and <code>y</code><span class="attrs"></span> is smaller than <code>x</code><span class="attrs"></span>), it turns out that each of those subtractions is an exact result (the paper has a proof of this, and it’s a <em>lot</em> so I’m not going to expand on that here), so we get the precise difference between the calculated sum and the real sum.</p>
<p>But this only works if you know that <code>x</code><span class="attrs"></span>’s magnitude is larger than (or equal to) <code>y</code><span class="attrs"></span>’s. If you <em>don’t</em> know which of the two values has a larger magnitude, you can do a bit more work and end up with:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token punctuation">(</span><span class="token keyword">double</span> sum<span class="token punctuation">,</span> <span class="token keyword">double</span> err<span class="token punctuation">)</span> <span class="token function">AddWithError</span><span class="token punctuation">(</span>
  <span class="token keyword">double</span> x<span class="token punctuation">,</span> 
  <span class="token keyword">double</span> y<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
  <span class="token keyword">double</span> sum <span class="token operator">=</span> a <span class="token operator">+</span> b<span class="token punctuation">;</span>
  <span class="token keyword">double</span> intermediate <span class="token operator">=</span> sum <span class="token operator">-</span> x<span class="token punctuation">;</span>
  <span class="token keyword">double</span> err1 <span class="token operator">=</span> y <span class="token operator">-</span> intermediate<span class="token punctuation">;</span>
  <span class="token keyword">double</span> err2 <span class="token operator">=</span> x <span class="token operator">-</span> <span class="token punctuation">(</span>sum <span class="token operator">-</span> intermediate<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">return</span> <span class="token punctuation">(</span>sum<span class="token punctuation">,</span> err1 <span class="token operator">+</span> err2<span class="token punctuation">)</span><span class="token punctuation">;</span>      
<span class="token punctuation">}</span></code></pre>
<p>(This is effectively the expanded version of listing 4.16 from the linked paper)</p>
<ul>
<li><code>err1</code><span class="attrs"></span> here is the same as the value in the first version we calculated (a precision-preserving rewrite of <code>(x + y) - sum</code><span class="attrs"></span>)</li>
<li><code>err2</code><span class="attrs"></span> is, mathematically, <code>x - (sum - (sum - x))</code><span class="attrs"></span> or <code>0</code><span class="attrs"></span>; its goal is to calculate the error involved in calculating err1, since without the <code>|x| &gt; |y|</code><span class="attrs"></span> guarantee those subtractions might NOT be exact … but these ones will be.</li>
<li>Thus, summing these two error terms together gives us a final, precise error term.</li>
</ul>
<p>(More details in the paper, hopefully this isn’t too glossed over that it loses any meaning)</p>
<h3>Finally, the End (For Single-Precision Floats)</h3>
<p>So, yeah, that’s how you implement the FMAdd instruction for single-precision floats on a machine that has double-precision support:</p>
<ul>
<li>Calculate the double-precision product of <code>a</code><span class="attrs"></span> and <code>b</code><span class="attrs"></span></li>
<li>Add this product to <code>c</code><span class="attrs"></span> to get a double-precision sum</li>
<li>Calculate the error of the sum</li>
<li>Use the error to odd-round the sum</li>
<li>Round the double-precision sum back down to single precision</li>
</ul>
<p>But what if you have to calculate FMAdd for double-precision floats? You can’t easily just cast up to, like, <em>quad-precision</em> floats and do the work there, so what now? Can you still do this?</p>
<p>The answer is yes, but it’s a <em><strong>lot</strong></em> more work, and that’s what <a href="https://www.drilian.com/posts/2025.01.02-emulating-the-fmadd-instruction-part-2-64-bit-floats/">the next post</a> is about.</p>

        ]]>
      </content:encoded>
    </item>
    <item>
      <title><![CDATA[C++17-Style Hex Floats (And How To Parse Them)]]></title>
      <link>https://www.drilian.com/posts/2024.12.30-c-17-style-hex-floats-and-how-to-parse-them/</link>
      <pubDate>Mon, 30 Dec 2024 12:00:00 PST</pubDate>
      <dc:creator><![CDATA[Josh Jersild]]></dc:creator>
      <guid>https://www.drilian.com/posts/2024.12.30-c-17-style-hex-floats-and-how-to-parse-them/</guid>
      <content:encoded>
        <![CDATA[
          <p>C++17 added support for hex float literals, so you can put more bit-accurate floating point values into your code. They’re handy to have, and I wanted to be able to parse them from a text file in a C# app I was writing.</p>
<p>Some examples:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token number">0x1.0p0</span>         <span class="token comment">// == 1.0</span>
<span class="token number">0x1.8p1</span>         <span class="token comment">// == 3</span>
<span class="token number">0x8.0p-3</span>        <span class="token comment">// == 1.0</span>
<span class="token number">0x0.8p1</span>         <span class="token comment">// == 1.0</span>
<span class="token number">0xAB.CDEFp-10</span>   <span class="token comment">// == 0.16777776181697846</span>
<span class="token number">0x0.0000000ABp0</span> <span class="token comment">// == 2.4883775040507317E-09</span></code></pre>
<h3>What Am I Even Looking At Here?</h3>
<p>(Before reading this, if you don’t have a good feel for how floats work, maybe check <a href="https://www.drilian.com/posts/2023.01.10-floating-point-numbers-and-rounding/">my previous post about floats (and float rounding)</a>)</p>
<p>I had a bit of a mental block on this number format for a bit - like, what does it even mean to have fractional hex digits? But it turns out it’s a concept that we already use all the time and my brain just needed some prodding to make the connection.</p>
<p>With our standard base 10 numbers, moving the decimal point left one digit means dividing the number by 10:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token number">12.3</span> <span class="token operator">==</span> <span class="token number">1.23</span> <span class="token operator">*</span> <span class="token number">10</span><span class="token operator">^</span><span class="token number">1</span> <span class="token operator">==</span> <span class="token number">0.123</span> <span class="token operator">*</span> <span class="token number">10</span><span class="token operator">^</span><span class="token number">2</span></code></pre>
<p>Hex floats? Same deal, just in 16s instead:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token number">0x1B.C8</span> <span class="token operator">==</span> <span class="token number">0x1.BC8</span> <span class="token operator">*</span> <span class="token number">16</span><span class="token operator">^</span><span class="token number">1</span> <span class="token operator">==</span> <span class="token number">0x0.1BC8</span> <span class="token operator">*</span> <span class="token number">16</span><span class="token operator">^</span><span class="token number">2</span></code></pre>
<p>Okay, so now what’s the “<code>p</code><span class="attrs"></span>” part in the number? Well, that’s the start of the exponent. A standard float has an exponent starting with ‘<code>e</code><span class="attrs"></span>’:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token number">1.3e2</span> <span class="token operator">==</span> <span class="token number">1.3</span> <span class="token operator">*</span> <span class="token number">10</span><span class="token operator">^</span><span class="token number">2</span></code></pre>
<p>But ‘<code>e</code><span class="attrs"></span>’ is a hex digit, so you can’t use ‘<code>e</code><span class="attrs"></span>’ anymore as the exponent starter, so they chose ‘<code>p</code><span class="attrs"></span>’ instead (why not ‘<code>x</code><span class="attrs"></span>’, the second letter? Probably because a hex number starts with ‘<code>0x</code><span class="attrs"></span>’, so ‘<code>x</code><span class="attrs"></span>’ also already has a use - but ‘<code>p</code><span class="attrs"></span>’ is free so it wins)</p>
<p>The exponent for a hex float is in powers of 2 (so it corresponds perfectly to the exponent as it is stored in the value), so:</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token number">0x1.ABp3</span> <span class="token operator">==</span> <span class="token number">0x1.AB</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">3</span></code></pre>
<p>So that’s how a hex float literal works! Here’s a quick breakdown:</p>
<div class="code-box">
<span class="keyword">&lt;hex-digits&gt;</span><span class="string">[ '.'</span> <span class="operator">[fractional-hex-digits]</span><span class="string">]</span> <span class="keyword">'p' &lt;exponent&gt;</span>
</div>
<span class="read-more">(Parsing code below the fold)</span>
<h3>Okay Now What’s This About Parsing Them?</h3>
<p>C++ conveniently has functions to parse these (std::strtod/strtof handle this nicely). However, if you’re (hypothetically) making a parser, and you happen to be writing it in C# which does <em>not</em> have an inbuilt way to parse these, then you’ll have to parse your own.</p>
<p>It ended up being a little more complicated than I thought for a couple reasons.</p>
<ul>
<li>Arbitrarily long hex strings seem to be supported, which means you need to track both where the top-most non-zero hex value starts (i.e. skip any leading zeros) but also properly handle bits that are extra tiny which may affect rounding.</li>
<li>In order to properly handle said rounding, running through the hex digits in reverse and pushing them in from the top ends up being a nice strategy, because float rounding works via a sticky bit that stays set as things right-shift down through it.</li>
</ul>
<p>Ultimately I ended up with the following algorithm (in C#) to parse a hex float, which I’m <em>definitely</em> sure is <em>~perfect~</em> and has absolutely no bugs whatsoever. It is also <em>absolutely</em> the most efficient version of this possible, with no room for improvement. Yep.</p>
<p>I’m throwing it in here in case anyone ever finds it useful.</p>
<pre class="language-cs"><code class="language-cs"><span class="token keyword">static</span> <span class="token return-type class-name"><span class="token keyword">bool</span></span> <span class="token function">IsDigit</span><span class="token punctuation">(</span><span class="token class-name"><span class="token keyword">char</span></span> c<span class="token punctuation">)</span> <span class="token operator">=></span> <span class="token punctuation">(</span>c <span class="token operator">>=</span> <span class="token char">'0'</span> <span class="token operator">&amp;&amp;</span> c <span class="token operator">&lt;=</span> <span class="token char">'9'</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token keyword">static</span> <span class="token return-type class-name"><span class="token keyword">bool</span></span> <span class="token function">IsHexDigit</span><span class="token punctuation">(</span><span class="token class-name"><span class="token keyword">char</span></span> c<span class="token punctuation">)</span> 
  <span class="token operator">=></span> <span class="token function">IsDigit</span><span class="token punctuation">(</span>c<span class="token punctuation">)</span> 
    <span class="token operator">||</span> <span class="token punctuation">(</span>c <span class="token operator">>=</span> <span class="token char">'A'</span> <span class="token operator">&amp;&amp;</span> c <span class="token operator">&lt;=</span> <span class="token char">'F'</span><span class="token punctuation">)</span> 
    <span class="token operator">||</span> <span class="token punctuation">(</span>c <span class="token operator">>=</span> <span class="token char">'a'</span> <span class="token operator">&amp;&amp;</span> c <span class="token operator">&lt;=</span> <span class="token char">'f'</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

<span class="token return-type class-name"><span class="token keyword">double</span></span> <span class="token function">ParseHexFloat</span><span class="token punctuation">(</span><span class="token class-name"><span class="token keyword">string</span></span> s<span class="token punctuation">)</span>
<span class="token punctuation">{</span> 
  <span class="token comment">// This doesn't handle a negative sign, if only because the parser I have only</span>
  <span class="token comment">// needed to support positive values, but it'd be easy to add</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>s<span class="token punctuation">.</span>Length <span class="token operator">&lt;</span> <span class="token number">2</span> <span class="token operator">||</span> s<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">!=</span> <span class="token char">'0'</span> <span class="token operator">||</span> <span class="token keyword">char</span><span class="token punctuation">.</span><span class="token function">ToLowerInvariant</span><span class="token punctuation">(</span>s<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token char">'x'</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span> <span class="token keyword">throw</span> <span class="token keyword">new</span> <span class="token constructor-invocation class-name">FormatException</span><span class="token punctuation">(</span><span class="token string">"Missing 0x prefix"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>

  <span class="token keyword">if</span> <span class="token punctuation">(</span>s<span class="token punctuation">.</span>Length <span class="token operator">&lt;</span> <span class="token number">3</span> <span class="token operator">||</span> <span class="token operator">!</span><span class="token function">IsHexDigit</span><span class="token punctuation">(</span>s<span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span> <span class="token keyword">throw</span> <span class="token keyword">new</span> <span class="token constructor-invocation class-name">FormatException</span><span class="token punctuation">(</span>
        <span class="token string">"Hex float literal must contain at least one whole part digit"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>
    
  <span class="token class-name"><span class="token keyword">int</span></span> i <span class="token operator">=</span> <span class="token number">2</span><span class="token punctuation">;</span>
  <span class="token class-name"><span class="token keyword">int</span></span> decimalPointIndex <span class="token operator">=</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">;</span>
  <span class="token class-name"><span class="token keyword">int</span></span> firstNonZeroHexDigitIndex <span class="token operator">=</span> <span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">;</span>

  <span class="token comment">// Scan through our digits, looking for the index of the first set (non-zero)</span>
  <span class="token comment">//  hex value and the decimal point (if we have one).</span>
  <span class="token keyword">while</span> <span class="token punctuation">(</span>i <span class="token operator">&lt;</span> s<span class="token punctuation">.</span>Length <span class="token operator">&amp;&amp;</span> <span class="token punctuation">(</span><span class="token function">IsHexDigit</span><span class="token punctuation">(</span>s<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">||</span> s<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token char">'.'</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
  <span class="token punctuation">{</span>
    <span class="token keyword">if</span> <span class="token punctuation">(</span>s<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token char">'.'</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
      <span class="token comment">// Found the decimal point! Hopefully there wasn't already one!</span>
      <span class="token keyword">if</span> <span class="token punctuation">(</span>decimalPointIndex <span class="token operator">>=</span> <span class="token number">0</span><span class="token punctuation">)</span>
        <span class="token punctuation">{</span> <span class="token keyword">throw</span> <span class="token keyword">new</span> <span class="token constructor-invocation class-name">FormatException</span><span class="token punctuation">(</span><span class="token string">"Too many decimal points"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>

      decimalPointIndex <span class="token operator">=</span> i<span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <span class="token keyword">else</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>s<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">!=</span> <span class="token char">'0'</span> <span class="token operator">&amp;&amp;</span> firstNonZeroHexDigitIndex <span class="token operator">&lt;</span> <span class="token number">0</span><span class="token punctuation">)</span>
      <span class="token punctuation">{</span> firstNonZeroHexDigitIndex <span class="token operator">=</span> i<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// Here's our top-most set hex value.</span>

    i<span class="token operator">++</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span>

  <span class="token comment">// Also make a note of where our last hex digit was (usually the digit before</span>
  <span class="token comment">//  the 'p' that we should be at right now)</span>
  <span class="token class-name"><span class="token keyword">int</span></span> lastHexDigitIndex <span class="token operator">=</span> i <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">;</span>

  <span class="token comment">// ... but if the previous character was the decimal point, the last hex</span>
  <span class="token comment">//  digit is before that.</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>lastHexDigitIndex <span class="token operator">==</span> decimalPointIndex<span class="token punctuation">)</span>
    <span class="token punctuation">{</span> lastHexDigitIndex<span class="token operator">--</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>

  <span class="token comment">// If we didn't find a decimal point, it's EFFECTIVELY here, at the 'p'</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>decimalPointIndex <span class="token operator">&lt;</span> <span class="token number">0</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span> decimalPointIndex <span class="token operator">=</span> i<span class="token punctuation">;</span> <span class="token punctuation">}</span>

  <span class="token comment">// Validate and skip the 'p' character</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>i <span class="token operator">>=</span> s<span class="token punctuation">.</span>Length <span class="token operator">||</span> <span class="token keyword">char</span><span class="token punctuation">.</span><span class="token function">ToLowerInvariant</span><span class="token punctuation">(</span>s<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token char">'p'</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span> <span class="token keyword">throw</span> <span class="token keyword">new</span> <span class="token constructor-invocation class-name">FormatException</span><span class="token punctuation">(</span><span class="token string">"Missing exponent 'p'"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>
  i<span class="token operator">++</span><span class="token punctuation">;</span>

  <span class="token comment">// Grab the sign if we have one</span>
  <span class="token class-name"><span class="token keyword">bool</span></span> negativeExponent <span class="token operator">=</span> <span class="token boolean">false</span><span class="token punctuation">;</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>i <span class="token operator">&lt;</span> s<span class="token punctuation">.</span>Length <span class="token operator">&amp;&amp;</span> <span class="token punctuation">(</span>s<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token char">'+'</span> <span class="token operator">||</span> s<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token char">'-'</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
  <span class="token punctuation">{</span>
    negativeExponent <span class="token operator">=</span> <span class="token punctuation">(</span>s<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token char">'-'</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    i<span class="token operator">++</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span>

  <span class="token keyword">if</span> <span class="token punctuation">(</span>i <span class="token operator">>=</span> s<span class="token punctuation">.</span>Length<span class="token punctuation">)</span>
    <span class="token punctuation">{</span> <span class="token keyword">throw</span> <span class="token keyword">new</span> <span class="token constructor-invocation class-name">FormatException</span><span class="token punctuation">(</span><span class="token string">"Missing exponent digits"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>

  <span class="token comment">// Parse the exponent!</span>
  <span class="token class-name"><span class="token keyword">int</span></span> exponent <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span>
  <span class="token keyword">while</span> <span class="token punctuation">(</span>i <span class="token operator">&lt;</span> s<span class="token punctuation">.</span>Length <span class="token operator">&amp;&amp;</span> <span class="token function">IsDigit</span><span class="token punctuation">(</span>s<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
  <span class="token punctuation">{</span>
    <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">int</span><span class="token punctuation">.</span>MaxValue <span class="token operator">/</span> <span class="token number">10</span> <span class="token operator">&lt;</span> exponent<span class="token punctuation">)</span>
      <span class="token punctuation">{</span> <span class="token keyword">throw</span> <span class="token keyword">new</span> <span class="token constructor-invocation class-name">FormatException</span><span class="token punctuation">(</span><span class="token string">"Exponent overflow"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>
    
    exponent <span class="token operator">*=</span> <span class="token number">10</span><span class="token punctuation">;</span>
    exponent <span class="token operator">+=</span> <span class="token punctuation">(</span><span class="token keyword">int</span><span class="token punctuation">)</span><span class="token punctuation">(</span>s<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">-</span> <span class="token char">'0'</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    i<span class="token operator">++</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span>

  <span class="token keyword">if</span> <span class="token punctuation">(</span>negativeExponent<span class="token punctuation">)</span>
    <span class="token punctuation">{</span> exponent <span class="token operator">=</span> <span class="token operator">-</span>exponent<span class="token punctuation">;</span> <span class="token punctuation">}</span>

  <span class="token comment">// If we had no non-zero hex digits, there's no point in continuing, it's</span>
  <span class="token comment">//  zero. </span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>firstNonZeroHexDigitIndex <span class="token operator">&lt;</span> <span class="token number">0</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token number">0.0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>
    
  <span class="token keyword">if</span> <span class="token punctuation">(</span>i <span class="token operator">!=</span> s<span class="token punctuation">.</span>Length<span class="token punctuation">)</span>
    <span class="token punctuation">{</span> <span class="token keyword">throw</span> <span class="token keyword">new</span> <span class="token constructor-invocation class-name">FormatException</span><span class="token punctuation">(</span><span class="token string">"Unexpected characters at end of string"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>
    
  <span class="token comment">// We have the supplied exponent, but we want to massage it a bit. In a</span>
  <span class="token comment">//  (IEEE) floating-point value, the mantissa is entirely fractional - that</span>
  <span class="token comment">//  is, the value is 1.mantissa * 2^(exponent) - there's an implied 1</span>
  <span class="token comment">//  (excluding subnormal floats, which we'll handle properly, but we can</span>
  <span class="token comment">//  ignore for the moment). Two things need to happen here:</span>
  <span class="token comment">//   1. We need to adjust the exponent based on the position of the first</span>
  <span class="token comment">//      non-zero hex digit, to match the fact that we're parsing hex digits</span>
  <span class="token comment">//      such that the top hex digit is sitting in the top 4 bits of our 64-</span>
  <span class="token comment">//      bit int.</span>
  <span class="token comment">//   2. But we EXPECT a single bit to be above the mantissa (the implied</span>
  <span class="token comment">//      1) so subtract 1 from our adjustment to take into account that</span>
  <span class="token comment">//      there will be 4 bits in that hex, so if we had parsed a single "1"</span>
  <span class="token comment">//      (from, say, 0x1p0, which just equals 1.0) our effective exponent</span>
  <span class="token comment">//      should be 3 (which we will later shift back down to 0 to position</span>
  <span class="token comment">//      the 1s bit at the very top)</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>decimalPointIndex <span class="token operator">>=</span> firstNonZeroHexDigitIndex<span class="token punctuation">)</span>
    <span class="token punctuation">{</span> exponent <span class="token operator">+=</span> <span class="token punctuation">(</span><span class="token punctuation">(</span>decimalPointIndex <span class="token operator">-</span> firstNonZeroHexDigitIndex<span class="token punctuation">)</span> <span class="token operator">*</span> <span class="token number">4</span><span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>
  <span class="token keyword">else</span>
    <span class="token punctuation">{</span> exponent <span class="token operator">+=</span> <span class="token punctuation">(</span><span class="token punctuation">(</span>decimalPointIndex <span class="token operator">-</span> firstNonZeroHexDigitIndex <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">*</span> <span class="token number">4</span><span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>

  <span class="token comment">// Now that we have the exponent and know the bounds of our hex digits,</span>
  <span class="token comment">//  we can parse backwards through the hex digits, shifting them in from</span>
  <span class="token comment">//  the top. We do this so that we can easily handle the rounding to the</span>
  <span class="token comment">//  final 53 bits of significand (by ensuring that we don't ever shift</span>
  <span class="token comment">//  any 1s off the bottom)</span>
  <span class="token class-name"><span class="token keyword">ulong</span></span> mantissa <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span>
  <span class="token keyword">for</span> <span class="token punctuation">(</span>i <span class="token operator">=</span> lastHexDigitIndex<span class="token punctuation">;</span> i <span class="token operator">>=</span> firstNonZeroHexDigitIndex<span class="token punctuation">;</span> i<span class="token operator">--</span><span class="token punctuation">)</span>
  <span class="token punctuation">{</span>
    <span class="token comment">// Skip the '.' if there was one.</span>
    <span class="token keyword">if</span> <span class="token punctuation">(</span>s<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token char">'.'</span><span class="token punctuation">)</span>
      <span class="token punctuation">{</span> <span class="token keyword">continue</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>

    <span class="token class-name"><span class="token keyword">char</span></span> c <span class="token operator">=</span> <span class="token keyword">char</span><span class="token punctuation">.</span><span class="token function">ToLowerInvariant</span><span class="token punctuation">(</span>s<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token class-name"><span class="token keyword">ulong</span></span> v <span class="token operator">=</span> <span class="token punctuation">(</span>c <span class="token operator">>=</span> <span class="token char">'a'</span> <span class="token operator">&amp;&amp;</span> c <span class="token operator">&lt;=</span> <span class="token char">'f'</span><span class="token punctuation">)</span> 
      <span class="token punctuation">?</span> <span class="token punctuation">(</span><span class="token keyword">ulong</span><span class="token punctuation">)</span><span class="token punctuation">(</span>c <span class="token operator">-</span> <span class="token char">'a'</span> <span class="token operator">+</span> <span class="token number">10</span><span class="token punctuation">)</span> 
      <span class="token punctuation">:</span> <span class="token punctuation">(</span><span class="token keyword">ulong</span><span class="token punctuation">)</span><span class="token punctuation">(</span>c <span class="token operator">-</span> <span class="token char">'0'</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

    <span class="token comment">// Shift the mantissa down, but keep any 1s that happen to be in the bottom</span>
    <span class="token comment">//  4 bits (this is a reasonably-efficient emulation of the "sticky bit"</span>
    <span class="token comment">//  that is used to round a floating point number properly.</span>
    mantissa <span class="token operator">=</span> <span class="token punctuation">(</span>mantissa <span class="token operator">>></span> <span class="token number">4</span><span class="token punctuation">)</span> <span class="token operator">|</span> <span class="token punctuation">(</span>mantissa <span class="token operator">&amp;</span> <span class="token number">0xf</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

    <span class="token comment">// Add in our parsed hex value, putting its 4 bits at the very top of the</span>
    <span class="token comment">//  mantissa ulong.</span>
    mantissa <span class="token operator">|=</span> <span class="token punctuation">(</span>v <span class="token operator">&lt;&lt;</span> <span class="token number">60</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span>

  <span class="token comment">// We know the mantissa is non-zero (checked earlier), and we want to position</span>
  <span class="token comment">//  the highest set bit at the top of our ulong so shift up until the top bit</span>
  <span class="token comment">//  is set (and adjust our exponent down 1 to compensate).</span>
  <span class="token keyword">while</span> <span class="token punctuation">(</span><span class="token punctuation">(</span>mantissa <span class="token operator">&amp;</span> <span class="token number">0x8000_0000_0000_0000ul</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span>
  <span class="token punctuation">{</span>
    mantissa <span class="token operator">&lt;&lt;=</span> <span class="token number">1</span><span class="token punctuation">;</span>
    exponent<span class="token operator">--</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span>

  <span class="token keyword">const</span> <span class="token class-name"><span class="token keyword">int</span></span> DoubleExponentBias <span class="token operator">=</span> <span class="token number">1023</span><span class="token punctuation">;</span>
  <span class="token keyword">const</span> <span class="token class-name"><span class="token keyword">int</span></span> MaxBiasedDoubleExponent <span class="token operator">=</span> <span class="token number">1023</span> <span class="token operator">+</span> DoubleExponentBias<span class="token punctuation">;</span>
  <span class="token keyword">const</span> <span class="token class-name"><span class="token keyword">ulong</span></span> MantissaMask <span class="token operator">=</span> <span class="token number">0x000f_ffff_ffff_fffful</span><span class="token punctuation">;</span>
  <span class="token keyword">const</span> <span class="token class-name"><span class="token keyword">ulong</span></span> ImpliedOneBit <span class="token operator">=</span> <span class="token number">0x0010_0000_0000_0000ul</span><span class="token punctuation">;</span>
  <span class="token keyword">const</span> <span class="token class-name"><span class="token keyword">int</span></span> ExponentShift <span class="token operator">=</span> <span class="token number">52</span><span class="token punctuation">;</span>
  <span class="token keyword">const</span> <span class="token class-name"><span class="token keyword">int</span></span> MantissaShiftRight <span class="token operator">=</span> <span class="token keyword">sizeof</span><span class="token punctuation">(</span><span class="token type-expression class-name"><span class="token keyword">double</span></span><span class="token punctuation">)</span><span class="token operator">*</span><span class="token number">8</span> <span class="token operator">-</span> ExponentShift <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">;</span>

  <span class="token comment">// Exponents are stored in a biased form (they can't go negative) so add our</span>
  <span class="token comment">//  bias now.</span>
  exponent <span class="token operator">+=</span> DoubleExponentBias<span class="token punctuation">;</span>

  <span class="token keyword">if</span> <span class="token punctuation">(</span>exponent <span class="token operator">&lt;=</span> <span class="token number">0</span><span class="token punctuation">)</span>
  <span class="token punctuation">{</span>
    <span class="token comment">// We have a subnormal value, which means there is no implied 1, so first</span>
    <span class="token comment">//  we need to shift our mantissa down by one to get rid of the implied 1.</span>
    <span class="token comment">//  (note that we're not letting any 1s shift off the bottom, keeping them</span>
    <span class="token comment">//  sticky)</span>
    mantissa <span class="token operator">=</span> <span class="token punctuation">(</span>mantissa <span class="token operator">>></span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">|</span> <span class="token punctuation">(</span>mantissa <span class="token operator">&amp;</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

    <span class="token comment">// Continue to denormalize the mantissa until our exponent reaches zero</span>
    <span class="token keyword">while</span> <span class="token punctuation">(</span>exponent <span class="token operator">&lt;</span> <span class="token number">0</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
      mantissa <span class="token operator">=</span> <span class="token punctuation">(</span>mantissa <span class="token operator">>></span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">|</span> <span class="token punctuation">(</span>mantissa <span class="token operator">&amp;</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
      exponent<span class="token operator">++</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
  <span class="token punctuation">}</span>

  <span class="token comment">// Now to do the actual rounding of the mantissa. Sometimes floating point</span>
  <span class="token comment">//  rounding needs 3 bits (guard, round, sticky) to do rounding, but in our</span>
  <span class="token comment">//  case, two will suffice: one bit that represents the uppermost bit that</span>
  <span class="token comment">//  shifts right off of the edge of the mantissa (i.e. the "0.5" bit) and</span>
  <span class="token comment">//  then "literally any 1 bit underneath that" (which is why we've been</span>
  <span class="token comment">//  holding on to extra 1s when shifting right) that is the tiebreaker</span>
  <span class="token class-name"><span class="token keyword">bool</span></span> roundBit      <span class="token operator">=</span> <span class="token punctuation">(</span>mantissa <span class="token operator">&amp;</span> <span class="token number">0b10000000000</span><span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token number">0</span><span class="token punctuation">;</span>
  <span class="token class-name"><span class="token keyword">bool</span></span> tiebreakerBit <span class="token operator">=</span> <span class="token punctuation">(</span>mantissa <span class="token operator">&amp;</span> <span class="token number">0b01111111111</span><span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token number">0</span><span class="token punctuation">;</span>

  <span class="token comment">// Now that we have those bits, we can shift our mantissa down into its</span>
  <span class="token comment">//  proper place (as the lower 53 bits of our 64-bit ulong).</span>
  mantissa <span class="token operator">>>=</span> MantissaShiftRight<span class="token punctuation">;</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>roundBit<span class="token punctuation">)</span>
  <span class="token punctuation">{</span>
    <span class="token comment">// If there's a tiebreaker, we'll increment the mantissa. Otherwise,</span>
    <span class="token comment">//  if there's a tie (could round either way), we round so that the</span>
    <span class="token comment">//  mantissa value is even (lowest bit in the double is 0)</span>
    <span class="token keyword">if</span> <span class="token punctuation">(</span>tiebreakerBit <span class="token operator">||</span> <span class="token punctuation">(</span>mantissa <span class="token operator">&amp;</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token number">0</span><span class="token punctuation">)</span>
      <span class="token punctuation">{</span> mantissa<span class="token operator">++</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>

    <span class="token comment">// If we have a subnormal float we may have overflowed into the implied 1</span>
    <span class="token comment">//  bit, otherwise we might have overflowed into the ... I guess the</span>
    <span class="token comment">//  implied *2* bit?</span>
    <span class="token class-name"><span class="token keyword">ulong</span></span> overflowedMask <span class="token operator">=</span> ImpliedOneBit <span class="token operator">&lt;&lt;</span> <span class="token punctuation">(</span><span class="token punctuation">(</span>exponent <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">?</span> <span class="token number">0</span> <span class="token punctuation">:</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token punctuation">(</span>mantissa <span class="token operator">&amp;</span> overflowedMask<span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token number">0</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
      <span class="token comment">// Shift back down one. This is not going to drop a 1 off the bottom</span>
      <span class="token comment">//  because if we overflowed it means we were odd, and added one to</span>
      <span class="token comment">//  become even.</span>
      exponent<span class="token operator">++</span><span class="token punctuation">;</span>
      mantissa <span class="token operator">>>=</span> <span class="token number">1</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
  <span class="token punctuation">}</span>

  <span class="token comment">// It's possible that the truncation we ended up with a 0 mantissa after all,</span>
  <span class="token comment">//  so our final value has rounded allll the way down to 0.</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>mantissa <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token number">0.0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span>

  <span class="token comment">// If our exponent is too large to be represented, this value is infinity.</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>exponent <span class="token operator">></span> MaxBiasedDoubleExponent<span class="token punctuation">)</span>
    <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token keyword">double</span><span class="token punctuation">.</span>PositiveInfinity<span class="token punctuation">;</span> <span class="token punctuation">}</span>

  <span class="token comment">// Mask off the implied one bit (if we have one)</span>
  mantissa <span class="token operator">&amp;=</span> <span class="token operator">~</span>ImpliedOneBit<span class="token punctuation">;</span>

  <span class="token comment">// Alright assemble the final double's bits, which means shifting and</span>
  <span class="token comment">//  adding the exponent into its proper place.</span>
  <span class="token comment">//  (if we had a sign to apply we'd apply it to the top bit). </span>
  <span class="token class-name"><span class="token keyword">ulong</span></span> assembled <span class="token operator">=</span> mantissa <span class="token operator">|</span> <span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword">ulong</span><span class="token punctuation">)</span>exponent<span class="token punctuation">)</span> <span class="token operator">&lt;&lt;</span> ExponentShift<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token class-name"><span class="token keyword">double</span></span> result <span class="token operator">=</span> BitConverter<span class="token punctuation">.</span><span class="token function">UInt64BitsToDouble</span><span class="token punctuation">(</span>assembled<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">return</span> result<span class="token punctuation">;</span>
<span class="token punctuation">}</span></code></pre>

        ]]>
      </content:encoded>
    </item>
    <item>
      <title><![CDATA[Wordpress No More]]></title>
      <link>https://www.drilian.com/posts/2024.12.29-wordpress-no-more/</link>
      <pubDate>Sun, 29 Dec 2024 12:00:00 PST</pubDate>
      <dc:creator><![CDATA[Josh Jersild]]></dc:creator>
      <guid>https://www.drilian.com/posts/2024.12.29-wordpress-no-more/</guid>
      <content:encoded>
        <![CDATA[
          <p>I’ve ported the blog off of Wordpress onto a static site generator (Specifically, <a href="https://www.11ty.dev/" target="_blank" rel="noopener">11ty/Eleventy</a>). I have a bit more control over the format here, and it’s easier for me to write pages (Wordpress was fighting me on all sorts of formatting which I can just <strong>do</strong> now).</p>
<p>What this means is I can finally start copying over the rest of my posts that were on (the now-defunct, sadly) cohost.org (rest easy, little eggbug).</p>
<p>Likely there are things on the new site that aren’t set up correctly yet, so if you happen to notice anything, find me on <a href="https://bsky.app/profile/joshjers.drilian.com" target="_blank" rel="noopener">Bluesky</a> or <a href="https://mastodon.gamedev.place/@JoshJers" target="_blank" rel="noopener">Mastodon</a> and let me know!</p>
<p>And now, for fun, here’s a pic I took in November at <a href="https://www.nps.gov/arch/index.htm" target="_blank" rel="noopener">Arches National Park</a>:</p>
<div class="image-container">
        <a href="https://www.drilian.com/assets/2024/delicate-arch.png" target="_blank"><img src="https://www.drilian.com/assets/2024/delicate-arch-small.jpg" alt="Photo of Delicate Arch"></a>
        </div>
        ]]>
      </content:encoded>
    </item>
    <item>
      <title><![CDATA[Floating Point Numbers and Rounding]]></title>
      <link>https://www.drilian.com/posts/2023.01.10-floating-point-numbers-and-rounding/</link>
      <pubDate>Tue, 10 Jan 2023 12:00:00 PST</pubDate>
      <dc:creator><![CDATA[Josh Jersild]]></dc:creator>
      <guid>https://www.drilian.com/posts/2023.01.10-floating-point-numbers-and-rounding/</guid>
      <content:encoded>
        <![CDATA[
          <p>I was writing about how to parse C++17-style hex floating point literals, and in doing so I ended up writing a bunch about how floats work in general (and specifically how floating point rounding happens), so I opted to split it off from that post into its own, since it’s already probably way too many words as it is?</p>
<p>Here we go!</p>
<h3>How Floats Work</h3>
<div class="svg-container">
  <svg viewBox="-10 0 342 55" xmlns="http://www.w3.org/2000/svg">
    <text x="6" y="20" text-anchor="middle">sign</text>
    <line x1="6" y1="23" x2="6" y2="29"></line>
    <text x="51" y="20" text-anchor="middle">exponent</text>
    <polyline points="12 29 12 23 90 23 90 29"></polyline>
    <text x="207" y="20" text-anchor="middle">mantissa</text>
    <polyline points="92 29 92 23 320 23 320 29"></polyline>
    <rect class="blue" x="1" y="30" width="10" height="10"></rect>
    <rect class="green" x="11" y="30" width="80" height="10"></rect>
    <rect class="red" x="91" y="30" width="230" height="10"></rect>
    <circle cx="320" cy="42.5" r="1.5"></circle>
    <text x="320" y="44" text-anchor="middle" dominant-baseline="hanging">0</text>
    <circle cx="91" cy="42.5" r="1.5"></circle>
    <text x="91" y="44" text-anchor="middle" dominant-baseline="hanging">f</text>
    <circle cx="11" cy="42.5" r="1.5"></circle>
    <text x="11" y="44" text-anchor="middle" dominant-baseline="hanging">e+f</text>
  </svg>
</div>
<p>If you don’t know, a floating point number (At least, an <a href="https://en.wikipedia.org/wiki/IEEE_754" target="_blank" rel="noopener">IEEE 754</a> float, which effectively all modern hardware supports), consists of three parts:</p>
<ul>
<li><strong>Sign bit</strong> – the upper bit of the float is the sign bit: 1 if the float is negative, 0 if it’s positive.</li>
<li><strong>Exponent</strong> – the next few bits (8 bits for a 32-bit float, 11 bits for a 64-bit float) contain the exponent data, which is the power of two to multiply the given hex value with. (Note that the exponent is stored in a <strong>biased</strong> way – more on that in a moment)</li>
<li><strong>Mantissa</strong> – the remaining bits (23 for 32-bit float, 52 for a 64-bit float) represent the fractional part of the float’s value.</li>
</ul>
<p><span class="read-more"></span></p>
<p>In general (with the exception of subnormal floats and 0.0, explained in a bit) there is an <strong>implied 1</strong> in the float: that is, if the mantissa has a value of “FC3CA0000” the actual float is 1.FC3CA0000 (the mantissa bits are all to the right of the decimal point) before the exponent is applied. Having this implied 1 gives an extra bit of precision to the value since you don’t even have to store that extra 1 bit anywhere – it’s implied! Clever.</p>
<p>The exponent represents the power of two involved (<code>Pow2(exponent)</code><span class="attrs"></span>), which has the nice property that multiplies or divides of a float by powers of two do not (usually, except at extremes) affect the precision of the number, dividing by 2 simply decrements the exponent by 1, and multiplying by 2 increments the exponent by 1.</p>
<p>For a double-precision (64-bit) float, the maximum representable exponent is 1023 and the minimum is -1022. These are stored in 11 bits, and they’re <strong>biased</strong> (which is to say that the stored 11 bits is <code>actualExponent + bias</code><span class="attrs"></span> where the bias is 1023. That means that this range of [-1022, 1023] is actually stored as [1, 2046] (00000000001 and 11111111110 in binary). This range uses all but two of the possible 11-bit values, which are used to represent two sets of special cases:</p>
<ul>
<li><strong>Exponent value <code>00000000000b</code><span class="attrs"></span></strong> represents a <strong>subnormal float</strong> – that is, it still has the effective exponent of -1022 (the minimum representable exponent) but it does <strong>NOT have the implied 1</strong> – values smaller than this start to lose bits of precision for every division by 2 as it can’t decrement the exponent any farther and so ends up sliding the mantissa to the right instead.
<ul>
<li>For this <code>00000000000b</code><span class="attrs"></span> exponent, if the mantissa is 0, then you have <strong>a value of 0.0</strong> (or, in a quirk of floating point math, -0.0 if the sign bit is set).</li>
</ul>
</li>
<li><strong>Exponent value <code>11111111111b</code><span class="attrs"></span></strong> represents one of two things:
<ul>
<li>If the mantissa is zero, this is <strong>infinity</strong> (either positive or negative infinity depending on the sign bit).</li>
<li>If the mantissa is non-zero, it’s <strong>NaN</strong> (not a number).
<ul>
<li>(There are two types of NaN, quiet and signaling. Those are a subject for another time, but the difference bit-wise is whether the upper bit of the mantissa is set: if 1 it’s quiet, if 0 then it’s signalling).</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>If you wanted to write a bit of math to calculate the value of a 64-bit float (ignoring the two special exponent cases) it would look something like this (where <code>bias</code><span class="attrs"></span> in this case is 1023):</p>
<pre class="language-cpp"><code class="language-cpp"><span class="token punctuation">(</span>signBit <span class="token operator">?</span> <span class="token operator">-</span><span class="token number">1</span> <span class="token operator">:</span> <span class="token number">1</span><span class="token punctuation">)</span> 
  <span class="token operator">*</span> <span class="token punctuation">(</span><span class="token number">1</span> <span class="token operator">+</span> <span class="token punctuation">(</span>mantissaBits <span class="token operator">/</span> <span class="token function">Pow2</span><span class="token punctuation">(</span><span class="token number">52</span> <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">)</span> 
  <span class="token operator">*</span> <span class="token function">Pow2</span><span class="token punctuation">(</span>exponentBits <span class="token operator">-</span> bias<span class="token punctuation">)</span></code></pre>
<h3>Standard Float Rounding</h3>
<p>Okay, knowing how floats are stored, clearly math in the computer isn’t done with infinite precision, so when you do an operation that drops some precision, how does the result get rounded?</p>
<p>When operations are done with values with mismatched exponents, the value with the lowest exponent is effectively shifted to the right by the difference to match the exponents.</p>
<p>For example, here’s the subtraction of two four-bit-significand (3 bits of mantissa plus the implied 1) floats:</p>
<pre class="language-cpp"><code class="language-cpp">  <span class="token number">1.000</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span>
<span class="token operator">-</span> <span class="token number">1.001</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">1</span></code></pre>
<p>The number being subtracted has the smaller exponent, so we end up shifting it to the right to compensate (for now, doing it as if we had no limit on extra digits):</p>
<pre class="language-cpp"><code class="language-cpp">  <span class="token number">1.000</span> <span class="token number">0000</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span>
<span class="token operator">-</span> <span class="token number">0.000</span> <span class="token number">1001</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span> <span class="token comment">// Shifted right to match exponents</span>
<span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span>
  <span class="token number">0.111</span> <span class="token number">0111</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span>
  <span class="token number">1.110</span> <span class="token number">111</span>  <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span> <span class="token comment">// shifted left to normalize (fix the implied 1)</span>
  <span class="token number">1.111</span>      <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span> <span class="token comment">// round up since we had more than half off the edge</span></code></pre>
<p>Note that in this example, the value being subtracted shifted completely off the side of the internal mantissa bit count. Since we can’t store infinite off-the-end digits, what do we do?</p>
<p>Float math uses three extra bits (to the “right” of the mantissa), called the <strong>guard bit</strong>, the <strong>round bit</strong>, and the <strong>sticky bit</strong>.</p>
<p>As the mantissa shifts off the end, it shifts into these bits. This works basically like a normal shift right, with the exception that the moment that ANY 1 bit get shifted into the sticky bit, it stays 1 from that point on (that’s what makes it sticky).</p>
<p>For instance:</p>
<pre class="language-cpp"><code class="language-cpp">      G R S
<span class="token number">1.001</span> <span class="token number">0</span> <span class="token number">0</span> <span class="token number">0</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">1</span>
<span class="token number">0.100</span> <span class="token number">1</span> <span class="token number">0</span> <span class="token number">0</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">2</span> <span class="token comment">// 1 shifts into the guard bit</span>
<span class="token number">0.010</span> <span class="token number">0</span> <span class="token number">1</span> <span class="token number">0</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">3</span> <span class="token comment">// now into the round bit</span>
<span class="token number">0.001</span> <span class="token number">0</span> <span class="token number">0</span> <span class="token number">1</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span> <span class="token comment">// now into the sticky bit</span>
<span class="token number">0.000</span> <span class="token number">1</span> <span class="token number">0</span> <span class="token number">1</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span> <span class="token comment">// sticky bit stays 1 now</span></code></pre>
<p>Note that the sticky bit stayed 1 on that last shift, even though in a standard right shift it would have gone off the end. Basically if you take the mantissa plus the 3 GRS bits (not to be confused with certain <em>cough</em> other meanings of GRS) and shift it to the right, the operation is the equivalent of:</p>
<pre class="language-cpp"><code class="language-cpp">mantissaAndGRS <span class="token operator">=</span> <span class="token punctuation">(</span>mantissaAndGRS <span class="token operator">>></span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">|</span> <span class="token punctuation">(</span>mantissaAndGRS <span class="token operator">&amp;</span> <span class="token number">1</span><span class="token punctuation">)</span></code></pre>
<p>Now when determining whether to round, you can take the 3 GRS bits and treat them as GRS/8 (i.e. GRS bits of 100b are the equivalent of 0.5 (4/8), and 101b is 0.625 (5/8)), and use that as the fraction that determines whether/how you round.</p>
<p>The standard float rounding mode is <strong>round-to-nearest, even-on-ties</strong> (that is, if it could round either way (think 1.5, which is equally close to either 1.0 or 2.0), you round to whichever of the neighboring values is even (so 1.5 and 2.5 would both round to 2).</p>
<p>Using our bits, the logic, then, is this:</p>
<ul>
<li>If the <strong>guard bit is not set</strong>, then it rounds down (fraction is &lt; 0.5), mantissa doesn’t change.</li>
<li>If the <strong>guard bit IS set</strong>:
<ul>
<li>If <strong>round bit or sticky bit is set</strong>, always round up (fraction is &gt; 0.5), mantissa increases by 1.</li>
<li>Otherwise, <strong>it’s a tie</strong> (exactly 0.5, could round either way), so round such that the mantissa is even (the lower bit of the mantissa is 0), mantissa increments if the lower bit was 1 (to make it even).</li>
</ul>
</li>
</ul>
<p>Okay, so if all we care about is guard bit and then , why even have three bits? Isn’t two bits enough?</p>
<p>Nope! Well, sometimes, but not always. Turns out, some operations (like subtraction) can require a left shift by one to normalize the result (like in the above subtraction example), which means if you only had two bits of extra-mantissa information (just, say, a round and sticky bit) you’d be left with one bit of information after the left shift and have no idea if there’s a rounding tiebreaker.</p>
<p>For instance, here’s an operation with the proper full guard, round, and sticky bits:</p>
<pre class="language-cpp"><code class="language-cpp">  <span class="token number">1.000</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span>
<span class="token operator">-</span> <span class="token number">1.101</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">1</span>

  <span class="token comment">// Shift into the GRS bits:</span>
  <span class="token number">1.000</span> <span class="token number">000</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span>
<span class="token operator">-</span> <span class="token number">0.000</span> <span class="token number">111</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span> <span class="token comment">// sticky kept that lowest 1</span>
<span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span>
  <span class="token number">0.111</span> <span class="token number">001</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span>
  <span class="token number">1.110</span> <span class="token number">01</span>  <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span> <span class="token comment">// shift left 1, still 2 digits left</span>
  <span class="token number">1.110</span>     <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span> <span class="token comment">// Round down properly</span></code></pre>
<p>If this were done with only two bits (round and sticky) we would end up with the following:</p>
<pre class="language-cpp"><code class="language-cpp">  <span class="token number">1.000</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span>
<span class="token operator">-</span> <span class="token number">1.101</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">1</span>

  <span class="token comment">// Shift into just RS bits:</span>
  <span class="token number">1.000</span> <span class="token number">00</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span>
<span class="token operator">-</span> <span class="token number">0.000</span> <span class="token number">11</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span>
<span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span>
  <span class="token number">0.111</span> <span class="token number">01</span> <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">5</span>
  <span class="token number">1.110</span> <span class="token number">1</span>  <span class="token operator">*</span> <span class="token number">2</span><span class="token operator">^</span><span class="token number">4</span> <span class="token comment">// shift left 1, still 2 digits left</span></code></pre>
<p>Once we shift left there, we only have one bit of data, and it’s set. We don’t know whether or not we had a fraction &gt; 0.5 (meaning we have to round up) or &lt; 0.5 (meaning we round to even, which is down in this case).</p>
<p>So the short answer is: three bits because sometimes we have to shift left and lose one bit of the information, and we need at least a “half bit” and a “tiebreaker bit” at all times. Given all the float operations, 3 suffices for this, always.</p>

        ]]>
      </content:encoded>
    </item>
    <item>
      <title><![CDATA[Resurrected]]></title>
      <link>https://www.drilian.com/posts/2023.01.01-resurrected/</link>
      <pubDate>Sun, 01 Jan 2023 12:00:00 PST</pubDate>
      <dc:creator><![CDATA[Josh Jersild]]></dc:creator>
      <guid>https://www.drilian.com/posts/2023.01.01-resurrected/</guid>
      <content:encoded>
        <![CDATA[
          <p>A while back (over a year ago) this blog got hacked and so it’s been down for a hot minute. I’ve brought it back online and now I can finally post things on it again – I have a few things that I’ll post here and there but honestly it’s unlikely to ever be really FREQUENT around here.</p>
<p>But for now, it’s back!</p>

        ]]>
      </content:encoded>
    </item>
    <item>
      <title><![CDATA[Procyon is Greenlit!]]></title>
      <link>https://www.drilian.com/posts/2014.01.11-procyon-is-greenlit/</link>
      <pubDate>Sat, 11 Jan 2014 12:00:00 PST</pubDate>
      <dc:creator><![CDATA[Josh Jersild]]></dc:creator>
      <guid>https://www.drilian.com/posts/2014.01.11-procyon-is-greenlit/</guid>
      <content:encoded>
        <![CDATA[
          <p>Valve recently announced that Procyon is in the <a href="http://steamcommunity.com/sharedfiles/filedetails/?id=205494242" target="_blank" rel="noopener">most-recent batch of titles</a> to be given the green light for release on <a href="http://store.steampowered.com/" target="_blank" rel="noopener">Steam</a>! This is super-exciting news!</p>
<p>There are a few things that I want to add to Procyon before it’s ready for general Steam release: achievements, proper leaderboards, steam overlay support, etc. Basically, all of the steam features that make sense.</p>
<p>It’ll be a bit before it gets done, as we’re working on putting the finishing touches on <a href="http://www.youtube.com/watch?v=o-B40rzJHOY" target="_blank" rel="noopener">Infamous: Second Son</a> at work, so I’m a little dead by the time I get home at the moment. But once we’ve shipped, I’ll likely have the time and energy to get Procyon rolling out onto Steam.</p>
<p>Yeaaaaaaaah!</p>

        ]]>
      </content:encoded>
    </item>
  </channel>
</rss>