Quake III Arena and the fast inverse square root algorithm
Author: Paul King
Published: 2023-02-28 12:05AM
Introduction
In 1999, id Software released Quake III Arena, a multiplayer first-person shooter.
When the game’s source code was released to the world, it contained a previously unknown algorithm called the Fast Inverse Square Root. We don’t plan to explain the algorithm in depth but its significance at the time is that it provided a very good approximation to the equation \$f(x) = 1/sqrt(x)\$ which is used for vector normalization and other math behind the game which is used extensively for numerous graphical aspects including 3D shading. The fast algorithm was 3-4 times faster at calculating the answer than using the traditional math libraries and the results were within < 0.2%.
Here is the code:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the f☼⁕k?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
Why does it look strange? There are numerous parts which you wouldn’t expect to be part of a square root calculation. There’s some tricks for converting to and from float and long IEEE 754 32-bit representations, a "magic" constant, some bit shifting, and a touch of Newton’s method throw in for good measure.
The details have been explained in some detail in numerous blogs, e.g. Fast Square Root Approximation and Understanding Quake’s Fast Inverse Square Root, and videos, e.g. Fast Inverse Square Root — A Quake III Algorithm and Quake III Arena’s Unknown Algorithm Explained. There are even sites which show the algorithm for numerous languages (including Groovy).
Not long after the game’s release, Intel added SQRTSS and RSQRTSS instructions to x86 processors. Here is one of the JDK enhancements to provide SQRTSS support for float Math.sqrt(). There are later blogs which explain that, due to these and other hardware advances, the algorithm is now mostly only relevant for historical folklore reasons, e.g. Death of Quake 3’s Inverse Square Root Hack and Is Fast Inverse Square Root still Fast?.
Let’s have a look at a few alternatives for calculating the inverse square root
including two variants of the fast inverse square root algorithm.
We’ll do this for Java and Groovy, and look at both float
and double
implementations. We could go further and try different numbers of iterations
of the Newton’s method correction, but we’ll leave that as an exercise for the reader. 😊
Java Float implementations
There are numerous places to see Java examples of the algorithm and numerous ways you could run a microbenchmark. We are using code somewhat similar to this gist. We could have been more elaborate and used jmh, but we aren’t trying to do a comprehensive performance analysis here, we just want some rough ballpark numbers. Perhaps that is the topic for a subsequent blog.
As with all microbenchmarks, you should exercise extreme caution before applying too much weight to the results. The numbers will be different on different machines, with different Java and Groovy versions and so forth. We used Groovy 4.0.8 and Java 17.0.2.
We start with the most obvious implementation which is
to use the Math.sqrt
method which takes and returns a double
:
public static float invsqrt0(float x) { // Java
return 1.0f / (float)Math.sqrt(x);
}
Compared to the C implementation above, Java doesn’t have the bit casting trick, but it does have its own methods for getting at the bit representation:
public static float invsqrt1(float x) { // Java
float xhalf = 0.5f * x;
int i = Float.floatToIntBits(x);
i = 0x5f3759df - (i >> 1);
x = Float.intBitsToFloat(i);
x = x * (1.5f - (xhalf * x * x));
return x;
}
As an alternative, we can use a byte buffer to mangle the bits back and forth:
private static ByteBuffer buf = ByteBuffer.allocateDirect(4); // Java
public static float invsqrt2(float x) {
float xhalf = 0.5f * x;
int i = buf.putFloat(0, x).getInt(0);
i = 0x5f3759df - (i >> 1);
x = buf.putInt(0, i).getFloat(0);
x = x * (1.5f - (xhalf * x * x));
return x;
}
Java Double implementations
Again, we’ll start with the obvious implementation:
public static double invsqrt0(double x) { // Java
return 1.0d / Math.sqrt(x);
}
Again, using the built-in methods for getting at the bit representation:
public static double invsqrt1(double x) { // Java
double xhalf = 0.5d * x;
long i = Double.doubleToLongBits(x);
i = 0x5fe6ec85e7de30daL - (i >> 1);
x = Double.longBitsToDouble(i);
x *= (1.5d - xhalf * x * x);
return x;
}
The code resembles the float version but the "magic" constant has changed for doubles.
The byte buffer alternative:
private static ByteBuffer buf = ByteBuffer.allocateDirect(8); // Java
public static double invsqrt2(double x) {
double xhalf = 0.5d * x;
long i = buf.putDouble(0, x).getLong(0);
//long i = Double.doubleToLongBits(x);
i = 0x5fe6ec85e7de30daL - (i >> 1);
x = buf.putLong(0, i).getDouble(0);
x *= (1.5d - xhalf * x * x);
return x;
}
We can also for comparison try the Math.pow
method:
public static double invsqrt3(double x) { // Java
return Math.pow(x, -0.5d);
}
(We could have done this for float
too but it doesn’t add much to our analysis
since it would call through to this double method anyway.)
Groovy Float implementations
All these examples are compiled with static compilation enabled. We want speed and aren’t doing any metaprogramming, so we don’t need Groovy’s dynamic capabilities.
Our code looks similar to Java for the obvious case:
static float invsqrt0(float x) {
1.0f / Math.sqrt(x)
}
Again, the code is similar to Java for the fast algorithm:
static float invsqrt1(float x) {
float xhalf = 0.5f * x
int i = Float.floatToIntBits(x)
i = 0x5f3759df - (i >> 1)
x = Float.intBitsToFloat(i)
x *= 1.5f - (xhalf * x * x)
}
And again with the byte buffer:
private static ByteBuffer buf = ByteBuffer.allocateDirect(8)
static float invsqrt2(float x) {
float xhalf = 0.5f * x
int i = buf.putDouble(0, x).getInt(0)
i = 0x5f3759df - (i >> 1)
x = buf.putInt(0, i).getDouble(0)
x *= 1.5f - (xhalf * x * x)
}
We can also try Groovy’s **
operator (power
method):
static float invsqrt4(float x) {
(x ** -0.5).floatValue()
}
Groovy Double implementations
The standard method should look familiar by now:
static double invsqrt0(double x) {
1.0d / Math.sqrt(x)
}
The fast algorithm:
static double invsqrt1(double x) {
double xhalf = 0.5d * x
long i = Double.doubleToLongBits(x)
i = 0x5fe6ec85e7de30daL - (i >> 1)
x = Double.longBitsToDouble(i)
x *= (1.5d - xhalf * x * x)
}
Incorporating the byte buffer:
private static ByteBuffer buf = ByteBuffer.allocateDirect(8)
static double invsqrt2(double x) {
double xhalf = 0.5d * x
long i = buf.putDouble(0, x).getLong(0)
i = 0x5fe6ec85e7de30daL - (i >> 1)
x = buf.putLong(0, i).getDouble(0)
x *= (1.5d - xhalf * x * x)
}
Using Math.pow
:
static double invsqrt3(double x) {
Math.pow(x, -0.5d)
}
Groovy’s **
operator (power
method) again:
static double invsqrt4(double x) {
(x ** -0.5).doubleValue()
}
Results
Here are the results of executing the above methods. We used a harness similar to the previously mentioned gist, but found the inverse square root of 100_000 random numbers instead of 10_000, and we used 1000 iterations in the timing loop and found the average execution time per 100_000 calculations.
Algorithm vs Implementation |
Java Float |
Java Double |
Groovy Float |
Groovy Double |
Math.sqrt |
0.216 |
0.254 |
0.359 |
0.245 |
Fast inverse square root |
0.230 |
0.236 |
0.357 |
0.127 |
Fast inverse square root with byte buffer |
0.337 |
0.364 |
0.486 |
0.187 |
Math.pow(x, -0.5d) |
8.949 |
8.997 |
||
x ** -0.5 |
0.737 |
1.807 |
Analysis
For all the examples, using the byte buffer was always slower than the original
algorithm, so we can rule that out as an optimization. We can also rule out using
the Math.pow
method. It is much slower than Math.sqrt
. Interesting though,
Groovy’s **
operator (power
method) while still a slower option was
significantly better than the JDK library pow
implementation.
The Java fast algorithm was slower for float and only marginally faster for doubles. It seems unlikely in most scenarios that taking on the extra code complexity is worth the small gain in performance.
The Groovy float implementations are a little slower. This is due to Groovy doing most of the calculations using doubles and converting back to floats in between steps. That is an area for possible optimization in the future.
The Groovy double implementations are at least as fast as Java. Interesting, the fast algorithms seem to be even faster in Groovy. These do seem worthwhile investigating further if you really need the speed, but given the extra precision of doubles, you might want to run extra Newton method iterations and those iterations might eat into any time saving.
For interest, the errors for the fast algorithm for the random numbers was very close to 6E-5 for all implementations.
Conclusion
We’ve done a little microbenchmark using Java and Groovy for the fast inverse square root algorithm. The result? You probably don’t need to ever worry about the fast inverse square root algorithm anymore! But if you really have the need, it is still relatively fast, but you should benchmark your application and see if it really helps.