MurmurHash3 Optimization in C#: From Good to Great with Zero Allocations

MurmurHash3 Optimization in C#: From Good to Great with Zero Allocations

Welcome to the exciting world of MurmurHash3 optimization in C#! If you’re reading this, chances are you’re familiar with this powerful hash function and its role in optimizing data storage and retrieval. But let’s face it: often, some implementations don’t cut it. For example, they might be a bit slow* or allocate too much memory.

*To me, even the speed of sound feels slow. Time to get optimizing!

In this article, we’ll explore how I approached the optimization of MurmurHash3 in C#, even when it seemed like there wasn’t much room for improvement. You’ll see the steps I took to take MurmurHash3 to the next level and the techniques and strategies used to achieve the highest performance.

So grab your favorite beverage, get comfortable, and let’s optimize some MurmurHash3!

Why?

The answer was clear: I am writing a key-value database and want it to be as fast and efficient as possible. To achieve this goal, I needed a high-performance hashing algorithm that would quickly hash keys on the fly. That’s when I laid my eyes on MurmurHash3.

What?

In short: Data -> Hash. Here is a excellent article to learn more about the algorithm and how it works.

Research

Searching for "MurmurHash3 C#" via Google yields many results.

However, as mentioned before, I will use this in a performance-critical scenario and in a hot path! So when I see:

byte[]        

Or any other reference type, I automatically move on to the next search result. After a relatively short amount of digging, I came across the following implementation on GitHub:

using System
using System.Runtime.CompilerServices;

namespace MurmurHash.Net
{
    public class MurmurHash3
    {
        public static uint Hash32(ReadOnlySpan<byte> bytes, uint seed)
        {
            var length = bytes.Length;
            var h1 = seed;
            var remainder = length & 3;
            var position = length - remainder;
            for (var start = 0; start < position; start += 4)
                h1 = (uint) ((int) RotateLeft(h1 ^ RotateLeft(BitConverter.ToUInt32(bytes.Slice(start, 4)) * 3432918353U,15) * 461845907U, 13) * 5 - 430675100);

            if (remainder > 0)
            {
                uint num = 0;
                switch (remainder)
                {
                    case 1:
                        num ^= (uint) bytes[position];
                        break;
                    case 2:
                        num ^= (uint) bytes[position + 1] << 8;
                        goto case 1;
                    case 3:
                        num ^= (uint) bytes[position + 2] << 16;
                        goto case 2;
                }

                h1 ^= RotateLeft(num * 3432918353U, 15) * 461845907U;
            }

            h1 = FMix(h1 ^ (uint) length);

            return h1;
        }
        
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static uint RotateLeft(uint x, byte r)
        {
            return x << (int) r | x >> 32 - (int) r;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static uint FMix(uint h)
        {
            h = (uint) (((int) h ^ (int) (h >> 16)) * -2048144789);
            h = (uint) (((int) h ^ (int) (h >> 13)) * -1028477387);
            return h ^ h >> 16;
        }
    }
}        

Two positive things I immediately noticed:

Firstly: ReadOnlySpan<T>. The remarks about this type on the?Microsoft docs: “ReadOnlySpan<T> is a?ref struct?that is allocated on the stack and can never escape to the managed heap.” Amazing, this means we won’t allocate any memory on the heap when using this type!

Secondly: [MethodImpl(MethodImplOptions.AggressiveInlining)]. Let’s split this one up. We have the MethodImplAttribute, which we can stick on any method or constructor, specifying how a method is implemented. In the case of MethodImplOptions.AggressiveInlining, we instruct the just-in-time (JIT) compiler to inline this method if possible.

Method inlining is a performance optimization technique in which the code for a method is inserted directly into the code for the method that calls it, eliminating the need for a separate function call.

They also have benchmark results in the README! And compared to the other implementations, it’s faster and doesn’t allocate any memory on the heap.

Benchmark results in the README of MurmurHash.Net

Hold on a second... 12.28us for a single hash? That’s not fast enough for me. But after looking closer at?the benchmarking code, I see it’s calculating 500 hashes of Guids. So phew, there is hope.

Let's calculate the time taken for a single hash:

12.28us / 500 = 0.02456us

0.02456 * 1000 = 24.56ns

So that’s?24.56ns?per Guid hash on their PC. I have to run the benchmark on my computer to make it a fair comparison. So I cloned their repository and ran their benchmark.

(The only adjustment I made was that I removed the other implementations of getting ran in the benchmark as I am not interested in those results)

Benchmark result on my rig

Same calculation:

7.433us / 500 = 0.014866us

0.014866 * 1000 = 14.866ns


14.866 Nanoseconds

The time to beat. Though, to be fair. The benchmarking code used to determine this number could be more optimal.

So let’s define a quality benchmark that gives a fair indication of the algorithm. Three cases

  1. A short string, 3 characters.
  2. A medium string, 40 characters.
  3. A large string, 128 characters.

The benchmarking code:

Benchmark code

A few things worth mentioning:

  1. The conversion of the string input to a byte[] is done outside of the benchmark
  2. We get a Span<byte> over this byte[] inside our benchmark and it gets implicitly cast to a ReadOnlySpan<byte>.

This gives us a very accurate benchmark of the algorithm, rather than doing anything else besides calculating the hash.

With that out of the way, let’s run the benchmark:

Quality benchmark for MurmurHash.Net

The inputs used:

  1. "key"
  2. "a$6ajXViSAfFw5pR2kkz3Q28YGrDx$jeaLJ5HFPe"
  3. "!zhgt#HVY#tV%kPPZ$LXYEo@EqyKjqRJPzUb3*hhASWpdyZAF3!t$V96j9Eb9ivzMH2w4jvuyHaXRxd&YbHz*W8yZGJ#CXjXfqMzNGgf@YMfh*RdZpRXtPQ3mV$N9N!%"

*Cracks knuckles* Game on!

Research?, quality benchmark to determine our baseline?, motivation to get those nanoseconds down into the?zeptoseconds?.

The process of optimization

Let’s analyze the code bit by bit. But first, let’s start by looking at the method signature:

The method signature.

I mentioned earlier that [MethodImpl(MethodImplOptions.AggressiveInlining)] will tell JIT to inline the method if possible. This may result in a small performance benefit.

Let’s add the attribute and run the benchmark with the original code as the baseline. Also, for good measure, I made the class static.

Our new code:

Iteration 1

The result:

Benchmark after adding [MethodImpl(MethodImplOptions.AggressiveInlining)]

I added a setting to the benchmark, which outputs all the inlining that is being done. In addition, I made sure to filter by namespace:

InliningDiagnoser

Let’s inspect the logs it generates for the 128-character input on the original implementation we found:

Fails to inline without  [MethodImpl(MethodImplOptions.AggressiveInlining)]

However, it does inline the RotateLeft calls and the FMix call.

Inlines RotateLeft and FMix

Let's check out our first iteration:

First iteration of our optimization gets inlined!

It got inlined! This same inlining behavior is consistent throughout the benchmark with the three different input sizes.

This means that [MethodImpl(MethodImplOptions.AggressiveInlining)] is effective on this method!

Let’s peek at the method’s body; RotateLeft is used multiple times.

MurmurHash.Net's RotateLeft and FMix implementation

Now, there is a class called BitOperations. And there is a method called RotateLeft in it! The methods in this class use hardware intrinsics when available on the underlying platform. So let’s replace our implementation and start using the BitOperations version.

By removing the RotateLeft implementation and adding “using static”, my code will start using the method from the BitOperations class, and we can remove the “custom” implementation.

Importing the BitOperations class
Showcasing the use of BitOperations.RotateLeft()

Also, we only call FMix once, and by having that code live in a different method, there is a chance it won’t be inlined. So let’s just move the body of the FMix method to the call site.

Before:

The callsite before

After:

No alt text provided for this image

Let’s try to optimize the loop by changing it to a do-while instead. This way, it will execute the loop body before checking the condition we give it. This can be useful if we know there is enough data to process.

Before:

The loop before

After:

No alt text provided for this image

We do have to check whether we initially have 4 or more bytes. It also means it can fast path when we don’t have 4 or more bytes.

I find the “switch case” at the bottom of the hash method not very appealing. So let’s get rid of it.

Before:

The switch-case before

After:

No alt text provided for this image

Unfortunately, the benchmark yields about the same result as the previous iteration of the implementation.

Entering the realm of unsafe

using System.Runtime.CompilerServices
using static System.Numerics.BitOperations;


namespace MurmurHash3;


public static class MurmurHash3
{
? ? [MethodImpl(MethodImplOptions.AggressiveInlining)]
? ? public static unsafe uint Hash32(ReadOnlySpan<byte> bytes, uint seed)
? ? {
? ? ? ? var length = bytes.Length;
? ? ? ? var h1 = seed;
? ? ? ? var remainder = length & 3;

? ? ? ? fixed (byte* bp = bytes)
? ? ? ? {
? ? ? ? ? ? uint* endPoint = (length >> 2) + (uint*)bp;
? ? ? ? ? ? if (length >= 4)
? ? ? ? ? ? {
? ? ? ? ? ? ? ? uint* ptr = (uint*)bp;
? ? ? ? ? ? ? ? do
? ? ? ? ? ? ? ? {
? ? ? ? ? ? ? ? ? ? h1 = (uint)((int)RotateLeft(h1 ^ RotateLeft(*ptr * 3432918353U, 15) * 461845907U, 13) * 5 - 430675100);
? ? ? ? ? ? ? ? ? ? ptr++;
? ? ? ? ? ? ? ? } while (ptr < endPoint);
? ? ? ? ? ? }


? ? ? ? ? ? if (remainder > 0)
? ? ? ? ? ? {
? ? ? ? ? ? ? ? uint num = 0;
? ? ? ? ? ? ? ? if (remainder > 2) num ^= endPoint[2] << 16;
? ? ? ? ? ? ? ? if (remainder > 1) num ^= endPoint[1] << 8;
? ? ? ? ? ? ? ? num ^= endPoint[0];


? ? ? ? ? ? ? ? h1 ^= RotateLeft(num * 3432918353U, 15) * 461845907U;
? ? ? ? ? ? }
? ? ? ? }


? ? ? ? h1 ^= (uint)length;
? ? ? ? h1 = (uint)(((int)h1 ^ (int)(h1 >> 16)) * -2048144789);
? ? ? ? h1 = (uint)(((int)h1 ^ (int)(h1 >> 13)) * -1028477387);
? ? ? ? return h1 ^ h1 >> 16;
? ? }
}        

This initial implementation may not win any awards for its appearance, but its efficiency is evident in the results:

No alt text provided for this image

Incidentally, I was looking to modify the “Ratio” column in BenchmarkDotNet and came across a well-written, short article by Dave Callan. You can access the article at the following link:?How to set the Ratio column style in BenchmarkDotNet results (davecallan.com).

This is the type of performance increase I was looking for. But why stop here? Right? Actually, every sensible person would call it a day here.

There are still some things that bother me in this code. One of them is the use of the “fixed” statement. In other words. The use of “unsafe”.

So instead of using unsafe, let’s use Unsafe. Pretty hard to see a difference there, but I am talking about using “System.Runtime.CompilerServices.Unsafe”

The initial "Unsafe" implementation:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static uint Hash32(ReadOnlySpan<byte> bytes, uint seed)
    {
        var length = (uint)bytes.Length;
        var h1 = seed;
        var remainder = length & 3;

        ref byte bp = ref MemoryMarshal.GetReference(bytes);
        ref uint endPoint = ref Unsafe.Add(ref Unsafe.As<byte, uint>(ref bp), length >> 2);
        if (length >= 4)
        {
            do
            {
                h1 = RotateLeft(h1 ^ RotateLeft(Unsafe.ReadUnaligned<uint>(ref bp) * 3432918353U, 15) * 461845907U, 13) * 5 - 430675100;
                bp = ref Unsafe.Add(ref bp, 4);
            } while (Unsafe.IsAddressLessThan(ref Unsafe.As<byte, uint>(ref bp), ref endPoint));
        }

        if (remainder > 0)
        {
            uint num = 0;
            if (remainder > 2) num ^= Unsafe.Add(ref endPoint, 2) << 16;
            if (remainder > 1) num ^= Unsafe.Add(ref endPoint, 1) << 8;
            num ^= endPoint;

            h1 ^= RotateLeft(num * 3432918353U, 15) * 461845907U;
        }

        h1 ^= length;
        h1 = (uint)((h1 ^ (h1 >> 16)) * -2048144789);
        h1 = (uint)((h1 ^ (h1 >> 13)) * -1028477387);
        return h1 ^ h1 >> 16;
    }        

“MurMurHash3_unsafe” is the previous implementation, and “MurMurHash3” is the code above.

No alt text provided for this image

A tiny but significant improvement; This looked like the end of the optimization journey. However, I had a gut feeling we could squeeze out more.

I analyzed the generated IL and the JIT ASM and figured I’d have to play with the number of variables I initialize.

With some more tweaks and adjustments, we were able to make it even faster.

You can find the final implementation and benchmarks on my GitHub: https://github.com/JeremyEspresso/MurmurHash

NuGet: NuGet Gallery | JeremyEspresso.MurmurHash 0.0.1

Conclusion

When optimizing code, it’s essential to consider the raw performance gains and the readability and maintainability of the final code. While spending hours to increase performance by 1.5x~ average might be worth it for certain developers or applications, it’s not necessarily a wise use of time for the average developer. This is especially true when the optimization causes the code to be difficult to read and understand. Ultimately, the key takeaway is that unless you have a really good reason for it. Don’t spend your time optimizing at the nanosecond level.

I hope you’ve gained some insights about the process of optimizing C# code and learned some new things. This was my first time diving into optimizing a specific algorithm and writing about it, and I’m grateful that you joined me on this ride. If you’ve got any feedback or comments, I’d love to hear them! Whether it’s a thumbs-up or constructive criticism, I’m all ears. So again, thank you for being a part of this learning experience, and I look forward to your thoughts on the journey.

Jonathan Soesman

Program Director B737 and A320, Component Services, KLM Engineering & Maintenance

2 年

very interesting, nice achievement Jeremy!!! Optimizing codes, I guess that there is a lot to gain in this field!!

Doron Themans

Innovation & Strategy specialist Intra- and Entrepreneur, Chatbot strategist, Botcasters Host

2 年

Great work Jeremy Themans! Are there more optimizations coming up?

Emil Svensson

Software Developer

2 年

Interesting read for any SW developer. Can't wait for your next article.

Simcho Stanton

Rabbi at Khal Chassidim | Building a Vibrant Torah Community Where Everybody is Somebody

2 年

Congratulations Jeremy! Unfortunately, my knowledge of your field is so limited that I may not understand one word ?? , but I have no doubt that its a fantastic piece. Only the best of luck!

Yannick de Graaff

Full Stack Developer & Freelancer

2 年

Was very nice to read! New insights and sometimes small changes can increase a performance boost already, really good!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了