Faster random integer generation with batching
We often generate random integers. Quite often these numbers must be within an interval: e.g., an integer between 0 and 100. One application is a random shuffle. A standard algorithm for a fair random shuffle is the Knuth algorithm:
void shuffle(mytype *storage, uint64_t size) {
for (uint64_t i = size; i > 1; i--) {
uint64_t nextpos = random(i); // random value in [0,i)
std::swap(storage[i - 1], storage[nextpos]);
}
}
It is how the C++ function std::shuffle is implemented, as well as any similar standard shuffle function. Notice how it requires you to generate random integers in even changing intervals.
Meanwhile, most modern (pseudo) random integer functions generate 64-bit integers. Indeed, our processors are 64-bit bits running on 64-bit operating systems. Thus we must convert these 64-bit integers to whatever else we need, whether they are integers in an interval or a floating-point number.
You might be tempted to using a modulo operation (random() % 100). While it generates a random number between 0 and 100, it also uses a statistical bias. Further, if the value 100 is not know at compile-time (i.e., it is a changing value), it may compile down to a division instruction, which is relatively expensive. There is a better approach that requires no division most of the time and is used by many programming language standard libraries:
uint64_t random_bounded(uint64_t range) {
__uint128_t random64bit, multiresult;
uint64_t leftover;
uint64_t threshold;
random64bit = rng(); // 64-bit random integer
multiresult = random64bit * range;
leftover = (uint64_t)multiresult;
if (leftover < range) {
threshold = -range % range;
while (leftover < threshold) {
random64bit = rng();
multiresult = random64bit * range;
leftover = (uint64_t)multiresult;
}
}
return (uint64_t)(multiresult >> 64); // [0, range)
}
If you are using GCC for C++ programming under Linux (glibc++), you are already using this trick. If you are a macOS user with LLVM and libc++, you are probably ?relying on a slower approach.
Is that the best we can do?
Think about the scenario where you must generate an integer in the interval [0,100) and then an integer in the interval [0,99). Could you generate both such integers form a single 64-bit random integer?
Indeed, you can, and you can avoid the division instructions once more. The code is somewhat similar although a bit more complicated. In theory, the algorithm needs to compute the product of the two ranges, but we only compute it if needed.
// product_bound can be any integer >= range1*range2
// it may be updated to become range1*range2
std::pair<uint64_t, uint64_t>
random_bounded_2(uint64_t range1, uint64_t range2,
uint64_t &product_bound) {
__uint128_t random64bit, multiresult;
uint64_t leftover;
uint64_t threshold;
random64bit = rng(); // 64-bit random integer
multiresult = random64bit * range1;
leftover = (uint64_t)multiresult;
uint64_t result1 = (uint64_t)(multiresult >> 64); // [0, range1)
multiresult = leftover * range2;
leftover = (uint64_t)multiresult;
uint64_t result2 = (uint64_t)(multiresult >> 64); // [0, range2)
if (leftover < product_bound) {
uint64_t product_bound = range2 * range1;
if (leftover < product_bound) {
threshold = -product_bound % product_bound;
while (leftover < threshold) {
random64bit = rng();
multiresult = random64bit * range1;
leftover = (uint64_t)multiresult;
result1 = (uint64_t)(multiresult >> 64); // [0, range1)
multiresult = leftover * range2;
leftover = (uint64_t)multiresult;
result2 = (uint64_t)(multiresult >> 64); // [0, range2)
}
}
}
return std::make_pair(result1, result2);
}
It has one limitation: the product of the two ranges must fit in a 64-bit word. Still, it be used to shuffle arrays, as long as they don’t exceed too much 1 billion elements:
领英推荐
void shuffle_2(mytype *storage, uint64_t size) {
uint64_t i = size;
for (; i > 1 << 30; i--) {
uint64_t index = random_bounded(i, g);
// index is in [0, i-1]
std::swap(storage[i - 1], storage[index]);
}
// Batches of 2 for sizes up to 2^30 elements
uint64_t product_bound = i * (i - 1);
for (; i > 1; i -= 2) {
auto [index1, index2] = random_bounded_2(i, i - 1,
product_bound, g);
// index1 is in [0, i-1]
// index2 is in [0, i-2]
std::swap(storage[i - 1], storage[index1]);
std::swap(storage[i - 2], storage[index2]);
}
}
Of course, you can extend this trick to 3, 4, 5, 6… elements, but it gets more complicated.
Is it fast? Let us test with a standard random number generator is C++ (std::mt19937_64). I wrote a C++ benchmark.
Let us first consider an M2 macBook with Apple/LLVM 15:
So we are four to five times faster than the standard library. Not bad.
What about Linux with an Intel processor and GCC 12? Then our batched approach is about 30% faster than the standard library. Notice how the Linux/GCC base results are much better: that’s because it is already using a well optimized routine. Still: 30% is not a bad gain.
Of course, the batched generation of random numbers applies to other important operations. Shuffling is just the most obvious one.
This is joint work with Nevin Brackett-Rozinsky.
References:
Senior Performance Engineer Specializing in Low Latency Java/C++ Environments | Contributing Author of "Performance Analysis & Tuning on Modern CPUs" (1st & 2nd Editions available on Amazon) | CEO of JabPerf
7 个月Daniel Lemire, what is the name of the initial trick used by GCC on Linux that you illustrated, so that I can read a detailed description of that algorithm?