Faster random integer generation with batching

Faster random integer generation with batching

We often generate random integers. Quite often these numbers must be within an interval: e.g., an integer between 0 and 100. One application is a random shuffle. A standard algorithm for a fair random shuffle is the Knuth algorithm:

void shuffle(mytype *storage, uint64_t size) {
  for (uint64_t i = size; i > 1; i--) {
    uint64_t nextpos = random(i); // random value in [0,i)
    std::swap(storage[i - 1], storage[nextpos]);
  }
}        

It is how the C++ function std::shuffle is implemented, as well as any similar standard shuffle function. Notice how it requires you to generate random integers in even changing intervals.

Meanwhile, most modern (pseudo) random integer functions generate 64-bit integers. Indeed, our processors are 64-bit bits running on 64-bit operating systems. Thus we must convert these 64-bit integers to whatever else we need, whether they are integers in an interval or a floating-point number.

You might be tempted to using a modulo operation (random() % 100). While it generates a random number between 0 and 100, it also uses a statistical bias. Further, if the value 100 is not know at compile-time (i.e., it is a changing value), it may compile down to a division instruction, which is relatively expensive. There is a better approach that requires no division most of the time and is used by many programming language standard libraries:

uint64_t random_bounded(uint64_t range) {
  __uint128_t random64bit, multiresult;
  uint64_t leftover;
  uint64_t threshold;
  random64bit = rng(); // 64-bit random integer
  multiresult = random64bit * range;
  leftover = (uint64_t)multiresult;
  if (leftover < range) {
    threshold = -range % range;
    while (leftover < threshold) {
      random64bit = rng();
      multiresult = random64bit * range;
      leftover = (uint64_t)multiresult;
    }
  }
  return (uint64_t)(multiresult >> 64); // [0, range)
}        

If you are using GCC for C++ programming under Linux (glibc++), you are already using this trick. If you are a macOS user with LLVM and libc++, you are probably ?relying on a slower approach.

Is that the best we can do?

Think about the scenario where you must generate an integer in the interval [0,100) and then an integer in the interval [0,99). Could you generate both such integers form a single 64-bit random integer?

Indeed, you can, and you can avoid the division instructions once more. The code is somewhat similar although a bit more complicated. In theory, the algorithm needs to compute the product of the two ranges, but we only compute it if needed.

// product_bound can be any integer >= range1*range2
// it may be updated to become range1*range2
std::pair<uint64_t, uint64_t> 
 random_bounded_2(uint64_t range1, uint64_t range2,
                   uint64_t &product_bound) {
  __uint128_t random64bit, multiresult;
  uint64_t leftover;
  uint64_t threshold;
  random64bit = rng(); // 64-bit random integer
  multiresult = random64bit * range1;
  leftover = (uint64_t)multiresult;
  uint64_t result1 = (uint64_t)(multiresult >> 64); // [0, range1)
  multiresult = leftover * range2;
  leftover = (uint64_t)multiresult;
  uint64_t result2 = (uint64_t)(multiresult >> 64); // [0, range2)
  if (leftover < product_bound) {
    uint64_t product_bound = range2 * range1;
    if (leftover < product_bound) {
      threshold = -product_bound % product_bound;
      while (leftover < threshold) {
        random64bit = rng();
        multiresult = random64bit * range1;
        leftover = (uint64_t)multiresult;
        result1 = (uint64_t)(multiresult >> 64); // [0, range1)
        multiresult = leftover * range2;
        leftover = (uint64_t)multiresult;
        result2 = (uint64_t)(multiresult >> 64); // [0, range2)
      }
    }
  }
  return std::make_pair(result1, result2);
}        

It has one limitation: the product of the two ranges must fit in a 64-bit word. Still, it be used to shuffle arrays, as long as they don’t exceed too much 1 billion elements:

void shuffle_2(mytype *storage, uint64_t size) {
  uint64_t i = size;
  for (; i > 1 << 30; i--) {
    uint64_t index = random_bounded(i, g); 
    // index is in [0, i-1]
    std::swap(storage[i - 1], storage[index]);
  }

  // Batches of 2 for sizes up to 2^30 elements
  uint64_t product_bound = i * (i - 1);
  for (; i > 1; i -= 2) {
    auto [index1, index2] = random_bounded_2(i, i - 1, 
         product_bound, g);
    // index1 is in [0, i-1]
    // index2 is in [0, i-2]
    std::swap(storage[i - 1], storage[index1]);
    std::swap(storage[i - 2], storage[index2]);
  }
}        

Of course, you can extend this trick to 3, 4, 5, 6… elements, but it gets more complicated.

Is it fast? Let us test with a standard random number generator is C++ (std::mt19937_64). I wrote a C++ benchmark.

Let us first consider an M2 macBook with Apple/LLVM 15:

So we are four to five times faster than the standard library. Not bad.

What about Linux with an Intel processor and GCC 12? Then our batched approach is about 30% faster than the standard library. Notice how the Linux/GCC base results are much better: that’s because it is already using a well optimized routine. Still: 30% is not a bad gain.

Of course, the batched generation of random numbers applies to other important operations. Shuffling is just the most obvious one.

This is joint work with Nevin Brackett-Rozinsky.

References:

  • Nevin Brackett-Rozinsky, Daniel Lemire, Batched Ranged Random Integer Generation, Software: Practice and Experience (to appear)
  • Daniel Lemire, Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation, Volume 29 Issue 1, February 2019

Mark Dawson, Jr.

Senior Performance Engineer Specializing in Low Latency Java/C++ Environments | Contributing Author of "Performance Analysis & Tuning on Modern CPUs" (1st & 2nd Editions available on Amazon) | CEO of JabPerf

7 个月

Daniel Lemire, what is the name of the initial trick used by GCC on Linux that you illustrated, so that I can read a detailed description of that algorithm?

要查看或添加评论,请登录

Daniel Lemire的更多文章

  • Speeding up C++ code with template lambdas

    Speeding up C++ code with template lambdas

    Let us consider a simple C++ function which divides all values in a range of integers: A division between two integers…

  • An overview of parallel programming (Go edition)

    An overview of parallel programming (Go edition)

    In practice, the software we write runs on several processors. Unfortunately, much of what we take for granted on a…

  • How fast can you open 1000 files?

    How fast can you open 1000 files?

    Jarred Sumner, the main author of the Bun JavaScript engine, commented a few days ago on X that opening many files on…

    1 条评论
  • AVX-512 gotcha: avoid compressing words to memory with AMD Zen 4 processors

    AVX-512 gotcha: avoid compressing words to memory with AMD Zen 4 processors

    Convention computer instructions operate on a single piece of data at once (e.g.

    4 条评论
  • Thread-safe memory copy

    Thread-safe memory copy

    A common operation in software is the copy of a block of memory. In C/C++, we often call the function memcpy for this…

    2 条评论
  • Programmer time and the pitfalls of wasteful work

    Programmer time and the pitfalls of wasteful work

    Programmer time is precious. This realization should shape our approach to software development, focusing our efforts…

  • Regular expressions can blow up!

    Regular expressions can blow up!

    Regular expressions, often abbreviated as regex, are a powerful tool for pattern matching within text. For example, the…

    6 条评论
  • Checking whether an ARM NEON register is zero

    Checking whether an ARM NEON register is zero

    Your phone probably runs on 64-bit ARM processors. These processors are ubiquitous: they power the Nintendo Switch…

  • JavaScript hashing speed comparison: MD5 versus SHA-256

    JavaScript hashing speed comparison: MD5 versus SHA-256

    Hashing algorithms convert input data into a fixed-size string of characters, known as a hash value or digest. These…

  • Counting the digits of 64-bit integers

    Counting the digits of 64-bit integers

    Given an integer in software, you may want to know how many decimal digits it needs. For example, the integer 100…

    3 条评论

社区洞察

其他会员也浏览了