登录查看更多内容

Faster random integer generation with batching

Daniel Lemire

Computer Scientist

发布日期: 2024年8月17日

We often generate random integers. Quite often these numbers must be within an interval: e.g., an integer between 0 and 100. One application is a random shuffle. A standard algorithm for a fair random shuffle is the Knuth algorithm:

void shuffle(mytype *storage, uint64_t size) {
  for (uint64_t i = size; i > 1; i--) {
    uint64_t nextpos = random(i); // random value in [0,i)
    std::swap(storage[i - 1], storage[nextpos]);
  }
}

It is how the C++ function std::shuffle is implemented, as well as any similar standard shuffle function. Notice how it requires you to generate random integers in even changing intervals.

Meanwhile, most modern (pseudo) random integer functions generate 64-bit integers. Indeed, our processors are 64-bit bits running on 64-bit operating systems. Thus we must convert these 64-bit integers to whatever else we need, whether they are integers in an interval or a floating-point number.

You might be tempted to using a modulo operation (random() % 100). While it generates a random number between 0 and 100, it also uses a statistical bias. Further, if the value 100 is not know at compile-time (i.e., it is a changing value), it may compile down to a division instruction, which is relatively expensive. There is a better approach that requires no division most of the time and is used by many programming language standard libraries:

uint64_t random_bounded(uint64_t range) {
  __uint128_t random64bit, multiresult;
  uint64_t leftover;
  uint64_t threshold;
  random64bit = rng(); // 64-bit random integer
  multiresult = random64bit * range;
  leftover = (uint64_t)multiresult;
  if (leftover < range) {
    threshold = -range % range;
    while (leftover < threshold) {
      random64bit = rng();
      multiresult = random64bit * range;
      leftover = (uint64_t)multiresult;
    }
  }
  return (uint64_t)(multiresult >> 64); // [0, range)
}

If you are using GCC for C++ programming under Linux (glibc++), you are already using this trick. If you are a macOS user with LLVM and libc++, you are probably ?relying on a slower approach.

Is that the best we can do?

Think about the scenario where you must generate an integer in the interval [0,100) and then an integer in the interval [0,99). Could you generate both such integers form a single 64-bit random integer?

Indeed, you can, and you can avoid the division instructions once more. The code is somewhat similar although a bit more complicated. In theory, the algorithm needs to compute the product of the two ranges, but we only compute it if needed.

// product_bound can be any integer >= range1*range2
// it may be updated to become range1*range2
std::pair<uint64_t, uint64_t> 
 random_bounded_2(uint64_t range1, uint64_t range2,
                   uint64_t &product_bound) {
  __uint128_t random64bit, multiresult;
  uint64_t leftover;
  uint64_t threshold;
  random64bit = rng(); // 64-bit random integer
  multiresult = random64bit * range1;
  leftover = (uint64_t)multiresult;
  uint64_t result1 = (uint64_t)(multiresult >> 64); // [0, range1)
  multiresult = leftover * range2;
  leftover = (uint64_t)multiresult;
  uint64_t result2 = (uint64_t)(multiresult >> 64); // [0, range2)
  if (leftover < product_bound) {
    uint64_t product_bound = range2 * range1;
    if (leftover < product_bound) {
      threshold = -product_bound % product_bound;
      while (leftover < threshold) {
        random64bit = rng();
        multiresult = random64bit * range1;
        leftover = (uint64_t)multiresult;
        result1 = (uint64_t)(multiresult >> 64); // [0, range1)
        multiresult = leftover * range2;
        leftover = (uint64_t)multiresult;
        result2 = (uint64_t)(multiresult >> 64); // [0, range2)
      }
    }
  }
  return std::make_pair(result1, result2);
}

It has one limitation: the product of the two ranges must fit in a 64-bit word. Still, it be used to shuffle arrays, as long as they don’t exceed too much 1 billion elements:

领英推荐

Argument-Dependent Lookup and the Hidden Friend Idiom

Rainer Grimm 2 年前

Nth element variadic pack extraction

Alex Dathskovsky ? 2 年前

C++26 Library: string and string_view Processing

Rainer Grimm 2 个月前

void shuffle_2(mytype *storage, uint64_t size) {
  uint64_t i = size;
  for (; i > 1 << 30; i--) {
    uint64_t index = random_bounded(i, g); 
    // index is in [0, i-1]
    std::swap(storage[i - 1], storage[index]);
  }

  // Batches of 2 for sizes up to 2^30 elements
  uint64_t product_bound = i * (i - 1);
  for (; i > 1; i -= 2) {
    auto [index1, index2] = random_bounded_2(i, i - 1, 
         product_bound, g);
    // index1 is in [0, i-1]
    // index2 is in [0, i-2]
    std::swap(storage[i - 1], storage[index1]);
    std::swap(storage[i - 2], storage[index2]);
  }
}

Of course, you can extend this trick to 3, 4, 5, 6… elements, but it gets more complicated.

Is it fast? Let us test with a standard random number generator is C++ (std::mt19937_64). I wrote a C++ benchmark.

Let us first consider an M2 macBook with Apple/LLVM 15:

So we are four to five times faster than the standard library. Not bad.

What about Linux with an Intel processor and GCC 12? Then our batched approach is about 30% faster than the standard library. Notice how the Linux/GCC base results are much better: that’s because it is already using a well optimized routine. Still: 30% is not a bad gain.

Of course, the batched generation of random numbers applies to other important operations. Shuffling is just the most obvious one.

This is joint work with Nevin Brackett-Rozinsky.

References:

Nevin Brackett-Rozinsky, Daniel Lemire, Batched Ranged Random Integer Generation, Software: Practice and Experience (to appear)
Daniel Lemire, Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation, Volume 29 Issue 1, February 2019

Mark Dawson, Jr.

Senior Performance Engineer Specializing in Low Latency Java/C++ Environments | Contributing Author of "Performance Analysis & Tuning on Modern CPUs" (1st & 2nd Editions available on Amazon) | CEO of JabPerf

7 个月

Daniel Lemire, what is the name of the initial trick used by GCC on Linux that you illustrated, so that I can read a detailed description of that algorithm?

2 次回应

查看更多评论

要查看或添加评论，请登录

Daniel Lemire的更多文章

Speeding up C++ code with template lambdas

2025年3月15日

Speeding up C++ code with template lambdas

Let us consider a simple C++ function which divides all values in a range of integers: A division between two integers…
An overview of parallel programming (Go edition)

2025年3月9日

An overview of parallel programming (Go edition)

In practice, the software we write runs on several processors. Unfortunately, much of what we take for granted on a…
How fast can you open 1000 files?

2025年3月1日

How fast can you open 1000 files?

Jarred Sumner, the main author of the Bun JavaScript engine, commented a few days ago on X that opening many files on…

1 条评论
AVX-512 gotcha: avoid compressing words to memory with AMD Zen 4 processors

2025年2月15日

AVX-512 gotcha: avoid compressing words to memory with AMD Zen 4 processors

Convention computer instructions operate on a single piece of data at once (e.g.

4 条评论
Thread-safe memory copy

2025年2月7日

Thread-safe memory copy

A common operation in software is the copy of a block of memory. In C/C++, we often call the function memcpy for this…

2 条评论
Programmer time and the pitfalls of wasteful work

2025年1月30日

Programmer time and the pitfalls of wasteful work

Programmer time is precious. This realization should shape our approach to software development, focusing our efforts…
Regular expressions can blow up!

2025年1月25日

Regular expressions can blow up!

Regular expressions, often abbreviated as regex, are a powerful tool for pattern matching within text. For example, the…

6 条评论
Checking whether an ARM NEON register is zero

2025年1月20日

Checking whether an ARM NEON register is zero

Your phone probably runs on 64-bit ARM processors. These processors are ubiquitous: they power the Nintendo Switch…
JavaScript hashing speed comparison: MD5 versus SHA-256

2025年1月11日

JavaScript hashing speed comparison: MD5 versus SHA-256

Hashing algorithms convert input data into a fixed-size string of characters, known as a hash value or digest. These…
Counting the digits of 64-bit integers

2025年1月7日

Counting the digits of 64-bit integers

Given an integer in software, you may want to know how many decimal digits it needs. For example, the integer 100…

3 条评论

See all articles

Faster random integer generation with batching

Daniel Lemire

Computer Scientist

领英推荐

Daniel Lemire的更多文章

社区洞察

其他会员也浏览了

Template Metaprogramming - How it All Started

C++20 Coroutine: Under The Hood

Bit Manipulation with C++20

Which Axis ACAP SDK should you use? Computer Vision at the edge

std::format Extension

Chapter 5: Commands and Tokens

C++20: Coroutines - A First Overview

Understanding constexpr and Constant Expressions in C++

Constant Folding in C++

C++ Core Guidelines: Programming at Compile Time with Type-Traits (The Second)

领英推荐

Daniel Lemire的更多文章

Speeding up C++ code with template lambdas

An overview of parallel programming (Go edition)

How fast can you open 1000 files?

AVX-512 gotcha: avoid compressing words to memory with AMD Zen 4 processors

Thread-safe memory copy

Programmer time and the pitfalls of wasteful work

Regular expressions can blow up!

Checking whether an ARM NEON register is zero

JavaScript hashing speed comparison: MD5 versus SHA-256

Counting the digits of 64-bit integers

社区洞察

其他会员也浏览了

Template Metaprogramming - How it All Started

C++20 Coroutine: Under The Hood

Bit Manipulation with C++20

Which Axis ACAP SDK should you use? Computer Vision at the edge

std::format Extension

Chapter 5: Commands and Tokens

C++20: Coroutines - A First Overview

Understanding constexpr and Constant Expressions in C++

Constant Folding in C++

C++ Core Guidelines: Programming at Compile Time with Type-Traits (The Second)