I recently discovered that the _mm_crc32_* intel intrinsic instruction could be used to generate (pseudo-)random 32 bit numbers.

```
#include <nmmintrin.h> /* needs CRC32C instruction from SSE4.2 instruction set extension */
uint32_t rnd = 1; /* initialize with seed != 0 */
/* period length is 4,294,967,295 = 2^32-1 */
while (1) {
#if 0 // this was faster but worse than xorshift32 (fails more tests)
// rnd = _mm_crc32_u8(rnd, rnd >> 3);
#else // this is faster and better than xorshift32 (fails fewer tests)
rnd = _mm_crc32_u32(rnd, rnd << 18);
#endif
printf("%08X\n", rnd);
}
```

This method is as fast as a LCG, and faster than xorshift32. Wikipedia says that because xorshift generators "fail a few statistical tests, they have been accused of being unreliable".

Now I am wondering if the CRC32C method passes the various tests that are done on random number generators. I only verified that each bit, even the LSB, is "random" by trying to compress using PAQ8 compressors (which failed). Can someone help me do better tests?

**EDIT:** Using tests from the suggested TestU01 suite, the method that I previously used turned out to be worse than xorshift32. I have updated the source code above in case anyone is interested in using the better version.

answered 5 months ago Robert Dodier #1

This is an interesting question. Ultimately the only test that counts is "does this rng yield correct results for the problem I'm working on". What are you hoping to do with the rng?

In order to avoid answering that question for every different problem, various tests have been devised. See for example the "Diehard" tests devised by George Marsaglia. A web search for "marsaglia random number generator tests" turns up several interesting links.

I think Marsaglia's work is a couple of decades old at this point. I don't know if there has been more work on this topic since then. My guess is that for non-cryptographic purposes, an rng which passes the Diehard tests is probably sufficient.

answered 5 months ago Peter Cordes #2

There's a big difference in the requirements for a PRNG for a video game (especially single-player) vs. a monte-carlo simulation. Small biases can be a problem for scientific numerical computing, but generally not for a game, especially if numbers from the same PRNG are used in different ways.

There's a reason that different PRNGs with different speed / quality tradeoffs exist.

This one is *very* fast, especially if the seed / state stays in a register, taking only 2 or 3 uops on a modern Intel CPU. So it's fantastic if it can inline into a loop. Compared to anything else of the same speed, it's probably better quality. But compared to something only a little bit slower with larger state, it's probably pathetic if you care about statistical quality.

On x86 with BMI2, each RNG step should only require `rorx edx, eax, 3`

/ `crc32 eax, dl`

. On Haswell/Skylake, that's 2 uops with total latency = 1 + 3 cycles for the loop-carried dependency. (http://agner.org/optimize/). Or 3 uops without BMI2, for `mov edx, eax`

/ `shr edx,3`

/ `crc32 eax, dl`

, but still only 4 cycles of latency on CPUs with zero-latency `mov`

for GP registers: Ivybridge+ and Ryzen.

2 uops is negligible impact on surrounding code in the normal case where you do enough work with each PRNG result that the 4-cycle dependency chain isn't a bottleneck. (Or ~9 cycle if your compiler stores/reloads the PRNG state inside the loop instead of keeping it live in a register and sinking the store to the global to after a loop, costing you 2 extra 1-uop instructions).

On Ryzen, `crc32`

is 3 uops with 3c total latency, so more impact on surrounding code but the same one per 4 clock bottleneck if you're doing so little with the PRNG results that you bottleneck on that.

**I suspect you may have been benchmarking the loop-carried dependency chain bottleneck, not the impact on real surrounding code that does enough work to hide that latency**. (Almost all relevant x86 CPUs are out-of-order execution.) Making an RNG even cheaper than xorshift128+, or even xorshift128, is likely to be of negligible benefit for most use-cases. xorshift128+ or xorshift128* is fast, and quite good quality for the speed.

**If you want lots of PRNG results very quickly, consider using a SIMD xorshift128+ to run two or four generators in parallel** (in different elements of XMM or YMM vectors). Especially if you can usefully use a `__m256i`

vector of PRNG results. See AVX/SSE version of xorshift128+, and also this answer where I used it.

**Returning the entire state as the RNG result is usually a bad thing, because it means that one value tells you exactly what the next one will be**. i.e. 3 is always followed by 1897987234 (fake numbers), never 3 followed by something else. Most statistical quality tests should pick this up, but this might or might not be a problem for any given use-case.

Note that https://en.wikipedia.org/wiki/Xorshift is saying that even xorshift128 fails a few statistical tests. I assume xorshift32 is significantly worse. CRC32c is also based on XOR and shift (but also with bit-reflect and modulo in Galois Field(2)), so it's reasonable to think it might be similar or better in quality.

You say your choice of `crc32(rnd, rnd>>3)`

gives a period of 2^32, and that's the best you can do with a state that small. (Of course `rnd++`

achieves the same period, so it's not the only measure of quality.) It's likely at least as good as an LCG, but those are *not* considered high quality, especially if the modulus is 2^32 (so you get it for free from fixed-width integer math).

answered 5 months ago Mark Adler #3

One measure of the goodness of a PRNG is the length of the cycle. If that matters for your application, then a CRC-32 as you are using it would not be a good choice, since the cycle is only 2^{32}. One consequence is that if you are using more samples than that, which doesn't take very long, your results will repeat. Another is that there is a correlation between successive CRC-32 values, where there is only one possible value that will follow the current value.

Better PRNGs have exponentially longer cycles, and the values returned are smaller than the bits in the state, so that successive values do not have that correlation.

You don't need to use a CRC-32C instruction to be fast. Also you don't need to design your own PRNG, which is fraught with hidden perils. Best to leave that to the professionals. See this work for high-quality, small, and fast random number generators.