Last week, I started a discussion about a problem a client had with unique IDs generated by random numbers. They expected some collision, but the expected benefits of scalability outweighed the rare conflicts which would be lost in the insignificant digits of their analytics. In practice, the collisions were much more frequent and polluting their data. They needed to know why.
Having convinced the engineers and engineering managers about the underlying mathematics, they were actually pleased to see that their intuition about a very large random number (256 bits!) was a good solution. In fact, the chances of conflict was much lower than they expected. As they continued digging through the data, however, they found more and more evidence of duplication that they could not explain.
The millions of users coming in as flash crowds were real. The next thing to look at was the size of the random number.
3. Where Does 256 bits of Randomness Come From?
The engineering manager swore that his team had a 256 bit pseudo-random number generator. The engineers believed they did too. When I looked at the code, however, they used a 32 bit pseudo-random number generator to generate a sequence of eight 32 bit numbers that they concatenated. This was a fundamental mistake.
All computers are Turing Machines, which are finite and deterministic. It is really hard for them to collect enough entropy sources to generate a truly random number. Usually, they need custom hardware to listen to some form of analog noise. The free service, random.com, uses atmospheric noise to generate truly random numbers.
The “pseudo” in pseudo-random number generators is important. The deterministic, finite state machine is faking random numbers with a recursive hash function. All pseudo-random number generators have a period after which they repeat the sequence of numbers over again.
Now consider this sequence of eight 32-bit random numbers. If the first 32 bits collide, what are the chances that the next seven sets of 32 bits will also going to collide? For a true random source, it is small. For pseudo-random number generators, it can be as high as 100%, depending on the implementation.
The pseudo-random number generators that come with most programming languages (including Java, libC, and others) use a technique called linear congruential generation. Don’t worry about the name; it is easier to read pseudo-code.
return selected_bits( )
where a, c and m are constants, is the seed.
- the multiplier, is fixed
- the increment, is fixed
- the modulus, is fixed.
- the selected_bits() function is also fixed.
You can find what these numbers are for various implementations on the web. The modulus, sets an upper bound on the period of the number sequence, Though in many cases the period may be much shorter!
If two clients use the same implementation and generate the same , the probability they will also generate matching . Depends on many details. In many implementations, such as my clients where and bit-selection is a no-op, the chance is 100%!
So I had to explain to my client that they did not really have a full 256-bit random number. They had a 32-bit random number followed by 224 bits of deterministic uselessness.
Using our approximation from last week’s post, we can see what this does to keep their chances of collision to under one in a million.
Remember, this is just an “rule of thumb” estimator. Calculating this exactly, I got 93 as the upper bound with chances of collision less than 1 in 1/1.000,000. A 1 in 50 chance is only 77,164 clients. Clearly this was not acceptable from an engineering standpoint.
I thought I might be done, but now the engineers started digging and thinking. They discovered it wasn’t just doubles, but triples, quadruples, quintuples. Sometimes they had 20-50 clients all with the same id, and not for very large populations. We were on the right track, but there was something else going on.
4. The Importance of a Good Seed.
The next thing to look to is how they are seeding the PRNG. There is a great example in gambling where a video poker game had a hard-coded seed that it booted with. Gamblers learned how to build sufficient static electricity to reboot the system, and start playing the same hands every single time thereafter. An early implementation of a web-browser used the process id’s of the browser and its parent as seeds for the PRNG, but the parent process was often the windowing system which had a well-defined process ID.
Often the seed is chosen from the clock. In this case, it was a millisecond clock, so you would expect a lot of entropy. But we’re talking about flash mobs, where tens of thousands of users can hit a website in a second. The computers don’t need to be synchronized , you start running into the pigeon hole principle again. Even if you have 32 bit clock with milisecond accuracy… as long as everyone’s clock is accurate to the hour, 1000 msecs/sec * 60 secs/min * 60 mins/hour = 3,600,000 possible values, which is only 22 bits.
Want to guess what the chances are for 10 million viewers to have collisions when there are only 3,600,000 possible unique keys to choose from in an hour? The answer is implied by the photo below, though on a much grander scale.
Ultimately, the customer moved to a server-based system where blocks of unique IDs were allocated to each edge server, which could request new blocks when their private pool were running out. Although their initial idea of generating random IDs on the client would have been more scalable, there simply was not enough entropy available in flash crowds to prevent colliding IDs on a massive scale.