Generally speaking, humans beings have a poor intuitive sense for randomness. That poor intuition is exploited all the time in gambling, games, markets, and business. Here’s a personal experience I had where the difference between random numbers and pseudo-random numbers made all the difference.
A client had a home-grown tracking system that was designed for flash crowds on the Internet. Everything was designed to support huge numbers of people joining the system in a short time, so everything was highly decentralized. They didn’t even want a central authority to guarantee unique ID numbers. Instead, they had the clients generate very large random numbers and reasoned that the chances of two clients generating the exact same number were so small and the population was so large, that the resulting data corruption would be insignificant (or “in the noise”).
As they started to get real traffic on their system for big events, they found there were very large numbers of customers from different IP addresses that were granting themselves the same ID number… much more than they expected and it was causing real problems with the quality of the analytics their business depended on.
1.B. That’s when I got involved…
Debugging an enterprise-class system is like being a detective. There’s no single person who knows the whole story, so you have to interview a lot of people. Yes, there may be an architect who has the big picture, and that’s a great starting point, but it isn’t the whole story.
In this case, my first clue that there was a real problem was from the architect. There were no real numbers included in the analysis, just “pseudo-numbers.”
I’d hear, “Chances of collision are small.”
So I asked “How small? Give me a number.”
But they didn’t have a number.
Their argument for design was reasonable, but it is impossible to do real risk analysis without numbers. The design, the arguments, the expected results were all entirely based on intuition. I’ll say it again, human beings have very poor intuitive grasp of random processes, myself included. Study of probability and statistics is an exercise in believing the counter-intuitive.
It turns out there were three mistakes made in this case: one mathematical, one rooted in computer science theory, and one practical. I’ll cover each one in order.
2. The Birthday Problem
If a teacher has a room full of students, what are the chances at least two of them have the same birthday? This is the “Birthday Problem” and it is beautiful because most people’s intuition fails miserably at it. For a class of 30 kids, the chances of at least one pair sharing a birthday are over 70%. It is useful because the mathematics behind the birthday problem is identical to a bunch of web clients that self assign ID#’s from a random number generator and estimating how often they’ll collide.
2.A. The Math
For simplicity, assume every day is an equally likely birthday, so exclude practical concerns like leap-day, twins, etc. leaving us exactly 365 possible birth dates for the students. If a teacher has 366 students in their room, then the chances of two kids having the same birthday is 100%, by the pigeonhole principle.
It is easier to start with the complimentary statement, the chance that there are no collisions at all. The first student has no collisions, so that’s 100%. The second one has a 364/365 chance of no collisions with the first, the next 363/365 chance of no collisions with the previous two, and so on. Multiplying it out, the total chances for no collisions in 30 students is.
Where is the number of r-permutations of n. (It is a little exotic, but you can find this function on most scientific calculators.) The chances for at least one collision is the compliment
Thus, a classroom of 30 students has a 70.63% chance that there is at least one pair of students with a matching birthday. You only need 23 students to have a 50/50 chance.
A general formula for probability of collisions for a random number of values n is
2.B. Applying Theory to Client’s Problem
Going back to our client’s problem, what are the chances for at least two of six million web-clients to generate the same random ID? In our formula above, the number of clients is r, and n depends on how many possible values can be generated by the pseudo-random number generator (PRNG).
They said it was 256 bits, which is a very large number. possible unique IDs. A calculator isn’t going to manage numbers at this scale. Even a program would take some serious care to compute this. Luckily, there’s a reasonable estimator.
For a coarse rule of thumb of probabilities less than 0.5, it can be further approximated
Solving for r, we get.
For instance, to estimate the number of people required for a 0.5 chance of a shared birthday, we get . Which is not too far from the correct answer of 23.
This rule of thumb is especially handy when working with exponents, such as the 256 bit random number. In this case, and want the chance of a collision to be at most one in a million (let’s say ).
Fifteen Decillion is a very large number, so they should definitely not see the number of collisions that they do. Now we can set our sights on implementation problems. It turns out that if they had true random number generators, they would have been fine, but these are very hard to do… impossible without special hardware. Instead, we only have pseudo-random number generators, and failing to appreciate that distinction is the root of their problem.
I’ll conclude this article and provide the final solution.