I had cause, recently, to write a little function that randomises a name for a resource from a list of roughly 90 words and a 3-digit postfix, in the form ‘happycat-267′*. Now, I don’t want to re-use a name that’s gone before, so the function checks already-deployed resources and if the name is taken, the function recurses to pick a new name from my ~90,000 possiblities.

Of course, when it came to testing this function, I wanted to be reasonably sure that the recursion worked and would continue to work in the future, so I decided to write a simple Pester test that loops a certain number of times and doesn’t return a duplicate value.

But this raised a question. How many iterations should this test go through to be reasonably sure of hitting a duplicate and thus triggering a recursion?

Obviously to fully test, I’d need to run the function 90,000+ times. But that’s computationally expensive and would slow down my testing. I don’t want to do that. But how many is the *least* number of tests I can do to be *reasonably* sure of a duplicate appearing?

Which brings me to the Birthday Problem, aka the Birthday Paradox.

Stated simply, the Birthday Problem tells us that in a group of only 23 people, it’s more likely than not to find two people who share a birthday. And if you have 70 people in a room, the probability of at least one birthday match is up at 99.9%.

The Wikipedia article goes into a lot of detail on why this is mathematically true, but the astute among us whill have noticed that this mathematical phenomenon has an application in unit testing my little “random-name-without-collisions” function.

There are Birthday Problem calculators online, such as this one, so I plugged in my possibility space of 90,000 and started playing with iterations. It turns out that if I call my function 1000 times, there’s a >99.5% chance that I’ll produce (and therefore handle) a duplicate.

At 500 iterations, the probability drops to around 75%, and at 300 iterations, it’s around 40% – so clearly 300 iterations is too low, that is it’s more likely that I *won’t* hit a duplicate.

So, I wrote my own function to calculate the probabilities for me

I can call this in a loop, and calculate a table of probabilities from 1 iteration to 10,000 (and graph that, if I feel like it).

1 2 3 4 |
(1..10000) | % { $probability = Get-CollisionChance 90000 $_ [pscustomobject]@{Iterations = $_ ; Probability = $probability} } |

I can then use that table to zoom in on a probability I feel comfortable with. Let’s say I’m happy with a 99% probability of collision, it turns out that 911 iterations will get me there. If I’m happy with 90%, 644 iterations will do it.

Above 911 iterations, the curve plateaus out and the returns from adding more iterations become smaller. We hit a point where PowerShell rounds the probability up to 1 at 2519 iterations. It’s not mathematically *certain* at this point, but we’re up around a 99.9999999999999% chance of collision.

So we can see there’s really not much point iterating above 2500 repetitions. The increased probability just isn’t worth the extra processor cycles.

So anyway, with a little mocking, I can write a useful test that has a 99% chance of hitting at least one duplicate in the function, and test that it doesn’t actualy return any dupes, thus:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Describe "Getting a random name" { # Mock this so we don't exhaust API calls by repeatedly calling for a list $currentresources = Get-AlreadyDeployedResources Mock Get-AlreadyDeployedResources { return $currentresources } It "Never returns duplicates (as far as we can tell)" { $reslist = @() (1..911) | % { # get a random resource name $reslist += Get-RandomResourceName | tee-object -Variable plusone # add that back to the list we mocked $currentresources += [pscustomobject]@{ ResourceName = $plusone; ResourceID = 'fakeidentifier' } } # count how many names we have $all = ($reslist | measure-object | select-object -expand Count) #check how many unique names are in that list $unique = ($reslist | select-object -unique | measure-object | select-object -expand Count) # they should match $all | Should Be $unique } } |

Anyway, this is what I spent yesterday afternoon researching and playing around with. Hopefully someone other than me will find it useful. If not, well I had fun.

** not one of the actual values.*