March 7, 2018

I had cause, recently, to write a little function that randomises a name for a resource from a list of roughly 90 words and a 3-digit postfix, in the form ‘happycat-267’*. Now, I don’t want to re-use a name that’s gone before, so the function checks already-deployed resources and if the name is taken, the function recurses to pick a new name from my ~90,000 possiblities.

Of course, when it came to testing this function, I wanted to be reasonably sure that the recursion worked and would continue to work in the future, so I decided to write a simple Pester test that loops a certain number of times and doesn’t return a duplicate value.

But this raised a question. How many iterations should this test go through to be reasonably sure of hitting a duplicate and thus triggering a recursion?

Obviously to fully test, I’d need to run the function 90,000+ times. But that’s computationally expensive and would slow down my testing. I don’t want to do that. But how many is the least number of tests I can do to be reasonably sure of a duplicate appearing?

Which brings me to the Birthday Problem, aka the Birthday Paradox.

Stated simply, the Birthday Problem tells us that in a group of only 23 people, it’s more likely than not to find two people who share a birthday. And if you have 70 people in a room, the probability of at least one birthday match is up at 99.9%.

The Wikipedia article goes into a lot of detail on why this is mathematically true, but the astute among us whill have noticed that this mathematical phenomenon has an application in unit testing my little “random-name-without-collisions” function.

There are Birthday Problem calculators online, such as this one, so I plugged in my possibility space of 90,000 and started playing with iterations. It turns out that if I call my function 1000 times, there’s a >99.5% chance that I’ll produce (and therefore handle) a duplicate.

At 500 iterations, the probability drops to around 75%, and at 300 iterations, it’s around 40% – so clearly 300 iterations is too low, that is it’s more likely that I won’t hit a duplicate.

So, I wrote my own function to calculate the probabilities for me

I can call this in a loop, and calculate a table of probabilities from 1 iteration to 10,000 (and graph that, if I feel like it).

(1..10000) | % {
    $probability = Get-CollisionChance 90000 $_ 
    [pscustomobject]@{Iterations = $_ ; Probability = $probability}
}

I can then use that table to zoom in on a probability I feel comfortable with. Let’s say I’m happy with a 99% probability of collision, it turns out that 911 iterations will get me there. If I’m happy with 90%, 644 iterations will do it.

Above 911 iterations, the curve plateaus out and the returns from adding more iterations become smaller. We hit a point where PowerShell rounds the probability up to 1 at 2519 iterations. It’s not mathematically certain at this point, but we’re up around a 99.9999999999999% chance of collision.

So we can see there’s really not much point iterating above 2500 repetitions. The increased probability just isn’t worth the extra processor cycles.

So anyway, with a little mocking, I can write a useful test that has a 99% chance of hitting at least one duplicate in the function, and test that it doesn’t actualy return any dupes, thus:

Describe "Getting a random name" {
    # Mock this so we don't exhaust API calls by repeatedly calling for a list
    $currentresources = Get-AlreadyDeployedResources
    Mock Get-AlreadyDeployedResources { return $currentresources }

    It "Never returns duplicates (as far as we can tell)" {   
        $reslist = @() 
        (1..911) | % { 
            # get a random resource name
            $reslist += Get-RandomResourceName | tee-object -Variable plusone 
            # add that back to the list we mocked
            $currentresources += [pscustomobject]@{ ResourceName = $plusone; ResourceID = 'fakeidentifier' }
        }
        # count how many names we have
        $all = ($reslist | measure-object | select-object -expand Count)
        #check how many unique names are in that list
        $unique = ($reslist | select-object -unique | measure-object | select-object -expand Count)

        # they should match
        $all | Should Be $unique
    }
}

Anyway, this is what I spent yesterday afternoon researching and playing around with. Hopefully someone other than me will find it useful. If not, well I had fun.

* not one of the actual values.

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

d.evops.co

Devops engineers do it in production

Daily Archives: March 7, 2018

The Birthday Paradox and testing random functions