This is something that’s been sitting in my ‘to-blog-about’ list for a while now and I guess I was waiting for it to be ‘finished’ but it became clear that ‘finished’ was quite hard to define for this little project so here it goes…
Last year I was living in Bristol and part of my home security was a padlock. Four Digits, each 0-9, 10,000 combinations. My starting thought was, did it mattered how much I scrambled it? What was the security risk of say only changing one number vs the convenience of being able to open it nice and quickly each time. Now I wasn’t the only person who opened this lock each day, there were at any time something between 10 and 20 people sharing this particular access so I thought I’d find out what everyone else was doing…
From then on a began to make a note of what number I found in the lock each time I got to it. I feel it’s safe to say that I got a good cross section of times as my schedule was very irregular at the time so I wasn’t likely to always following the same people in/out on a daily basis. The only weighting I can think of that might exist in my results are toward people who come and go more regular vs someone who perhaps is a way for long periods of time but I can’t do much about that. It was important to me not to mention this to people, I didn't want to change their habits.
Over the course of perhaps 2 months I collected 52 results which I entered in to a spreadsheet which I kept stored on my phone. After 21 results the padlock was replaced with a different style padlock.
This gave me 2 data sets to play with
Set 1 - 21 Results, padlock #1
Set 2 - 31 Results, padlock #2
So the two main things I wanted to do was to take the modal score for each digit and then to find the mean. The mode was simple enough,find how much each digit occurs and find out which cropped up the most.
The mean was more tricky, just doing a standard mean calculation wasn’t going to cut it as I needed to be doing this all modulus 10 because if you twist it one place beyond 9 in this case you get 0….not 10.
So, to do this I did something that apparently is how they calculate average wind velocities. I split a circle into 10 sections, with 0 being straight up, 5 straight down and the rest spread evenly in between and then using vectors started adding arrows together.
So imagine you start at the centre and the first digit is a 5, so you draw an arrow directly down, with length 1, and then say you get a 3 so carrying on from where you left off you draw another arrow of length one, this time in the direction of 3. Then you keep doing this for your whole data set. To see how this is a mean imagine doing the number 0,1,...,9 in sequence, you would draw a 10 sided regular polygon and end up back where you started, they all cancel out. Here’s a quick sketch to show the principle for a few data points, 0, 3, 5, 1, 3, 0.
I included the resultant vector as a dotted line and by comparing to the reference circle you can see it looks like it’s about 2. Finally as a check, because I picked numbers that didn’t cross around to 9 which would confuse things at this point we can check this by doing a simple mean calculation to get 13/6 ~2.17
So I scaled this technique up to my whole data set by building a spreadsheet to do the hard work and this is what I ended up with.
As you can see it was a bit of a beast, I’ll include a version that you can browse yourself to have a look at the nuts and bolts if you’re interested but here I’ll just give a summary. I'd wanted to do it with polar coordinates as it would have been way more elegant but google sheets seemed fussy about radians.
The orangeish band on the left is where my data set drops into from a separate page, 4 columns, one for each digit in the code. The next 8 columns generate cumulative x and y coordinates for the resulting vector at each point and then the spaces between each of the greeny/red bands give the resultant combined resultant vector, scale it back down to a length one vector, give its direction and then also give the strength of the result as a %. I am able to give a strength (sort of like a level confidence in the result) because as I keep adding vectors I can keep track of how far away from the centre it’s travelled. If results keep agreeing then they’ll get further away than if they disagree and cancel out. 100% confidence would imply every data point was the same (hence they always start green - there is only one data point) 0% confidence would imply they cancelled out perfectly and ended up back where they started.
The final greeny/red band is the strength of the whole 4 digit prediction. The thing to bare in mind is that all of these are cumulative, working their way down the page and recalculating as each new data point it added and hence the data set gets larger.
The first thing I should mention is that this is not exactly my data as I received it. This padlock still exists and is still securing homes so I'm not going to give away the code. For this reason I've introduced a systematic change into my results, adding the same number to each of the digits. This also has the advantage of allowing my to simulate the two different padlocks having the same combination despite the fact that they didn't, allowing me to compare them side by side.
First addressing the modal results, both data sets seem confident that the 2nd, 3rd and 4th digits are 2,3 and 4 respectively and the 1st digit appears to be in the range 8-1. Looking at the frequency breakdown across both data sets we see this very clearly and some other things show up.
As you can see now considering both data sets together there are four clear peaks visible showing the four mode’s, 1, 2, 3 and 4., this is in fact the code. That there are definite peaks here and not an even distribution indicates that people do not scramble completely at random and there is a level of system to this. What we are seeing here is that people scramble and often leave at least one digit unscrambled, everyone does this in a different way and they cancel each there out to leave an indication of the original number. Looking back through the data 29% of all codes had at least 3 of the 4 digits correct where people have gone for the convenient option of just changing either the first or last digit, 53% have two digits or more and 73% where 1 or more digits are correct. Just as comparison, here are the expected outcomes as %’s. This shows this result is well beyond what you’d expect from random mixing.
The other thing to note in the graphs above is further evidence of systematic scrambling. Secondary peaks can be seen 3 places either side of the mode which indicates a particularly convenient twisting distance. The only graph to not exhibit this characteristic is the fourth digit however this was the digit most often left unchanged, staying the same in 50% of cases. This can also be seen in the high confidence the predictor has is the 4th digit which goes up above 65%. This is even higher when only data set #2 is considered.
To the left is the full data set for both padlocks with conditional formatting to indicate how close to correct the value is. This ranges from perfect (Black), to completely wrong ie +/-5 (Light Gray). The main thing to be gained from this is that it highlights the difference between the two padlocks and the fact that they encourage different behaviours. In the top 40% you can see t is darker slightly left of centre indicating a preference to change the last digits more than the first. I do however also notice a run at the begininng of just changing the middle two digits significantly while leaving the ends unchanged. The bottom half of the data shows a different story with the first two digits regularly changed and the end digits changed less often.
Also evident is that the top 40% is darker in colour than the bottom 60%, this is backed up by replicating the chart from earlier but splitting it back into two data sets, this indicated that padlock #2 encouraged more random mixing than padlock #1 in this situation.
One final characteristic is that in every frequency graph for data set #2 it is clear that the secondary spike that corresponds to 2-3 less than the modal value is always larger than the mode spike. Given that the padlock was orientated as per the image above when in use that indicated a preference to pull with the thumb to scramble. This is based on a right handed user.
Finally I have graphed the predictions over time alongside overall confidence. As you can see the overall confidence seems to have settled at around 40-45%. Going back to initial image of arrows moving around a center point this means that on average for each new data point with its associated length 1 arrow the resultant point moved 0.4-0.45 units away from the centre in the direction of the mean result. Instinctively it feels that because that number is staying the same the result is getting no less certain however it must be noted that that number staying constant only indicates that with each new data point the the result is getting more certain at a constant rate.
Click here to view the spreadsheet in google sheets. To explore data set #1, #2 or both combined enter 1,2 or 3 respectively into cell 'L5'
You don't need a huge data set to get in the right territory but you'd have to be quite persistent to learn the code for a lock this way. However, there is an easy way of making this sort of attack impossible, always scramble to the same number...
Something like this would get everyone setting into 0,0,0,0 consistently. But then all it takes is one person to 'just pop out' and only change one number and it will be very obvious...
And finally, bolt croppers would be quicker...