password, Benford's Law, PCI, Hangover

Disobeying Benford's Law, one password at a time

Is it wrong to say I was enjoying toying around with the other day... and, if so, is it more wrong to mention that the first thing that came to my mind was, "I wonder if Benford's Law applies to the dataset behind the site?" Turns out, it does not.

"password" was, naturally the most popular password. Since the top 10,000 passwords (i.e. the corpus) didn't cover all combinations of /password[0-9]/, I shifted over to the gargantuan rockyou corpus instead. I didn't expect the distribution to be perfectly random - humans don't work that way; I wondered, instead, if it would follow Benford, i.e. password1 at ~31%, password2 at 18%, and diminishing logarithmically from there.

It turns out that the distribution of /password[0-9]/, as predicted in Hangover 2, was weighted heavily towards password1, a whopping 70%; password2 was second at 10%, and password3 third at 4%. The other numbers did not appear in sequential order. princess and iloveyou, the next two most popular passwords, had similar results:

$ bunzip2 -c rockyou-withcount.txt.bz2 | awk '$2 ~ /^princess[0-9]$/ {print}'  
5187 princess1
683 princess2
410 princess7
391 princess3
266 princess5
252 princess4
224 princess8
204 princess9
145 princess6
30 princess0

Why does this matter? Because we in InfoSec justify imposing password complexity rules on our users when we say that complexity will help defeat brute-force attacks. We claim that enforcing 8-character password schemes requiring mixed case letters and numbers will result in a user community whose passwords have 47.6 bits (8 x log62 / log2) of entropy and, thus, will not fall victim to brute force password guessing attacks. In reality, most users capitalize the first letter and tack on a "1"; the most likely real entropy of an 8-character complex password is closer to 32.9 (7 x log 26 / log 2 + log 1 / log 2). In other words, real users will go out of their way to thwart our mathematically pure plans and intentionally select passwords that are no more brute-force-resistant than a random 6-character password.

What can we (InfoSec folks) do? A few things come to mind. Based on the data above, one could argue that forcing users to place a "0" in the middle of all their passwords is an extremely effective way to slow down brute-force attacks. A more practical approach I have followed is maintaining a small library of threat mitigation charts, not unlike the one below, for common situations. It helps both me and my conversation partner quickly size up the effectiveness of a given solution, and demystify the reasons why I will fight harder to preserve some controls (inexpensive, 3+ stars) and not others (expensive, 1-2 stars).

Second, I have become a fan of the of using visual password strength meters to discourage (if not altogether forbid) users from selecting weak passwords during account registration... and an equally big fan of banning absurdly weak, guessable passwords, as both hotmail and twitter* have done, ditto all those grizzled UNIX/Linux sysadmins implementing pam_cracklib. I like this approach for a couple of reasons. It's an inexpensive control: in the case of webapps, it's one of those rare cases where client-side security enforcement is acceptable. It's also a control which, unlike complexity or periodic forced change, most users will not struggle against and defeat in favor of their own convenience. Plus, if you believe the numbers are anywhere close to accurate -- that 40% of users choose a password in the top 100, 79% in the top 500, and 91% in the top 1000... and that 5% of the top 10000 passwords would still comply with the PCI DSS 2.0 (8.5.10) standard of 7 alphanumeric characters -- then restricting the top X passwords your users is a great way to reduce the effectiveness of salami slicing attacks, which are immune to intruder lockout, especially in environments where extreme length requirements are impractical.

Finally, and this will be the most controversial to-do: we must stop punishing our users with draconian requirements that add little security value and are intuitively difficult to get right. More on this later, I promise!

* Doing this client-side, as Twitter has done, apparently does involve rot-13 encoding the password strings, lest some unsavory phrases result in one's web pages being blocked by aggressive content filters.