On May 9, 2012 the headlines were ‘58000+ Twitter username and passwords are leaked’. Twitter is one the most used social networks with 140 million active users. Such a networks security has been breached and considerably huge number of passwords along with usernames was made available on Pastebin which is a paste and share web service. As a response all those accounts hacked are deactivated immediately by Twitter.
I just got the chance at right time to see those data and I copied everything (Now it has been removed from that site). Well, we cannot do anything with that as all those accounts were deactivated. But it was helpful for my long time idea of analyzing password selection behaviors of people. Though those passwords are not valid anymore, they were selected by people once. So I happily took them as sample data and started analyzing. I write the results here. Despite my interest in statistics, this is my first article to write a statistical result. So I try to give my best.
Exactly 58,918 passwords and corresponding usernames were pasted on Pastebin (Some sources claim 58,978 where leaked which I think does not ignore the 60 blank lines that were present on the raw data). Not all these seemed to be valid passwords. A considerable amount of passwords seemed to be computer generated junk with small, caps and numbers mixed in an inhuman way. I decided to neglect them as they have the potential to affect the result. All the userids which are e-mail ids seemed to be having valid passwords and all the rest where are usernames seemed to have computer generated junk. So I removed 21,802 username and password pairs from data. Finally remaining 37,116 records where selected for analysis.
First basic analysis is on the length of the passwords. 97.5% of passwords were of length 4 to 12 with the most preferred length being 6 (29.4% of total passwords) followed closely by 8 (27.9%) leaving all the other possible lengths far behind. The longest password found is of the length 35! (I wonder how he would have logged in every time). The following chart maps number of passwords against length. It contains data only for length varying from 4 to 13 as others are relatively too low to plot on the chart.
The security of a password hugely depends on the character set that we choose for the password. Here I would like to categorize the character sets used into eight.
- Numbers only
- Small letters only
- Caps only
- Numbers and Small letters
- Numbers and Caps
- Small & Caps
- Numbers, Small & Caps
- Unclassified (Special characters)
The results are quite shocking. While numbers are the easiest of passwords to break, 43.5% passwords are numbers alone. On the other hand a mix of small, caps and numbers which is considered the best security is preferred by only 0.8% people.
‘Have you selected a password which is not used by others?’ That’s uniqueness. Now the question is ‘How many people have selected a password that was not selected by anyone else?’. When your password is unique it could be rarely broken through dictionary or rainbow table attacks (as others may not have thought over that).
Most Frequent Passwords
When uniqueness becomes a question, the next thing that rises in our mind is ‘If passwords are used more than once, then which is the most frequently used password’. Unfortunately, the result is very shocking with all the most frequent candidates being numbers and they are easily guessable too. The top one is ‘123456’ appearing 688 times which is nearly 2% of all the passwords!