Ever since I read XKCD 936, I’ve been a big fan of the Diceware
approach for picking strong passwords, where you randomly pick words from a
list to create a memorable passphrase. Sometimes, I need to generate Diceware
passwords in Dutch for friends and relatives. Unfortunately, the Dutch word list from
the Diceware page contains
many uncommon words, non-existent words, duplicate words (yikes!), numbers, and characters (as the
ijler 100 leperd akolei kolkje on that page proves), which
diminishes the memorability and
usability of the generated passwords. I therefore created my own, improved lists,
which you can try out directly from your browser on my password generator (or in the Dutch version. The list consists of the most common Dutch words, and has an added
benefit that it only contains words that don’t weaken security when leaving
out spaces between words. This post discusses the details how I composed this list. The process and
links can be used as a guide to generate Diceware lists for other languages.
The Candidate Word List
The candidate word list is the list of words from which 7776 will be picked for the final list. I used the following requirements:
- It needs to contain as many Dutch words as possible. The more words we have in the candidate list, the more choice we have to pick the best ones in the end.
- It needs to contain only (Dutch) words. Most (all?) existing Diceware lists contain numbers and symbols, but I find this makes the passphrases potentially harder to memorize.
- The words should be short enough to type.
- The words should be easy enough to type. (i.e. no special characters)
- Any combination of words in the list must not result in another word in the list. To make passphrases easier to type, some people like to avoid spaces between the different words of their passphrase. An added benefit is that acoustic monitoring (the distinct sound of spacebars makes it easier to guess the lengths of the different words) is much harder. However, the Diceware FAQ advices against it, because leaving out spaces potentially weakens your password, since combinations of words can yield other words. By picking the words in the list in such a way that this isn’t possible, this drawback no longer holds.
To satisfy 1 and 2, I used the OpenTaal Woordenlijst, a very extensive list of over 160000 Dutch words and names, as the basis of my Diceware list. Since this list wasn’t perfect (it contained some non-existent words), I I ran the list through the official Dutch spelling search engine to remove the invalid entries. The result is a list of ±121000 valid Dutch words.
To satisfy 3 and 4, I filtered out names, abbreviations, words with special characters, and words longer than 6 characters (the same maximum as the original English Diceware list).
Finally, to satisfy 5, I applied a search algorithm to remove words that occur a lot in other words (avoiding the loss of too many words because of a few simple words), and then removed any words composed of other words in the list.
After going through the entire process, I was left with a list of ±10500 candidate words, more than the 7776 words we need for a full Diceware list.
Picking the Best Words
After generating a candidate list of ±10500, the best 7776 (5 dice rolls) words need to be picked for the final Diceware list. The most obvious criteria for ‘best’ are memorability and typability (the choice of words has no impact on security).
Since memorability is hard to measure, I tried approximating it by choosing the most frequently used words instead. In order to determine that, I compiled a collection of Dutch word frequency lists from around the web:
- OpenSubtitles.org Frequency Word List: This is a list of 790000 words occurring in the entire database of movie subtitles from OpenSubtitles.org. There is some noise in this list from typos, and english words from missing translations, but as this list is only used to score the words (not to generate the candidate list), this is fine.
- University of Leipzig Corpora Collection: The university of Leipzig offers a collection of statistics from analyzing newspapers in many different languages, including Dutch. The Dutch frequency list contains over 700000 words. This list also isn’t perfect, and contains invalid words, but works well for scoring.
- Since children books and stories contain the simplest (and therefore most memorable) words, I also crawled some Dutch fairy tales websites to collect word frequencies on those pages as well.
Other sources of frequency lists for other languages can be found on Wiktionary’s page on Frequency Lists and in Google’s NGram Dataset, a huge collection of word and phrase frequencies from literary works between 1500 and 2008. Unfortunately, the latter doesn’t include Dutch, so it’s not usable for the list I created.
In order to combine the different frequency lists to give a single score to a word, the frequencies need to be normalized. As is expected from natural language texts, a few words occur disproportionally frequent compared to most other words. For example, when looking at the frequency distribution of the OpenSubtitles list, you can see that most words occur in the bottom frequency bin:
In order to flatten this curve out a bit, I scaled the frequencies logarithmically, which gives more distributed curve, and makes combining different scores ‘fairer’:
To get to a single score per word, I combined the scaled frequencies, applying a higher weight to the less noisy fairy tale list, and a lower weight to the very noisy OpenSubtitles list.
Besides frequencies, another metric that could be taken into account is the length of the words. After experimenting with this, I decided that the result wasn’t better, so I left it out in my final list.
The Final Dutch Diceware Lists
Below is the result of picking the best 7776 (5 dice rolls) words using the scoring described above. The average word length is 5.0 characters – the same as the old Dutch Diceware list.
Easier to remember
|No Composite Words
A little bit safer
There are different list variants, depending on your needs:
- Standard Diceware lists, with one word per 5 dice rolls. Roll the dice 5 times, look up the word corresponding to the 5 dice rolls, and repeat the process for more words.
- Machine-friendly lists, with 8192 words instead of 5 dice rolls. This is a handier format for creating passphrases with machine-generated random numbers.
- Lists with No Composite Words, so spaces can be left out between words without weakening the passphrase. (see Requirement 5 above)
- Lists with Composite Words, with a few more memorable words, but where password strength may potentially be weaker when leaving out spaces.
If you want to generate lists yourself, you can have a look at the code I used to generate these lists. (note that it’s still rough around the edges)