TL;DR: You can find Dutch Diceware word lists here. TL;DR 🇳🇱: Je kan hier Diceware woordenlijsten vinden, of er hier wachtwoorden mee aanmaken.
Ever since I read XKCD 936, I’ve been a big fan of the Diceware
approach for picking strong passwords, where you randomly pick words from a
list to create a memorable passphrase. Sometimes, I need to generate Diceware
passwords in Dutch for friends and relatives. Unfortunately, the Dutch word list from
the Diceware page contains
many uncommon words, non-existent words, duplicate words (yikes!), numbers, and characters (as the
example ijler 100 leperd akolei kolkje on that page proves), which
diminishes the memorability and
usability of the generated passwords. I therefore created my own, improved lists,
which you can try out directly from your browser on my password generator (or in the Dutch version. The list consists of the most common Dutch words, and has an added
benefit that it only contains words that don’t weaken security when leaving
out spaces between words. This post discusses the details how I composed this list. The process and
links can be used as a guide to generate Diceware lists for other languages.
The Candidate Word List
The candidate word list is the list of words from which 7776 will be picked
for the final list. I used the following requirements:
It needs to contain as many Dutch words as possible. The more words we have
in the candidate list, the more choice we have to pick the best ones in the end.
It needs to contain only (Dutch) words. Most (all?) existing Diceware lists contain
numbers and symbols, but I find this makes the passphrases potentially harder
The words should be short enough to type.
The words should be easy enough to type. (i.e. no special characters)
Any combination of words in the list must not result in another word in the
list. To make passphrases easier to type, some people like to avoid spaces
between the different words of their passphrase. An added benefit is that
acoustic monitoring (the distinct sound of spacebars makes it easier to guess
the lengths of the different words) is much harder. However, the Diceware FAQ
advices against it,
because leaving out spaces potentially weakens your password, since
combinations of words can yield other words. By picking the words
in the list in such a way that this isn’t possible, this drawback no longer holds.
To satisfy 1 and 2, I used the OpenTaal
Woordenlijst, a very extensive list
of over 160000 Dutch words and names, as the basis of my Diceware list. Since this
list wasn’t perfect (it contained some non-existent words), I
I ran the list through the
official Dutch spelling search engine to
remove the invalid entries. The result is a list of ±121000 valid Dutch
To satisfy 3 and 4, I filtered out names, abbreviations, words with
special characters, and words longer than 6 characters (the same maximum as the
original English Diceware list).
Finally, to satisfy 5, I applied a search
algorithm to remove words that occur a lot in other words (avoiding the loss of
too many words because of a few simple words), and then removed any words
composed of other words in the list.
After going through the entire process, I was left with a list of ±10500 candidate
words, more than the 7776 words we need for a full Diceware list.
Picking the Best Words
After generating a candidate list of ±10500, the best 7776 (5 dice rolls) words
need to be picked for the final Diceware list. The most obvious criteria for
‘best’ are memorability and typability (the choice of words has no impact on
Since memorability is hard to measure, I tried approximating it by choosing
the most frequently used words instead. In order to determine that, I compiled
a collection of Dutch word frequency lists from around the web:
OpenSubtitles.org Frequency Word List: This is a list of 790000 words occurring in the entire database of
movie subtitles from OpenSubtitles.org.
There is some noise in this list from typos, and english words from missing
translations, but as this list is only used to score the words (not to
generate the candidate list), this is fine.
University of Leipzig Corpora Collection: The university of Leipzig offers a collection of statistics
from analyzing newspapers in many different languages, including Dutch.
The Dutch frequency list contains over 700000 words. This list also
isn’t perfect, and contains invalid words, but works well for scoring.
Since children books and stories contain the simplest (and therefore most memorable)
words, I also crawled some Dutch fairy tales websites to collect word
frequencies on those pages as well.
Other sources of frequency lists for other languages can be found on Wiktionary’s page on Frequency Lists
and in Google’s NGram Dataset, a huge collection of word and phrase frequencies from literary works between 1500 and 2008. Unfortunately, the latter doesn’t
include Dutch, so it’s not usable for the list I created.
In order to combine the different frequency lists to give a single score to
a word, the frequencies need to be normalized. As is expected from natural
language texts, a few words occur disproportionally frequent compared to
most other words. For example, when looking at the frequency distribution of
the OpenSubtitles list, you can see that most words occur in the bottom frequency
In order to flatten this curve out a bit, I scaled the frequencies
logarithmically, which gives more distributed curve, and makes combining
different scores ‘fairer’:
To get to a single score per word, I combined the scaled frequencies,
applying a higher weight to the less noisy fairy tale list, and a lower
weight to the very noisy OpenSubtitles list.
Besides frequencies, another metric that could be taken into account is the
length of the words. After experimenting with this, I decided that the result
wasn’t better, so I left it out in my final list.
The Final Dutch Diceware Lists
Below is the result of picking the best 7776 (5 dice rolls) words using the
scoring described above. The average word length is 5.0 characters – the same
as the old Dutch Diceware list.
Composite Words Easier to remember
No Composite Words A little bit safer
There are different list variants, depending on your needs:
Standard Diceware lists, with one word per 5 dice rolls. Roll the dice 5 times, look up the word corresponding to the 5 dice rolls, and repeat the process for more words.
Machine-friendly lists, with 8192 words instead of 5 dice rolls. This is
a handier format for creating passphrases with machine-generated random numbers.
Lists with No Composite Words, so spaces can be left out between words without
weakening the passphrase. (see Requirement 5 above)
Lists with Composite Words, with a few more memorable words, but where
password strength may potentially be weaker when leaving out spaces.