Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 8 Dec 2019 19:59:32 +0100
From: Solar Designer <solar@...nwall.com>
To: passwdqc-users@...ts.openwall.com
Subject: Re: curse words in passwords

Hi,

I finally approached the task of cleaning up our word list used for
generated passphrases, and adding other words to make up for the removed
ones and keep the count at 4096.

This took some trial and error - e.g., some approaches didn't produce
enough words.  I describe below the approach I ended up settling on.

I started with passwdqc's current list, dropped from it everything that
started with a capital letter (225 words) and added to it the overlap of
EFF's Diceware with the combination of English/1-tiny/lower.gz and
English/2-small/lower.gz in Openwall's wordlists collection limited to
lengths 3 to 6.  This resulted in 5295 words.

Then I proceeded with manual edits of the list, mostly removing words
but also adding some.  My current work-in-progress list has 4728 words.
This is more than 4096, which is a good problem to have - we can still
identify and drop more potentially problematic words without having to
come up with replacements.

My removals so far included these categories:

- Words that are too similar to each other in pronunciation (e.g.,
right, rite, wright, and write).

- Words for which different valid spellings exist (e.g., gray and grey).

- Plural forms of nouns (leave only the singular, except where plural is
the more common form in which case leave only the plural).

- Given names, countries, cities, languages, nationalities, and
everything else that normally starts with a capital letter (e.g., even
all month names, some of which are also people's given names).

- Pronouns and also the word "user".

- Curse or rude words, and words with such slang meanings.

- Words related to race, etc.  Potential skin colors.

- Words related to religion, as well as "pig", "piggy", "pork".

- Words related to sexuality, and innuendo.  Some body parts.

- Words related to drugs.

- Words related to death, murder, burial.

- Certain medical conditions (might need to drop benign ones as well,
for completeness?)

- Other likely trigger words, e.g. related to pregnancy and abortion.

- Some other words that are OK on their own but would fall in the above
categories if seen paired with other included words.

- Words like "bully", "harass", and "offend" (gone too far?)

- Words that are too obscure (but there are still many).

I also thought of (and briefly attempted, separately) achieving a nice
property that EFF's Diceware list has: that words can be concatenated
without separator characters yet produce no duplicate passphrases for
different word combinations.  Unfortunately, to achieve it some common
and otherwise benign words would need to be dropped.  Testing my
work-in-progress list for this property shows about 6000 duplicates
among 22 million word pairs, including e.g. these examples:

actorbit (can be "actor" + "bit" or "act" + "orbit")
allyear (can be "all" + "year" or "ally" + "ear")

just to illustrate what I'm talking about (like I say, there are about
6000 of these).

At this point, I'd appreciate the community's opinion and maybe help on
what else to drop.  More problematic words?  More obscure words?
Different forms of the same words (e.g., right now we have "bake",
"baked", "baker", and "bakery")?  Words that are too similar in meaning
(was a stated property of current passwdqc's list, but something I
neglected with these updates so far)?  Some of the length 6 words to
reduce the average passphrase length?  Try to achieve the above property
to allow for concatenation without separators yet no security impact?

The more words we drop of one category (e.g., forms of the same simple
words, or length 6 words), the fewer we can drop from other categories
(e.g., obscure words), so even with 600+ words yet to drop it's a tough
decision.

Any words to add - e.g., was my decision to choose only singular or
plural and not both a wrong one?  By including both, we could instead
drop more obscure words.

Attached to this e-mail are 3 word lists: passwdqc's current
(wordset_4k-old.txt), the result of initial automated processing as
mentioned above (wordset_4k-new-draft0.txt), and my current manually
edited list (wordset_4k-new-draft1.txt).  Please feel free to review
these (and/or the changes between them) and make suggestions.

Overall, this is a lot of effort.

Alexander

On Sun, Sep 25, 2016 at 01:24:20PM +0200, Solar Designer wrote:
> On Sun, Sep 25, 2016 at 04:54:58PM +1000, Andrew Stuart wrote:
> > In less than 50 password generations I have had three passwords that included
> > 
> > shit
> > cock
> > gay (not that this is a curse word
> 
> And is e.g. cock a curse word?  It depends.
> 
> > but I'm wondering if some childish code underlies this password generator)
> 
> Not sure what you mean here.  That there was deliberate attempt to use
> controversial words?  No, there was not.  It's just that 4096 common
> English words of length up to 6 do indeed include these words above.
> 
> > Is this some sort of joke?  I am generating passwords to give to my users - can this software trusted?  Can I expect it to generate more controversial words?
> 
> Unfortunately, yes - it will generate more controversial words, and not
> only words, but also word combinations where each individual word would
> likely not be considered controversial on its own, but the combination
> is likely to be.
> 
> We have a pending task to revise passwdqc's list of words to replace the
> more likely problematic ones - in terms of not only such words on their
> own, but also their use in passphrases.  My current estimate is that
> maybe 200 words, if not more, will need to be replaced.  200 is about 5%
> of the total words we have.  Unfortunately, this may make passphrases
> somewhat harder to memorize, but we probably have to make this change.
> 
> Thank you for reminding us about this.
> 
> Alexander

View attachment "wordset_4k-old.txt" of type "text/plain" (24510 bytes)

View attachment "wordset_4k-new-draft0.txt" of type "text/plain" (32221 bytes)

View attachment "wordset_4k-new-draft1.txt" of type "text/plain" (28759 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.