john-dev - RE: unique using more than 2 GB RAM

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <00b701ccdd3a$d7b33770$8719a650$@net>
Date: Fri, 27 Jan 2012 15:30:07 -0600
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: RE: unique using more than 2 GB RAM

>We should make "unique" able to use more than 2 GB of memory.  As it
>turns out, "unique" at 2 GB is about twice slower than "sort -u -S 14G"
>(on a 16 GB RAM machine), although of course this may vary by input
>data.  Maybe "unique" should start using 40-bit offsets (good for up to
>1 TB of RAM).  

Unique in Jtr will almost always be slower than sort | uniq  type of work.
It is much harder to search/unique on very large worksets, than sort and a
brain dead search (simple compare)

>
I am concerned that it will become slightly less
>efficient (in terms of both speed and memory usage) when this new
>functionality is not being made use of, though.

However, if we did this with a --hugefile (or some switch, or set of
switches), then for sure, we can get it faster in large memory than it is
today.  Also, if we had a --tinyfile then we could also optimize the other
way (making smaller files faster).  I do not remember if we do a file
length, prior to computing hash table size (I am not next to the code right
now), then a --hugemem  or --hugefile would simply tell existing code that
it is OK to use a few larger items.   I really think, that is about all
there should be to changing it. But I agree, this is a pretty good 'wish
list' item.  I know there are people with 100gb very dirty wordlists.  Even
using the max memory john's unique uses now, that is many times run through.
The fewer of those 'large' block re-runs, the faster overall.  Remove as
much of that ^2 from the O(n^2) part.

Jim.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.