[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 14 Mar 2011 23:02:55 -0500
From: "jfoug" <jfoug@....net>
To: <john-users@...ts.openwall.com>
Subject: RE: unique
>From: Solar Designer [mailto:solar@...nwall.com]
>>On Mon, Feb 28, 2011 at 09:23:20AM -0600, jfoug wrote:
>> Unique is nice, but you HAVE to bump up the params.h values a lot to
>>make it work, on anything other than a tiny file.
>
>You mean, to make it work "fast".
Yes, exactly. It did work, but really ground on the disk IO ;)
>Because it does work on huge files
>even with default settings, just slowly.
>
>Also, I bumped up the defaults between 1.7.4 and 1.7.4.2. Starting with
>that version, unique uses about 70 MB of RAM by default (previously, it
>would use about 9 MB by default).
I bumped way over that, but at this level, here is the IO needed, assuming
the file is 'mostly' unique.
9mb buffer size:
100mb file 700mb read
500mb file 14gb read
1gb file 57gb read
2gb file 224gb read
5gb file 1389gb read
At 70mb buffer
100mb file 170mb read
500mb file 2.4gb read
1gb file 9gb read
2gb file 29gb read
5gb file 179gb read
At 384mb buffer (where I run)
100mb file 100mb read
500mb file 884mb read
1gb file 2gb read
2gb file 8gb read
5gb file 40gb read
Now, above is worst case (data already unique, or almost). However, if the
file is highly redundant, all buffer sizes will get a likewise reduction,
but I believe the above proportions will stay close to what is listed. So
yes, at the current default 70mb buffer level, a 1gb file to unique, is not
'too' bad.
The file 'read' IO is classic O(n^2). Actually the runtime of the algorithm
is a 'triangular number' so it is actually O((n^2+n)/2), but that reduces to
simple O(n^2). In this case, n is sizeof(file)/sizeof(buffer) (or number of
buffer reads from file). Thus reducing this ratio (increasing the size of
the buffer), greatly reduces the overall impact of a O(n^2) algorithm. So
for that 9mb buffer, and 5gb file, n is about 555 while 384mb buffer n is
only about 13. Each 384mb operation is larger than the 9mb, but I will take
O(13^2) over O(555^2) any day.
NOTE, I have not looked at the unique program in a long time. You may have
changed how the base algorithm works (but I do not think so). I image you
just made the code run faster (such as faster hashing), but the algorithm as
it deals with disk IO is likely the same.
>With these settings, unique is
>reasonably usable on wordlists of up to a few hundred megabytes. So
>maybe your comment applies to a pre-1.7.4.2 version.
Yes, when I did look at unique, was before that version.
>That said, I think a future version should have the memory buffer size
>tunable from the command-line or/and in john.conf. This has been on my
>to-do for ages.
Would be a wise addition, and should add no overhead to runtime of unique.
>> -inp=file Allows file to be read, vs only reading from stdin.
>
>What for? Is it somehow more useful on Windows (IIRC, you're on
>Windows).
Yes I am. I do not remember if the reason I added this was due to
redirection limitations in windoze cmd.exe or not. IIRC, I think it had to
do with 2gb limits, but I am a little foggy on remembering just why.
>> --2nd_uniq_file=file
>
>We need to come up with self-explanatory names for these options.
If you have ideas, please post. To make command line switch changes is
trivial. The biggest change to make, is to get any changes I have done into
your newer code.
The most useful 'real' changes are the addition of the --2nd_uniq_file (or
whatever command line switch you think is better) processing. The verbosity
is not really useful (just a pacifier).
Jim
Powered by blists - more mailing lists
Powered by Openwall GNU/*/Linux -
Powered by OpenVZ