Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Fri, 10 Jul 2009 12:27:03 -0500
From: "Jim Fougeron" <>
To: <>
Cc: <>
Subject: Contributing significant changes to the jumbo patch (mostly performance improvements)

I posted this email originally to Solar, off list.  He requested me
to post on-list and start a new thread.  I will simply repost the
email to Solar in whole, and then start following up with the
patches I am working on, along with links to said patches.  There
are a couple of patches which are sort of walking all over each
other in a few places, so I am working to get these diff files to
play nice with each other.
*** Original message  ***
I have made several changes, some pretty significant.  These are for 
performance gains.  I would like to know how you want to proceed on 
some of these patches, since they might be big enough to end up being
a new 1.7.3-jumbo-6.
I was using a fast (raw-md5) algorithm, and was getting about 5% of
expected throughput (compared with --test speeds) The changes are:
*** Enhance percentages (from 43% to 43.12% i.e. added 1/100'ths %).
Not a performance improvement, only impacts markov and wordlist.
*** Expanded the fmt_methods structure  (*** This is the main one I
have questions about, see below ***), to have 5 binary_hash and 5
get_hash functions.  This is a HUGE performance gain when the size
of the password/hash list is large.   I was working with 250k element
list (shrunk to 100k quickly).  The raw-md5 uses one 'salt' list, and
hashes that.  Thus, jamming 250k into a 4k hash table is horrible for
performance.  I also made changes to param, to start using a larger
hash table earlier on.  I did quite a bit of performance testing, and
it now seems pretty decent, up to about 2 million entries (but will
again start to degrade after about 750k). The new table sizes are 64k
and 1mb entries.   Each table jump does 'slightly' reduce performance
due to a larger workingset, and fewer memory cache hits.  However, a
cache miss is MUCH less costly, than a whole comparison of a chained
hash entry.   This modification can make a huge improvement (5x or
better) if the size of the list is large and the algorithm is FAST.
Slow algorithms really do not benefit from 'any' of these changes,
only the really fast ones.
*** Added 'option' to do in-memory file caching.  This can benefit a
lot of --rules based searches, by 25-45% (after all other
improvements).  It is pretty drastic.   All save --restore is working
just fine, same as it would reading directly from a file.  The method
I used 'might' not be the best in the end.  I have a single array I
load the file in, then count lines, then allocate an array of
int32's, then strtok the stored cache (strtok(NULL, "\r\n")),
assigning each new line to the next int 'pointer'.  This requires a
lot more memory (size of file + 4*(number of lines)).  However, doing
this, can easily allow for things such as computation of file offset,
and also being able to handle things like lines longer than the
LINE_BUFFER_SIZE size (which I have increased from 0x400 to 0x2400).
The code 'could' be written to simply work with a single file memory
buffer (strtok'd for \r\n), and then walk forward to the next line.
That type code is workable, but might be slower (would use less
memory though), and it would have to be a little more complex to be
able to handle situations where 1 EOL char, 2 EOL chars, and lines
longer than LINE_BUFFER_SIZE.  However, I think it 'could' be done,
it is just not how I initially have written it.
*** Added a command line switch --mem-file-size=#   Default is 5000000
(this is hard coded, and trivial to change).  Zero, 'might' be a
better default, as it would give exact same run time characteristics
as today (i.e. all files read from disk).  But I thought a default of
5m would be pretty good (possibly a #define for something like
__DJGPP__ or others which might have a smaller memory footprint).
When a wordfile is opened, if the size of the file is less than the
value from mem-file-size, then the file is loaded into the memory
cache (note, the size of the pointer array to the lines is NOT used
in computing size).  Also, if --save-memory is 1 or more, then
--mem-file-size variable is forced to 0, so file is read from disk.
*** One bugfix/workaround/side-affect of the memory-file code, is I
do an initial fseek to end of file.  If that returns a negative
number, then I abort john with a message listing the file is too big
for the OS to properly read it, as compiled.  Again, not a
performance improvement, but gives john a way to inform the user
that their input is not within valid parameters, allowing them to
cut the file.
*** The final performance improvement was seen in my initial
modification to raw-md5 (I will get this patch in later), when I
added SSE2.  I did so, and was only processing 4 (or 2 for MMX)
password checks per 'run'.  At the end of each run, cracker calls
fix_state() which for a wordlist.c call, it calls ftell().  Well,
this significantly slows the speed down, even though calling this
a million times per second for a fast algorithm does not make sense.
I initially made this change, and then later made changes to my
raw-md5 code to process 64 passwords per 'chunk'.  However, I still
have this code left in, and it does work fine.  I added a command
line option --fix-state-delay=#  (default 0) and then made changes
to wordlist.c so that it only does the 'real' work every # times it
is called.  Thus --fix-state-delay=50 will improve performance if
the _fmt is very fast, and does not do enough work in batches.  NOTE
this patch may or may not be needed, but for someone starting out
with the format's, it can help.    Even after going to blocks of 64
for raw-md5, if setting --fix-state-delay=10 there was still a 3-5%
improvement in speed.   NOTE2, if using a memory file, then we do
NOT delay state fixing and do it every time, since it is simply a
pointer subtraction from another pointer.  NOTE3, this might be a
good item to add to the fmt_main structure, to allow a format to
set it's 'own' delayed ftell() calling, if it need be.

*!*!*!*!* Ok, now for the main question.  In the changes to the
ftm_methods means that all *_fmt.c files 'should' be re-worked.
However, pulling out, and dusting off my C programming language book
(and C++ programming v2), I do see that structs being initialized
where there are NOT enough params, are supposed to be null.  Now, I
have made changes in the ldr_init_hash so that it starts from 
element[4], but will ONLY use the hash, if both the methods
binary_hash[x] and get_hash[x] are not null.  This is the code:
static void ldr_init_hash(struct db_main *db) {
  struct db_salt *current;
  int size;

  if ((current = db->salts))
  do {

    for (size = NUM_PW_HASHES-1; size >= 0; size--)
      if (current->count >= password_hash_thresholds[size])
        if (db->format->methods.binary_hash[size] &&

So what will happen, is the code will try to use the bigger hash
tables (if there are enough elements), but ONLY do so if the code
actually exists. Now, I have made 'real' changes to raw-md5, because
I am using and have tested it.  However, I have not spent the time
to update others (I tried for DES, and it failed self test, so I
backed things out).  
My main question is I have 2 NULL's added to each *_fmt.c file, and
the ones that have default_hash values for the 3 existing elements,
I added defaults for the 4th and 5th.   However, this touches a LOT
of files, but does not add code to them (more for documentation).
The question I have (finally), is do you want me to make patch
changes with all of these NULL, NULL added, or simply keep the
*_fmt.c's the same as they are, and let default compiler
initialization set them to NULL.  If the 2nd, I have no idea if
there are compilers out there which fail to properly pre-set to
null or not.  If the 2 NULL's are added explicitly, then there
would be no chance of compiler issues.   However, it makes for a
patch that touches the world for little 'gain.
To sum things up, how would you like me to proceed from here?  I have
maintained other open software before, and know that large changes
like this can be a pain in the arse.  I can provide the stuff in any
way that would be beneficial to you.  But I am not sure this is the
proper 'patch' to put out publicly, and might be better to be
incorporated into an 'official' jumbo. NOTE if you want, I 'could'
build a jumbo-6 with jumbo-5 and these changes in it.   If this is
acceptable to you, I can get this within 24 hours or so (if the wife
gives me time).  I would also add MinGW32 (and M$VC) #defines and
changes, so that MinGW32 would be a 'default' part of the jumbo.
The VC is not really stand alone, but still requires MinGW to build
some of the asm (and do things like copy the proper x86-*.h to
arch.h).  But I simply use the VC to help in my own debugging.
I will leave what 'type' changes you want in your court, and wait
for a reply from you.
However, all in all, I took my raw-md5 SSE2 which was --testing at
12M/s and was getting 450K/s with 50000 passwords, and now it is
getting closer to 4M/s.  Still slower (some delay is in the rules
and pre-rules), but much better than it was.

The SSE2 version of the raw-md5, will be provided as it's own patch.
This one I also have questions about, since there are other patches
floating around.  I am not sure what patch level to do for that one,
as the jumbo-5 seems to take the best of them, but is not the same.
It might be best to roll that into jumbo-6.

NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged
information. Any unauthorized use, disclosure or distribution is
prohibited. If you are not the intended recipient, please contact
the sender by reply email and destroy all copies of the original

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.