john-dev - Re: Judy array

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <164a70f9994a98a504e01e6f26dd6be3@smtp.hushmail.com>
Date: Wed, 16 Sep 2015 02:43:42 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Judy array

On 2015-09-16 01:09, Solar Designer wrote:
> On Wed, Sep 16, 2015 at 12:43:44AM +0200, magnum wrote:
>> Also, I don't observe any gain from disabling mmap and only minimal gain
>> from using --mem=0 when mmap is enabled (I stopped using -mem after mmap
>> was implemented).
>
> What does it mean when with mmap I am getting "Each node loaded 1/8 of
> wordfile to memory (about 15 MB/node)"?  Doesn't mmap imply that each
> node has the full wordlist mapped into its address space?
>
> In fact, without mmap I am getting "Each node loaded the whole wordfile
> to memory".  Doesn't not using mmap enable easy and efficient loading of
> portions of the wordlist into each node's memory?
>
> This looks backwards to me.  Can you explain?

It is backwards for sure. It grew organically. I'll be stating a few 
obvious (for you) things below, to explain for a broader audience.

Before mmap and MPI/fork, we would either just fgetl() each line, or use 
a memory buffer. The latter would load the whole file into a contiguous 
buffer once, and then modify that buffer (eg. replace \n with null). 
Also, index pointers was set up to point to each word. So we could 
immediately get word number 12345 using a pointer to it. This was mostly 
meant for -rules but that initial load proved to be faster even without 
rules IIRC. So far, things were pretty sane.

Then, with MPI, came some messy code that could do the above but only 
for "my words" for a multi-node run. That was implemented on a leap-frog 
(or should I say round-robin) basis, so we wouldn't end up with 200,000 
short words for one node, and 40,000 long words for an other. But it 
also had to take into account edge cases like "just a few words, and a 
humongous number of rules" or vice versa. From this point it went downhill.

Then I implemented mmap and dropped that other buffer for a while. The 
beauty of mmap is it's shared between processes (and not just forks but 
any processes that use the same files) and I was hoping to do without 
the other buffer. But unfortunately our mapped memory is read-only... so 
we can't prepare it and just point to ready-to-use words. Instead, I 
implemented an "mgetl()" that works just like fgetl() but reads from the 
mmap instead of the file. BTW it's SIMD capable (using our pseudo 
intrinsics), pretty damn fast scanning for next newline. It's nearly as 
fast as the old mem buffer, much more straightforward and potentially 
uses much less memory, BUT we can't suppress dupes. Loopback mode 
*really* needs dupe suppression. So I re-enabled the simpler (whole 
wordlist) version of memory buffer on top of the mmap but it's really 
mostly meant for loopback.

Oh, and there's also encodings... if we do use the memory buffer and 
need re-encoding, we obviously only do that once, when preparing. I 
can't even remeber all details. This is by far the messiest source file 
throughout the Jumbo tree. It's just that everything works pretty good 
and pretty fast, so I'm a bit afraid of touching it.

But what we should do, is completely separate loopback mode from 
wordlist mode. Loopback mode should be it's own code. Then we should 
simplify wordlist mode, eg. drop support for full dupe suppression and 
some other crazy things.

BTW another idea is to load, prepare and index a (non-mmap) buffer 
before forking. If/when we're re-writing wordlist.c, we really should 
set the goals beforehand... and stick to them.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.