john-dev - Re: Parallella: scrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130617063422.NETN0.643185.imail@eastrmwml206>
Date: Mon, 17 Jun 2013 6:34:22 -0400
From:  <jfoug@....net>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: scrypt

---- Solar Designer <solar@...nwall.com> wrote: 
> On Wed, Jun 12, 2013 at 02:27:42PM +0200, magnum wrote:
> > We actually have Colin's reference implementation in bleeding-jumbo, added by Dhiru for the "scrypt" format (with format_name django-scrypt). And Jim optimized it to 2x. I haven't looked at it but maybe it should be renamed to django-scrypt (and your revisions merged)?
> 
> Ouch!  Dhiru - you should start announcing your additions to the tree in
> here.  At least new formats.  Somehow I missed this one in git.
> 
> It's a pity that Jim spent time on this.  The reference implementation,
> by definition, was not optimized.  Colin, the original author, also
> provides two other implementations, which are much faster (more than 2x
> over reference).  There's no point in us optimizing the reference
> implementation in any way - we simply should drop and replace it.

Little time spent on this (but I do have little 'real' free time, so point taken).  It was trivial changes. 90% of the improvement was removal of ridiculous memsets/memcpys, etc in a couple of areas in the deep inner loop.  But I did also put in quite a bit of time studying the algo, and then digging in, and spending some time studying salsa, and starting to get personal insights on how to do salsa 'right', and where the SIMD is within that algo (there is a lot of opportunity) .   I have yet to start putting any of that knowledge into code, however.  But yes, you are very right.  There probably is an easy 10x gain to be had by attacking the problem from the performance direction instead of the current extreme readability direction of the current reference code.  But I did want to look at it. 

Just because something is reference code, does not mean it can not be highly optimal.  Take for instance when I re-wrote the reference code for DCC2 (mscash2) found on the wiki, when I was looking to first port DCC2 to SIMD and needed to get it figured out.  Before I started this, PBKDF2 was a total mystery to me, I had no clue at all.  I found the original ref code authored by S3nf, to be pretty opaque, with the algorithm hard to see, due to all of the crypt code being inline within the ref piece. I threw away all of the MD4/SHA1 stuff, and replaced with simply calls into oSSL.  I had already made some changes in other places, and knew to try to reduce the amount of times the ipad/opad part of the HMAC being computed from iteration times to 1 time.  And the code itself was greatly reduced (probably 20% or so line count, compared to original ref piece), AND the actual logic of the DCC2 was tightly grouped and very easy to follow.   Once I got the code into that state, and could really wrap my brain around it, it jumped right out at me, that the first 1/2 of each crypt (ipad/opad) within each HMAC was being done over a constant.  It was not only the ipad/opad 'building' that could be eliminated, but the actual crypt also.  That change itself got put INTO the reference code on the wiki:  http://openwall.info/wiki/john/MSCash2_simple_code  since I felt that the reduction of that part of the crypt did not change the easy of understanding with the ref code (it did need a comment to explain things a little).  And now, 2 or 3 years later, the logic from that ref code, with VERY little change is still fully being used as the best code within most of the JtR formats doing PBKDF2 logic, and I cringe when I see oSSL's PBKDF2 code being used.

But yes, you are certainly write about Colin's code, here.  And it does seem to be his habit in his ref pieces, that he almost seems to purposely write code that is designed to run slowly.  I wish he would not 'try' to do things that way, but I have seen this on several things over the last decade or so.  It is too bad, because often that horribly sub-optimal method gets into projection, and sometimes into HIGHLY used production code, because a product needs some functionality, but the production team has no one that is a crypto guru and there is no time for a team member to study up to become one.  They simply get tasked with 'add feature X' to the product, find some code that they CAN read, and that 'works', and put it in.  Now, it DOES work, gets the job done, tests well, and ends up being released. A prime example of this was php for at least a couple of years, and it's MD5. Solar, I think even you felt the wrath of that one, when you wrote PhPass and had logic in there (along with comments), that if run on php prior to version XYZ, use the $P9$ (I think) which significantly reduced the key stretching, because the the logic in those early versions was SOOO dismal.  Now I am not 100% sure that this was due to the php team using some gawd awful slow ref piece implementation, but I would be willing to bet a burger and a coke, that it was such a situation.

[/rant] 

WOW, looking back on what I wrote, I am REALLY sorry for that strange off the wall rant.  I guess I should try to get more sleep, but I had to send it, after spending he time to write it.  All I really meant to say was 

"Solar, not much time was wasted optimizing that code, possibly an hour or so. But I did spend some time 'learning' about the what/how of scrypt, so that later I can help when we write it properly for our needs"

But things went their own way, within my taxed brain.  Guess I felt like sharing a little of the insanity this morning, LOL.

Jim.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.