john-dev - 1.7.9's --external + OpenMP fails on Cygwin

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111126194543.GA16734@openwall.com>
Date: Sat, 26 Nov 2011 23:45:43 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: 1.7.9's --external + OpenMP fails on Cygwin

JimF -

Here's a really weird bug that I spent several hours on today, without
much luck.  Maybe you'll be able to figure it out?

I built 1.7.9 on latest Cygwin (with gcc 4.5.3), including with OpenMP
(this required a fix to the Makefile line for john.exe to pass LDFLAGS).
Then I went to test it.  While most things worked fine, I unexpectedly
got the program to lock up with --external=LanMan.  Then I reproduced
the same with -e=Double.  Then with other hash types as well.

In my testing, the problem occurs only with OpenMP builds on Cygwin
running more than one thread, but only when --external is the main
cracking mode.  Hash type does not matter (I tested DES, MD5, BF from
pw-fake-unix available on the wiki).

The problem does not occur with --incremental, not even when I add an
external filter to that.  It also does not occur with OMP_NUM_THREADS=1.

I tried to reproduce it on Linux by compiling without OS_TIMER (more
similar to the build with Cygwin) - no luck (that is, the program ran
fine on Linux no matter what).

Also, the problem does not occur with 1.7.8 built in a similar fashion
(tested on BF only, obviously).  It is new with 1.7.9.

I did not test any -jumbo, I did not try moving my build to another
machine yet, and I did not try using a third-party build of recent JtR.

I debugged the problem in OllyDbg a little bit.  (It's my first time
using this debugger, by the way.)  On a dual-core, there are three
threads - two are running, one is mostly waiting.  When the problem is
triggered - which happens just a few seconds after program start - only
one running thread remains, and it is looping in cyggomp-1's calls to
cygwin1.dll's sem_wait().  Specifically, per gcc/libgomp sources, it
appears to assume that if sem_wait() returns an error, that error must
be EINTR because of a signal, so it simply repeats the call.  In my
case, the error is instead EINVAL (yes, I did locate and check errno).

Why the semaphore is invalid I don't know.  There's code that checks the
semaphore struct at offset +4 for magic values 0xdf0df04c
(PTHREAD_MUTEX_MAGIC) and 0xdf0df046 (PTHREAD_RWLOCK_MAGIC).  When the
EINVAL looping occurs, the value is the latter.  Changing it to the
former (which the underlying code checks for first) made the EINVAL go
away and the program continue working for a while longer (even cracking
some more passwords), but that's just black magic.

I don't see relevant changes between 1.7.8 and 1.7.9.  While I did
change the external mode code a little bit, none of those changes look
like a likely culprit.  I've tested that int_word[] is not being
overflown on the copy from ext_word[] - it is not.

I suppose we can try bisecting the changes between 1.7.8 and 1.7.9
anyway, but I am not sure if this will help much.

Would you try to reproduce this?

I am really not into Windows.  Even for OllyDbg, I am running it over
VNC from a Linux desktop.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.