john-dev - SSH thread-safety

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120315234631.GA10059@openwall.com>
Date: Fri, 16 Mar 2012 03:46:31 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: SSH thread-safety

Dhiru, magnum, all -

It was reported to me off-list that the "SSH" format in 1.7.9-jumbo-5
crashes on self-test on a 64-way machine running RHEL 6.2 on x86-64.

I managed to reproduce similar crashes on an 8-core machine by
increasing OMP_NUM_THREADS:

$ for n in {1..10000}; do OMP_NUM_THREADS=$n GOMP_SPINCOUNT=1000000 ./john -te -fo=ssh; done &> sshout
*** glibc detected *** double free or corruption (!prev): 0x0000000013d9ac50 ***
*** glibc detected *** realloc(): invalid next size: 0x0000000000ba0600 ***
*** glibc detected *** double free or corruption (!prev): 0x0000000003e80f50 ***
*** glibc detected *** double free or corruption (!prev): 0x0000000001bcff70 ***
*** glibc detected *** realloc(): invalid next size: 0x000000000de36c20 ***
*** glibc detected *** realloc(): invalid next size: 0x000000001c12f010 ***
*** glibc detected *** realloc(): invalid next size: 0x0000000004df17c0 ***
*** glibc detected *** double free or corruption (!prev): 0x0000000001ad3ed0 ***
*** glibc detected *** realloc(): invalid next size: 0x0000000006974160 ***
*** glibc detected *** double free or corruption (!prev): 0x000000001798c2e0 ***
*** glibc detected *** realloc(): invalid next size: 0x0000000002e73d50 ***
*** glibc detected *** double free or corruption (!prev): 0x00000000135b0650 ***
*** glibc detected *** realloc(): invalid next size: 0x00000000098041f0 ***
*** glibc detected *** double free or corruption (!prev): 0x000000001144d830 ***
*** glibc detected *** double free or corruption (!prev): 0x0000000015636440 ***
*** glibc detected *** double free or corruption (!prev): 0x0000000005962d20 ***
*** glibc detected *** double free or corruption (!prev): 0x0000000001caf160 ***
*** glibc detected *** realloc(): invalid next size: 0x000000001c654eb0 ***
*** glibc detected *** double free or corruption (!prev): 0x000000001f5b6fa0 ***

These crashes correspond to these thread counts:

$ fgrep Aborted sshout
Benchmarking: ssh [32/64]... (44xOMP) Aborted
Benchmarking: ssh [32/64]... (202xOMP) Aborted
Benchmarking: ssh [32/64]... (523xOMP) Aborted
Benchmarking: ssh [32/64]... (664xOMP) Aborted
Benchmarking: ssh [32/64]... (765xOMP) Aborted
Benchmarking: ssh [32/64]... (884xOMP) Aborted
Benchmarking: ssh [32/64]... (1041xOMP) Aborted
Benchmarking: ssh [32/64]... (1073xOMP) Aborted
Benchmarking: ssh [32/64]... (1090xOMP) Aborted
Benchmarking: ssh [32/64]... (1315xOMP) Aborted
Benchmarking: ssh [32/64]... (1771xOMP) Aborted
Benchmarking: ssh [32/64]... (2027xOMP) Aborted
Benchmarking: ssh [32/64]... (2045xOMP) Aborted
Benchmarking: ssh [32/64]... (2538xOMP) Aborted
Benchmarking: ssh [32/64]... (3450xOMP) Aborted
Benchmarking: ssh [32/64]... (3725xOMP) Aborted
Benchmarking: ssh [32/64]... (4243xOMP) Aborted
Benchmarking: ssh [32/64]... (4528xOMP) Aborted
Benchmarking: ssh [32/64]... (4699xOMP) Aborted

Additionally, john went into an infinite loop two times during the above
run - I had to kill those john processes.  That was for 103 and 4773
threads.  In both cases, the gdb backtrace looked like:

(gdb) bt
#0  0x00002b4af3dbc591 in gomp_team_barrier_wait_end () from /usr/lib64/libgomp.so.1
#1  0x00002b4af3dbb62e in gomp_team_end () from /usr/lib64/libgomp.so.1

Perhaps I could find something more informative by looking at per-thread
backtraces, but I did not bother.  BTW, for 4773 threads, the process
consumed over 46 GB of address space:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
solar     682459  799  0.2 48915460 43696 pts/3  Sl+  Mar15 1524:08 ./john -te -fo=ssh

I did not proceed to test even higher thread counts (I interrupted the
"for" loop in the shell) - I felt the above was enough info.

I also repeated the experiment with ASLR disabled:

$ for n in {1..10000}; do OMP_NUM_THREADS=$n GOMP_SPINCOUNT=1000000 ./john -te -fo=ssh; done &> sshout-nonrand
*** glibc detected *** double free or corruption (!prev): 0x00000000008dac50 ***
*** glibc detected *** realloc(): invalid next size: 0x000000000091d4d0 ***
*** glibc detected *** double free or corruption (!prev): 0x000000000092c650 ***
*** glibc detected *** realloc(): invalid next size: 0x000000000092ec20 ***
*** glibc detected *** double free or corruption (!prev): 0x0000000000935b30 ***
*** glibc detected *** double free or corruption (!prev): 0x0000000000941db0 ***
*** glibc detected *** realloc(): invalid next size: 0x0000000000920c80 ***
*** glibc detected *** realloc(): invalid next size: 0x00000000009135d0 ***

$ fgrep -i abort sshout-nonrand
Benchmarking: ssh [32/64]... (44xOMP) Aborted
Benchmarking: ssh [32/64]... (642xOMP) Aborted
Benchmarking: ssh [32/64]... (803xOMP) Aborted
Benchmarking: ssh [32/64]... (826xOMP) Aborted
Benchmarking: ssh [32/64]... (897xOMP) Aborted
Benchmarking: ssh [32/64]... (1024xOMP) Aborted
Benchmarking: ssh [32/64]... (1027xOMP) Aborted
Benchmarking: ssh [32/64]... (1272xOMP) Aborted

Got infinite loop for 1532 threads this time, same kind of backtrace:

(gdb) bt
#0  0x00002aaaabb8a591 in gomp_team_barrier_wait_end () from /usr/lib64/libgomp.so.1
#1  0x00002aaaabb8962e in gomp_team_end () from /usr/lib64/libgomp.so.1

I similarly did not proceed to try higher thread counts.

GCC 4.6.2, OpenSSL 1.0.0d.  (The RHEL 6.2 system where the problem was
initially detected had slightly different versions, though.)

My guess is that the OpenSSL functions we're calling are still not
entirely thread-safe even in these recent versions of OpenSSL.  We could
want to look into this and maybe end up submitting a patch to OpenSSL.

Additionally, the has_been_cracked[] array elements type should be
changed from char to int (or maybe even sig_atomic_t) because at least
the original Alpha lacked instructions to update individual bytes in
memory.  To update a byte, it would have to read a 32- or 64-bit word,
update it in a register, and write the entire word back.  This may undo
a change being made to a nearby byte by another thread at about the same
time.  For Alpha, this was corrected with BWX:

http://en.wikipedia.org/wiki/DEC_Alpha#Byte-Word_Extensions_.28BWX.29

I don't know if there are any other archs (on which JtR may reasonably
be run) that have a similar limitation, and Alpha is history, yet I
think we want to correct this.  We'll also need to adjust the memset()
call to use sizeof() instead of MAX_KEYS_PER_CRYPT.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.