john-dev - john-devkit at PHDays VI

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <857eogohcy.fsf@gmail.com>
Date: Sun, 06 May 2018 19:53:49 +0300
From: Aleksey Cherepanov <lyosha@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: john-devkit at PHDays VI

PHDays 8 is coming on (May 15-16, Russia, Moscow). So there is some
info about my old talk about john-devkit done at PHDays VI, May 17 2016.

The main achievement since my talk in 2015: john-devkit can generate
more than 100 formats for john. The speeds are bad though.

Deep understanding of john-devkit is needed for almost any
development. That's a huge flaw.


Slides as text
https://raw.githubusercontent.com/AlekseyCherepanov/john-devkit/master/slides_2016-05-17_phdays_vi.src.org

Code
https://github.com/AlekseyCherepanov/john-devkit

There are some cool things that were not included into the slides.
"Complexity of optimizations for humans and machines" and "bitslicing"
were mentioned in the annotation of the talk, but the talk was shorter
than planned and did not include them. So this email covers these
points. It would be more interesting to publish the notes and the code
right after the talk. I am sorry for the delay.


Short summary of project and work done

- Intrusive Changes to intermediate representation (IR) are planned!
- Experimental phase is not over yet

- 100+ formats for john were tested
  - some pbkdf2-* and hmac-*,
  - "dynamics" built with md4, md5, sha1, sha256, sha512
  - without optimizations (bad)
  - with one C template (good)

- 1 variant of TrueCrypt (SHA-512 + AES XTS)
  - with optimizations
  - not as fast as in john
  - no vectorization

- There is automatic bitslicing
  - only for fully unrolled code
  - it may wrap parts into cycles to mitigate code-cache problems
  - bitsliced LM is ~2x slower than in john

- Hash parsers are generated too

- New types in IR and respective optimizations:
  - functions
    - inlining
    - Merkle-Damgard construction as instruction
  - byte buffers
    - computation of known lengths
    - join/concat + split undoing
    - concat + slice moving

- There is pattern matching to replace parts of IR easier
- There is interpreter of IR for debugging

- Borrowing of code was tried successfully:
  - python with changes from pyaes
  - linear parts in C from sphlib

- Overall results are not descriptive so far:
  - it is not obvious that the goals of project can be met
  - there are no obvious blocks though
- It feels like writing a compiler, it is pure fun!


Optimizations hard for humans and machines

- computers are good at repetitive tasks including bruteforcing
- humans are good at innovations and pattern matching
- also humans may beat a general purpose compiler using knowledge
  specific to the task

- example win of humans: optimizing SHA-1 and SHA-256
  - both have an extender and main loop
  - optimal SHA-1: unroll both loops and intermix them
  - optimal SHA-256: unroll only the main loop
  - SHA-1 is big enough to prevent compiler from intermixing loops
    even if loops were unrolled forcefully

- example win of machines: bitslice
  - Roman Rusakov generated current formulas for S-boxes of DES used in
    John the Ripper with custom generator
    - it took 3 months
  - Billy Bob Brumley demonstrated application of simulated annealing
    algorithm to scheduling of DES S-box instructions

- so john-devkit tries to be a tool that moves repetitive tasks from
  humans to machines saving full flexibility needed for innovations
  and precise application of optimizations


Bitslice

- splits numbers into bits and computes everything through bitwise
  operations
- optimization focuses on minimization of Boolean formula (or circuit)

- number of instructions per operator, bitsliced vs regular:
  - constant shifts/rotates and permutations of bits are done for free
  - bitwise ops are 1:1
  - arithmetic addition grows significantly
  - static S-boxes may benefit from bitslicing
- also we need to pack/unpack vectors of bits (transpose matrix of bits)

- bitslicing may be done automatically in many cases if vectorization
  may be done automatically

- bitslicing implies increase in number of variables
  - 64 variables for one 64-bit integer variable
  - developer needs to manage these variables
- john-devkit grew from experiments to make work with bitslice more
  convenient


Hash parsers

john-devkit generates hash parsers for valid(), binary() and
get_salt() from simple description in python.

Example of code to generate hmac-md5 format with parser for cram-md5
variant ('$cram_md5$'.base64(salt).'$'.base64('<...> '.hex(hash))) and
regular format (salt.'#'.hash):

gen_format(
    ('hmac-md5', grep_tests('hmacMD5_fmt_plug.c', 'tests')), 16,
    make_hmac('md5'),
    [ 'or',
      '$cram_md5$', [ 'split', '$',
                      [ 'b64', [ 'mime', 'raw', 'MIME_TRAIL_EQ' ], 'decode', 'salt' ],
                      [ 'b64', [ 'mime', 'raw', 'MIME_TRAIL_EQ' ], 'decode',
                        [ 'split_right', ' ',
                          [ 'any', 'ignore' ],
                          [ 'hex', 32, 32, 'decode', 'hash' ] ] ] ],
      True, [ 'split_right', '#',
              [ 'any', 'salt' ],
              [ 'hex', 32, 32, 'decode', 'hash' ] ] ]
)

While the generator has its own hack value, a parser engine in C would
be better due to much smaller binary code without duplications across
all the formats. It is possible to write a parser engine in C using
similar semantics of description and allowing definition of parsers in
runtime.


Own compiler

- some optimizations in john-devkit resemble regular compilers
- it is fun to develop your own compiler
- it is rather easy in case of john-devkit because
  - cryptography is relatively small field
    - there is no need to implement rich general purpose language
    - for instance, some useful implementations of some algorithms
      need only 1 type of variables
  - code generation delegates low-level stuff to regular compiler
  - a lot of work on front-end was saved using embedded DSL
    - such approach allowed to use Python as macro-language, very rich
      macro-language
  - "programs" are developed together with the "compiler"
    - some transformations are very specific
      - pattern matching should simplify their creation
      - JFYI other languages allow similar thing too: Haskell by
        rewrite rules, C by clang rewriter
  - it is possible to add new primitives and transformations
  - ideas may be incorporated at different stages, there is more space
    to choose
  - john-devkit is designed to provide manual control over optimizations
    - so john-devkit is simpler
    - but it is not obvious yet how to optimize 100+ formats saving
      such flexibility
  - it is possible to write transformations with lower quality than in
    regular compiler: less generic, having assumptions
    - hence there are undocumented assumptions in the code (bad)
  - john-devkit itself may be slow
    - filters are isolated from each other outside of IR
      - almost all things are transferred through the list with instructions
        - there are "instructions" with meta-information
      - some things are passed explicitly at the very top-level, where
        all filters are called
      - so results of the filters are observable as of IR is readable
        and everything else is explicit
      - maybe there will be single global object shared between
        filters to store small settings to move common branching
        patterns from the top level into special filters (like
        vectorize_maybe)
  - john-devkit generates code that is not beautiful
  - John the Ripper provides rich foundation for development including
    - test vectors (including ad-hoc dynamic formats)
    - mechanism for benchmarking
    - implementations to get ideas from
    - abstraction for portable vectorization
    - and the main program: we compile only a part and embed it into
      big project written in regular C
  - hash cracking provides its own useful limitations:
    - we have as many data sets on input as we want
      - so vectorization occurs after all optimizations, there is
        almost no need to keep separate optimizations for vectorized
        code
    - length of input is limited


Thanks!

-- 
Regards,
Aleksey Cherepanov
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.