kernel-hardening - Re: [PATCH v2] siphash: add cryptographically secure hashtable function

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHmME9qSW1U3dU+VjV8UBz=XOMfpbTkOCyrz74VnQTNcJW_FUw@mail.gmail.com>
Date: Mon, 12 Dec 2016 06:48:27 +0100
From: "Jason A. Donenfeld" <Jason@...c4.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: "kernel-hardening@...ts.openwall.com" <kernel-hardening@...ts.openwall.com>, 
	LKML <linux-kernel@...r.kernel.org>, 
	Linux Crypto Mailing List <linux-crypto@...r.kernel.org>, George Spelvin <linux@...izon.com>, 
	Scott Bauer <sbauer@....utah.edu>, Andi Kleen <ak@...ux.intel.com>, 
	Andy Lutomirski <luto@...capital.net>, Greg KH <gregkh@...uxfoundation.org>, 
	Jean-Philippe Aumasson <jeanphilippe.aumasson@...il.com>, "Daniel J . Bernstein" <djb@...yp.to>
Subject: Re: [PATCH v2] siphash: add cryptographically secure hashtable function

Hey Linus,

On Mon, Dec 12, 2016 at 5:01 AM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
> The above is extremely inefficient. Considering that most kernel data
> would be expected to be smallish, that matters (ie the usual benchmark
> would not be about hashing megabytes of data, but instead millions of
> hashes of small data).
>
> I think this could be rewritten (at least for 64-bit architectures) as
>
>     #ifdef CONFIG_DCACHE_WORD_ACCESS
>
>         if (left)
>                 b |= le64_to_cpu(load_unaligned_zeropad(data) &
> bytemask_from_count(left));
>
>     #else
>
>         .. do the duff's device thing with the switch() ..
>
>     #endif
>
> which should give you basically perfect code generation (ie a single
> 64-bit load and a byte mask).

I modified the test to hash data of size 0 through 7 repeatedly
100000000 times, and benchmarked that a few times on a Skylake laptop.
The `load_unaligned_zeropad & bytemask_from_count` version was
consistently 7% slower.

I then modified it again to simply hash a 4 byte constant repeatedly
1000000000 times. The `load_unaligned_zeropad & bytemask_from_count`
version was around 6% faster. I tried again with a 7 byte constant and
got more or less a similar result.

Then I tried with a 1 byte constant, and found that the
`load_unaligned_zeropad & bytemask_from_count` version was slower.

So, it would seem that between the `if (left)` and the `switch
(left)`, there's the same number of branches. But for small values of
`left`, the duff's device just has simpler arithmetic, whereas for
large values of `left`, the `load_unaligned_zeropad` prevails. If
micro-optimization is really appealing, one could imagine a hybrid of
the two:

    switch (left) {
    case 7:
    case 6:
    case 5:
    case 4:
        b |= le64_to_cpu(load_unaligned_zeropad(data) &
bytemask_from_count(left));
        break;
    case 3: b |= ((u64)data[2]) << 16;
    case 2: b |= ((u64)data[1]) <<  8;
    case 1: b |= ((u64)data[0]); break;
    case 0: break;
    }

But I'm not sure this complication is worth it, and it might be more
likely that the left-over size is 4 bytes most of the time, so we
should just use your trick on platforms that support it.

Jason

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.