Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 7 Jul 2015 16:16:48 +0300
From: Solar Designer <>
Subject: Re: extend SIMD intrinsics


Your reply is slightly out of context.  I guess we confused you by
discussing several topics at once.  Earlier, we discussed use of
load/store intrinsics vs. simple assignments (or direct use of in-memory
SIMD operands in expressions with other intrinsics).  However, in the
message you replied to, and in the piece of it you quoted, we were
discussing different kinds of the "simple assignments" approach, which
may differ as it relates to C strict aliasing rules and as it relates to
compiler optimizations unrelated to what you mention.

However, your comment is useful anyway, and I'll comment on it further:

On Mon, Jul 06, 2015 at 11:15:41PM -0400, Alain Espinosa wrote:
> In Visual C the difference of a simple assignment and a vload is that for the assignment the compiler generate an unaligned SIMD load instruction, and for vload it generates an aligned SIMD load with the usual restriction: if this memory access isn't aligned the required byte amount an exception is raised. In general the performance difference is negligible,  if any.

I saw similar behavior with recent gcc, but it wasn't as simple as you
understand/explain it.  It turned out that recent gcc started generating
unaligned SIMD load instructions when it didn't have a reliable way to
see that the access is aligned.  This meant that we should make the
alignment transparent to gcc - avoid going via opaque pointers (which,
as discussed elsewhere in this thread, also tends to violate strict
aliasing rules).  When I corrected my code (bitslice DES code in JtR) to
make the alignment apparent to gcc, it stopped generating the unaligned
load instructions, generating the aligned ones instead.  I suspect
Visual C might be similar.

As to the performance difference being negligible or non-existent, this
is true on recent Intel CPUs, but not true on older ones.  In
particular, I saw performance impact for bitslice DES on the order of
20% on Xeon E5420 (Core 2'ish), caused solely by the unaligned load
instructions.  Simply replacing those instructions (via sed applied to
gcc-generated assembly) with their aligned counterparts regained that
performance loss.  Ditto correcting the source code to make the
alignments apparent to gcc.

So there is in fact an issue for us to keep in mind here: if we avoid
the load/store intrinsics, we have to make sure the compiler is aware of
the alignment through other means, and we should review the generated
code to make sure it uses aligned loads/stores.  Well, or we may use the

Thank you for reminding me about this issue!


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ