Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Mon, 28 Apr 2014 21:59:26 +0200
From: magnum <>
Subject: Re: Re: mmap()

On 2014-04-27 22:47, magnum wrote:
> I'm experimenting with using SSE *with* mmap (not Atom's code) but
> since most words are shorter than 16 bytes it seems to be better using
> 32-bit or even 8-bit stuff.

The mmap stuff is now committed to bleeding-jumbo. The "problem" with 
SSE described above is gone: If the word is shorter we'll copy 16 bytes 
but then we'll leave the loop knowing where to put the null byte. So 
it's now very fast for any length.

I have another problem though, calling for help or knowledge: The SSE2 
version is a fine boost on Linux, I've tested it on a couple of Intels 
and an AMD. But when I run it on OSX with an i7 mobile, it *halves* the 
speed. At first I thought it was something with poor handling of 
unaligned SSE but it did seem unlikely for this CPU. And now I booted 
Linux on the Macbook and could confirm the SSE code runs just fine 
there, with a 6-7% boost over the 64-bit alternative code. In both cases 
it was compiled with gcc 4.7-ish. How the heck can SSE intrinsics end up 
that different? The OS should have absolutely nothing to do with it!?

For now I disable the SSE2 code path for __APPLE__ but I really think 
this is weird. I'll try peaking at the assembler output from the compiler.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.