Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 17 Feb 2015 18:30:37 +0100
From: Denys Vlasenko <>
To: Rich Felker <>
Cc: musl <>
Subject: Re: [PATCH] x86_64/memset: use "small block" code for blocks
 up to 30 bytes long

On Tue, Feb 17, 2015 at 5:51 PM, Denys Vlasenko
<> wrote:
> On Tue, Feb 17, 2015 at 5:12 PM, Rich Felker <> wrote:
>> On Tue, Feb 17, 2015 at 02:08:52PM +0100, Denys Vlasenko wrote:
>>> >> Please see attached file.
>>> >
>>> > I tried it and it's ~1 cycle slower for at least sizes 16-30;
>>> > presumably we're seeing the cost of the extra compare/branch at these
>>> > sizes but not at others. What does your timing test show?
>>> See below.
>>> First column - result of my2.s
>>> Second column - result of vda1.s
>>> Basically, the "rep stosq" code path got a bit faster, while
>>> small memsets stayed the same.
>> Can you post your test program for me to try out? Here's what I've
>> been using, attached.
> With your program I see similar results:

Changed your program to output floating point results,
and do many more iterations finding minimum,
as otherwise (on my machine) consecutive runs give
+-2 cycles discrepancy for most measurements.
With one million iterations, discrepancy between
 runs is often zero, and when it's not, it's one cycle or less.

Please see attached files.
my2.OUT1 and my2.OUT2 are two runs of my2.s code
(to judge how much noise is in the measurements).

Download attachment "my2.OUT1" of type "application/octet-stream" (1533 bytes)

Download attachment "vda1.OUT1" of type "application/octet-stream" (1533 bytes)

View attachment "memset-cycles-vda.c" of type "text/x-csrc" (1222 bytes)

Download attachment "my2.OUT2" of type "application/octet-stream" (1533 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.