Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 15 Feb 2015 22:44:59 +0100
From: Denys Vlasenko <>
To: Rich Felker <>
Cc: musl <>
Subject: Re: [PATCH] x86_64/memset: use "small block" code for blocks
 up to 30 bytes long

On Sun, Feb 15, 2015 at 4:03 PM, Rich Felker <> wrote:
>> Just because we don't personally see a hit from 6-cycle imul of AMD CPUs,
>> it does not mean people who do use those CPUs don't exist. Have heart...
> Did you test the version I attached? I think there should be at least
> 4-5 cycles between when the imul is launched and when the result is
> used, so I'm failing to see how the latency is a big deal.

Okay, I won't insist.
Your version works good. The "rep stosq" setup time is still noticeable
even when we switch to it after 126:

129 byte block: 10.37 bytes/ns
128 byte block: 10.65 bytes/ns
127 byte block: 10.58 bytes/ns
126 byte block: 18.44 bytes/ns
125 byte block: 18.30 bytes/ns
124 byte block: 18.15 bytes/ns

but I don't think we should do anything about this.


        lea -1(%rdx),%rcx
        cmp $126,%rcx
        jae 2f

you'd have a stall, since cmp needs the result of lea. why not this?

        lea -1(%rdx),%rcx
        cmp $127,%rdx
        jae 2f

then you can even move lea to "big buf" code part
(no point doing it in "small buf" code where it is not used).

Possible bug: this check seems misplaced:

2:      test %rdx,%rdx
        jz 1b

it should be before byte stores:
        mov %sil,(%rdi)
        mov %sil,-1(%rdi,%rdx)
        cmp $2,%edx
        jbe 1f
otherwise memset of zero length will fill two bytes, at buf[0] and buf[-1]

"sub $8,%rcx" can be folded into lea.

Please see attached file.

Download attachment "vda1.s" of type "application/octet-stream" (1056 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.