Date: Tue, 10 Feb 2015 21:27:17 +0100
From: Denys Vlasenko <>
To: Rich Felker <>,
Subject: Re: [PATCH] x86_64/memset: simple optimizations

On Sat, Feb 7, 2015 at 2:06 PM, Rich Felker <> wrote:
> On Sat, Feb 07, 2015 at 01:49:43PM +0100, Denys Vlasenko wrote:
>> On Sat, Feb 7, 2015 at 1:35 AM, Rich Felker <> wrote:
>> What speedups?
>> In particular:
>> - perform pre-alignment if dst is unaligned
> For the rep stosq path? Does it help? I don't recall the details but I
> seem to remember both docs and measurements showing no reliable
> benefit from alignment for this instruction, and we had people trying
> things on several different cpu models. I'm open to hearing evidence
> to the contrary though.

size:20k buf:0x7f38656e2100
stos:25978 ns (times 32), 25.227500 bytes/ns
stos+1:31395 ns (times 32), 20.874662 bytes/ns
stos+4:31396 ns (times 32), 20.873997 bytes/ns
stos+8:24446 ns (times 32), 26.808476 bytes/ns

size:50k buf:0x7fbca1dc9100
stos:68149 ns (times 32), 24.041439 bytes/ns
stos+1:85762 ns (times 32), 19.104032 bytes/ns
stos+4:85762 ns (times 32), 19.104032 bytes/ns
stos+8:68204 ns (times 32), 24.022051 bytes/ns

size:1024k buf:0x7fa3036a5100
stos:1632285 ns (times 32), 20.556724 bytes/ns
stos+1:1891092 ns (times 32), 17.743416 bytes/ns
stos+4:1891089 ns (times 32), 17.743444 bytes/ns
stos+8:1632181 ns (times 32), 20.558034 bytes/ns

size:5000k buf:0x7fdf5cd6b100
stos:15592138 ns (times 32), 10.558298 bytes/ns
stos+1:15501841 ns (times 32), 10.619799 bytes/ns
stos+4:15507773 ns (times 32), 10.615737 bytes/ns
stos+8:15589617 ns (times 32), 10.560005 bytes/ns

The source is attached.

This data shows that (on my CPU, Sandy Bridge with 4MB L2)
8-byte alignment helps when stores fit into L1 or L2.
If memset is larger than L2, memory throughput is too low
and there is no measurable difference.

View attachment "t.c" of type "text/x-csrc" (3216 bytes)

