musl - Re: [PATCH] x86_64/memset: use "small block" code for blocks up to 30 bytes long

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150216173634.GA23507@brightrain.aerifal.cx>
Date: Mon, 16 Feb 2015 12:36:35 -0500
From: Rich Felker <dalias@...c.org>
To: Denys Vlasenko <vda.linux@...glemail.com>
Cc: musl <musl@...ts.openwall.com>
Subject: Re: [PATCH] x86_64/memset: use "small block" code for blocks
 up to 30 bytes long

On Sun, Feb 15, 2015 at 10:44:59PM +0100, Denys Vlasenko wrote:
> On Sun, Feb 15, 2015 at 4:03 PM, Rich Felker <dalias@...c.org> wrote:
> >> Just because we don't personally see a hit from 6-cycle imul of AMD CPUs,
> >> it does not mean people who do use those CPUs don't exist. Have heart...
> >
> > Did you test the version I attached? I think there should be at least
> > 4-5 cycles between when the imul is launched and when the result is
> > used, so I'm failing to see how the latency is a big deal.
> 
> Okay, I won't insist.
> Your version works good. The "rep stosq" setup time is still noticeable
> even when we switch to it after 126:
> 
> 129 byte block: 10.37 bytes/ns
> 128 byte block: 10.65 bytes/ns
> 127 byte block: 10.58 bytes/ns
> 126 byte block: 18.44 bytes/ns
> 125 byte block: 18.30 bytes/ns
> 124 byte block: 18.15 bytes/ns
> 
> but I don't think we should do anything about this.

Agreed. The size of code is really going to blow up at the next step,
and hopefully future cpus will get less bad about pessimizing rep
stosq startup.

> "sub $8,%rcx" can be folded into lea.
> 
> Please see attached file.

I tried it and it's ~1 cycle slower for at least sizes 16-30;
presumably we're seeing the cost of the extra compare/branch at these
sizes but not at others. What does your timing test show?

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.