Date: Tue, 17 Feb 2015 12:40:45 -0500 From: Rich Felker <dalias@...c.org> To: Denys Vlasenko <vda.linux@...glemail.com> Cc: musl <musl@...ts.openwall.com> Subject: Re: [PATCH] x86_64/memset: use "small block" code for blocks up to 30 bytes long On Tue, Feb 17, 2015 at 05:51:11PM +0100, Denys Vlasenko wrote: > On Tue, Feb 17, 2015 at 5:12 PM, Rich Felker <dalias@...c.org> wrote: > > On Tue, Feb 17, 2015 at 02:08:52PM +0100, Denys Vlasenko wrote: > >> >> Please see attached file. > >> > > >> > I tried it and it's ~1 cycle slower for at least sizes 16-30; > >> > presumably we're seeing the cost of the extra compare/branch at these > >> > sizes but not at others. What does your timing test show? > >> > >> See below. > >> First column - result of my2.s > >> Second column - result of vda1.s > >> > >> Basically, the "rep stosq" code path got a bit faster, while > >> small memsets stayed the same. > > > > Can you post your test program for me to try out? Here's what I've > > been using, attached. > > With your program I see similar results: > > .... > size 50: min=10, avg=10 min=10, avg=10 > size 52: min=10, avg=10 min=10, avg=10 The ... was the part where mine seemed better. :) Anyway thanks; I'll give your test program a run and see what comes out. I don't think the difference is going to be big either way, but I suspect mine is slightly faster for small sizes (~1-30) and slightly slower for large sizes (>126). BTW I appreciate your work and interest in improving this. I just don't like string-ops optimization in general because determining that changes are actually a net gain for a wide range of cpus and usage cases and not just for one benchmark turns into a big time sink. :-( But at least it's fun... Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.