Date: Sat, 10 Oct 2020 20:25:14 -0400 From: Rich Felker <dalias@...c.org> To: Denys Vlasenko <vda.linux@...glemail.com> Cc: Denys Vlasenko <dvlasenk@...hat.com>, musl@...ts.openwall.com Subject: Re: [PATCH] x86/memset: avoid performing final store twice On Sun, Oct 04, 2020 at 12:32:09AM +0200, Denys Vlasenko wrote: > From: Denys Vlasenko <dvlasenk@...hat.com> > > For not very short NBYTES case: > > To handle the tail alignment, the code performs a potentially > misaligned word store to fill the final 8 bytes of the buffer. > This is done even if the buffer's end is aligned. > > Eventually code fills the rest of the buffer, which is a multiple > of 8 bytes now, with NBYTES / 8 aligned word stores. > > However, this means that if NBYTES *was* divisible by 8, > we store last word too, again. > > This patch decrements byte count before dividing it by 8, > making one less store in "NBYTES is divisible by 8" case, > and not changing anything in all other cases. > > CC: Rich Felker <dalias@...c.org> > CC: musl@...ts.openwall.com > Signed-off-by: Denys Vlasenko <dvlasenk@...hat.com> > --- > src/string/i386/memset.s | 7 ++++--- > src/string/x86_64/memset.s | 2 +- > 2 files changed, 5 insertions(+), 4 deletions(-) > > diff --git a/src/string/i386/memset.s b/src/string/i386/memset.s > index d00422c4..b1c5c2f8 100644 > --- a/src/string/i386/memset.s > +++ b/src/string/i386/memset.s > @@ -47,7 +47,7 @@ memset: > mov %edx,(-1-2-4-8-8)(%eax,%ecx) > mov %edx,(-1-2-4-8-4)(%eax,%ecx) > > -1: ret > +1: ret > > 2: movzbl 8(%esp),%eax > mov %edi,12(%esp) > @@ -57,13 +57,14 @@ memset: > mov %eax,-4(%edi,%ecx) > jnz 2f > > -1: shr $2, %ecx > +1: dec %ecx > + shr $2, %ecx > rep > stosl > mov 4(%esp),%eax > mov 12(%esp),%edi > ret > - > + > 2: xor %edx,%edx > sub %edi,%edx > and $15,%edx > diff --git a/src/string/x86_64/memset.s b/src/string/x86_64/memset.s > index 2d3f5e52..85bb686c 100644 > --- a/src/string/x86_64/memset.s > +++ b/src/string/x86_64/memset.s > @@ -53,7 +53,7 @@ memset: > 2: test $15,%edi > mov %rdi,%r8 > mov %rax,-8(%rdi,%rdx) > - mov %rdx,%rcx > + lea -1(%rdx),%rcx > jnz 2f > > 1: shr $3,%rcx > -- > 2.25.0 Does this have measurably better performance on a system you've tested it on? Indeed it nominally stores less, but it also performs an extra arithmetic operation. I doubt it makes anything worse but I'd like to avoid making random changes to such functions unless there's a good reason to believe it actually helps. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.