musl - Re: [PATCH] x86/memset: avoid performing final store twice

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201011002514.GF17637@brightrain.aerifal.cx>
Date: Sat, 10 Oct 2020 20:25:14 -0400
From: Rich Felker <dalias@...c.org>
To: Denys Vlasenko <vda.linux@...glemail.com>
Cc: Denys Vlasenko <dvlasenk@...hat.com>, musl@...ts.openwall.com
Subject: Re: [PATCH] x86/memset: avoid performing final store twice

On Sun, Oct 04, 2020 at 12:32:09AM +0200, Denys Vlasenko wrote:
> From: Denys Vlasenko <dvlasenk@...hat.com>
> 
> For not very short NBYTES case:
> 
> To handle the tail alignment, the code performs a potentially
> misaligned word store to fill the final 8 bytes of the buffer.
> This is done even if the buffer's end is aligned.
> 
> Eventually code fills the rest of the buffer, which is a multiple
> of 8 bytes now, with NBYTES / 8 aligned word stores.
> 
> However, this means that if NBYTES *was* divisible by 8,
> we store last word too, again.
> 
> This patch decrements byte count before dividing it by 8,
> making one less store in "NBYTES is divisible by 8" case,
> and not changing anything in all other cases.
> 
> CC: Rich Felker <dalias@...c.org>
> CC: musl@...ts.openwall.com
> Signed-off-by: Denys Vlasenko <dvlasenk@...hat.com>
> ---
>  src/string/i386/memset.s   | 7 ++++---
>  src/string/x86_64/memset.s | 2 +-
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/src/string/i386/memset.s b/src/string/i386/memset.s
> index d00422c4..b1c5c2f8 100644
> --- a/src/string/i386/memset.s
> +++ b/src/string/i386/memset.s
> @@ -47,7 +47,7 @@ memset:
>  	mov %edx,(-1-2-4-8-8)(%eax,%ecx)
>  	mov %edx,(-1-2-4-8-4)(%eax,%ecx)
>  
> -1:	ret 	
> +1:	ret
>  
>  2:	movzbl 8(%esp),%eax
>  	mov %edi,12(%esp)
> @@ -57,13 +57,14 @@ memset:
>  	mov %eax,-4(%edi,%ecx)
>  	jnz 2f
>  
> -1:	shr $2, %ecx
> +1:	dec %ecx
> +	shr $2, %ecx
>  	rep
>  	stosl
>  	mov 4(%esp),%eax
>  	mov 12(%esp),%edi
>  	ret
> -	
> +
>  2:	xor %edx,%edx
>  	sub %edi,%edx
>  	and $15,%edx
> diff --git a/src/string/x86_64/memset.s b/src/string/x86_64/memset.s
> index 2d3f5e52..85bb686c 100644
> --- a/src/string/x86_64/memset.s
> +++ b/src/string/x86_64/memset.s
> @@ -53,7 +53,7 @@ memset:
>  2:	test $15,%edi
>  	mov %rdi,%r8
>  	mov %rax,-8(%rdi,%rdx)
> -	mov %rdx,%rcx
> +	lea -1(%rdx),%rcx
>  	jnz 2f
>  
>  1:	shr $3,%rcx
> -- 
> 2.25.0

Does this have measurably better performance on a system you've tested
it on? Indeed it nominally stores less, but it also performs an extra
arithmetic operation. I doubt it makes anything worse but I'd like to
avoid making random changes to such functions unless there's a good
reason to believe it actually helps.

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.