musl - Re: Release prep for 1.2.1, and afterwards

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200706221242.GM6430@brightrain.aerifal.cx>
Date: Mon, 6 Jul 2020 18:12:43 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Release prep for 1.2.1, and afterwards

On Fri, Jun 26, 2020 at 10:40:49AM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@...c.org> [2020-06-25 21:20:06 -0400]:
> > On Thu, Jun 25, 2020 at 05:15:42PM -0400, Rich Felker wrote:
> > > On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote:
> > > > > > > but it would be nice if we could get the aarch64
> > > > > > > memcpy patch in (the c implementation is really
> > > > > > > slow and i've seen ppl compare aarch64 vs x86
> > > > > > > server performance with some benchmark on alpine..)
> > > > > > 
> > > > > > OK, I'll look again.
> > > > > 
> > > > > thanks.
> > > > > 
> > > > > (there are more aarch64 string functions in the
> > > > > optimized-routines github repo but i think they
> > > > > are not as important as memcpy/memmove/memset)
> > > > 
> > > > I found the code. Can you commend on performance and whether memset is
> > > > needed? (The C memset should be rather good already, moreso than
> > > > memcpy.)
> 
> the asm seems faster in all measurements but there is
> a lot of variance with different size/alignment cases.
> 
> the avg improvement on typical workload and the possible
> improvements across various cases and cores i'd expect:
> 
> memcpy typical: 1.6x-1.7x
> memcpy possible: 1.2x-3.1x
> 
> memset typical: 1.1x-1.4x
> memset possible: 1.0x-2.6x
> 
> > > Are the assumptions (v8-a, unaligned access) documented in memcpy.S
> > > valid for all presently supportable aarch64?
> 
> yes, unaligned access on normal memory in userspace
> is valid (part of the base abi on linux).
> 
> iirc a core can be configured to trap unaligned access
> and it is not valid on device memory so e.g. such
> memcpy would not work in the kernel. but avoiding
> unaligned access in memcpy is not enough to fix that,
> the compiler will generate unaligned load for
> 
> int f(char *p)
> {
>     int i;
>     __builtin_memcpy(&i,p,sizeof i);
>     return i;
> }
> 
> > > 
> > > A couple comments for merging if we do, that aren't hard requirements
> > > but preferences:
> > > 
> > > - I'd like to expand out the macros from ../asmdefs.h since that won't
> > >   be available and they just hide things (I guess they're attractive
> > >   for Apple/macho users or something but not relevant to musl) and
> > >   since the symbol name lines need to be changed anyway to public
> > >   name. "Local var name" macros are ok to leave; changing them would
> > >   be too error-prone and they make the code more readable anyway.
> 
> the weird macros are there so the code is similar to glibc
> asm code (which adds cfi annotation and optionally adds
> profile hooks to entry etc)
> 
> > > 
> > > - I'd prefer not to have memmove logic in memcpy since it makes it
> > >   larger and implies that misuse of memcpy when you mean memmove is
> > >   supported usage. I'd be happy with an approach like x86 though,
> > >   defining an __memcpy_fwd alias and having memmove tail call to that
> > >   unless len>128 and reverse is needed, or just leaving memmove.c.
> 
> in principle the code should be called memmove, not memcpy,
> since it satisfies the memmove contract, which of course
> works for memcpy too. so tail calling memmove from memcpy
> makes more sense but memcpy is more performance critical
> than memmove, so we probably should not add extra branches
> there..
> 
> > 
> > Something like the attached.
> 
> looks good to me.

I think you saw already, but just to make it clear on the list too,
it's upstream now. I'm open to further improvements like doing
memmove (either as a separate copy of the full implementation or some
minimal branch-to-__memcpy_fwd approach) but I think what's already
there is sufficient to solve the main practical performance issues
users were hitting that made aarch64 look bad in relation to x86_64.

I'd still like to revisit the topic of minimizing the per-arch code
needed for this so that all archs can benefit from the basic logic,
too.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.