Date: Fri, 2 Aug 2013 16:41:47 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Cc: Andre Renaud <andre@...ewatersys.com> Subject: Re: ARM memcpy post-0.9.12-release thread Andre, do you have any input on this? (Cc'ing) Rich On Tue, Jul 30, 2013 at 10:26:31PM -0400, Rich Felker wrote: > Hi all (especially Andre), > > I've been doing some experimenting with ARM memcpy, and I have not > found any way to beat the Bionic asm file for misaligned copies. The > best I could do with simple inline asm (reading multi-words and > writing byte-at-a-time or vice versa) improved the performance nearly > 40% compared to musl's current code, but it was still worse than half > the speed of the Bionic asm. > > For the aligned case, however, as I've said before, the Bionic code > runs 10% slower for me than the C-with-inline-asm I posted to the > list. Commenting out the prefetch code in the Bionic version brings > the performance up to the same as my version. > > I also found that the Bionic code was mysteriously crashing on the > real system I test on (it worked on my toolchain with qemu). On > further investigation, the test system's toolchain had -mthumb (with > thumb2) as the default; adding -marm made it work. Both ways the asm > was being interpreted as arm; the problem was that the *calling* code > being thumb broke it. The solution was adding .type memcpy,%function > to the asm file. Without that, the linker cannot know that the symbol > it's resolving is a function name and thus that it has to adjust the > low bit of the relocated address as a flag for whether the code is arm > or thumb. I've now got the code working reliably it seems. > > Sizes so far: > Current C code: 260 bytes > My best-attempt inline asm: 352 bytes > Bionic (with prefetch removed): 764 bytes > > Obviously the Bionic code is a bit larger than the others and than I'd > like it to be, but it looks really hard to trim it down without > ruining performance for misaligned copies; roughly half of the asm > covers the misaligned case, which is expensive because you have three > different code paths for different ways it can be off mod 4. > > One other issue we have to consider if we go with the Bionic code is > that we'd need to add sub-arch asm dirs to use it. As-is, the code is > hard-coded for little endian. It will shuffle the byte order badly > when copying on a big endian machine. > > Some rough times (128k copy repeated 10000 times): > > Aligned case: > Current C code: 1.2s > My best-attempt C code: 0.75s > My best-attempt inline asm: 0.57s > Bionic asm: 0.63s > Bionic asm without prefetch: 0.57s > > Misaligned case: > Current C code: 4.7s > My best-attempt inline asm: 2.9s > Bionic asm: 1.1s > > Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.