Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 11 Jul 2013 10:44:16 +1200
From: Andre Renaud <>
To: Andre Renaud <>
Subject: Re: Thinking about release

> This results in 95MB/s on my platform (up from 65MB/s for the existing
> memcpy.c, and down from 105MB/s with the asm optimised version). It is
> essentially identically readable to the existing memcpy.c. I'm not
> really famiilar with any other cpu architectures, so I'm not sure if
> this would improve, or hurt, performance on other platforms.

Reviewing the assembler that is produced, it appears that GCC will
never generate an ldm/stm instruction (load/store multiple) that reads
into more than 4 registers, where as the optimised assembler does them
that read 8 (ie: 8 * 32bit reads in a single instruction). I've tried
various tricks/optimisations with the C code, and can't convince GCC
to do more than 4. I assume that this is probably where the remaining
10MB/s is between these two variants.

Rich - do you have any comments on whether either the C or assembler
variants of memcpy might be suitable for inclusion in musl?


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.