Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 10 Jul 2013 10:26:46 +1200
From: Andre Renaud <andre@...ewatersys.com>
To: musl@...ts.openwall.com
Subject: Re: Thinking about release

Replying to myself

> Certainly if there was a more straight forward C implementation that
> achieved similar results that would be superior. However the existing
> musl C memcpy code is already optimised to some degree (doing 32-bit
> rather than 8-bit copies), and it is difficult to convince gcc to use
> the load-multiple & store-multiple instructions via C code I've found,
> without resorting to pretty horrible C code. It may still be
> preferable to the assembler though. At this stage I haven't
> benchmarked this - I'll see if I can come up with something.

As a comparison, the existing memcpy.c implementation tries to copy
sizeof(size_t) bytes at a time, which on ARM is 4. This ends up being
a standard load/store. However GCC is smart enough to know that it can
use ldm/stm instructions for copying structures > 4 bytes. So if we
change memcpy.c to use a structure whose size is > 4 (ie: 16), instead
of size_t for it's basic copy unit, we do see some improvements:

typedef struct multiple_size_t {
    size_t d[4];
} multiple_size_t;

#define SS (sizeof(multiple_size_t))
#define ALIGN (sizeof(multiple_size_t)-1)

void *my_memcpy(void * restrict dest, const void * restrict src, size_t n)
{
    unsigned char *d = dest;
    const unsigned char *s = src;

    if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN))
        goto misaligned;

    for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++;
    if (n) {
        multiple_size_t *wd = (void *)d;
        const struct multiple_size_t *ws = (const void *)s;

        for (; n>=SS; n-=SS) *wd++ = *ws++;

        d = (void *)wd;
        s = (const void *)ws;
misaligned:
        for (; n; n--) *d++ = *s++;
    }
    return dest;

}

This results in 95MB/s on my platform (up from 65MB/s for the existing
memcpy.c, and down from 105MB/s with the asm optimised version). It is
essentially identically readable to the existing memcpy.c. I'm not
really famiilar with any other cpu architectures, so I'm not sure if
this would improve, or hurt, performance on other platforms.

Any comments on using something like this for memcpy instead?
Obviously this gives you a higher penalty if the size of the area to
be copied is between sizeof(size_t) and sizeof(multiple_size_t).

Regards,
Andre

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.