Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 24 Jul 2013 16:40:16 +1200
From: Andre Renaud <>
Subject: Re: Thinking about release

Hi Rich,
> It looks buggy as-is; as far as I can tell, it will crash if src/dest
> are aligned with respect to each other but not aligned mod 4, i.e. the
> code starts out copying word-at-a-time rather than byte-at-a-time.

Yes, you are correct, I'd messed that up while looking at the cache
alignment stuff (along with anoter small size related bug). Fixing it
is relatively straight forward though:
#define SS sizeof(size_t)
#define ALIGN (SS - 1)
void * noinline my_asm_memcpy(void * restrict dest, const void *
restrict src, size_t n)
    unsigned char *d = dest;
    const unsigned char *s = src;

    if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN))
        goto misaligned;

    /* Get them word aligned */
    for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++;

    /* ARM has 32-byte cache lines, so align to that for performance */
    for (; ((uintptr_t)d & ((8 * SS) - 1)) && n >= SS; n-=SS) {
            *(size_t *)d = *(size_t *)s;
            d += SS;
            s += SS;
    /* Do full cache line read/writes */
    for (; n>=(8 * SS); n-= (8 * SS))
            __asm__ __volatile__(
                            "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
                            "ldrhi r12, [%1]\n"
                            "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
                            : "+r"(d), "+r"(s) :
                            : "a4", "v1", "v2", "v3", "v4", "v5",
"v6", "v7", "r12", "memory");

        for (; n; n--) *d++ = *s++;
    return dest;


> I think the C version would be acceptable if we get the bugs fixed and
> test it well, but I'd also like to still keep the asm under
> consideration. There are lots of cases not covered by the C version,
> like misaligned copies (important for strings, not for much else). Do
> you think these cases are important?

At the moment the mis-aligned copies perform terribly (18MB/s vs glibc
@ 100MB/s). However the existing C implementation in musl is no
different, so we're not degrading the current system.

We're essentially missing the non-congruent copying stuff from the asm
code. I'll have a look at this and see if I can write a similar C


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.