Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 12 Jul 2013 10:34:31 +1200
From: Andre Renaud <>
Subject: Re: Thinking about release

Hi Rich,

> You need both instructions in the same asm block, and proper
> constraints. As it is, whether the registers keep their values between
> the two separate asm blocks is up to the compiler's whims.
> With the proper constraints ("+r" type), the s+=SS and d+=SS are
> unnecessary, as a bonus. Also there's no reason to force alignment to
> SS for this loop; that will simply prevent it from being used as much
> for smaller copies. I would use SS==sizeof(size_t) and then write 8*SS
> in the for loop.
> Last night I was in the process of writing something very similar, but
> I put the for loop in asm too and didn't finish it. If it performs
> just as well with the loop in C, I like your version better.

I've rejiggled it a bit, and it appears to be working. I wasn't
entirely sure what you meant about the proper constraints. There is an
additional reason why 8*4 was used for the align - to force the whole
loop to work in cache-line blocks. I've now done this explicitly on
the lead-in by doing the first few copies as 32-bit, then going to the
full cache-line asm. This has the same performance as the fully native
assembler. However to get that I had to use the same trick that the
native assembler uses - doing a load of the next block prior to
storing this one. I'm a bit concerned that this would mean we'd be
doing a read that was out of bounds, and I can't entirely see why this
wouldn't be happening with the existing assembler (but I'm presuming
it doesn't). Any comments on this side of it?

#define SS sizeof(size_t)
#define ALIGN (SS - 1)
void * noinline my_asm_memcpy(void * restrict dest, const void *
restrict src, size_t n)
    unsigned char *d = dest;
    const unsigned char *s = src;

    if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN))
        goto misaligned;

    /* ARM has 32-byte cache lines, so get us aligned to that */
    for (; ((uintptr_t)d & ((8 * SS) - 1)) && n; n-=SS) {
            *(size_t *)d = *(size_t *)s;
            d += SS;
            s+= SS;
    /* Do full cache line read/writes */
    if (n) {
        for (; n>=(8 * SS); n-= (8 * SS)) {
                __asm__ (
                        "ldmia %0, {r4-r11}\n"
                        "add %0, %0, %4\n"
                        "bic r12, %0, %5\n"
                        "ldrhi r12, [%0]\n"
                        "stmia %1, {r4-r11}\n"
                        "add %1, %1, %4"
                        : "=r"(s), "=r"(d)
                        : "0"(s), "1"(d), "i"(8 * SS), "i"((8 * SS) - 1)
                        : "r4", "r5", "r6", "r7", "r8",
                          "r9", "r10", "r11", "r12");

        for (; n; n--) *d++ = *s++;
    return dest;



Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.