Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Mon, 30 Jul 2012 16:41:00 -0400
From: Rich Felker <>
To: Kim Walisch <>
Subject: Re: musl libc, memcpy


I'm replying with the list CC'd so others can comment too. Sorry I
haven't gotten a chance to try this code or review it in detail yet.
What follows is a short initial commentary but I'll give it some more
attention soon.

On Sun, Jul 29, 2012 at 11:41:47AM +0200, Kim Walisch wrote:
> Hi,
> I have been reading through several libc implementations on the
> internet for the past days and for fun I have written a fast yet
> portable memcpy implementation. It uses more code than your
> implementation but I do not think it is bloated. Some quick benchmarks
> that I ran on my Intel Core-i5 670 3.46GHz  (Red Hat 6.2 x86_64)
> indicate that my implemenation runs about 50 percent faster than yours
> for aligned data and up to 10 times faster for unaligned data using
> gcc-4.7. The Intel C compiler even vectorizes the main copying loop
> using SSE instructions (if compiled with icc -O2 -xHost) which gives a
> performance better than glibc's memcpy on my system. I would be happy
> to hear your opinion about my memcpy implementation.

I'd like to know what block sizes you were looking at, because for
memcpy that makes all the difference in the world:

For very small blocks (down to 1 byte), performance will be dominated
by conditional branches picking what to do.

For very large blocks (larger than cache), performance will be
memory-bound and even byte-at-a-time copying might be competitive.

Theoretically, there's only a fairly small range of sizes where the
algorithm used matters a lot.

> /* CPU architectures that support fast unaligned memory access */
> #if defined(__i386) || defined(__x86_64)
> #endif

I don't think this is necessary or useful. If we want better
performance on these archs, a tiny asm file that does almost nothing
but "rep movsd" is known to be the fastest solution on 32-bit x86, and
is at least the second-fastest on 64-bit, with the faster solutions
not being available on all cpus. On pretty much all other archs,
unaligned access is illegal.

> static void *internal_memcpy_uintptr(void *dest, const void *src, size_t n)
> {
> 	char *d = (char*) dest;
> 	const char *s = (const char*) src;
> 	size_t bytes_iteration = sizeof(uintptr_t) * 8;
> 	while (n >= bytes_iteration)
> 	{
> 		((uintptr_t*)d)[0] = ((const uintptr_t*)s)[0];
> 		((uintptr_t*)d)[1] = ((const uintptr_t*)s)[1];
> 		((uintptr_t*)d)[2] = ((const uintptr_t*)s)[2];
> 		((uintptr_t*)d)[3] = ((const uintptr_t*)s)[3];
> 		((uintptr_t*)d)[4] = ((const uintptr_t*)s)[4];
> 		((uintptr_t*)d)[5] = ((const uintptr_t*)s)[5];
> 		((uintptr_t*)d)[6] = ((const uintptr_t*)s)[6];
> 		((uintptr_t*)d)[7] = ((const uintptr_t*)s)[7];
> 		d += bytes_iteration;
> 		s += bytes_iteration;
> 		n -= bytes_iteration;
> 	}

This is just manual loop unrolling, no? GCC should do the equivalent
if you ask it to aggressively unroll loops, including the
vectorization; if not, that seems like a GCC bug.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.