Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 10 Feb 2015 15:43:42 -0500
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: Re: [PATCH] x86_64/memset: simple optimizations

On Tue, Feb 10, 2015 at 09:27:17PM +0100, Denys Vlasenko wrote:
> On Sat, Feb 7, 2015 at 2:06 PM, Rich Felker <dalias@...ifal.cx> wrote:
> > On Sat, Feb 07, 2015 at 01:49:43PM +0100, Denys Vlasenko wrote:
> >> On Sat, Feb 7, 2015 at 1:35 AM, Rich Felker <dalias@...ifal.cx> wrote:
> >> What speedups?
> >> In particular:
> >> - perform pre-alignment if dst is unaligned
> >
> > For the rep stosq path? Does it help? I don't recall the details but I
> > seem to remember both docs and measurements showing no reliable
> > benefit from alignment for this instruction, and we had people trying
> > things on several different cpu models. I'm open to hearing evidence
> > to the contrary though.
> 
> size:20k buf:0x7f38656e2100
> stos:25978 ns (times 32), 25.227500 bytes/ns
> stos+1:31395 ns (times 32), 20.874662 bytes/ns
> stos+4:31396 ns (times 32), 20.873997 bytes/ns
> stos+8:24446 ns (times 32), 26.808476 bytes/ns
> 
> size:50k buf:0x7fbca1dc9100
> stos:68149 ns (times 32), 24.041439 bytes/ns
> stos+1:85762 ns (times 32), 19.104032 bytes/ns
> stos+4:85762 ns (times 32), 19.104032 bytes/ns
> stos+8:68204 ns (times 32), 24.022051 bytes/ns
> 
> size:1024k buf:0x7fa3036a5100
> stos:1632285 ns (times 32), 20.556724 bytes/ns
> stos+1:1891092 ns (times 32), 17.743416 bytes/ns
> stos+4:1891089 ns (times 32), 17.743444 bytes/ns
> stos+8:1632181 ns (times 32), 20.558034 bytes/ns
> 
> size:5000k buf:0x7fdf5cd6b100
> stos:15592138 ns (times 32), 10.558298 bytes/ns
> stos+1:15501841 ns (times 32), 10.619799 bytes/ns
> stos+4:15507773 ns (times 32), 10.615737 bytes/ns
> stos+8:15589617 ns (times 32), 10.560005 bytes/ns
> 
> The source is attached.

OK. This looks sufficiently significant (despite unaligned memsets
being rare) that it would be nice to optimize it. Could we just write
an initial possibly-misaligned word then increment the start address
and round it up before using rep stos?

> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/time.h>
> #include <sys/syscall.h>
> #include <time.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
> /* Old glibc (< 2.3.4) does not provide this constant. We use syscall
>  * directly so this definition is safe. */
> #ifndef CLOCK_MONOTONIC
> #define CLOCK_MONOTONIC 1
> #endif
> 
> /* libc has incredibly messy way of doing this,
>  * typically requiring -lrt. We just skip all this mess */
> static void get_mono(struct timespec *ts)
> {
>         syscall(__NR_clock_gettime, CLOCK_MONOTONIC, ts);
> }

FWIW, this is a bad idea; you get syscall overhead in your
measurements. If you just use clock_gettime (the function) you'll get
vdso results (no syscall).

Using the syscall directly is also sketchy in that x32 has an
incorrect kernel-side definition for struct timespec, but I think it
will only matter if aarch64-ILP32 copies this problem from x32 and
you're using a big-endian system.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.