libc-coord - Re: Add new ABIs '__strcmpeq', '__strncmpeq', '__wcscmpeq' and '_

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAFUsyfKRicqg3zGdReTRgT-4LJv5Asbfn1+YE7tKvXLb4fuJLw@mail.gmail.com>
Date: Fri, 21 Jan 2022 15:50:26 -0600
From: Noah Goldstein <goldstein.w.n@...il.com>
To: Joerg Sonnenberger <joerg@....de>
Cc: libc-coord@...ts.openwall.com, Richard Biener via Gcc <gcc@....gnu.org>, 
	GNU C Library <libc-alpha@...rceware.org>
Subject: Re: Add new ABIs '__strcmpeq', '__strncmpeq',
 '__wcscmpeq' and '__wcsncmpeq' to libc

On Fri, Jan 21, 2022 at 12:51 PM Joerg Sonnenberger <joerg@....de> wrote:
>
> On Thu, Jan 20, 2022 at 04:56:59PM -0600, Noah Goldstein wrote:
> > The goal is that the new interfaces will be usable as an optimization
> > by compilers if a program uses the return value of the non "eq"
> > variant as a boolean.
>
> So I'm curious, but can you demonstrate that it can be implemented
> notacibly faster than regular strcmp? Unlike for memcmp, I don't see an
> obvious way to save any operations.

Strong point! I had been somewhat assuming we could make the same
optimizations with `__memcmpeq` but there still needs to be some
logic that tracks which comes first the mismatch or the null terminator.

It's not quite as much as `memcmp` vs `__memcmpeq` but we can
still save.

Using the x86_64 AVX2 optimized implementation as reference:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/strcmp-avx2.S;h=9c73b5899d55a72b292f21b52593284cd513d2a3;hb=HEAD

We can convert the general return method of checking equals + strlen from:

```
VMOVU (%rdi), %ymm0
VPCMPEQ (%rsi), %ymm0, %ymm1
VPCMPEQ %ymm0, %ymmZERO, %ymm2
vpandn %ymm1, %ymm2, %ymm1
vpmovmskb %ymm1, %ecx
incl %ecx
jz L(keep_going)
tzcntl %ecx, %ecx
movzbl (%rdi, %rcx), %eax
movzbl (%rsi, %rcx), %ecx
subl %ecx, %eax
vzeroupper
ret
```

To

```
VMOVU (%rdi), %ymm0
VPCMPEQ (%rsi), %ymm0, %ymm1
VPCMPEQ %ymm0, %ymmZERO, %ymm2
vpandn %ymm1, %ymm2, %ymm2
vpmovmskb %ymm2, %ecx
incl %ecx
jz L(keep_going)
vpmovmskb %ymm1, %eax
blsi %ecx, %ecx
andn %eax, %ecx, %eax
vzeroupper
ret
```

Testing this with comparisons where mismatch or strlen in the first 32 bytes
(common case) it's about the same throughput but ~20% reduction in latency.

Another benefit is we can reuse this exact return logic throughout as memory
offset is no longer required. This simplifies the page cross logic a
great deal and
will net us some serious code size reduction for the common usage of strcmp.

I think though I was a bit over optimistic about the performance benefits as I
was using `memcmp` vs `__memcmpeq` as a reference. I'll put together
a patch for just `__strcmpeq` and post the results here. I think the
wide-character
versions have more expensive return value checks so if the character versions
show a benefit we can expect it to translate.

>
> Joerg

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.