Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 13 Jul 2021 11:45:47 -0400
From: Rich Felker <dalias@...c.org>
To: Stefan Kanthak <stefan.kanthak@...go.de>
Cc: musl@...ts.openwall.com
Subject: Re: Changes for strcspn(), strspn(), strtok() and strtok_r()

On Tue, Jul 13, 2021 at 02:02:06PM +0200, Stefan Kanthak wrote:
> <https://git.musl-libc.org/cgit/musl/plain/src/string/strcspn.c>
> 
>  #include <string.h>
>  
>  #define BITOP(a,b,op) \
>   ((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
>  
> -size_t strcspn(const char *s, const char *c)
> +size_t strcspn(const char *restrict s, const char *c)
>  {
> -        const char *a = s;
> +        const char *a;
> -        size_t byteset[32/sizeof(size_t)];
> +        size_t byteset[32/sizeof(size_t)] = { 0 };
>  
> -        if (!c[0] || !c[1]) return __strchrnul(s, *c)-a;
> +        if (!c[0] || !c[1]) return __strchrnul(a=s, *c)-a;
>  
> -        memset(byteset, 0, sizeof byteset);
>          for (; *c && BITOP(byteset, *(unsigned char *)c, |=); c++);
> -        for (; *s && !BITOP(byteset, *(unsigned char *)s, &); s++);
> -        return s-a;
> +        for (a=s; *a && !BITOP(byteset, *(unsigned char *)a, &); a++);
> +        return a-s;
>  }
> 
> After this change, musl's favorite compiler (GCC) will generate code
> like the following (here for x86-64), just like the code it already
> generates for strspn(), where the initialization of byteset[] is NOT
> done via memset():

That's probably a good change, but it's a 2-line change. The above
diff has at least 4 other gratuitous changes that have nothing to do
with getting rid of the memset call. Even if they were desirable, they
can't be done as part of the same change; that destroys the ability to
review/audit the history.

Note that I'd like to restore the ability of the compiler to transform
calls to memset into a builtin, but it's nontrivial because we have to
have -fno-builtin (via -ffreestanding) as the baseline to prevent
circular definitions.

>  #include <string.h>
>  
>  char *strtok(char *restrict s, const char *restrict sep)
>  {
>          static char *p;
> +        return strtok_r(s, sep, &p);
> -        if (!s && !(s = p)) return NULL;
> -        s += strspn(s, sep);
> -        if (!*s) return p = 0;
> -        p = s + strcspn(s, sep);
> -        if (*p) *p++ = 0;
> -        else p = 0;
> -        return s;
>  }
> 
> <https://git.musl-libc.org/cgit/musl/plain/src/string/strtok_r.c>

This cannot be done due to namespace. It may be desirable to make a
namespace-safe alias __strtok_r that could be called to implement
strtok, though.

>  #include <string.h>
>  
>  char *strtok_r(char *restrict s, const char *restrict sep, char **restrict p)
>  {
>          if (!s && !(s = *p)) return NULL;
>          s += strspn(s, sep);
> -        if (!*s) return *p = 0;
> +        if (!*s) return *p = NULL;
>          *p = s + strcspn(s, sep);
>          if (**p) *(*p)++ = 0;
> -        else *p = 0;
> +        else *p = NULL;
>          return s;
>  }

This is a gratuitous style change in the opposite direction of what's
preferred in musl.

> If you want to go a step further, avoid to build the same byteset twice:
> 
> <https://git.musl-libc.org/cgit/musl/plain/src/string/strtok_r.c>
> 
>  #include <string.h>
>  
>  char *strtok_r(char *restrict s, const char *restrict sep, char **restrict p)
>  {
> +        size_t byteset[32/sizeof(size_t)] = { 0 };
> +
>          if (!s && !(s = *p)) return NULL;
> +        if (!*s) return *p = NULL;
> +        if (!*sep) return *p = NULL, *s;
> -        s += strspn(s, sep);
> +        for (; *c && BITOP(byteset, *(unsigned char *)c, |=); c++);
> +        for (; *s && BITOP(byteset, *(unsigned char *)s, &); s++);
> -        if (!*s) return *p = 0;
> +        if (!*s) return *p = NULL;
> -        *p = s + strcspn(s, sep);
> -        if (**p) *(*p)++ = 0;
> -        else *p = 0;
> -        return s;
> +        sep = s;
> +        for (; *s && !BITOP(byteset, *(unsigned char *)s, &); s++);
> +        if (*s) *s++ = 0;
> +        else *s = NULL;
> +        *p = s;
> +        return sep;
>  }

Rewriting code that works and does not have any problems (note: UB or
platform assumptions that could bite someone later would count as a
problem) is not improvement. It consumes maintainer time *and* time of
anyone consuming musl who wants to review all changes for possible
security bugs/backdoors. If there are any worthwhile changes to this
type of legacy function, they should be made as isolated and simple as
possible to ensure that they can be reviewed easily by anyone who
wants to.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.