musl - Re: Re: First feedback on new C locale problems

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150927134712.GQ17773@brightrain.aerifal.cx>
Date: Sun, 27 Sep 2015 09:47:12 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Re: First feedback on new C locale problems

On Sun, Sep 27, 2015 at 08:17:38AM +0200, Felix Janda wrote:
> Rich Felker wrote:
> > On Sat, Sep 26, 2015 at 06:58:36AM +0200, Felix Janda wrote:
> > > On 2015-09-09 05:56:48 GMT, Rich Felker wrote:
> > > > On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote:
> > > > > What I'd like to do to fix it is just always return "UTF-8" for
> > > > > nl_langinfo(CODESET) regardless of locale (rather than returning
> > > > > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on
> > > > > nl_langinfo that would preclude this, and it seems like it would
> > > > > restore the desired properties and fix all the regressions.
> > > >
> > > > Committed.
> > > >
> > > > Rich
> > > 
> > > GNU sed seems to care about the output from nl_langinfo:
> > > 
> > > https://bugs.gentoo.org/show_bug.cgi?id=560728
> > > 
> > > More specifically, so does lib/localecharset.c, which is used in
> > > the replacement of re_compile_pattern.
> > 
> > I was able to reproduce this (with slightly different output, "a© a'")
> > on Alpine. Clearly this is some sort of bug in the gnulib code or sed
> > itself, since it's producing corrupt output. I think we should explore
> > why that's happening and whether it's possible to fix there. But if
> > there remain other reasons that returning "UTF-8" in the C locale is
> > not practical then perhaps we could resort to returning "ASCII".
> 
> A possible fix is
> 
> --- ./a/sed-4.2.1/lib/regcomp.c
> +++ ./a/sed-4.2.1/lib/regcomp.c
> @@ -824,7 +824,7 @@ re_compile_internal (regex_t *preg, cons
>  
>  #ifdef RE_ENABLE_I18N
>    /* If possible, do searching in single byte encoding to speed things up.  */
> -  if (dfa->is_utf8 && dfa->mb_cur_max != 1 && !(syntax & RE_ICASE) && preg->translate == NULL)
> +  if (dfa->is_utf8 && !(syntax & RE_ICASE) && preg->translate == NULL)
>      optimize_utf8 (dfa);
>  #endif
>  
> 
> In our case is_utf8 is 1 and mb_cur_max is also 1. The function
> optimize_utf8() would change "." to match utf8 characters instead of
> bytes. For some reason I have not investigated further then "©" (or any
> other non-ASCII) character is not matched, but in the C locale we want
> "." also to match non-valid utf8 characters anyway.

I think this fix is misplaced; it looks like it would make GNU regex
do UTF-8 character matching rather than byte matching in the C locale.
Rather one of the other places that has an is_utf8 check also needs to
have the mb_cur_max!=1 check added, I think.

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.