musl - Re: Re: First feedback on new C locale problems

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150927061738.GA311@nyan>
Date: Sun, 27 Sep 2015 08:17:38 +0200
From: Felix Janda <felix.janda@...teo.de>
To: musl@...ts.openwall.com
Subject: Re: Re: First feedback on new C locale problems

Rich Felker wrote:
> On Sat, Sep 26, 2015 at 06:58:36AM +0200, Felix Janda wrote:
> > On 2015-09-09 05:56:48 GMT, Rich Felker wrote:
> > > On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote:
> > > > What I'd like to do to fix it is just always return "UTF-8" for
> > > > nl_langinfo(CODESET) regardless of locale (rather than returning
> > > > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on
> > > > nl_langinfo that would preclude this, and it seems like it would
> > > > restore the desired properties and fix all the regressions.
> > >
> > > Committed.
> > >
> > > Rich
> > 
> > GNU sed seems to care about the output from nl_langinfo:
> > 
> > https://bugs.gentoo.org/show_bug.cgi?id=560728
> > 
> > More specifically, so does lib/localecharset.c, which is used in
> > the replacement of re_compile_pattern.
> 
> I was able to reproduce this (with slightly different output, "a© a'")
> on Alpine. Clearly this is some sort of bug in the gnulib code or sed
> itself, since it's producing corrupt output. I think we should explore
> why that's happening and whether it's possible to fix there. But if
> there remain other reasons that returning "UTF-8" in the C locale is
> not practical then perhaps we could resort to returning "ASCII".

A possible fix is

--- ./a/sed-4.2.1/lib/regcomp.c
+++ ./a/sed-4.2.1/lib/regcomp.c
@@ -824,7 +824,7 @@ re_compile_internal (regex_t *preg, cons
 
 #ifdef RE_ENABLE_I18N
   /* If possible, do searching in single byte encoding to speed things up.  */
-  if (dfa->is_utf8 && dfa->mb_cur_max != 1 && !(syntax & RE_ICASE) && preg->translate == NULL)
+  if (dfa->is_utf8 && !(syntax & RE_ICASE) && preg->translate == NULL)
     optimize_utf8 (dfa);
 #endif
 

In our case is_utf8 is 1 and mb_cur_max is also 1. The function
optimize_utf8() would change "." to match utf8 characters instead of
bytes. For some reason I have not investigated further then "©" (or any
other non-ASCII) character is not matched, but in the C locale we want
"." also to match non-valid utf8 characters anyway.

glibc seems to be the upstream for the code.

Felix

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.