musl - Re: [BUG] swprintf() doesn't handle Unicode characters correctly

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMOBWkMPYE2aX_Uy2a8nxM6QhvUOHM=xiTdCNpQy9a2VvcEESQ@mail.gmail.com>
Date: Mon, 24 May 2021 21:58:34 -0400
From: Konstantin Isakov <dragonroot@...il.com>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com
Subject: Re: [BUG] swprintf() doesn't handle Unicode characters correctly

Thanks, Rich, that was very informative!

On Mon, May 24, 2021 at 9:09 PM Rich Felker <dalias@...c.org> wrote:

> On Mon, May 24, 2021 at 08:46:01PM -0400, Konstantin Isakov wrote:
> > Is swprintf() a form of fwprintf() though?
>
> As specified, it is. They're all covered together under
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/swprintf.html
>
> and "all forms" is in contrast to just "fwprintf() and wprintf()" (the
> other 2/3) mentioned above which can fail for any of the fputwc
> reasons (which would already cover EILSEQ anyway).
>
> > fwprintf() and wprintf() output
> > to single-byte streams, so the conversion is necessary there, while
> > swprintf() outputs to a wide buffer. Performing double conversion (to
> > single chars and back) seems like unnecessary work in that case (though,
> of
> > course, it's less work to implement swprintf() like that).
>
> It's what gives consistent behavior, and it's what you get
> automatically if you don't want either a completely independent
> implementation of swprintf (that behaves surprisingly unlike fwprintf)
> or the wide-mode buffering glibc does.
>
> (Note: the original reason they did separate wide-mode buffering was
> that gconv is very slow for individual character conversions and was
> designed only for bulk conversion calls, which would happen at flush
> time. Making individual conversions fast was one of the original
> design goals of musl before there even was a whole libc around it.)
>
> Rich
>
>
> > On Mon, May 24, 2021 at 8:30 PM Rich Felker <dalias@...c.org> wrote:
> >
> > > On Mon, May 24, 2021 at 08:04:04PM -0400, Konstantin Isakov wrote:
> > > > Thanks for replying!
> > > >
> > > > That fixed it.
> > > >
> > > > I'm surprised, however, that this is required given that in this case
> > > > swprintf() operates on wchars exclusively -- taking wchar arguments
> and
> > > > producing wchar output. I'd expect that in the worst case scenario it
> > > would
> > > > have to convert from single chars to wide chars, but never the other
> way
> > > > around, so the representation requirement seems strange. That
> setlocale()
> > > > step also doesn't seem to be needed with glibc.
> > >
> > > Yes, it's not clear to me whether the glibc behavior is conforming or
> > > not. As specified,
> > >
> > >   In addition, all forms of fwprintf() shall fail if:
> > >
> > >   [EILSEQ]
> > >     A wide-character code that does not correspond
> > >     to a valid character has been detected.
> > >
> > >   ...
> > >
> > > The "has been detected" wording may allow for the possibility of
> > > ignoring the error, as glibc does, if the function is implemented such
> > > that no conversion takes place (or, for fwprintf, such that conversion
> > > is deferred until flush time) and thus no "detection" takes place. But
> > > it's wrong to assume the operation will succeed.
> > >
> > > In musl, there is no separate wide stdio buffering mode; conversion to
> > > a multibyte sequence happens at (logical) fputwc time, and in the case
> > > of swprintf, conversion (in this case, conversion back) to a wchar_t[]
> > > string occurs at flush time.
> > >
> > > Rich
> > >
> > >
> > >
> > >
> > > > On Mon, May 24, 2021 at 5:50 PM Rich Felker <dalias@...c.org> wrote:
> > > >
> > > > > On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote:
> > > > > > Hi,
> > > > > >
> > > > > > The following program:
> > > > > >
> > > > > > ===================================
> > > > > > #include <stdio.h>
> > > > > > #include <wchar.h>
> > > > > >
> > > > > > int main()
> > > > > > {
> > > > > >   wchar_t buf[ 32 ];
> > > > > >
> > > > > >   swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" );
> > > > > >
> > > > > >   for ( wchar_t * p = buf; *p; ++p )
> > > > > >     printf( "%u\n", ( unsigned ) *p );
> > > > > >
> > > > > >   return 0;
> > > > > > }
> > > > > > ===================================
> > > > > >
> > > > > > With musl 1.2.2 produces the following output:
> > > > > > 97
> > > > > > 98
> > > > > >
> > > > > > The expected output is:
> > > > > > 97
> > > > > > 98
> > > > > > 225
> > > > > > 99
> > > > > >
> > > > > > With musl, only the first two characters ('a' and 'b') are
> > > processed, and
> > > > > > the string ends on a Unicode character (U+00E1, which is an 'a'
> with
> > > > > acute
> > > > > > accent), instead of outputting it and the last character, 'c'.
> > > > > >
> > > > > > Please CC me when replying. Thanks!
> > > > >
> > > > > You need to call setlocale(LC_CTYPE, ""). Otherwise the character
> > > > > \u00e1 is unrepresentable, because POSIX requires the C locale be
> > > > > single-byte and you're in the C locale until you call setlocale,
> and
> > > > > thus produces an encoding error (EILSEQ).
> > > > >
> > > > > Rich
> > > > >
> > >
>

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.