musl - Re: [PATCH] implement a private state for the uchar.h functions

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20141111143900.GJ22465@brightrain.aerifal.cx>
Date: Tue, 11 Nov 2014 09:39:00 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] implement a private state for the uchar.h
 functions

On Tue, Nov 11, 2014 at 02:53:02PM +0100, Jens Gustedt wrote:
> Am Montag, den 10.11.2014, 22:21 -0500 schrieb Rich Felker:
> > On Sun, Nov 09, 2014 at 11:18:08AM +0100, Jens Gustedt wrote:
> > > The C standard is imperative on that:
> > > 
> > >   7.28.1 ... If ps is a null pointer, each function uses its own internal
> > >   mbstate_t object instead, which is initialized at program startup to
> > >   the initial conversion state;
> > 
> > Thanks. Actually I originally had this functionality and removed it
> > because it seemed to be unnecessary, due to the requirement being
> > buried in that introductory text rather than the descriptions of the
> > individual functions. I figured the committee had just intentionally
> > decided not to copy this backwards functionality from the old
> > multibyte functions into the new uchar ones, but sadly that's not the
> > case...
> 
> Yes these are bizarre additions. That has almost a dozen different
> static states for all of the different restartable functions.
> 
> Perhaps I misunderstood something, but isn't it that in direction mbs
> -> charXX_t these functions allow to handle surrogates, but the other
> way around is not possible?

Both directions are possible. c16rtomb returns 0 and saves the first
surrogate as state for the next call. mbrtoc16 writes out the first
surrogate, saves the second in the state, and returns 4 on the first
call, then returns (size_t)-3 and writes out the second surrogate on
the next call. Yes it's hideously ugly but it way trivial to
implement.

> From that new unicode support in C11 I get some of the ideas, but some
> things remain quite misterious
> 
>  - having a standard way to specify unicode characters inside a string
>    of any kind through \u and \U is really a great achievement

Yes and no. I don't think anyone really wants to use these. They're
unreadable except when used extremely sparingly, and embedding natural
language text in source is widely frowned upon anyway which limits the
usefulness. But it is nice to at least have a way if/when you need it.

>  - introducing types charXX_t and constants literals with u and U is
>    already less clear. The only thing that can be done with them is
>    conversion, there are no auxiliary functions. In particular the
>    character counting and classification problems for surrogates is
>    still not solved.

The provided conversions to/from multibyte are useless because the
current multibyte character set cannot necessarily even represent
them. Initially I thought they should have provided conversions
to/from wchar_t, but that would also be useless since wchar_t is only
officially meaningful for characters in the current (multibyte)
character set. The only conversions that would actually be useful are
between UTF-8, UTF-16, and UTF-32, but those are all well-defined in
an implementation-independent manner and thus something you can
provide yourself (even though at least 70% if people doing so do it
wrong...) which I can only assume is the reason the language standard
doesn't provide them.

>  - introducing a u8 prefix for strings that guarantees utf8 encoding
>    for mbs sounds nice. But then there is nothing that relates these
>    to "normal" string literals. What are we supposed to do with these?

Process them with your own code, or just pass them to external
interfaces that expect UTF-8 (e.g. filesystem structures, network
protocols, etc.).

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.