musl - Re: [ Guidance ] Potential New Routines; Requesting Help

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANHA4OhdNZ7wEn7Ntnbd8VY=b0mM-NzYsrwZpcQML8239BJYmA@mail.gmail.com>
Date: Mon, 30 Dec 2019 13:53:45 -0500
From: JeanHeyd Meneide <phdofthehouse@...il.com>
To: Rich Felker <dalias@...c.org>
Cc: Florian Weimer <fw@...eb.enyo.de>, musl@...ts.openwall.com
Subject: Re: [ Guidance ] Potential New Routines; Requesting Help

On Mon, Dec 30, 2019 at 12:28 PM Rich Felker <dalias@...c.org> wrote:
> I think you misunderstood my remarks here. I was not talking about
> invention of new charsets (which we seem to agree should not happen),
> but making it possible to use existing legacy charsets which were
> previously not usable as a locale's encoding due to limitations of the
> C APIs. I see making that possible as counter-productive. It does not
> serve to let users keep doing something they were already doing
> (compatibility), only do to something newly backwards.

     My goal is to allow developers to go from an encoding they do not
control fully (the multibyte encoding) to an encoding they know and
can reason about in their program (c8, for example). This is why I am
providing the mb -> cNN and wc -> cNN functions in both
single-character and string forms. The hope is to make it easy to go
from a statically known encoding (modulo difficulties from
__STD_C_UTF16/32__ not being defined) to the platform encoding, and
vice-versa, using the same style of functions like mb(s)(r)towc(s) and
wc(s)(r)tomb(s).

> >  ... I will, however, note that the paper
> > specifically wants to add the Restartable versions of "single unit" wc
> > and mb to/from functions.
>
> I don't follow. mbrtowc and wcrtomb already exist and have since at
> least C99.

     Apologies, I meant doing wc <-> cNN and mb <-> cNN!

> > ...
> >
> >     This means that while wcto* and *towc functions are broken, the
>
> I don't see them as broken. They support every encoding that has ever
> worked in the past as the encoding for a locale (tautologically). The
> only way they're "broken" is if you want to add new locale encodings
> that weren't previously supportable.

     Apologies; this was in reference to wide characters given a not
UTF-32 interpretation on certain platforms like Windows and certain
flavors of IBM. They chose 16 bits, which can't accommodate Unicode
without needing multiple wchar_t. Unfortunately, this means that they
were really out of luck before DR488 was accepted: they had no means
to return multiple wchar_t for characters outside the 16-bit maximum.
With DR488, restartable functions have the potential to convert out
properly (albeit, the DR was only applied to char16_t functions, so
while I have a hope and a wish we can fix it for their platforms it
might not work out for the wcto* and *towc functions anyways).

     char16_t functions, though, should offer those platforms a better
way out (though not a perfect one: they'll need to rely on platform
knowledge and perform some casts).

> ...
>
> Conversion of arbitrary encodings other than the one in use by the
> locale requires a different API that takes encodings by name or some
> other identifier. The standard (POSIX) API for this is iconv, which
> has plenty of limitations of its own, some the same as what you've
> identified.

    Absolutely agreed! I just want the ones that the platform controls
(wide character and multibyte character encodings) to have correct,
simple paths to static encodings that can be used for more rigorous
text processing.

Sincerely,
JeanHeyd Meneide
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.