musl - Re: Re: a bug in bindtextdomain() and strip '.UTF-8'

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170213170825.GH1520@brightrain.aerifal.cx>
Date: Mon, 13 Feb 2017 12:08:25 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Re: a bug in bindtextdomain() and strip '.UTF-8'

On Sun, Feb 12, 2017 at 02:56:53PM +0800, He X wrote:
> 1. cat is added to the keys, also do a validate
> 2. so we what do we deal with the gettextdir() exactly? inline it or
> construct a gettextpointer()?
> 3. i added a extra locbuf array, and goto is replaced by a loop, memcpy is
> replaced by snprintf, compiled, and working well with fcitx

I haven't verified the loop logic yet but on a high level it looks
correct.

> 4. i just found that i forgot to store the keys to new buffer, it's ok to
> just use normal expression? or we need atomic operations?
> ```
> + p->cat = category;
> + p->binding = q;
> + p->lm = lm;
> ```

This is fine since the new msgcat is not visible to other threads
until it's installed with an atomic, which makes all previous writes
visible. I do want to rework this all with a lock structure rather
than atomics but that's a separate project.

> 5.  I do want to rewrite all to .UTF8, but it's a bit annoying as your
> words, then i changed the code to simply strip.

Since this part is separate and there seems to be disagreement about
what it should do, let's separate it from the issue at hand; it's
really a separate change from making gettext do proper fallbacks
anyway.

> >  (safe for the user's terminal)
> LANG is set by users who are using musl and it's modified to zh_CN at
> setlocale(), app will use UTF8 directly, there's no such situation where
> charset will cause troubles to users' terminal, except apps which get the
> LANG manually by getenv(). I have not seen such strange applications so
> far, and most apps only have the UTF8 translation files.
> 
> For moving from glibc to musl, i think doing this way is good for now, we
> could delete it later, or just keep it forever. And most people won't use
> non-UTF8 at all, if they do use GBK, their app will even fallback to UTF8,
> because no translation files for GBK. So, it's not so dagerous, i think :).

The main considerations are:

1. what happens when a glibc user ssh's into a musl-based system
2. what happens when a musl user ssh's into a glibc-based system
3. what happens when running musl binaries on a glibc-based system

For #1 and #3, it's desirable for musl to accept ".UTF-8" in the
locale name, and for #2, users may desire to have ".UTF-8" in their
LC_* env vars so that remote glibc programs behave correctly.

For #1 and #3, if a glibc uses is using a legacy non-UTF-8 locale and
runs a musl program, they're either going to get messed-up output or
ASCII-only, depending on decisions we make and/or what their locale
value is. These are not really important since legacy encodings are
not supported, but it might be nice to make least-bad.

If the user has a locale name like "fr_FR" or "zh_CN" that, that's
going to be interpreted differently by musl vs glibc; that was already
decided a long time ago in the interest of designing around the future
rather than broken legacy stuff. But if the locale name is explicitly
non-UTF-8 like "zh_CN.GBK", we could opt to reject it without breaking
anything, and this may give users better feedback about what's going
wrong if they have such settings when ssh'ing into a musl-based
system.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.