musl - Re: Planned locale work and community thoughts

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <59dd8a0c9a7dfd72728bc14a2de57461c2c796d8.camel@postmarketos.org>
Date: Wed, 30 Jul 2025 17:53:46 +0200
From: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com
Subject: Re: Planned locale work and community thoughts

El mie, 18-06-2025 a las 15:28 -0400, Rich Felker escribió:
> On Mon, Jun 02, 2025 at 07:37:51PM +0200, Pablo Correa Gomez wrote:
> > Hi everybody,
> > 
> > I am Pablo Correa Gomez, a member of postmarketOS Core
> > Contributors,
> > working on the collation and locale overhaul project
> > (https://www.openwall.com/lists/musl/2025/05/05/5)together with
> > Rich.
> > 
> > We have now more details on the planned locale work that was
> > earlier
> > announced. The current musl locales experience is sub-par compared
> > to
> > other platforms, and we plan to use this project to fix that. 
> > 
> > The main and biggest issue that we aim to solve is the
> > representation
> > format of the locale strings. The initial implementation used
> > English
> > strings as keys to lookup for translations. This had a major issue
> > where May would represent both the abbreviated and non-abbreviated
> > forms of the month, making it untranslatable in languages where May
> > has
> > more than 3 letters. However, there are other different issues that
> > are
> > also aiming to solve in this project:
> 
> Main decision to be made here is how we key items that need
> localization, whether by fixing the string-based keying (e.g. using
> the macro names like "ABMON5" as the keys) with the gettext-type
> lookup we have now, or switching to assigned integer indices as the
> keying for a more catgets-like system (likely using the values from
> the macros in langinfo.h as the indices), or something else.

Given not many people vouched-in, Rich and I have agreed that he'll
come with a proposal some people can comment on something more
specific. In addition to passing it through the community, we will also
ask translators that have volunteered for this project through
postmarketOS to vouch in.

> 
> > * Implement RADIXCHAR so that "." is not the only possible
> > separator.
> > THOUSEP will in principle not be implemented due to it breaking
> > quite
> > some assumptions, and it being less critical for users.
> 
> To give some background on this: from the start I was largely opposed
> to having the radix char be localizable at all, as this has been a
> source of perpetual problems for parsing and generating text-based
> data formats intended for interchange, and I didn't really think
> there
> was any modern demand for it.
> 
> However, in past discussions of the topic, it's come up that some
> people do want it, and I don't want us to be the bad guys who are
> being stubborn dismissing someone else's cultural expectations, so
> the
> tentative plan has been to offer this with 1-bit degree of freedom
> between '.' and ',' as the only choices.
> 
> I've been made aware that, at least historically prior to use in
> computer systems, there have been other notations for radix point,
> but
> it's not clear if there's any modern expectation to be able to do
> that. What I think would be a useful next step is to grep the Unicode
> CLDR for whether there are non-'.' non-',' radix chars in any locale
> definitions. If there are none, I think that already settles it. If
> there are any, we should attempt to figure out whether there are
> real-world systems that support them and precedent for users to
> expect
> they work.
> 
> Note that supporting basically anything plausble other than '.' and
> ',' as radix characters has major technical issues that may introduce
> vulns into programs not expecting it, so in the absence of both
> strong
> evidence of necessity and research into what would break and whether
> unsafe breakage is unlikely, I want to just say no to this.
> 
> It may however make sense for the on-disk data format to allow for
> the
> possibility, and for musl to just treat anything but "," as if it
> were
> "." for the forseeable future.
> 
> > * Implement LC_MONETARY so that we can get properly localized
> > currency
> > representation.
> 
> This is fairly straightforward, but does need a reasonable data
> format
> that translates well into "struct localeconv" form. The localeconv
> fields that are strings need to be directly usable from the
> memory-mapped locale file, so that we don't need to allocate
> variable-sized storage for them, and one complication of this is that
> "grouping" and "mon_grouping" are *arch-specific* because the
> encoding
> uses CHAR_MAX with a special meaning, and CHAR_MAX could be 127 or
> 255. This means both versions of the string (one for singed-plain-
> char
> archs and one for unsigned-plain-char archs) need to be stored in the
> on-disk format.
> 
> > * Make sure that every function that accepts a locale actually uses
> > it
> > for the translation.
> > 
> > To be able to prepare for the technical work, there are some things
> > for
> > which we would like community input:
> > 
> > 1. We need to figure out an alternative representation for the
> > translatable strings derived from[1] to avoid the "May issue". A
> > simple
> > solution would be to use those constants (or an abbreviation of
> > them)
> > as keys for the lookup. Hopefully that would be both unambiguous
> > and
> > self-explanatory and as a bonus, it's already documented. Does
> > somebody
> > have other/better ideas?
> 
> See above for the options I'm aware of.
> 
> > 2. Regardless of the representation we choose, we need to decide on
> > a
> > workflow for translators. Currently, people can just copy the .pot 
> > file[2] with a hard-coded representation that might include other
> > things to translate. That seems good enough if we chose the
> > representation directly from [1], but might not be possible if we
> > decide on something different.
> 
> Long-term, the workflow should probably be deriving the data from the
> Unicode CLDR with possibility for overrides, with tooling to do that.
> I'm not sure if we want to prepare such tooling now.
> 
> At least for collation, I think we need some level of tooling now in
> order to be able to test/evaluate it. I'm presently trying to find
> the
> relevant tooling other systems use. ICU has something that converts
> the base weights table to a possibly-reasonable binary form but I
> haven't located the tooling to apply locale-specific modifications
> from CLDR data to the table.
> 
> > 3. Right now, other translatable strings coming from different
> > sources
> > (another email with a detailed analysis will follow up) are also
> > part
> > of the musl locales project. Those are also just translated
> > directly as
> > strings. However, some also appear in different contexts. Like "out
> > of
> > memory" on regex, on getting network address info. Should these be
> > split, receive a different representation, and thus provide
> > additional
> > context information to translators? I personally believe that most
> > high-level applications should hide these messages coming directly
> > from
> > libc, and thus they should only be rarely available to users, like
> > in
> > CLI applications, where users are generally expected to have a
> > basic
> > knowledge of English. I would be fine with leaving these strings
> > represented just by their own English string names, even if that
> > means
> > a bit of context is lost in some languages.
> 
> I think it's expected that they're translated (don't set LC_MESSAGES
> if you don't want that) but again the mechanism is open to change
> while we're making a major overhaul here.
> 
> Do we want the messages keyed by the English strings, as now, or do
> we
> want them keyed by identity of the error, whether that's the names of
> the errno E*/REG_*/EAI_*/etc. macros or some assigned integer codes
> as
> in the option for LC_TIME stuff above.

In in the end we're going to translate those (or lay the ground-work
for it), from my personal translation experience, these are probably
best being context-dependent. Different languages have different ways
of expressing context that might not always be represented by English.
So the E* proposal sounds to me a lot more sane than English-based
strings. On top of that, is there a chance of English strings ever
changing? In the musl locales project there are currently some from
musl 1.1.14, that do no longer exist in master, so I'm assuming the
answer is "unlikely, but yes". I'm pretty sure that if we take
translations seriously from now on, we don't want to neither have to
block musl release on translation upgrades, neither break translations
just due to some typo, change, or clarification in the English wording.

> > 4. Chose a default locale placement, so that we can get
> > translations
> > without needing to parse an envvar in [3]. In Alpine/pmOS the
> > location
> > is currently in /usr/share/i18n/locales/musl/ I do not think that's
> > a
> > great place, but the FHS does not seem to provide an obvious place
> > for
> > it to live, since AFAIU locales for the libc should not be mixed
> > with
> > LC_MESSAGES from other applications. Are there other suggestions?
> 
> My main concern, especially if we want them to be usable by suid
> binaries, is that they should be in a location we can rely on to
> belong to root. While I don't think they should be *stored* in /etc,
> reaching them via a path component (intended to be a symlink) in /etc
> is probably the best way to both ensure that and allow the actual
> files to be placed wherever distro policy wants them to be placed.

I think in general the musl ecosystem cares quite a bit about
standards. And whereas the FHS does not specify a perfect place for
this, expecting a symlink in /etc certainly not spec-compliant, since
that's for "Host-specific configuration". It would seem reasonable to
me that in the case of a minimal recovery system where /usr does not
exist, translations are not available. If users have specific
requirements related to recovery, mounting /usr, and translations, then
they can place the locales under /etc and use MUSL_LOCPATH. Forcing
that on everybody seems like would for the majority to take action
(either symlink /etc or have an envvariable with less security) for
some niche use-case, and in the meanwhile going against a respected
standard. Is there something I'm missing?

> > 5. So far, although the musl-locales project exist, it has been
> > kept
> > apart from the main musl project, and not really sanctioned as
> > "official". It would be great, if we could have discussions related
> > to
> > musl-locales project directly in this mailing list. And if there
> > could
> > be a synchronized copy of it in https://git.musl-libc.org/cgit next
> > to
> > the musl repository. Is there somebody against this?
> 
> I think having the discussion on the main mailing list should be
> fine.

Great to know for this all the other ones below.

Best,
Pablo

> 
> > 6. Given that at postmarketOS good localization is something
> > critical,
> > I would be very happy if we could fork the current project, host
> > it  in
> > our gitlab, and use it as the place to synchronize with
> > https://git.musl-libc.org/cgit. If somebody would have other ideas,
> > or
> > moving it is considered disruptive, then it would be great if
> > somebody
> > from our team could also get access, so we can increase the
> > maintenance
> > it has seen lately.
> 
> I don't have a strong opinion on this yet, but I do agree that we
> should have it sync'd to the main musl git server, regardless of
> where
> actual devel takes place, so that it presents as "official".
> 
> > 7. If a locale is missing in musl, setlocale currently "fakes" that
> > support exist by copying the C data to the said locale. This has
> > the
> > benefit that apps which are translated in a locale missing in musl
> > still show up as translated for the application-related messages.
> > The
> > problem with this is that the UX is then inconsistent, since users
> > get
> > things mixed and matched in different languages. This is also
> > generally
> > a step against musl philosophy of being strinctly correct. A
> > previous
> > discussion[4] had a pretty good proposal[5] that I fully support.
> > As I
> > said in the thread, as long as we have some time to adapt, the
> > behavior
> > change should be acceptable.
> 
> I need to review this but I recall the proposal being acceptable.
> 
> > 8. Finally, if you want to be involved in testing in a language for
> > which we don't yet have a volunteer signed-in in[6], feel free to
> > report yourself, we might have some small funding available, for
> > which
> > please send me a private email with the details specified in there.
> > 
> > We hope that at the end of this work, we have a setup for musl
> > locales
> > that is able to fit the needs of most users. If you believe there
> > is
> > something missing, please let us know.
> > 
> > This work is possible thanks to a grant from NLnet and the NGI Zero
> > Core Fund. Thank you for supporting us!
> > 
> > [1]
> > https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/langinfo.
> > h.html
> > [2]
> > https://git.adelielinux.org/adelie/musl-locales/-
> > /blob/main/musl.pot?ref_type=heads
> > [3]
> > https://git.musl-
> > libc.org/cgit/musl/tree/src/locale/locale_map.c#n66
> > [4] https://www.openwall.com/lists/musl/2023/08/10/3
> > [5]
> > https://gist.github.com/al45tair/15c3ade52b09d0cad67074176ad43e4a#p
> > roposed-behaviour
> > [6]
> > https://gitlab.postmarketos.org/postmarketOS/postmarketos/-
> > /issues/65
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.