musl - Re: Planned locale work and community thoughts

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <67d29f815bba91bd7bca96c4308ac2667f77ac82.camel@postmarketos.org>
Date: Fri, 01 Aug 2025 11:58:30 +0200
From: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org>
To: Rich Felker <dalias@...c.org>, Thorsten Glaser <tg@...bsd.de>
Cc: musl@...ts.openwall.com
Subject: Re: Planned locale work and community thoughts

El mie, 18-06-2025 a las 19:14 -0400, Rich Felker escribió:
> On Thu, Jun 19, 2025 at 12:42:50AM +0200, Thorsten Glaser wrote:
> > On Wed, 18 Jun 2025, Rich Felker wrote:
> > 
> > > Theoretically it's possible the textual grep missed things if
> > > there is
> > > inconsistent json formatting anywhere, so if anyone familiar with
> > > jq
> > > wants to conduct a search using it instead to confirm, go ahead.
> > > I
> > 
> > My jq-foo is not very good, but I managed this:
> > 
> > tg@...p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq
> > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")),
> > getpath($p).decimal]' | sed 's/">>/>>/' | grep -e '^  "[^.,]"' -e
> > '^  ".[^"]' | uniq
> >   "٫"
> > 
> > So yes, U+066B is the only other one, and no multi-char ones.
> > 
> > tg@...p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq
> > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")),
> > getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^  "[^.,]"'
> > -e '^  ".[^"]'
> > 
> > … shows all the occurrences, but a quick filter shows that we have
> > both symbols-numberSystem-arabext and symbols-numberSystem-arab but
> > assuming both are out of scope…
> > 
> > tg@...p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq
> > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")),
> > getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^  "[^.,]"'
> > -e '^  ".[^"]' | fgrep '>>' | fgrep -v -e '.symbols-numberSystem-
> > arabext"' -e '.symbols-numberSystem-arab"'
> >   >>main.bgn-AE.numbers.symbols-numberSystem-latn",
> >   >>main.bgn-AF.numbers.symbols-numberSystem-latn",
> >   >>main.bgn-IR.numbers.symbols-numberSystem-latn",
> >   >>main.bgn-OM.numbers.symbols-numberSystem-latn",
> >   >>main.bgn.numbers.symbols-numberSystem-latn",
> > 
> > … leaves us with this; bgn/numbers.json examplary:
> > 
> > {
> >   "main": {
> >     "bgn": {
> >       "numbers": {
> >         "symbols-numberSystem-arabext": {
> >           "decimal": "٫",
> >           "group": "٬",
> >           "list": "؛",
> > …
> >         },
> >         "symbols-numberSystem-latn": {
> >           "decimal": "٫",
> >           "group": "،",
> >           "list": ";",
> > …
> > 
> > So, if the bgn locales are ever going to be relevant…
> > unsure what that exactly is, but my acronyms database says…
> >  [ISO 639-3] Western Balochi (cf. bal)
> > … which seems to fit.
> 
> Thanks. My grapping seems to have overlooked that just because it was
> the same character that would normally only be used in an alt-digits
> context. I wonder if the above is intentional or a mistake and if any
> systems are actually doing that.

I've done some research on this topic to see if we could figure out a
bit more information. Unfortunately, online resources related to
Western Balochi are incredibly sparse:

* glibc: no support at all
https://github.com/bminor/glibc/tree/master/localedata/locales
* Windows: no support at all
https://support.microsoft.com/en-us/windows/language-packs-for-windows-a5094319-a92d-18de-5b53-1cfc697cfca8
* Android: no support at all
https://android.googlesource.com/platform/frameworks/base/+/android-16.0.0_r1/core/res/res/values/locale_config.xml
* Weblate: 3 projects seems to have translations, but on 0%
translation: https://hosted.weblate.org/languages/bgn/
* iOS: no support in system languages
https://www.apple.com/ios/feature-availability/#system-language-system-language
or keyboard support
https://www.apple.com/ios/feature-availability/#quicktype-keyboard-language-support

In addition, it seems like that data in the CLDR was introduced 10
years ago in
https://github.com/unicode-org/cldr/commit/a4fe61ea1c1a01e3dfe2545d013ca3289640c81f
and never changed since. I've also tried to do some research on whether
the data in the CLDR could be an error. The survey for Western Balochi
unfortunately shows no votes:
https://st.unicode.org/cldr-apps/v#/bgn/Symbols/a1ef41eaeb6982d
compared to something like Spanish that has votes from Apple,
Microsoft, and Google:
https://st.unicode.org/cldr-apps/v#/es/Symbols/4ec3d1b99830ad07

I wonder if it's worth it to bring this to the attention of the unicode
consortium to get some clarity on it, or if we just consider this a bug
from a language with very  limited digitalization and move on with the
assumption of just "." and ",".

Best,
Pablo Correa Gomez
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.