![]() |
|
Message-ID: <20250801135815.GM1827@brightrain.aerifal.cx> Date: Fri, 1 Aug 2025 09:58:16 -0400 From: Rich Felker <dalias@...c.org> To: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org> Cc: Thorsten Glaser <tg@...bsd.de>, musl@...ts.openwall.com Subject: Re: Planned locale work and community thoughts On Fri, Aug 01, 2025 at 11:58:30AM +0200, Pablo Correa Gomez wrote: > El mie, 18-06-2025 a las 19:14 -0400, Rich Felker escribió: > > On Thu, Jun 19, 2025 at 12:42:50AM +0200, Thorsten Glaser wrote: > > > On Wed, 18 Jun 2025, Rich Felker wrote: > > > > > > > Theoretically it's possible the textual grep missed things if > > > > there is > > > > inconsistent json formatting anywhere, so if anyone familiar with > > > > jq > > > > wants to conduct a search using it instead to confirm, go ahead. > > > > I > > > > > > My jq-foo is not very good, but I managed this: > > > > > > tg@...p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq > > > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")), > > > getpath($p).decimal]' | sed 's/">>/>>/' | grep -e '^ "[^.,]"' -e > > > '^ ".[^"]' | uniq > > > "٫" > > > > > > So yes, U+066B is the only other one, and no multi-char ones. > > > > > > tg@...p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq > > > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")), > > > getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^ "[^.,]"' > > > -e '^ ".[^"]' > > > > > > … shows all the occurrences, but a quick filter shows that we have > > > both symbols-numberSystem-arabext and symbols-numberSystem-arab but > > > assuming both are out of scope… > > > > > > tg@...p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq > > > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")), > > > getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^ "[^.,]"' > > > -e '^ ".[^"]' | fgrep '>>' | fgrep -v -e '.symbols-numberSystem- > > > arabext"' -e '.symbols-numberSystem-arab"' > > > >>main.bgn-AE.numbers.symbols-numberSystem-latn", > > > >>main.bgn-AF.numbers.symbols-numberSystem-latn", > > > >>main.bgn-IR.numbers.symbols-numberSystem-latn", > > > >>main.bgn-OM.numbers.symbols-numberSystem-latn", > > > >>main.bgn.numbers.symbols-numberSystem-latn", > > > > > > … leaves us with this; bgn/numbers.json examplary: > > > > > > { > > > "main": { > > > "bgn": { > > > "numbers": { > > > "symbols-numberSystem-arabext": { > > > "decimal": "٫", > > > "group": "٬", > > > "list": "؛", > > > … > > > }, > > > "symbols-numberSystem-latn": { > > > "decimal": "٫", > > > "group": "،", > > > "list": ";", > > > … > > > > > > So, if the bgn locales are ever going to be relevant… > > > unsure what that exactly is, but my acronyms database says… > > > [ISO 639-3] Western Balochi (cf. bal) > > > … which seems to fit. > > > > Thanks. My grapping seems to have overlooked that just because it was > > the same character that would normally only be used in an alt-digits > > context. I wonder if the above is intentional or a mistake and if any > > systems are actually doing that. > > I've done some research on this topic to see if we could figure out a > bit more information. Unfortunately, online resources related to > Western Balochi are incredibly sparse: > > * glibc: no support at all > https://github.com/bminor/glibc/tree/master/localedata/locales > * Windows: no support at all > https://support.microsoft.com/en-us/windows/language-packs-for-windows-a5094319-a92d-18de-5b53-1cfc697cfca8 > * Android: no support at all > https://android.googlesource.com/platform/frameworks/base/+/android-16.0.0_r1/core/res/res/values/locale_config.xml > * Weblate: 3 projects seems to have translations, but on 0% > translation: https://hosted.weblate.org/languages/bgn/ > * iOS: no support in system languages > https://www.apple.com/ios/feature-availability/#system-language-system-language > or keyboard support > https://www.apple.com/ios/feature-availability/#quicktype-keyboard-language-support > > In addition, it seems like that data in the CLDR was introduced 10 > years ago in > https://github.com/unicode-org/cldr/commit/a4fe61ea1c1a01e3dfe2545d013ca3289640c81f > and never changed since. I've also tried to do some research on whether > the data in the CLDR could be an error. The survey for Western Balochi > unfortunately shows no votes: > https://st.unicode.org/cldr-apps/v#/bgn/Symbols/a1ef41eaeb6982d > compared to something like Spanish that has votes from Apple, > Microsoft, and Google: > https://st.unicode.org/cldr-apps/v#/es/Symbols/4ec3d1b99830ad07 > > I wonder if it's worth it to bring this to the attention of the unicode > consortium to get some clarity on it, or if we just consider this a bug > from a language with very limited digitalization and move on with the > assumption of just "." and ",". Thanks for the quick research! My view is that unless there's an existing strong precedent for this convention in digital interfaces, which you seem to have established that there's not, we should not pursue supporting it. I'm fine with leaving open the possibility in the data format (i.e. not just encoding the value in the locale file as 1 bit) so that the possibility isn't locked out, but I'm pretty strongly on the side of either mapping anything but ',' to '.', or refusing to load locale files where the field is neither '.' nor ',' as unsupported/malformed. I just don't see any way to rationalize doing something that likely has unforseen security consequences for the sake of a generality that no existing users expect (because there's no software that has set that expectation). Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.