Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Fri, 28 Jan 2022 20:47:33 +0100
From: Markus Wichmann <>
Subject: Re: A journey of weird file sorting and desktop systems

On Fri, Jan 28, 2022 at 01:01:04PM -0500, Rich Felker wrote:
> ICU is really, *really* bad. I don't want to be encouraging people to
> use it because basic functionality is missing from libc.

But basic functionality *is* missing from libc, and by design. By the
standard. For example, toupper and towupper can only return a single
code point. That doesn't work with German's ß character, which has the
capital form SS. If you were transforming some general German word group
into block capitals for a headline or something, that is the
transformation you would use. Now, some people have invented a capital
version of ß, that is still new enough to make blocks appear in many
programs (test your mail program here: ẞ), but that letter is not widely

Also, many applications expect towupper and towlower to be inverse
functions of each other, but here, not all instance of SS ought to be
transformed to ß when passing them through towlower, even if the
interface did support such a thing.

My point is that the development of interfaces that deal with
internationalization might be better put into a library with an
interface less rigid than libc, where any adjustment moves at the
glacial pace of the Austin Group or WG14, and in any case, breaking
changes are completely out of the question. That is also why we still
have gets() and strchr().

Whether ICU is a suitable library for that purpose I lack the expertise
to say. However, all I have heard about it so far is either that one
should use it to cure all i18n ills, or that it is an abomination unto
the Lord. But even the people in the second camp fail to recommend a
superior alternative. So I'm guessing there isn't one.

As to the actual function in question: Simply having a possibility to
switch strcoll to be the same as strcasecmp instead of strcmp would
probably already be the 80% solution for most European languages.

Yeah, it won't work with umlauts, but we Germans are used to that. "It
is <current year> and we still can't do umlauts" is a common curse
levelled at information technology, and for the most part it is apt. I
routinely counsel against using umlauts in file names or pass phrases,
because you never know what character set it gets saved in or
transmitted later, and it just causes avoidable problems. I really doubt
this issue will ever be solved within my lifetime.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.