musl - Re: iconv Korean and Traditional Chinese research so far

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130805143540.GH221@brightrain.aerifal.cx>
Date: Mon, 5 Aug 2013 10:35:40 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: iconv Korean and Traditional Chinese research so far

On Mon, Aug 05, 2013 at 09:53:32AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> > iconv is not something that needs to be extensible. There is a
> > finite set of legacy encodings that's relevant to the world,
> > and their relevance is going to go down and down with time, not
> > up.
> 
> Oh! So you consider Japanese, Chinese, Korean, etc. languages
> relevant for programs sitting on my machines? How can you decide

I don't decide what's relevant for you. Rather, I don't have the
authority to declare it irrelevant-by-default. This is true even for
things like crypt algorithms (does anybody really want to use md5??)
but especially for anything that would preclude somebody from being
able to receive data in their native language. Simple multilingual
support via UTF-8 with conversion from legacy data has been near top
priority, if not top, since the conception of musl.

If history has shown us anything, it's that universal support for all
languages must be default and turning off some support to save space
(which is rarely if ever actually needed) needs to be a conscious
decision. I'm no Apple fan by any means, but just look at the
situation on iOS: you can turn on a new iPhone or iPad and read data
in any language (including having the relevant fonts!) and even add a
keyboard and type in almost any language, without having to buy a
special localized version or install add-ons. This is very different
from the situation on Android right now.

musl's intended applicability is broad. From industrial control to
settop boxes, in-car entertainment, initramfs images for desktop
machines, phones, tablets, plug computers that run your private home
or office webmail server, full desktops, VE LAMP stacks, hosts for
VEs, etc. Some of these usages have a real need for human-language
text; others don't. But if we have the power to make it such that, if
someone uses a musl to implement a plug computer for webmail, it
naturally supports all languages unless the maker of the device goes
and actively rips that support out, then we have a responsibility to
do so. Or, said differently, it's OUR FAULT for making
broken-by-default software if language support is missing unless you
go to the effort of learning musl-specific ways to enable it.

> this? Why being so ignorant and trying to write an standard
> conform library and then pick out a list of char sets of your
> choice which may be possible on iconv, neglecting wishes and
> need of any musl user.

If I were to just accept your demands, it would essentially mean:

(1) discarding the opinions of everybody else who discussed this issue
in the past and decided that static linking should mean real static
binaries that work the same without needing extra files in the
filesystem..

(2) discarding the informed decisions I made based on said
discussions.

> .... or in other words, if you really be this ignorant and
> insist on including those charsets fixed in musl, musl is never
> more for me :( ... I don't need to bring in any part of mine into
> musl, but I don't consider a lib usable for my needs, which
> include several char set files in statical build and neglects to
> load seldom used charset definitions from extern in any way.

Name the extra "seldom used charset definitions" you're interested in.
They're probably already supported. We are not discussing adding some
new giant subsystem to musl. We are discussing adding the last two
missing major legacy charsets to an existing framework that's existed
for a long time.

> > > > Do I want to give users who have large volumes of legacy
> > > > text in their languages stored in these encodings the same
> > > > respect and dignity as users of other legacy encodings we
> > > > already support? Yes.
> > > 
> > > Of course. I won't dictate others which conversions they want
> > > to use. I only hat to have plenty of conversion tables on my
> > > system when I really know I never use such kind of
> > > conversions.
> > 
> > And your table for just Chinese is as large as all our tables
> > combined...
> 
> How can you tell this. I don't think so.

You're welcome to implement it and see. Thanks to the way static
linking works, if you add -lyouriconv when static linking, the iconv
in musl will be completely omitted from the binary and yours will be
used instead. Of course the iconv in musl will be completely omitted
anyway except in the small number of programs that actually use iconv.
This is not glibc where stdio and locale depend on iconv. iconv is
purely iconv.

> Such conversion codes
> may be very compact. Size is mainly required for translation
> tables, that is when code points of the char sets does not match
> Unicode character order, but you always need the space for those
> translations. The rest won't be much.

That's all the size. The VAST majority of the table size is for 4
major character encoding families, those based on:

- JIS 0208
- GB 18030
- KS X 1001
- Big5

As for legacy 8-bit encodings, musl's approach to them is also more
efficient than you could easily be with a state machine. The fact that
the number of codepoints that ever appear in an 8-bit encoding is less
than 1024 is used to store the mappings as 10-bit-per-entry packed
arrays of indices into the legacy_chars table. This reduces the
marginal cost of individual 8bit encodings by 25% (versus 16-bit
entries). The ASCII range and any span upward into the high range that
maps directly to Unicode codepoints is also elided from the table
(which reduces ISO-8859-* by another 62.5%).

In short, what we have is about the smallest possible representation
you can get without applying LZMA or something (and thereby needing
all the code to decompress and dirty pages to store the decompressed
version). It's hard to beat.

By the way, if you really want to save the space they take, you could
just delete this email thread from your mail folder. It's larger than
musl's iconv already. :-)

> > I agree you can make iconv smaller than musl's in the case
> > where _no_ legacy DBCS are installed. But if you have just one,
> > you'll be just as large or larger than musl with them all.
> 
> .... musl with them all? I don't consider them smaller than an
> optimized byte code interpreter ... not when you are going to
> include DBCS char sets fixed into musl. At least if you do all
> the required translations.

I may have been exaggerating a little bit, but I doubt you can get
your bytecode GB18030 support smaller than about 110k once you count
the bytecode and the interpreter binary. I'm even more doubtful that
you can get it smaller than the current 71k in musl.

> > compare the size of musl's tables to glibc's converters. I've
> > worked hard to make them as small as reasonably possible
> > without doing hideous hacks like decompression into an
> > in-memory buffer, which would actually increase bloat.
> 
> Are you now going to build a lib for startup purpose and embedded
> systems only or are you trying to write a general purpose
> library?

General-purpose. Have you not read the website?

    Originally in the 1990s, Linux-based systems used a fork of the
    GNU C library (glibc) version 1, which existed in various versions
    (libc4, libc5). Later, distributions adopted the more mature
    version 2 of glibc, and denoted it libc6. Since then, other
    specialized C library implementations such as uClibc and dietlibc
    have emerged as well.

    musl is a new general-purpose implementation of the C library. It
    is lightweight, fast, simple, free, and aims to be correct in the
    sense of standards-conformance and safety.

If you're using it for startup purposes or embedded systems that don't
communicate with humans in human language, you won't be running
applications that call iconv() and thus it's irrelevant.

> On one hand you say "use dietlibc" if you need small statical
> programs and on the other hand you want to include many charset
> definitions into a statical build to avoid dynamic loading of
> tables, required only on embedded systems.

Where did I say "use dietlibc"? If I did (I don't really remember) it
was not a serious recommendation but a sarcastic remark to make a
point that musl is not about being "smallest-at-all-costs" (and
thereby broken) like dietlibc is.

> > have been over in 1992, when Pike and Thompson made them
> > obsolete, but it's really over now.
> 
> So why are you adding Japanese, Chinese and Korean charsets to an
> iconv conversion in musl? Why not just using UTF-8? Whenever you
> use iconv you want the flexibility to do all required charset
> conversions. Which means you need to statically link in many
> charset definitions or you need to dynamically load what is
> required.

The time of creating charsets is over. That does not magically make
the data created in those charsets in the past go away or convert
itself to UTF-8. It doesn't even magically stop people from making new
data in those charsets. All it means is that governments, vendors,
etc. have stopped the madness of making new charsets.

> > Then dynamic link it. If you want an extensible binary, you use
> > dynamic linking.
> 
> Dynamic linking of mail client, ok and where go the charset
> definition files? Are they all packed into your libc.so? That is
> a very big file? Why do I need to have Asian language definition
> on my disk, when I do not want?

Because any other solution would be larger, would defeat the purpose
of static linking, and would contribute to the problem of poor
multilingual support. Why are you upset about these tables and not
other tables like crypto sboxes, wcwidth, character classes, bits of
2/pi and pi/2, etc.? By the way, math/*.o are also fairly large, on
the same order of magnitude as iconv; would you also suggest we move
it all out to bytecode loaded at runtime even in static binaries?

> It is your decision, but please state clear what purpose you are
> building musl. Here it looks you are mixing things and steping in
> a direction I will never like.

This has all been documented all along. I'm sorry you don't understand
the goals of the project. Perhaps your misunderstanding is what
"general purpose" means. It does not mean we omit anything that could
offend anyone by wasting a few bytes on their hard drive. It means we
don't cut corners that break important usage cases. Having a complete
iconv linked whenever you link a program using iconv() does not break
your usage case unless you have less than 100k of disk/ssd/rom storage
to spare, and in that case, you probably shouldn't be using iconv. If
anyone ever does have a practical difficulty because of this, rather
than theoretical complaints based on anglocentricism, eurocentricism,
and/or xenophobia, I am not entirely opposed to making a build option
to omit iconv tables, but it has to be well-motivated.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.