Date: Fri, 13 Dec 2013 01:49:23 -0500 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: validation of utf-8 strings passed as system call arguments On Fri, Dec 13, 2013 at 07:36:51AM +0100, Szabolcs Nagy wrote: > * Rich Felker <dalias@...ifal.cx> [2013-12-12 23:39:41 -0500]: > > that filenames can contain arbitrary byte sequences. And Linus in > > particular is opposed to changing this, though there's been some > > indicastion (I don't have references right off) that he might be open > > to optional restrictions at the kernel level. > > he didnt look very persuadable some time ago > http://yarchive.net/comp/linux/utf8.html Yes, that was a long time ago though. I forget where I saw an indication that this could change (perhaps the Austin Group list? in the thread about newlines...) but the general idea, if I recall, was that restrictions would take place in the framework of a generic layer for restricting malicious content in filenames that's not UTF-8 specific. > (i actually like the kernel that way: what would you do when > mounting a filesystem with invalid filenames? would you also > reject surrogate pairs, pua codes or do unicode normalization?) "Surrogate pairs" aren't even a question; surrogates aren't encodable at all in UTF-8. So they would automatically be gone just by mandating well-formed UTF-8. Normalization (which Apple does) is absolutely wrong and non-conforming to POSIX; it causes multiple distinct names to refer to the same file (despite having a link count of 1, BTW), which is just as dangerous as issues like "over-long sequence" decoding and URL-escaped dots and slashes. The only "correct" way to do normalization at the FS level is disallowing non-normalized filenames. But normalization is actually just broken and harmful anyway, since there are languages for which bugs in Unicode have made the normalized form contrary to the actual semantic ordering of characters in the language (characters were incorrectly assigned combining classes such that letters reorder contrary to their actual semantic order, and due to stability policy this can't be fixed, so the only solution is to forget about using normalization). As for PUA, it wouldn't be forbidden by enforcing UTF-8. Per the definition, a "UTF" is a bijective mapping between the Unicode scalar values (0 through 0xD7FF and 0xE000 through 0x10FFFF) and legal sequences of code units. Whether a character identity is assigned to a scalar value is irrelevant to UTFs. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.