Date: Sat, 2 Jun 2018 22:26:35 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: iconv UTF-8 <--> CP1255 roundtrip possible bug? On Wed, May 16, 2018 at 08:48:08PM -0500, Will Dietz wrote: > On Wed, May 16, 2018 at 6:04 PM, Rich Felker <dalias@...c.org> wrote: > > On Wed, May 16, 2018 at 12:22:36PM -0500, Will Dietz wrote: > >> I admit to being a bit unsure, but the behavior shown below doesn't > >> seem obviously right --LMK if I'm missing something :). > >> > >> Input file attached for inspection without relying on it getting > >> through byte-identical to what I have-- > >> indeed I'm not sure copy+paste into this is working correctly (the > >> characters look different in my terminal :)). Anyway: > >> > >> $ cat cp1255-snippet.xxd > >> 00000000: efac b3d6 b8d7 9d0a ........ > >> $ xxd -r cp1255-snippet.xxd > >> דָּם > >> > >> Attempt to round-trip this from UTF-8 to CP1255 and back, > >> first with glibc's iconv (2.26): > >> > >> $ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255 > >> -t UTF-8 | xxd > >> 00000000: efac b3d6 b8d7 9d0a > >> > >> Looks good, same as what was sent in. > >> > >> Using musl-based iconv utility (1.1.19): > >> $ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP1255 > >> -t UTF-8 | xxd > >> 00000000: 2ad6 b8d7 9d0a *..... > >> > >> Indeed, the result looks different than what was started with: > >> > >> *ָם > >> > >> (again apologies if that doesn't survive mailing and such) > >> > >> This input was taken from gnu libiconv's test suite, in particular the > >> first line of tests/CP1255-snippet.UTF-8. Since it's 2 characters, > >> and test data, I hope there's no problem re:licensing O:). > >> > >> I've reproduced the same behavior using iconv() directly, I can share > >> that if that would be preferable. It's the same code from earlier > >> iconv threads on the ML. > > > > No need; it's easy to reproduce, and I'm leaning towards saying the > > test is invalid. U+FB33 is a precomposed ligature form (from the > > Alphabetic Presentation Forms block), roughly equivalent in status to > > stuff like "ﬁ" (U+FB01). An iconv implementation could perform an > > approximate conversion for such characters, returning a positive value > > indicating the number of such substitutions made, but silently > > converting it in a lossy way is not conforming, and of there's > > apparently no lossless way to convert it since CP1255 has no dedicated > > character slot for it (at least based on the definition of the > > codepage I'm using). > > Thanks for looking into this and for the great information! > I'll investigate more tomorrow, but wanted to respond to your inquiry > since it's easy to produce and might help explain things :). Any further findings? > > Do you know how/why they expect it to round-trip? What does glibc do > > when converting it -- can you show the intermediate (CP1255) form as a > > hexdump? > > Sure! > > Here's the intermediates for libiconv first, then w/musl: > > $ cat libiconv-cp1255.xxd > 00000000: e3cc c8ed 0a ..... > $ cat musl-iconv-cp1255.xxd > 00000000: 2ac8 ed0a *... This is a plausible/reasonable conversion GNU iconv is doing... > Here's what happens when each of these are feed through both: > > ---- using libiconv's intermediate: > $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8 > דָּם > $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8|xxd > 00000000: efac b3d6 b8d7 9d0a ........ > $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8 > דָּם > $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8|xxd > 00000000: d793 d6bc d6b8 d79d 0a ......... ...but the GNU iconv behavior here is completely unreasonable/wrong. The first character it outputs is a presentation form for a ligature. There is no reason iconv should be doing this kind of renormalization when the original representation as two separate characters is available in the dest charset. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.