Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 16 May 2018 19:04:25 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: iconv UTF-8 <--> CP1255 roundtrip possible bug?

On Wed, May 16, 2018 at 12:22:36PM -0500, Will Dietz wrote:
> I admit to being a bit unsure, but the behavior shown below doesn't
> seem obviously right --LMK if I'm missing something :).
> 
> Input file attached for inspection without relying on it getting
> through byte-identical to what I have--
> indeed I'm not sure copy+paste into this is working correctly (the
> characters look different in my terminal :)).  Anyway:
> 
> $ cat cp1255-snippet.xxd
> 00000000: efac b3d6 b8d7 9d0a                      ........
> $ xxd -r cp1255-snippet.xxd
> דָּם
> 
> Attempt to round-trip this from UTF-8 to CP1255 and back,
> first with glibc's iconv (2.26):
> 
> $ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255
> -t UTF-8 | xxd
> 00000000: efac b3d6 b8d7 9d0a
> 
> Looks good, same as what was sent in.
> 
> Using musl-based iconv utility (1.1.19):
> $ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP1255
> -t UTF-8 | xxd
> 00000000: 2ad6 b8d7 9d0a                           *.....
> 
> Indeed, the result looks different than what was started with:
> 
> *ָם
> 
> (again apologies if that doesn't survive mailing and such)
> 
> This input was taken from gnu libiconv's test suite, in particular the
> first line of tests/CP1255-snippet.UTF-8.  Since it's 2 characters,
> and test data, I hope there's no problem re:licensing O:).
> 
> I've reproduced the same behavior using iconv() directly, I can share
> that if that would be preferable. It's the same code from earlier
> iconv threads on the ML.

No need; it's easy to reproduce, and I'm leaning towards saying the
test is invalid. U+FB33 is a precomposed ligature form (from the
Alphabetic Presentation Forms block), roughly equivalent in status to
stuff like "fi" (U+FB01). An iconv implementation could perform an
approximate conversion for such characters, returning a positive value
indicating the number of such substitutions made, but silently
converting it in a lossy way is not conforming, and of there's
apparently no lossless way to convert it since CP1255 has no dedicated
character slot for it (at least based on the definition of the
codepage I'm using).

Do you know how/why they expect it to round-trip? What does glibc do
when converting it -- can you show the intermediate (CP1255) form as a
hexdump?

> --------------
> 
> Hopefully this is useful!
> 
> On the subject, a question or two if it's not too much trouble:
> 
> * is the above what's meant by "round-trip" as discussed in[1]?
> * What sorts of "round-trip" conversions are expected to work? And
> over what inputs should round-trip conversions work-- for any 'valid"
> UTF-8 or so?

Any UTF-8 whose content is representable in the encoding you're asking
about round-tripping through, i.e. where the first iconv returns 0.

> Armed with some insights regarding these questions, I'm hoping to
> scope out something that can be tested or (no promises!) perhaps
> pushed through some formal verification goodness.  But also I'm just
> curious :).

Yay!

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.