Date: Tue, 27 Feb 2018 12:34:26 -0500 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: iconv failure (ISO-2022-JP) since musl update on AlpineLinux On Tue, Feb 27, 2018 at 05:57:04PM +0100, Steffen Nurpmeso wrote: > Hello. > > After updating to musl-1.1.19-r0 there i saw test failures for the > MUA i maintain, namely regarding the mentioned charset. I will > attach a file to reproduce. (Am not subscribed.) > Ciao! > > #?0[steffen@...on steffen]$ cksum in.utf > 1259742080 686 in.utf > #?0[steffen@...on steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum > 2184132317 536 > #?0[steffen@...on steffen]$ iconv --version > iconv (GNU libiconv 1.11) > ... > #?0[steffen@...ex tmp]$ cksum in.utf > 1259742080 686 in.utf > #?0[steffen@...ex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum > 209789743 1736 > #?0[steffen@...ex tmp]$ apk info --who-owns /usr/bin/iconv > /usr/bin/iconv is owned by musl-utils-1.1.19-r0 Does the data round-trip correctly? I don't think you can expect bitwise match between outputs of different ISO-2022-JP converters, unless perhaps they both guarantee minimality, because the ISO-2022-JP representation of a string is highly nonunique. In particular musl's to-ISO-2022-JP converter is stateless and always generates shifts in/out around every non-ASCII character. Of course this is highly suboptimal, but in the worst case (where the caller calls iconv one character at a time) the iconv API can't do any better because strings are required to end in the unshifted state, and the iconv API doesn't have any method to "finalize" a conversion. This implies that every time iconv returns with non-ASCII as the most recent output character, it must be followed by a shift back to the initial (ASCII) state. We could improve this in the case of batch conversions by overwriting the previous shift-back-to-initial and skipping the next shift if the character set of the next character to output matches the previous one, but that only works within a single batch call, since iconv can't write outside the buffer passed to it for the current call. This is an improvement I think I want to make, since it would improve typical output size a lot, but the cost is output determinism under different chunking by the caller. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.