john-dev - Re: episerver UTF-8

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <d53ef8e70b5b09e23824e604a2f92e7f@smtp.hushmail.com>
Date: Fri, 14 Aug 2015 11:18:40 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: episerver UTF-8

On 2015-08-14 08:04, Frank Dittrich wrote:
> On 08/14/2015 03:07 AM, jfoug@....net wrote:
>> On Thu, 13 Aug 2015 19:35:57 -0500, Lei Zhang <zhanglei.april@...il.com>
>> wrote:
>>> BTW, I think 3*PLAINTEXT_LENGTH means that we assume
>>
>> Yes, this is an 'assumption'

No, it is not. It's always correct.

>>> each UTF8 char to be no larger than 3 bytes. Is that assumption true?
>>> Or 4-byte UTF8 chars are too rare to be considered?
>>
>> In real world, they are somewhat rare.  But your point is valid.  There
>> could certainly be a string of X 4 byte utf8 (there are even 5 byte utf8
>> characters) which cause something that should handle 25 characters to
>> not be able to handle a string of 25 4 (or 5) byte utf8. But we simply
>> have drawn a line in the sand where reality vs theoretical limits come
>> into play.
>
> For applications that use UTF-16 with surrogates internally, the above
> assumption is OK. If you enter characters that require more than tree
> bytes when converted to utf-8, the max. number of characters will be
> reduced accordingly.

Exactly.

1 byte of UTF-8 = 2 octets of UTF-16
2 bytes of UTF-8 = 2 octets of UTF-16
3 bytes of UTF-8 = 2 octets of UTF-16
4 bytes of UTF-8 = 4 octets of UTF-16

For all of the above, the UTF-16 is *one* character.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.