![]() |
|
Message-ID: <CADDzAfOpS69iRh06ML=OpUfZ0ZvbSJHp3mJnK8TiXMK5B1CxGQ@mail.gmail.com> Date: Sat, 5 Apr 2025 05:12:08 +0800 From: Kang-Che Sung <explorer09@...il.com> To: musl@...ts.openwall.com Subject: wcrtomb in UTF-8 locale should check the multibyte state Hello. I'm reporting an issue that I think should be a bug. Even though UTF-8 is not an encoding that uses shift characters, the wide character encoding functions like wcrtomb should check the mbstate_t object anyway if supplied by the caller. I'm presenting a use case for why this matters: (1) I call mbrtowc (or sometimes mbsnrtowcs) to scan a few bytes from a multibyte string (2) Then I call wcrtomb with the appropriate offset to the multibyte string buffer, to "overwrite" or "append to" the end of the buffer. In this case, the mbstate_t object after the mbrtowc call might contain an incomplete UTF-8 sequence. When writing a new character through wcrtomb, I should be able to tell if appending a character would cause the string to become ill-formed. If the mbstate_t object is ignored when calling wcrtomb, then it can mask out an error. (An incomplete sequence during mbrtowc does not set errno=EILSEQ, but writing a new sequence right after the incomplete one should make that incomplete sequence _invalid_, and thus it should have an EILSEQ error.) This is an example code: ```c #include <limits.h> #include <locale.h> #include <stdio.h> #include <string.h> #include <wchar.h> int main(void) { // Preloaded character: U+306B Hiragana Letter Ni char buf[256] = "\xE3\x81\xAB"; setlocale(LC_ALL, "en_US.UTF-8"); mbstate_t state; memset(&state, 0, sizeof(state)); wchar_t wc = 0; size_t len = mbrtowc(&wc, buf, 2, &state); printf("%zu %04x\n", len, (unsigned int)wc); wc = (wchar_t)0; len = wcrtomb(&buf[2], wc, &state); printf("%zu\n", len); } ``` Actual result (with musl libc 1.2.5 on Arch Linux x86-64): ```text 18446744073709551614 0000 1 ``` Expected result: ```text 18446744073709551614 0000 18446744073709551615 ``` Note: It is _allowed_ in the C standard to reuse an mbstate_t object across different multibyte conversion functions. It is _not an undefined behavior_ when the mbstate_t object is used for the _same string_ in the _same locale_, and thus the example code above should be a valid use. When I tested the code in macOS 15.4, it demonstrated the expected behavior. But both glibc and musl libc seem to have the bug.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.