Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADDzAfOpS69iRh06ML=OpUfZ0ZvbSJHp3mJnK8TiXMK5B1CxGQ@mail.gmail.com>
Date: Sat, 5 Apr 2025 05:12:08 +0800
From: Kang-Che Sung <explorer09@...il.com>
To: musl@...ts.openwall.com
Subject: wcrtomb in UTF-8 locale should check the multibyte state

Hello.

I'm reporting an issue that I think should be a bug.

Even though UTF-8 is not an encoding that uses shift characters, the
wide character encoding functions like wcrtomb should check the
mbstate_t object anyway if supplied by the caller.

I'm presenting a use case for why this matters:
(1) I call mbrtowc (or sometimes mbsnrtowcs) to scan a few bytes from
a multibyte string
(2) Then I call wcrtomb with the appropriate offset to the multibyte
string buffer, to "overwrite" or "append to" the end of the buffer.

In this case, the mbstate_t object after the mbrtowc call might
contain an incomplete UTF-8 sequence. When writing a new character
through wcrtomb, I should be able to tell if appending a character
would cause the string to become ill-formed. If the mbstate_t object
is ignored when calling wcrtomb, then it can mask out an error. (An
incomplete sequence during mbrtowc does not set errno=EILSEQ, but
writing a new sequence right after the incomplete one should make that
incomplete sequence _invalid_, and thus it should have an EILSEQ
error.)

This is an example code:

```c
#include <limits.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main(void) {
    // Preloaded character: U+306B Hiragana Letter Ni
    char buf[256] = "\xE3\x81\xAB";

    setlocale(LC_ALL, "en_US.UTF-8");

    mbstate_t state;
    memset(&state, 0, sizeof(state));
    wchar_t wc = 0;
    size_t len = mbrtowc(&wc, buf, 2, &state);
    printf("%zu %04x\n", len, (unsigned int)wc);

    wc = (wchar_t)0;
    len = wcrtomb(&buf[2], wc, &state);
    printf("%zu\n", len);
}
```

Actual result (with musl libc 1.2.5 on Arch Linux x86-64):

```text
18446744073709551614 0000
1
```

Expected result:

```text
18446744073709551614 0000
18446744073709551615
```

Note: It is _allowed_ in the C standard to reuse an mbstate_t object
across different multibyte conversion functions. It is _not an
undefined behavior_ when the mbstate_t object is used for the _same
string_ in the _same locale_, and thus the example code above should
be a valid use.

When I tested the code in macOS 15.4, it demonstrated the expected behavior.
But both glibc and musl libc seem to have the bug.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.