libc-coord - c8rtowc and wcrtoc8

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <87cy2jqazg.fsf@keithp.com>
Date: Thu, 05 Feb 2026 10:52:19 -0800
From: Keith Packard <keithp@...thp.com>
To: libc-coord@...ts.openwall.com
Subject: c8rtowc and wcrtoc8

I found this proposal from 2019 titled 'Restartable and Non-Restartable
Functions for Efficient Character Conversions':

        https://www.open-std.org/JTC1/SC22/WG14/www/docs/n2440.pdf

It seems to predate the C23 changes to uchar.h which added the char8_t
type, mbrtoc8 and c8rtomb functions. While I'm sure the addition of
mbrtoc8 and c8rtomb are useful to someone, it seems like they miss the
whole point of n2440 -- we need APIs which don't depend upon locale
settings for handling UTF-8 strings.

The above paper suggests a simple solution for C libraries where the
wchar_t encoding is always UCS-4/UTF-32 -- provide conversion between
UTF-8 and wide characters via new functions, the lowest-level ones being
wcrtoc8 and c8rtowc.

For libraries already implementing mbrtoc8 and c8rtomb, the additions
should be trivial -- there's an internal transition within both
functions holding a UCS-4/UTF-32 value which can be translated to a
wchar_t.

Because the char8_t encoding is not stateless, something like mbstate_t
is required. In my c8rtomb/mbrtoc8 code, I'm re-using the internals of
that structure for UTF-8 state, but that seems messy, so I propose a new
c8state_t structure. That leaves us with the following basic APIs:

    C8_LEN_MAX

    c8state_t

    #include <uchar.h>

    int c8rtowc(wchar_t * restrict pwc, char8_t c8, c8state_t * restrict ps);

        Returns 1 if c8 completes a character, the result is stored in
        *pwc.

        Returns 0 if c8 contributes to an incomplete (but potentially
        valid) character.

        Returns -1 if an encoding error occurs; the value of the macro
        EILSEQ is stored in errno and the conversion state is unspecified.

    size_t wcrtoc8(char8_t * restrict s, wchar_t wc);

        Returns the number of bytes stored in the array pointed to by s.
        At most C8_LEN_MAX bytes are stored.

        When wc is not a valid wide character, an encoding error occurs:
        the function stores the value of the macro EILSEQ in errno and
        returns (size_t) -1.

The first one could look more like mbrtowc with an array of char8_t, but
in my experience applications very often mis-use that API as it is
extremely fussy.

To resolve performance concerns, bulk conversion APIs (perhaps c8stowcs
and wcstoc8s?) would be added. I hesitate to propose APIs for those as I
feel that the current definition of mbstowcs and wcstombs provide poor
guidance.

-- 
-keith

Download attachment "signature.asc" of type "application/pgp-signature" (833 bytes)

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.