|
|
Message-ID: <20230625172619.GR4163@brightrain.aerifal.cx>
Date: Sun, 25 Jun 2023 13:26:20 -0400
From: Rich Felker <dalias@...c.org>
To: Jens Gustedt <Jens.Gustedt@...ia.fr>
Cc: musl@...ts.openwall.com
Subject: Re: [C23 string conversion 1/2] C23: implement the c8rtomb
and mbrtoc8 functions
Following up on some specifics of code I hadn't reviewed in enough
detail yet...
On Wed, May 31, 2023 at 04:01:43PM +0200, Jens Gustedt wrote:
> The implementations each separate two cases: a fast path that
> corresponds to calls with state variables that are in the initial
> state and a leading input character that is ASCII; the other cases are
> handled by a function for the slow path. These functions are
> implemented such that the fast versions should not need stack
> variables of their own, and thus can tail call into the slow path with
> a jmp instruction if necessary. The fast versions could typically be
> also used as shallow inline wrapper, if we like to go that way.
>
> Only the implementation of mbrtoc8 is a bit tricky. This is because of
> the weird conventions for dripping the output bytes one call after the
> other. This is handled by using the state variable as a stack for the
> up to three characters that are still to be sent to the output.
> ---
> include/uchar.h | 6 +++++
> src/multibyte/c8rtomb.c | 31 +++++++++++++++++++++++++
> src/multibyte/mbrtoc8.c | 51 +++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 88 insertions(+)
> create mode 100644 src/multibyte/c8rtomb.c
> create mode 100644 src/multibyte/mbrtoc8.c
>
> diff --git a/include/uchar.h b/include/uchar.h
> index 7e5c4d40..9f6a3706 100644
> --- a/include/uchar.h
> +++ b/include/uchar.h
> @@ -9,6 +9,9 @@ extern "C" {
> typedef unsigned short char16_t;
> typedef unsigned char32_t;
> #endif
> +#if __cplusplus < 201811L
> +typedef unsigned char char8_t;
> +#endif
>
> #define __NEED_mbstate_t
> #define __NEED_size_t
> @@ -16,6 +19,9 @@ typedef unsigned char32_t;
> #include <features.h>
> #include <bits/alltypes.h>
>
> +size_t c8rtomb(char *__restrict, char8_t, mbstate_t *__restrict);
> +size_t mbrtoc8(char8_t *__restrict, const char *__restrict, size_t, mbstate_t *__restrict);
> +
> size_t c16rtomb(char *__restrict, char16_t, mbstate_t *__restrict);
> size_t mbrtoc16(char16_t *__restrict, const char *__restrict, size_t, mbstate_t *__restrict);
>
> diff --git a/src/multibyte/c8rtomb.c b/src/multibyte/c8rtomb.c
> new file mode 100644
> index 00000000..8645e112
> --- /dev/null
> +++ b/src/multibyte/c8rtomb.c
> @@ -0,0 +1,31 @@
> +#include <uchar.h>
> +#include <wchar.h>
> +
> +__attribute__((__noinline__))
Where we use this so far is conditioned on #ifdef __GNUC__. I'm not
sure if it's actually helping, but I think it'd make sense to be
consistent.
> +static size_t c8rtomb_slow(char *__restrict s, unsigned char c8, mbstate_t *__restrict st)
> +{
> + // We need an internal state different from mbrtowc.
> + static mbstate_t internal_state;
> + if (!st) st = &internal_state;
> + wchar_t wc;
> +
> + // mbrtowc has return values -2, -1, 0, 1.
> + size_t res = mbrtowc(&wc, (char const*)&c8, 1, st);
> + switch (res) {
> + // our implementation of wcrtomb ignores the state
> + case 1: res = wcrtomb(s, wc, 0); break;
> + case 0: res = 1; if (s) *s = 0; break;
> + }
> + return res;
> +}
At first I was confused by the conversion back and forth, but indeed
it absolutely makes sense to handle the weird buffering this interface
imposes a requirement for.
However, it seems to be malfunctioning in the C locale, where
arbitrary bytes of c8 input would be accepted (despite not being valid
UTF-8, as the interface requires).
Logically, the right thing to do is to set the calling thread's locale
to __c_dot_utf8_locale for the duration of the mbrtowc only. However,
I think there may be simpler solutions depending on what you interpret
the interface requirement as being.
If c8rtomb is required to accept valid UTF-8 up to the penultimate
byte then error only on storing it due to the locale not being able to
represent it, I don't see a good solution except what I just said.
But if it's permitted to preemptively say "sorry, nope, not gonna be
able to represent that" as soon as it sees the first byte, then when
MB_CUR_MAX==1, having c8rtomb_slow fail unconditionally with EILSEQ
seems like the best and easiest approach.
> +__attribute__((__noinline__))
> +static size_t mbrtoc8_slow(unsigned char *__restrict pc8, const char *__restrict src, size_t n, mbstate_t *__restrict st)
> +{
> + static unsigned internal_state;
> + wchar_t wc;
> + unsigned* pending = st ? (void*)st : &internal_state;
> +
> + // The high bit is set if there is still missing input. If it
> + // is not set and the value is not zero, a previous call has
> + // stacked the missing output bytes withing the state.
> + if ((-*pending) > INT_MIN) {
> + if (pc8) *pc8 = *pending;
> + (*pending) >>= 8;
> + return -3;
> + }
> +
> + // mbrtowc has return values -2, -1, 0, 1, ..., 4.
> + size_t res = mbrtowc(&wc, src, n, (void*)pending);
> + if (res <= 4) {
> + _Alignas(unsigned) unsigned char s[8] = { 0 };
> + // Write the result bytes to s over the word boundary
> + // as we need it. Our wcrtomb implementation ignores
> + // the state variable, there will be no errors and the
> + // return value can be ignored.
> + wcrtomb((void*)(s+3), wc, 0);
> + // Depending on the endianess this maybe a single load
> + // instruction. We want to be sure that the bytes are
> + // in this order, and that the high-order byte is 0.
> + *pending = (0u|s[4])<<000 | (0u|s[5])<<010 | (0u|s[6])<<020 | (0u|s[7])<<030;
> + if (pc8) *pc8 = s[3];
> + }
> + return res;
> +}
Does _Alignas work the way you want it on an array like this? I was
under the impression you would need a structure to do that in a way
that actually meshes with the standard semantics for _Alignas and not
the GCC attribute's special behaviors.
A cleaner approach might be a union where the presence of the unsigned
union member yields the alignment. This would also be good since the
code is written in C99 which doesn't have _Alignas.
Overall it seems like a lot of complexity effort to set things up for
an efficient store, but if anyone's going to use this interface it's
probably worthwhile.
> +weak_alias(__mbrtoc8, mbrtoc8);
I think we already discussed this in another place, but in case I'm
misremembering, these aliases aren't needed unless you need a
namespace-safe way to call the function internally.
Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.