musl - Re: [C23 string conversion 1/2] C23: implement the c8rtomb and mbrtoc8 functions

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230626122833.7f0e73a8@inria.fr>
Date: Mon, 26 Jun 2023 12:28:33 +0200
From: Jₑₙₛ Gustedt <jens.gustedt@...ia.fr>
To: musl@...ts.openwall.com
Subject: Re: [C23 string conversion 1/2] C23: implement the c8rtomb
 and mbrtoc8 functions

Rich,

on Sun, 25 Jun 2023 13:26:20 -0400 you (Rich Felker <dalias@...c.org>)
wrote:

> Following up on some specifics of code I hadn't reviewed in enough
> detail yet...
> 
> On Wed, May 31, 2023 at 04:01:43PM +0200, Jens Gustedt wrote:
> > The implementations each separate two cases: a fast path that
> > corresponds to calls with state variables that are in the initial
> > state and a leading input character that is ASCII; the other cases
> > are handled by a function for the slow path. These functions are
> > implemented such that the fast versions should not need stack
> > variables of their own, and thus can tail call into the slow path
> > with a jmp instruction if necessary. The fast versions could
> > typically be also used as shallow inline wrapper, if we like to go
> > that way.
> > 
> > Only the implementation of mbrtoc8 is a bit tricky. This is because
> > of the weird conventions for dripping the output bytes one call
> > after the other. This is handled by using the state variable as a
> > stack for the up to three characters that are still to be sent to
> > the output. ---
> >  include/uchar.h         |  6 +++++
> >  src/multibyte/c8rtomb.c | 31 +++++++++++++++++++++++++
> >  src/multibyte/mbrtoc8.c | 51
> > +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 88
> > insertions(+) create mode 100644 src/multibyte/c8rtomb.c
> >  create mode 100644 src/multibyte/mbrtoc8.c
> > 
> > diff --git a/include/uchar.h b/include/uchar.h
> > index 7e5c4d40..9f6a3706 100644
> > --- a/include/uchar.h
> > +++ b/include/uchar.h
> > @@ -9,6 +9,9 @@ extern "C" {
> >  typedef unsigned short char16_t;
> >  typedef unsigned char32_t;
> >  #endif
> > +#if  __cplusplus < 201811L
> > +typedef unsigned char char8_t;
> > +#endif
> >  
> >  #define __NEED_mbstate_t
> >  #define __NEED_size_t
> > @@ -16,6 +19,9 @@ typedef unsigned char32_t;
> >  #include <features.h>
> >  #include <bits/alltypes.h>
> >  
> > +size_t c8rtomb(char *__restrict, char8_t, mbstate_t *__restrict);
> > +size_t mbrtoc8(char8_t *__restrict, const char *__restrict,
> > size_t, mbstate_t *__restrict); +
> >  size_t c16rtomb(char *__restrict, char16_t, mbstate_t *__restrict);
> >  size_t mbrtoc16(char16_t *__restrict, const char *__restrict,
> > size_t, mbstate_t *__restrict); 
> > diff --git a/src/multibyte/c8rtomb.c b/src/multibyte/c8rtomb.c
> > new file mode 100644
> > index 00000000..8645e112
> > --- /dev/null
> > +++ b/src/multibyte/c8rtomb.c
> > @@ -0,0 +1,31 @@
> > +#include <uchar.h>
> > +#include <wchar.h>
> > +
> > +__attribute__((__noinline__))  
> 
> Where we use this so far is conditioned on #ifdef __GNUC__. I'm not
> sure if it's actually helping, but I think it'd make sense to be
> consistent.

ok, no pb

> > +static size_t c8rtomb_slow(char *__restrict s, unsigned char c8,
> > mbstate_t *__restrict st) +{
> > +	// We need an internal state different from mbrtowc.
> > +	static mbstate_t internal_state;
> > +	if (!st) st = &internal_state;
> > +	wchar_t wc;
> > +
> > +	// mbrtowc has return values -2, -1, 0, 1.
> > +	size_t res = mbrtowc(&wc, (char const*)&c8, 1, st);
> > +	switch (res) {
> > +	// our implementation of wcrtomb ignores the state
> > +	case  1: res = wcrtomb(s, wc, 0); break;
> > +	case  0: res = 1; if (s) *s = 0;  break;
> > +	}
> > +	return res;
> > +}  
> 
> At first I was confused by the conversion back and forth, but indeed
> it absolutely makes sense to handle the weird buffering this interface
> imposes a requirement for.

it is weird, indeed

> However, it seems to be malfunctioning in the C locale, where
> arbitrary bytes of c8 input would be accepted (despite not being valid
> UTF-8, as the interface requires).
> 
> Logically, the right thing to do is to set the calling thread's locale
> to __c_dot_utf8_locale for the duration of the mbrtowc only. However,
> I think there may be simpler solutions depending on what you interpret
> the interface requirement as being.
> 
> If c8rtomb is required to accept valid UTF-8 up to the penultimate
> byte then error only on storing it due to the locale not being able to
> represent it, I don't see a good solution except what I just said.
> 
> But if it's permitted to preemptively say "sorry, nope, not gonna be
> able to represent that" as soon as it sees the first byte, then when
> MB_CUR_MAX==1, having c8rtomb_slow fail unconditionally with EILSEQ
> seems like the best and easiest approach.

Good question. We should just be consistent and do the same whatever
we do when we pass a surrogate into `c16rtomb`. If I read that
correctly this does fail with `EILSEQ`.

The intent (IIRC what I did some weeks ago ;-) of this

> > +	size_t res = mbrtowc(&wc, (char const*)&c8, 1, st);

was to fail the same if `c8` is not start or continue of a valid
mb-character.

> > +__attribute__((__noinline__))
> > +static size_t mbrtoc8_slow(unsigned char *__restrict pc8, const
> > char *__restrict src, size_t n, mbstate_t *__restrict st) +{
> > +	static unsigned internal_state;
> > +	wchar_t wc;
> > +	unsigned* pending = st ? (void*)st : &internal_state;
> > +
> > +	// The high bit is set if there is still missing input. If
> > it
> > +	// is not set and the value is not zero, a previous call
> > has
> > +	// stacked the missing output bytes withing the state.
> > +	if ((-*pending) > INT_MIN) {
> > +		if (pc8) *pc8 = *pending;
> > +		(*pending) >>= 8;
> > +		return -3;
> > +	}
> > +
> > +	// mbrtowc has return values -2, -1, 0, 1, ..., 4.
> > +	size_t res = mbrtowc(&wc, src, n, (void*)pending);
> > +	if (res <= 4) {
> > +		_Alignas(unsigned) unsigned char s[8] = { 0 };
> > +		// Write the result bytes to s over the word
> > boundary
> > +		// as we need it. Our wcrtomb implementation
> > ignores
> > +		// the state variable, there will be no errors and
> > the
> > +		// return value can be ignored.
> > +		wcrtomb((void*)(s+3), wc, 0);
> > +		// Depending on the endianess this maybe a single
> > load
> > +		// instruction. We want to be sure that the bytes
> > are
> > +		// in this order, and that the high-order byte is
> > 0.
> > +		*pending = (0u|s[4])<<000 | (0u|s[5])<<010 |
> > (0u|s[6])<<020 | (0u|s[7])<<030;
> > +		if (pc8) *pc8 = s[3];
> > +	}
> > +	return res;
> > +}  
> 
> Does _Alignas work the way you want it on an array like this?

It is supposed to, yes.

> I was under the impression you would need a structure to do that in
> a way that actually meshes with the standard semantics for _Alignas
> and not the GCC attribute's special behaviors.

In the contrary, I think the support for this feature in `struct` had
first been forgotten in C11 (was it) and then added in C17.

> A cleaner approach might be a union where the presence of the unsigned
> union member yields the alignment. This would also be good since the
> code is written in C99 which doesn't have _Alignas.

This may indeed be better readable and portable to oldish C. I'll look
into this.

> Overall it seems like a lot of complexity effort to set things up for
> an efficient store, but if anyone's going to use this interface it's
> probably worthwhile.

Great.

Yes, this is supposed to be used to process streams of input.

I observed quite nice optimization on my platform with that
trick.

> > +weak_alias(__mbrtoc8, mbrtoc8);  
> 
> I think we already discussed this in another place, but in case I'm
> misremembering, these aliases aren't needed unless you need a
> namespace-safe way to call the function internally.

Yes, this is probably a left-over, sorry.

Thanks
Jₑₙₛ

-- 
:: ICube :::::::::::::::::::::::::::::: deputy director ::
:: Université de Strasbourg :::::::::::::::::::::: ICPS ::
:: INRIA Nancy Grand Est :::::::::::::::::::::::: Camus ::
:: :::::::::::::::::::::::::::::::::::: ☎ +33 368854536 ::
:: https://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

Content of type "application/pgp-signature" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.