Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aOXPE3B7PZLWb6zH@raf.org>
Date: Wed, 8 Oct 2025 13:40:19 +1100
From: raf <musl@....org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] fnmatch: add bare minimum support for character
 equivalents (e.g. [[=e=]])

On Mon, Oct 06, 2025 at 09:33:08PM -0400, Rich Felker <dalias@...c.org> wrote:

> On Sun, Oct 05, 2025 at 10:00:33AM +1100, raf wrote:
> > From: raf <raf@....org>
> > 
> > ---
> >  src/regex/fnmatch.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/src/regex/fnmatch.c b/src/regex/fnmatch.c
> > index 978fff88..b8c71afa 100644
> > --- a/src/regex/fnmatch.c
> > +++ b/src/regex/fnmatch.c
> > @@ -146,6 +146,12 @@ static int match_bracket(const char *p, int k, int kfold)
> >  				    iswctype(kfold, wctype(buf)))
> >  					return !inv;
> >  			}
> > +			if (z == '=' && *p0) {
> > +				wchar_t wc2;
> > +				int l = mbtowc(&wc2, p0, 4);
> > +				if (l < 0) return 0;
> > +				if (wc2==k || wc2==kfold) return !inv;
> > +			}
> >  			continue;
> >  		}
> >  		if (*p < 128U) {
> > -- 
> > 2.39.5
> 
> I haven't reviewed the implementation but making this change seems
> desirable.
> 
> Rich

It's slightly better than nothing. I don't know how to
implement it properly. I don't think there's any libc
API for determining character equivalents. glibc does
it by poking around in locale data. newlib does it in
what seems like a possibly incorrect method where it
normalises text into separate base and combining
characters (NFD) and then compares the base characters.
I don't think it's locale-specific which is probably
should be but it's probably fine and would match what
most people expect of it.

I also have a woefully inadequate implementation of
collating sequences [[.e.]] but it only works where it
doesn't matter and it doesn't work where it does
matter, because it only checks a single character of
the target text and uses wcscoll() for the comparison.
It looks like this:

	if (z == '.' && p-1-p0 < 4) {
		char buf[4];
		wchar_t wbuf[4], wbuf1[2], wbuf2[2];
		memcpy(buf, p0, p-1-p0);
		buf[p-1-p0] = 0;
		int wlen = (int)mbstowcs(wbuf, buf, 3);
		if (wlen != -1)
		{
			wbuf[wlen] = 0;
			wbuf1[0] = k;
			wbuf1[1] = 0;
			wbuf2[0] = kfold;
			wbuf2[1] = 0;
			if (!wcscoll(wbuf, wbuf1) || !wcscoll(wbuf, wbuf2)) return !inv;
		}
	}

The only example I know of collating sequences (from
https://www.regular-expressions.info/posixbrackets.html)
is [[.ch.]] in the Czech locale (cs-CZ.UTF-8) where two
characters in the target text (c and h) are to be
interpreted as a single digraph character. The above
code can't handle that without adding more parameters
to match_bracket() to represent the character following
k, and an output parameter to indicate whether or not
that character has been processed.

Given that I'm only aware of one example means that I really
don't understand the requirements of collating sequences and
so I doubt that the aforementioned changes would even be
close to OK (except by luck).

cheers,
raf

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.