musl - Re: [PATCH] fnmatch: add bare minimum support for character equivalents (e.g. [[=e=]])

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <aOXPE3B7PZLWb6zH@raf.org>
Date: Wed, 8 Oct 2025 13:40:19 +1100
From: raf <musl@....org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] fnmatch: add bare minimum support for character
 equivalents (e.g. [[=e=]])

On Mon, Oct 06, 2025 at 09:33:08PM -0400, Rich Felker <dalias@...c.org> wrote:

> On Sun, Oct 05, 2025 at 10:00:33AM +1100, raf wrote:
> > From: raf <raf@....org>
> > 
> > ---
> >  src/regex/fnmatch.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/src/regex/fnmatch.c b/src/regex/fnmatch.c
> > index 978fff88..b8c71afa 100644
> > --- a/src/regex/fnmatch.c
> > +++ b/src/regex/fnmatch.c
> > @@ -146,6 +146,12 @@ static int match_bracket(const char *p, int k, int kfold)
> >  				    iswctype(kfold, wctype(buf)))
> >  					return !inv;
> >  			}
> > +			if (z == '=' && *p0) {
> > +				wchar_t wc2;
> > +				int l = mbtowc(&wc2, p0, 4);
> > +				if (l < 0) return 0;
> > +				if (wc2==k || wc2==kfold) return !inv;
> > +			}
> >  			continue;
> >  		}
> >  		if (*p < 128U) {
> > -- 
> > 2.39.5
> 
> I haven't reviewed the implementation but making this change seems
> desirable.
> 
> Rich

It's slightly better than nothing. I don't know how to
implement it properly. I don't think there's any libc
API for determining character equivalents. glibc does
it by poking around in locale data. newlib does it in
what seems like a possibly incorrect method where it
normalises text into separate base and combining
characters (NFD) and then compares the base characters.
I don't think it's locale-specific which is probably
should be but it's probably fine and would match what
most people expect of it.

I also have a woefully inadequate implementation of
collating sequences [[.e.]] but it only works where it
doesn't matter and it doesn't work where it does
matter, because it only checks a single character of
the target text and uses wcscoll() for the comparison.
It looks like this:

	if (z == '.' && p-1-p0 < 4) {
		char buf[4];
		wchar_t wbuf[4], wbuf1[2], wbuf2[2];
		memcpy(buf, p0, p-1-p0);
		buf[p-1-p0] = 0;
		int wlen = (int)mbstowcs(wbuf, buf, 3);
		if (wlen != -1)
		{
			wbuf[wlen] = 0;
			wbuf1[0] = k;
			wbuf1[1] = 0;
			wbuf2[0] = kfold;
			wbuf2[1] = 0;
			if (!wcscoll(wbuf, wbuf1) || !wcscoll(wbuf, wbuf2)) return !inv;
		}
	}

The only example I know of collating sequences (from
https://www.regular-expressions.info/posixbrackets.html)
is [[.ch.]] in the Czech locale (cs-CZ.UTF-8) where two
characters in the target text (c and h) are to be
interpreted as a single digraph character. The above
code can't handle that without adding more parameters
to match_bracket() to represent the character following
k, and an output parameter to indicate whether or not
that character has been processed.

Given that I'm only aware of one example means that I really
don't understand the requirements of collating sequences and
so I doubt that the aforementioned changes would even be
close to OK (except by luck).

cheers,
raf

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.