|
|
Message-ID: <aOXPE3B7PZLWb6zH@raf.org>
Date: Wed, 8 Oct 2025 13:40:19 +1100
From: raf <musl@....org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] fnmatch: add bare minimum support for character
equivalents (e.g. [[=e=]])
On Mon, Oct 06, 2025 at 09:33:08PM -0400, Rich Felker <dalias@...c.org> wrote:
> On Sun, Oct 05, 2025 at 10:00:33AM +1100, raf wrote:
> > From: raf <raf@....org>
> >
> > ---
> > src/regex/fnmatch.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/src/regex/fnmatch.c b/src/regex/fnmatch.c
> > index 978fff88..b8c71afa 100644
> > --- a/src/regex/fnmatch.c
> > +++ b/src/regex/fnmatch.c
> > @@ -146,6 +146,12 @@ static int match_bracket(const char *p, int k, int kfold)
> > iswctype(kfold, wctype(buf)))
> > return !inv;
> > }
> > + if (z == '=' && *p0) {
> > + wchar_t wc2;
> > + int l = mbtowc(&wc2, p0, 4);
> > + if (l < 0) return 0;
> > + if (wc2==k || wc2==kfold) return !inv;
> > + }
> > continue;
> > }
> > if (*p < 128U) {
> > --
> > 2.39.5
>
> I haven't reviewed the implementation but making this change seems
> desirable.
>
> Rich
It's slightly better than nothing. I don't know how to
implement it properly. I don't think there's any libc
API for determining character equivalents. glibc does
it by poking around in locale data. newlib does it in
what seems like a possibly incorrect method where it
normalises text into separate base and combining
characters (NFD) and then compares the base characters.
I don't think it's locale-specific which is probably
should be but it's probably fine and would match what
most people expect of it.
I also have a woefully inadequate implementation of
collating sequences [[.e.]] but it only works where it
doesn't matter and it doesn't work where it does
matter, because it only checks a single character of
the target text and uses wcscoll() for the comparison.
It looks like this:
if (z == '.' && p-1-p0 < 4) {
char buf[4];
wchar_t wbuf[4], wbuf1[2], wbuf2[2];
memcpy(buf, p0, p-1-p0);
buf[p-1-p0] = 0;
int wlen = (int)mbstowcs(wbuf, buf, 3);
if (wlen != -1)
{
wbuf[wlen] = 0;
wbuf1[0] = k;
wbuf1[1] = 0;
wbuf2[0] = kfold;
wbuf2[1] = 0;
if (!wcscoll(wbuf, wbuf1) || !wcscoll(wbuf, wbuf2)) return !inv;
}
}
The only example I know of collating sequences (from
https://www.regular-expressions.info/posixbrackets.html)
is [[.ch.]] in the Czech locale (cs-CZ.UTF-8) where two
characters in the target text (c and h) are to be
interpreted as a single digraph character. The above
code can't handle that without adding more parameters
to match_bracket() to represent the character following
k, and an output parameter to indicate whether or not
that character has been processed.
Given that I'm only aware of one example means that I really
don't understand the requirements of collating sequences and
so I doubt that the aforementioned changes would even be
close to OK (except by luck).
cheers,
raf
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.