![]() |
|
Message-ID: <aOXPE3B7PZLWb6zH@raf.org> Date: Wed, 8 Oct 2025 13:40:19 +1100 From: raf <musl@....org> To: musl@...ts.openwall.com Subject: Re: [PATCH] fnmatch: add bare minimum support for character equivalents (e.g. [[=e=]]) On Mon, Oct 06, 2025 at 09:33:08PM -0400, Rich Felker <dalias@...c.org> wrote: > On Sun, Oct 05, 2025 at 10:00:33AM +1100, raf wrote: > > From: raf <raf@....org> > > > > --- > > src/regex/fnmatch.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/src/regex/fnmatch.c b/src/regex/fnmatch.c > > index 978fff88..b8c71afa 100644 > > --- a/src/regex/fnmatch.c > > +++ b/src/regex/fnmatch.c > > @@ -146,6 +146,12 @@ static int match_bracket(const char *p, int k, int kfold) > > iswctype(kfold, wctype(buf))) > > return !inv; > > } > > + if (z == '=' && *p0) { > > + wchar_t wc2; > > + int l = mbtowc(&wc2, p0, 4); > > + if (l < 0) return 0; > > + if (wc2==k || wc2==kfold) return !inv; > > + } > > continue; > > } > > if (*p < 128U) { > > -- > > 2.39.5 > > I haven't reviewed the implementation but making this change seems > desirable. > > Rich It's slightly better than nothing. I don't know how to implement it properly. I don't think there's any libc API for determining character equivalents. glibc does it by poking around in locale data. newlib does it in what seems like a possibly incorrect method where it normalises text into separate base and combining characters (NFD) and then compares the base characters. I don't think it's locale-specific which is probably should be but it's probably fine and would match what most people expect of it. I also have a woefully inadequate implementation of collating sequences [[.e.]] but it only works where it doesn't matter and it doesn't work where it does matter, because it only checks a single character of the target text and uses wcscoll() for the comparison. It looks like this: if (z == '.' && p-1-p0 < 4) { char buf[4]; wchar_t wbuf[4], wbuf1[2], wbuf2[2]; memcpy(buf, p0, p-1-p0); buf[p-1-p0] = 0; int wlen = (int)mbstowcs(wbuf, buf, 3); if (wlen != -1) { wbuf[wlen] = 0; wbuf1[0] = k; wbuf1[1] = 0; wbuf2[0] = kfold; wbuf2[1] = 0; if (!wcscoll(wbuf, wbuf1) || !wcscoll(wbuf, wbuf2)) return !inv; } } The only example I know of collating sequences (from https://www.regular-expressions.info/posixbrackets.html) is [[.ch.]] in the Czech locale (cs-CZ.UTF-8) where two characters in the target text (c and h) are to be interpreted as a single digraph character. The above code can't handle that without adding more parameters to match_bracket() to represent the character following k, and an output parameter to indicate whether or not that character has been processed. Given that I'm only aware of one example means that I really don't understand the requirements of collating sequences and so I doubt that the aforementioned changes would even be close to OK (except by luck). cheers, raf
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.