Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260324131135.GH1827@brightrain.aerifal.cx>
Date: Tue, 24 Mar 2026 09:11:36 -0400
From: Rich Felker <dalias@...c.org>
To: Simon Resch <simon.resch@...e-intelligence.com>,
	musl@...ts.openwall.com
Subject: Re: regexec infinite loop on self-referential backreference
 pattern

On Mon, Mar 23, 2026 at 10:54:55PM +0100, Szabolcs Nagy wrote:
> * Rich Felker <dalias@...c.org> [2026-03-23 11:57:01 -0400]:
> > On Mon, Mar 23, 2026 at 03:15:34PM +0100, Simon Resch wrote:
> > > Hi,
> > > 
> > > I've found that musl's regexec enters an infinite loop on the
> > > self-referential backreference pattern \(^.*\1\)*, even with a two-byte
> > > input. Since the input is only two bytes, this is not exponential blowup
> > > but appears to be a true infinite loop in the backtracking matcher.
> > > 
> > > The same bug exists in upstream TRE, from which musl's regex implementation
> > > is derived. For comparison, glibc rejects this pattern at regcomp time with
> > > REG_ESUBREG ("Invalid back reference"), since \1 references a group that
> > > contains itself.
> > > 
> > > Reproducer
> > > 
> > >   #include <regex.h>
> > >   #include <stdio.h>
> > > 
> > >   int main(void) {
> > >       regex_t re;
> > > 
> > >       const char pattern[] = "\\(^.*\\1\\)*";
> > >       const char subject[] = "aa";
> > > 
> > >       int rc = regcomp(&re, pattern, 0);
> > >       if (rc != 0) {
> > >           char errbuf[256];
> > >           regerror(rc, &re, errbuf, sizeof(errbuf));
> > >           fprintf(stderr, "regcomp failed: %s\n", errbuf);
> > >           return 1;
> > >       }
> > > 
> > >       printf("regcomp succeeded, calling regexec ...\n");
> > >       fflush(stdout);
> > > 
> > >       regexec(&re, subject, 0, NULL, 0);
> > > 
> > >       regfree(&re);
> > >       return 0;
> > >   }
> > > 
> > > To verify on Alpine (musl 1.2.5) place the above content into a file
> > > reproducer.c and execute
> > > 
> > >   docker run --rm -v $PWD/reproducer.c:/reproducer.c alpine sh -c \
> > >           'apk add gcc musl-dev && gcc -o /reproducer /reproducer.c &&
> > > timeout 10 /reproducer; echo "Exit code: $?"'
> > > 
> > > Output:
> > > 
> > >   regcomp succeeded, calling regexec ...
> > >   Exit code: 143
> > > 
> > > The process is killed by timeout after 10 seconds. I have also let it run
> > > for over an hour with no termination.
> > > 
> > > Any program that passes user-supplied patterns to regcomp/regexec (grep,
> > > sed, awk, web servers with regex-based URL routing, input validators, etc.)
> > > can be made to hang a thread/process permanently with a short pattern and a
> > > trivial input.
> 
> note that not all those tools use libc regex
> (busybox sed,grep,awk do but e.g. awk uses ERE, not BRE)
> 
> > > 
> > > Suggested fix: Either reject self-referential backreferences during regcomp
> > > (as glibc does), or add a step limit to the backtracking matcher that
> > > returns REG_ESPACE when exceeded.
> > > 
> > > Please CC me on replies.
> > 
> > Thanks. It should be caught during regcomp. I'm not sure yet how to do
> > that, but that's the approach a fix should take.
> > 
> > Rich
> 
> can be fixed by adding a bit of internal state.
> attached a patch.
> 

> >From d73884a4510ca5d2d2e9f4ee2ad345b304f67342 Mon Sep 17 00:00:00 2001
> From: Szabolcs Nagy <nsz@...t70.net>
> Date: Mon, 23 Mar 2026 17:33:20 +0000
> Subject: [PATCH] regex: reject invalid \digit back reference in BRE
> 
> in BRE \n matches the nth subexpression, but regcomp did not check if
> the nth subexpression was complete or not, only that there were more
> subexpressions overall than the largest backref.
> 
> fix regcomp to error if the referenced subexpression is incomplete.
> the bug could cause an infinite loop in regexec:
> 
>  regcomp(&re, "\\(^a*\\1\\)*", 0);
>  regexec(&re, "aa", 0, 0, 0);
> 
> so this is a DoS vuln if the pattern is not under control (ERE is not
> affected).

Looks good. Except I would not characterize this as a "vuln" because
BRE is inherently subject to backreferences taking exponential time
which, for all practical purposes, is the same as infinite time. It's
an application vuln if it passes untrusted BRE to regcomp.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.