Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 25 Jan 2023 09:33:52 +0900
From: Dominique MARTINET <dominique.martinet@...ark-techno.com>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com
Subject: Re: infinite loop in mallocng's try_avail

Thanks for the reply,

Rich Felker wrote on Tue, Jan 24, 2023 at 03:37:48AM -0500:
> > (this is musl 1.2.4 with a couple of patchs, none around malloc:

(I had meant 1.2.3)

> > https://gitlab.alpinelinux.org/alpine/aports/-/tree/3.17-stable/main/musl
> > )
> > 
> > For convenience, I've copied the incriminated loop here:
> >         int cnt = m->mem->active_idx + 2;
> >         int size = size_classes[m->sizeclass]*UNIT;
> >         int span = UNIT + size*cnt;
> >         // activate up to next 4k boundary
> >         while ((span^(span+size-1)) < 4096) {
> >                 cnt++;
> >                 span += size;
> >         }
> 
> This code should not be reachable for size class 0 or any size class
> allocated inside a larger-size-class slot.
> That case has active_idx = cnt-1 (set at line 272).

I figured that it might be "normally" unreachable but did not see why,
thanks for confirming that intention.

> If this code is being reached, either the allocator state has been
> corrupted by some UB in the application, or there's a logic bug in
> mallocng. The sequence of events that seem to have to happen to get
> there are:
> 
> 1. Previously active group has no more available slots (line 120).

Right, that one has already likely been dequeued (or at least
traversed), so I do not see how to look at it but that sounds possible.

> 2. Freed mask of newly activating group (line 131 or 138) is either
>    zero (line 145) or the active_idx (read from in-band memory
>    susceptible to application buffer overflows etc) is wrong and
>    produces zero when its bits are anded with the freed mask (line
>    145).

m->freed_mask looks like it is zero from values below; I cannot tell if
that comes from a corruption outside of musl or not.

> > (gdb) p __malloc_context            
> > $94 = {
> >   secret = 15756413639004407235,
> >   init_done = 1,
> >   mmap_counter = 135,
> >   free_meta_head = 0x0,
> >   avail_meta = 0x18a3f70,
> >   avail_meta_count = 6,
> >   avail_meta_area_count = 0,
> >   meta_alloc_shift = 0,
> >   meta_area_head = 0x18a3000,
> >   meta_area_tail = 0x18a3000,
> >   avail_meta_areas = 0x18a4000 <error: Cannot access memory at address 0x18a4000>,
> >   active = {0x18a3e98, 0x18a3eb0, 0x18a3208, 0x18a3280, 0x0, 0x0, 0x0, 0x18a31c0, 0x0, 0x0, 0x0, 0x18a3148, 0x0, 0x0, 0x0, 0x18a3dd8, 0x0, 0x0, 0x0, 0x18a3d90, 0x0, 
> >     0x18a31f0, 0x0, 0x18a3b68, 0x0, 0x18a3f28, 0x0, 0x0, 0x0, 0x18a3238, 0x0 <repeats 18 times>},
> >   usage_by_class = {2580, 600, 10, 7, 0 <repeats 11 times>, 96, 0, 0, 0, 20, 0, 3, 0, 8, 0, 3, 0, 0, 0, 3, 0 <repeats 18 times>},
> >   unmap_seq = '\000' <repeats 31 times>,
> >   bounces = '\000' <repeats 18 times>, "w", '\000' <repeats 12 times>,
> >   seq = 1 '\001',
> >   brk = 25837568
> > }
> > (gdb) p *__malloc_context->active[0]
> > $95 = {
> >   prev = 0x18a3f40,
> >   next = 0x18a3e80,
> >   mem = 0xb6f57b30,
> >   avail_mask = 1073741822,
> >   freed_mask = 0,
> >   last_idx = 29,
> >   freeable = 1,
> >   sizeclass = 0,
> >   maplen = 0
> > }
> > (gdb) p *__malloc_context->active[0]->mem
> > $97 = {
> >   meta = 0x18a3e98,
> >   active_idx = 29 '\035',
> >   pad = "\000\000\000\000\000\000\000\000\377\000",
> >   storage = 0xb6f57b40 ""
> > }
> 
> This is really weird, because at the point of the infinite loop, the
> new group should not yet be activated (line 163), so
> __malloc_context->active[0] should still point to the old active
> group. But its avail_mask has all bits set and active_idx is not
> corrupted, so try_avail should just have obtained an available slot
> from it without ever entering the block at line 120. So I'm confused
> how it got to the loop.

try_avail's pm is `__malloc_context->active[0]`, which is overwritten by
either dequeue(pm, m) or *pm = m (lines 123,128), so the original
m->avail_mask could have been zero, with the next element having a zero
freed mask?

I'm really not familiar with the slot managemnet logic here, that might
not normally be possible without corruption, but the structures look
fairly sensible to me... Not that it proves there wasn't some sort of
outside corruption, I wish this was easier to reproduce so I could just
run it in valgrind or asan to detect overflows...

> One odd thing I noticed is that the backtrace pm=0xb6f692e8 does not
> match the __malloc_context->active[0] address. Were thse from
> different runs?

These were from the same run, I've only observed this single occurence
first-hand.

pm is &__malloc_context->active[0], so it's not 0x18a3e98 (first value
of active) but its address (e.g. __malloc_context+48 as per gdb symbol
resolution in the backtrace)
I didn't print __malloc_context but I don't see why gdb would have
gotten that wrong.


Cheers,
-- 
Dominique

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.