musl - Re: [Toybox] Not sure how to debug this one.

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <202402171521.KAA18350@Stone.Rodents-Montreal.ORG>
Date: Sat, 17 Feb 2024 10:21:07 -0500 (EST)
From: Mouse <mouse@...ents-Montreal.ORG>
To: toybox@...ts.landley.net, musl@...ts.openwall.com
Subject: Re: [Toybox] Not sure how to debug this one.

>> This smells to me like depending on uninitialized stack trash.
> A write-only function that didn't change its behavior when I memset
> the structure before calling it?

The intent may be write-only.  The implementation may not be.

It also may be using _other_ stack trash, what would in a normal
function be its own local variables.  Continuing to fail when you've
zeroed the struct argues for this, actually; while it's _possible_ that
it would succeed with nonzero values somewhere in the struct, it
strikes me - and apparently you :-) - as a good deal less likely.

> Define "uninitialized".

"Contains junk left around from previous use of that memory",
approximately.

> (Unless you mean an uninitialized variable inside a function written
> entirely in assembly, that's part of a C library shipped and used by
> many people for many years?)

Hey, I found a use of uninitialized memory in NetBSD's libc, once.
(Admittedly, that one was concealed.)

>>> And I dunno how to stick a printf into superh assembly code.
>> The simple way to figure that out is to compile something that uses
>> printf and look at the assembly,

> 0045341c <printf>:
>   45341c:	86 2f       	mov.l	r8,@-r15
>   45341e:	f8 e2       	mov	#-8,r2
>   453420:	22 4f       	sts.l	pr,@-r15
[...]

> That's just the vfprintf() wrapper,

I actually meant to look at the assembly for the _call_ to printf.
That's what you'd want to mimic if you want to hand-code such a call,
after all.

> No, I would wind up CALLING the function, meaning set up a call
> stack, but how you're supposed to do that in the middle of setjmp()
> without corrupting the registers you're supposed to be saving...

That's what I was talking about when I wrote

>> But, given what sigsetjmp is, sticking a printf in there is likely
>> to be more difficult than usual.

> even manually making a _system_call_ in that context is... I mean
> it's _documented_ [...]

> But again, the point is to SAVE those registers, in a defined order,
> and there's no WAY to insert something that big into delicate
> assembly non-intrusively.  This already heisenbugs if my dprintf() is
> too elaborate.

Yes; that's in large part why I think it feels like use of stack trash.
That's probably the commonest cause of heisenbugs in my experience.

I think I would do that by saving registers in a global save area and
switching the stack pointer to separate, data-space, memory for the
duration of the printf call(s), so as to not disturb whatever is on the
stack.

Another thing I'd try is to insert code to modify (zero, set to all 1
bits, whatever - I'd experiment with various things) the theoretically
unused stack space below/above the current top-of-stack.  (Below in
terms of memory addresses, assuming of course that the ABI in use uses
a downward-growing stack; above in terms of stack depth.)

> But that's not what I was asking about HERE. "Magic blob of assembly
> for architecture I'm not hugely familiar with is throwing an
> interrupt, I wonder why?"

Fair point.  I was forgetting whom I was writing to.

>> But, given what sigsetjmp is, sticking a printf in there is likely
>> to be more difficult than usual.
> Define "usual".

Doing "the same thing" in an ordinary function, one that's not playing
unusual games with registers (and maybe stack space).

> One of the private email replies [...] suggested trying it under
> qemu-user (which reproduced the issue! MUCH easier),

Oh, good!  But....

> Unhandled trap: 0x180

> And that ALSO says it's a trap 0x180 which in qemu:

> sh7750_regs.h:#define SH7750_EVT_ILLEGAL_INSTR       0x180 /* General
> Illegal Instruction */

This leads me to wonder if it really _did_ reproduce the issue, as
opposed to breaking in a different way.

>>> (The problem with trying to configure the kernel to produce core
>>> dumps and compare against the readelf -d output is it's running as
>>> PID 1.  [...])
>> Why is that a problem?  I don't see any statement of what kernel
>> you're running under, but I can think of two plausible reasons
>> offhand: (1) the kernel refuses to coredump PID 1 under any
>> circumstances or (2) there's no writable filesytem to take a
>> coredump on at that point.
> The kernel panics immediately upon PID 1 exiting and even if the
> panic is deferred until after it's written the core dump instead of a
> check at the START of exiting, the writeable filesystem is initramfs
> which is transient.

:-(

The more I see of Linux the more I wonder why so many developers like
it so much.  But I suppose plenty of Linux people feel, or would feel,
analogously about my preferred environments....

> Once upon a time (like 2.0 or something) the kernel continued
> processing network packets and such after panic, so setting up your
> firewall rules and then intentionally panicing the kernel was
> considered the most secure way to set up a Linux router.

Elegant!  Twisted, but elegant!

> But I only pull out gdb when I'm REALLY annoyed.  (Cure worse than
> the disease.  Can't STAND the user interface...)

I can...tolerate it.  Though the gdb people really seem to be doing
their best to make it less and less tolerable; every once in a while,
when I tangle with a newer version at work, I have to go digging to
figure out how to turn off yet another obnoxious piece.  (The last
iteration of this, I actually had to dive into the source; their
documentation is only barely better than nothing.)

The closest thing I can recall using to a _good_ debugger for C was
probably ups.  Wish I'd kept a copy; it'd be an interesting task to
port it to something more modern, and if I succeeded it would be really
nice to have.

>> [...]
> Except sigsetjmp() is writing to the structure.  The function is not
> supposed to be reading from the structure.

It's not supposed to be crashing, either.

But, given what you said above, I now think it more likely it's using
stack trash from elsewhere on the stack.

Does qemu have the ability to track the initialized status of memory, a
la valgrind?  (I'd suggest valgrind itself except that (a) I think it
doesn't support Super-H, (b) I'd be surprised if you hadn't already
thought of it, and (c) it's intrusive enough it could well disturb a
heisenbug like this one.)  I added uninitialized tracking to my
userland SPARC emulator; it's what let me find that uninitialized
memory use in libc I mentioned above.  Something like that would be the
next tool I'd be looking for for this one.  (Indeed, if it were my
headache I'd likely take my userland SPARC emulator and replace the CPU
code with Super-H emulation.)

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse@...ents-montreal.org
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4Bl
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.