musl - Re: Re: musl getaddr info breakage on older kernels

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220216014153.GM7074@brightrain.aerifal.cx>
Date: Tue, 15 Feb 2022 20:41:54 -0500
From: Rich Felker <dalias@...ifal.cx>
To: Satadru Pramanik <satadru@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: Re: musl getaddr info breakage on older kernels

On Tue, Feb 15, 2022 at 05:56:53PM -0500, Satadru Pramanik wrote:
> >
> >
> >
> > OK, then in that case it's surely Docker's seccomp filters that are
> > the problem. I think --security-opt seccomp=unconfined is the part you
> > need to work around it.
> 
>  That's the command line I was using, which leads to the application NOT
> breaking, and thus doesn't allow me to replicate the problem:
>  docker run --security-opt seccomp=unconfined  --platform linux/386
> --cap-add SYS_PTRACE --rm -v $(pwd)/pkg_cache:/usr/local/tmp/packages -v
> $(pwd):/output -h $(hostname)-i686 -it satmandu/crewbuild:alex-i686.m58
> /usr/local/bin/setarch i686 sudo -i -u chronos /usr/local/bin/bash -i
> 
> The goal with docker was to try to replicate the breakage on the actual
> hardware, which is the place we are having this problem.

OK, you haven't been clear about where the problem actually happens
from the beginning. I was under the impression all along that the
problem happened only in a Docker environment. Before we continue, can
you please clarify the exact environment the problem happens in
including:

- Whether any network traffic occurs when it fails (in the real
  environment not a replicated one elsewhere).

- Whether it fails or succeeds under strace (in the real
  environment not a replicated one elsewhere).

- Whether the real environment involves Docker or not.

- What's in resolv.conf (in the real environment not a replicated one
  elsewhere) and what nameserver software (if known) is running on the
  nameserver(s) listed in there.

- Anything else that might be relevant.

It's really hard to offer any productive advice when the problem is
unclear.

> I ran the process through gdb on the hardware, and stepped through it with
> the timeit function from here: https://stackoverflow.com/a/48412363
> 
> Of note perhaps is the very long time it takes for some of these calls to
> return in gdb? (The program does run in gdb when stepping through the
> function, but not when run without the break point)
> my commands were in essence the following in gdb:
> add symbol table from file "/usr/local/share/musl/lib/libc.so"
> break main
> run google.com 2>>gdb.out.txt
> ti (repeated until the program exited)
> (I ran this twice, and both runs succeed with long delays)
> Then I ran (this, which fails):
> clear main
> run google.com 2>>gdb.out.txt
> 
> Any other suggestions on how to track down this issue?

Rather than stepping through, I would put a single breakpoint at a
place you want to see whether execution reaches before running the
test program, then start it and see if the breakpoint fires or not.
Then remove the breakpoint, add a different one, and repeat. For
example, see if __res_msend is ever called, and if so, whether
particular lines of it are reached (or just put breakpoints on some of
the functions it calls, like socket, bind, recvfrom, poll, etc. to see
if they're called).

It might also be useful to put a breakpoint on clock_gettime and then
'finish' to see what it returns (in case the problem is something
time64-related).
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.