Date: Mon, 30 Jan 2017 18:52:46 -0800 From: Eric Hassold <hassold@...il.com> To: musl@...ts.openwall.com Subject: Re: Fix pthread_create on some devices failing to initialize guard area On 1/30/17 3:13 PM, Rich Felker wrote: > On Mon, Jan 30, 2017 at 01:30:00PM -0800, Eric Hassold wrote: >>>>>>> This occurs because of call to mprotect() in pthread_create fails. >>>>>>> In current implementation, if guard size is non null, memory for >>>>>>> (guard + stack + ...) is first allocated (mmap'ed) with no >>>>>>> accessibility (PROT_NONE), then mprotect() is called to re-enable >>>>>>> read/write access to (memory + guardsize). Since call to mprotect() >>>>>>> systematically fails in this scenario (returning error code EINVAL), >>>>>>> it is impossible to create thread. >>>>>> Failure is ignored and the memory is assumed to be writable in this >>>>>> case, since EINVAL is assumed to imply no MMU. Is this assumption >>>>>> wrong in your case, and if so, can you explain why? >>>>> In my case, devices exhibiting issue are not MMU-less, they are >>>>> Cortex-A9 devices with valid mmu / page protection working as >>>>> expected otherwise. Note that current Musl code assumes ENOSYS means >>>>> no MMU and handles it by assuming the system has no page protection >>>>> at all. For the case I observe, it is EINVAL which is returned, this >>>>> is not ignored, so memory is unmap'ed and pthread_create() fails. >>>> In that case I think this is a kernel bug. Do you know why EINVAL is >>>> happening? If there's an MMU, Linux should be able to replace the >>>> anon PROT_NONE pages with anon RW pages. >>> Agree. Unfortunately, those are devices we don't built the kernel >>> for, so have been hardly able to track issue deeper. The point is >>> however that such devices with this issue in kernel might not be >>> that uncommon, and it concretely means impossibility at that >>> moment to deploy on them a functional static executable built with >>> musl. >> [...] >> Pinging... any comment, feedback or concern about latest version of >> the patch, attached above, keeping current behavior but falling back >> to (mmap(PROT_READ|PROT_WRITE) && mprotect(guard, none)) if and only >> if current approach detected to fail) ? > I still want to know what's going on on the kernel side, because it > looks like this a rogue/nonsensical patch to the kernel that breaks > mmap functionality in a general way that has nothing to do with the > specific cpu/board. > > Rich Yes, I understand this would be ideal, and wish too I would be able to find out more about what's happening, beyond the symptomatic approach. But as mentioned, we have very little leverage to investigate much further what's going on, since issue wasn't reproducible on any device we built kernel for, but only on a few devices we don't build those kernels nor have easy way to perform some lower level debugging on those platforms (no jtag access). Telemetry was reporting failure on pthread creation on some devices, one of them we could reproduce in-house was a Marvell 375 board used e.g. in Western Digital MyCloud devices. But it seems hard to even just get reference to the actual GPL source of the exact kernel used there, so I don't see a way to find out if it's a broken patch introduced at one point by a vendor, or some transient regression fixed at one point but still having devices shipped with this oddity (kernel versions seems to be 3.10 branch, so one very rough guess is that it may be related to multiple patches done to support huge TLB on ARM at that time, but that's more of a guts feeling than supported by facts). Though I understand it is not the role of a libc implementation to work around all kind of flaws in kernel implementations that vendors ever shipped, I would find it valuable, in this specific case, to prevent musl from consistently bubbling up an issue not (easily) reproducible otherwise (e.g. with other libc implementations). One of the great value of musl is to allow easy deployment across a variety of devices (not as opened as we would ideally hope) of the same statically linked executable. This suggested patch aims at supporting this use case, since fixing kernel isn't an option there. Of course, this pragmatic motivation conflicts somehow with the legitimate goal to keep musl implementation lean and clean (from a functional standpoint, this doesn't introduce any change on behavior on all non broken platforms). So of course only you can value the pros (regarding deployment) and decide whether it is worth the cons of slightly "bloating" source code with alternative recovery path. Thanks, Eric
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.