musl - Re: Fix pthread_create on some devices failing to initialize guard area

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <81f188cf-3fb8-2899-5c24-ec72d38ad300@gmail.com>
Date: Mon, 30 Jan 2017 13:30:00 -0800
From: Eric Hassold <hassold@...il.com>
To: musl@...ts.openwall.com
Subject: Re: Fix pthread_create on some devices failing to initialize
 guard area



On 1/20/17 2:42 PM, Eric Hassold wrote:
>
>
> On 1/20/17 1:29 PM, Rich Felker wrote:
>> On Fri, Jan 20, 2017 at 01:04:28PM -0800, Eric Hassold wrote:
>>> On 1/20/17 11:56 AM, Rich Felker wrote:
>>>> On Fri, Jan 20, 2017 at 11:45:09AM -0800, Eric Hassold wrote:
>>>>> Hi All,
>>>>>
>>>>> While deploying test static executable across farm of different
>>>>> embedded systems, found out that pthread_create() is failing
>>>>> systematically on some (very few) arm-linux devices whenever non
>>>>> null stack guard is enabled (that is, also when calling
>>>>> pthread_create with default - i.e. null - attributes since default
>>>>> is a one page of guard). One of those device is for example a
>>>>> Marvell Armada 375 running Linux 3.10.39. Same test code, built with
>>>>> alternative libc implementations (glibc, uClibc) works as expected
>>>>> on those devices.
>>>>>
>>>>>
>>>>> Issue
>>>>>
>>>>> This occurs because of call to mprotect() in pthread_create fails.
>>>>> In current implementation, if guard size is non null, memory for
>>>>> (guard + stack + ...) is first allocated (mmap'ed) with no
>>>>> accessibility (PROT_NONE), then mprotect() is called to re-enable
>>>>> read/write access to (memory + guardsize). Since call to mprotect()
>>>>> systematically fails in this scenario (returning error code EINVAL),
>>>>> it is impossible to create thread.
>>>> Failure is ignored and the memory is assumed to be writable in this
>>>> case, since EINVAL is assumed to imply no MMU. Is this assumption
>>>> wrong in your case, and if so, can you explain why?
>>> In my case, devices exhibiting issue are not MMU-less, they are
>>> Cortex-A9 devices with valid mmu / page protection working as
>>> expected otherwise. Note that current Musl code assumes ENOSYS means
>>> no MMU and handles it by assuming the system has no page protection
>>> at all. For the case I observe, it is EINVAL which is returned, this
>>> is not ignored, so memory is unmap'ed and pthread_create() fails.
>> In that case I think this is a kernel bug. Do you know why EINVAL is
>> happening? If there's an MMU, Linux should be able to replace the
>> anon PROT_NONE pages with anon RW pages.
>
> Agree. Unfortunately, those are devices we don't built the kernel for, 
> so have been hardly able to track issue deeper. The point is however 
> that such devices with this issue in kernel might not be that 
> uncommon, and it concretely means impossibility at that moment to 
> deploy on them a functional static executable built with musl.
>>
>>>>> In proposed patch (attached below), memory for (guard + stack + ...)
>>>>> is first mmap'ed with read/write accessibility, then guard area is
>>>>> protected by calling mprotect() with PROT_NONE on guardsize first
>>>>> bytes of returned memory. This call to mprotect() to remove all
>>>>> accessibility on guard area, with guard area being at beginning of
>>>>> previously mmap'ed memory, works correctly on those platforms having
>>>>> issue with current implementation. Incidentally, this makes the
>>>>> logic more concise to handle both cases (with or without guard) is a
>>>>> more consistent way, and handle systems with partial/invalid page
>>>>> protection implementation (e.g. mprotect() returning ENOSYS) more
>>>>> gracefully since the stack is explicitly created with read/write
>>>>> access.
>>>> This doesn't work correctly on normal systems with mmu, because the
>>>> size of the guard pages is accounted against commit charge. Linux
>>>> should, but AFAIK doesn't, subtract it from commit charge once it's
>>>> changed to PROT_NONE without having been dirtied, but even if this bug
>>>> is fixed on the kernel side, there would still be a moment where
>>>> excess commit charge is consumed and thus where pthread_create might
>>>> spuriously fail or cause allocations in other processes/threads to
>>>> fail.
>>>>
>>>> If the kernel is not allocating actually-usable address ranges for
>>>> PROT_NONE on all nommu systems, I think the only solution is to handle
>>>> EINVAL from mprotect by going back and re-doing the mmap with
>>>> PROT_READ|PROT_WRITE. Do you have any better ideas?
>>>>
>>>> Rich
>>> Had this "deja vu" feeling... reminds me conversation you had in
>>> this thread some time ago elsewhere...
>>> https://sourceware.org/ml/libc-alpha/2015-09/msg00447.html
>>>
>>> Your proposition seems reasonable on nommu system, but again, the
>>> issue observed here is on legit systems with mmu, with mprotect
>>> failing with EINVAL (and not ENOSYS), for some other reason than
>>> system not supporting page protection. Catching EINVAL error
>>> returned by mprotect and falling back to re-doing the mmap would
>>> mean actually silently running without stack guard on system
>>> supporting it, so I believe it is actually legitimate to fail and
>>> return error in that case. But that's difference use case than the
>>> issue I'm observing.
>>>
>>> I took note of Balazs's suggestion (in the thread referenced above)
>>> to switch to a pattern similar to Musl's current one
>>> (mmap(PROT_NONE) + mprotect(stack, READ|WRITE)) in order to avoid
>>> those guard pages to actually occupy resources. But I can indeed
>>> observe that this approach fails on some devices (which have valid
>>> mmu), while I'm not sure I'm seeing the issue with first mapping
>>> PROT_READ|PROT_WRITE then mprotect(PROT_NONE) guard area. Latter
>>> approach (as implemented by patch) is, at least, consistent with all
>>> the other implementations out there (I checked glibc's
>>> allocatestack.c, but also e.g. bionic), and couldn't find report of
>>> those failures you are envisaging.
>> Consider the case of guard_size=128M stack_size=128k with
>> Commit_Limit=128M. This will fail with your approach but works
>> perfectly well now.
> Right, makes sense. Though I would point out that such uncommon 
> scenario for an application would fail when linked with anything but 
> musl, since "my approach" is the widespread one across all other libc 
> implementations (glibc, bionic, ...). But I understand both approaches 
> aim at working around issue in some corner case while creating issue 
> in other rare one, and so at the end it's all about deciding which 
> edge case is more critical than the other, and whether 
> consistency/compatibility with existing 3rd parties solutions matters 
> or not.
>
> But in order to get closer to "having the cake and eat it too", please 
> find attached another patch implementing the "unmaping and re-doing" 
> strategy you initially suggested, i.e. starting with current approach, 
> then giving it a second chance using "my" approach if and only if 
> issue with mprotect() is detected. Should be consistent with current 
> behavior on any system working correctly today, and just provide a 
> plan B falling back to more common approach only when needed. Does 
> that sound more acceptable?
>
> Thanks,
>
> Eric
>
> ----
> [PATCH] set up guard protection after stack is mapped
>
> calling mprotect beyond the guard page of PROT_NONE mmap-ed stack 
> fails on
> some devices with buggy kernel, making it impossible to create
> thread. if such condition is met, fall back to allocating memory
> for stack and guard first, with read and write permission, then 
> protect guard
> area.
> ---
>  src/thread/pthread_create.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/src/thread/pthread_create.c b/src/thread/pthread_create.c
> index 49f2f72..d3c030b 100644
> --- a/src/thread/pthread_create.c
> +++ b/src/thread/pthread_create.c
> @@ -242,7 +242,17 @@ int __pthread_create(pthread_t *restrict res, 
> const pthread_attr_t *restrict att
>              if (__mprotect(map+guard, size-guard, PROT_READ|PROT_WRITE)
>                  && errno != ENOSYS) {
>                  __munmap(map, size);
> -                goto fail;
> +                if (errno == EINVAL) {
> +                    map = __mmap(0, size, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANON, -1, 0);
> +                    if (map == MAP_FAILED) goto fail;
> +                    if (__mprotect(map, guard, PROT_NONE)
> +                        && errno != ENOSYS) {
> +                        __munmap(map, size);
> +                        goto fail;
> +                                        }
> +                } else {
> +                    goto fail;
> +                                }
>              }
>          } else {
>              map = __mmap(0, size, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANON, -1, 0);

Hi Rich,

Pinging... any comment, feedback or concern about latest version of the 
patch, attached above, keeping current behavior but falling back to 
(mmap(PROT_READ|PROT_WRITE) && mprotect(guard, none)) if and only if 
current approach detected to fail) ?

Eric
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.