Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Date: Fri, 02 Nov 2018 23:59:13 +0800
From: "dirk@...iot.com" <dirk@...iot.com>
To: Rich Felker <dalias@...c.org>
Cc: musl <musl@...ts.openwall.com>
Subject: Re: Deadlock when calling fflush/fclose in multiple threads

Content of type "text/html" skipped

In our case, fflush(NULL) is called in lua's popen. free/df/loadavg will be executed in popen to collect device status in multiple threads, for each five seconds.  And the case for me to reproduce this bug is calling popen In infinite loop in two threads, the deadlock happens within one second in our application.

This deadlock happens on x86-64 openwrt (virtual machine), and we haven't meet this deadlock in nxp imx6ul(arm-cortex-a9) and tplink 702n(mips24_kc). I was thought it might caused by 64bits, and we used glibc for those amd64 vms.  We started port our application to Raspberry 3B+from last month, we got same deaklock(the executed commands become zombie). And i got chance to loop at this issue again.



来自 魅族 PRO 6s

-------- 原始邮件 --------
发件人:Rich Felker <dalias@...c.org>
时间:2018年11月2日 22:29
收件人:dirk@...iot.com
抄送:musl <musl@...ts.openwall.com>
主题:Re: [musl] Deadlock when calling fflush/fclose in multiple threads

>On Fri, Nov 02, 2018 at 01:11:00PM +0800, dirk@...iot.com wrote:
>> Hi,
>> 
>> We got deadlock on fflush/fclose with musl-1.1.19 (openwrt 18.06).
>> Actually we using lua's popen in mutiple threads, following is gdb
>> trace.
>> 
>> I am new to musl libc source code, fflush(NULL) will call __ofl_lock
>> and then try to lock and flush every stream, fclose will lock the
>> stream and then __ofl_lock. The question is the fflush/fclose api
>> thread-safe? What i have got from man document is that linux
>> fflush/fclose is thread-safe api.
>
>Your analysis is exactly correct. Calling fflush(NULL) frequently (or
>at all) is a really bad idea because of how it scales and how
>serializing it is, but it is valid, and the deadlock is a bug in musl.
>
>The current placement of the ofl update seems to have been based on
>minimizing how serializing fclose is, and on avoiding taking the
>global lock for F_PERM (stdin/out/err) FILEs (which is largely a
>useless optimization since the operation can happen at most 3 times).
>Just moving it above the FLOCK (and making it not conditional on
>F_PERM, to avoid data races) would solve this, but there's a deeper
>bug here too.
>
>By removing the FILE being closed from the open file list (and
>unlocking the open file list, without which the removal can't be seen)
>before it's flushed and closed, fclose creates a race window where
>fflush(NULL) or exit() from another thread can complete without this
>file being flushed, potentially causing data loss.
>
>I think we just have to move the __ofl_lock to the top of the
>function, before FLOCK, and the __ofl_unlock to after the
>fflush/close. Unfortunately this makes fclose much more serializing
>than it was before, but I don't see any way to avoid it.
>
>Rich
>

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.