Date: Tue, 16 Nov 2021 10:19:59 +0100 From: Lukas Bulwahn <lukas.bulwahn@...il.com> To: Petr Mladek <pmladek@...e.com> Cc: Alexander Popov <alex.popov@...ux.com>, Gabriele Paoloni <gpaoloni@...hat.com>, Robert Krutsch <krutsch@...il.com>, Linus Torvalds <torvalds@...ux-foundation.org>, Jonathan Corbet <corbet@....net>, Paul McKenney <paulmck@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, Thomas Gleixner <tglx@...utronix.de>, Peter Zijlstra <peterz@...radead.org>, Joerg Roedel <jroedel@...e.de>, Maciej Rozycki <macro@...am.me.uk>, Muchun Song <songmuchun@...edance.com>, Viresh Kumar <viresh.kumar@...aro.org>, Robin Murphy <robin.murphy@....com>, Randy Dunlap <rdunlap@...radead.org>, Lu Baolu <baolu.lu@...ux.intel.com>, Kees Cook <keescook@...omium.org>, Luis Chamberlain <mcgrof@...nel.org>, Wei Liu <wl@....org>, John Ogness <john.ogness@...utronix.de>, Andy Shevchenko <andriy.shevchenko@...ux.intel.com>, Alexey Kardashevskiy <aik@...abs.ru>, Christophe Leroy <christophe.leroy@...roup.eu>, Jann Horn <jannh@...gle.com>, Greg Kroah-Hartman <gregkh@...uxfoundation.org>, Mark Rutland <mark.rutland@....com>, Andy Lutomirski <luto@...nel.org>, Dave Hansen <dave.hansen@...ux.intel.com>, Steven Rostedt <rostedt@...dmis.org>, Will Deacon <will@...nel.org>, Ard Biesheuvel <ardb@...nel.org>, Laura Abbott <labbott@...nel.org>, David S Miller <davem@...emloft.net>, Borislav Petkov <bp@...en8.de>, Arnd Bergmann <arnd@...db.de>, Andrew Scull <ascull@...gle.com>, Marc Zyngier <maz@...nel.org>, Jessica Yu <jeyu@...nel.org>, Iurii Zaikin <yzaikin@...gle.com>, Rasmus Villemoes <linux@...musvillemoes.dk>, Wang Qing <wangqing@...o.com>, Mel Gorman <mgorman@...e.de>, Mauro Carvalho Chehab <mchehab+huawei@...nel.org>, Andrew Klychkov <andrew.a.klychkov@...il.com>, Mathieu Chouquet-Stringer <me@...hieu.digital>, Daniel Borkmann <daniel@...earbox.net>, Stephen Kitt <steve@....org>, Stephen Boyd <sboyd@...nel.org>, Thomas Bogendoerfer <tsbogend@...ha.franken.de>, Mike Rapoport <rppt@...nel.org>, Bjorn Andersson <bjorn.andersson@...aro.org>, Kernel Hardening <kernel-hardening@...ts.openwall.com>, linux-hardening@...r.kernel.org, "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>, linux-arch <linux-arch@...r.kernel.org>, Linux Kernel Mailing List <linux-kernel@...r.kernel.org>, linux-fsdevel <linux-fsdevel@...r.kernel.org>, notify@...nel.org, main@...ts.elisa.tech, safety-architecture@...ts.elisa.tech, devel@...ts.elisa.tech, Shuah Khan <shuah@...nel.org> Subject: Re: [ELISA Safety Architecture WG] [PATCH v2 0/2] Introduce the pkill_on_warn parameter On Tue, Nov 16, 2021 at 9:41 AM Petr Mladek <pmladek@...e.com> wrote: > > On Tue 2021-11-16 10:52:39, Alexander Popov wrote: > > On 15.11.2021 18:51, Gabriele Paoloni wrote: > > > On 15/11/2021 14:59, Lukas Bulwahn wrote: > > > > On Sat, Nov 13, 2021 at 7:14 PM Alexander Popov <alex.popov@...ux.com> wrote: > > > > > On 13.11.2021 00:26, Linus Torvalds wrote: > > > > > > On Fri, Nov 12, 2021 at 10:52 AM Alexander Popov <alex.popov@...ux.com> wrote: > > > > > Killing the process that hit a kernel warning complies with the Fail-Fast > > > > > principle . pkill_on_warn sysctl allows the kernel to stop the process when > > > > > the **first signs** of wrong behavior are detected. > > > > > > > > > In summary, I am not supporting pkill_on_warn. I would support the > > > > other points I mentioned above, i.e., a good enforced policy for use > > > > of warn() and any investigation to understand the complexity of > > > > panic() and reducing its complexity if triggered by such an > > > > investigation. > > > > > > Hi Alex > > > > > > I also agree with the summary that Lukas gave here. From my experience > > > the safety system are always guarded by an external flow monitor (e.g. a > > > watchdog) that triggers in case the safety relevant workloads slows down > > > or block (for any reason); given this condition of use, a system that > > > goes into the panic state is always safe, since the watchdog would > > > trigger and drive the system automatically into safe state. > > > So I also don't see a clear advantage of having pkill_on_warn(); > > > actually on the flip side it seems to me that such feature could > > > introduce more risk, as it kills only the threads of the process that > > > caused the kernel warning whereas the other processes are trusted to > > > run on a weaker Kernel (does killing the threads of the process that > > > caused the kernel warning always fix the Kernel condition that lead to > > > the warning?) > > > > Lukas, Gabriele, Robert, > > Thanks for showing this from the safety point of view. > > > > The part about believing in panic() functionality is amazing :) > > Nothing is 100% reliable. > > With printk() maintainer hat on, the current panic() implementation > is less reliable because it tries hard to provide some debugging > information, for example, error message, backtrace, registry, > flush pending messages on console, crashdump. > > See panic() implementation, the reboot is done by emergency_restart(). > The rest is about duping the information. > > Well, the information is important. Otherwise, it is really hard to > fix the problem. > > From my experience, especially the access to consoles is not fully > safe. The reliability might improve a lot when a lockless console > is used. I guess that using non-volatile memory for the log buffer > might be even more reliable. > > I am not familiar with the code under emergency_restart(). I am not > sure how reliable it is. > > > Yes, safety critical systems depend on the robust ability to restart. > > If I wanted to implement a super-reliable panic() I would > use some external device that would cause power-reset when > the watched device is not responding. > Petr, that is basically the common system design taken. The whole challenge then remains to show that: Once panic() was invoked, the watched device does not signal being alive unintentionally, while the panic() is stuck in its shutdown routines. That requires having a panic() or other shutdown routine that still reliably can do something that the kernel routine that makes the watched device signal does not signal anymore. Lukas > Best Regards, > Petr > > > PS: I do not believe much into the pkill approach as well. > > It is similar to OOM killer. And I always had to restart the > system when it was triggered. > > Also kernel is not prepared for the situation that an external > code kills a kthread. And kthreads are used by many subsystems > to handle work that has to be done asynchronously and/or in > process context. And I guess that kthreads are non-trivial > source of WARN().
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.