Date: Tue, 20 Oct 2020 22:28:13 +0200 From: David Hildenbrand <david@...hat.com> To: "Guilherme G. Piccoli" <gpiccoli@...onical.com>, Michal Hocko <mhocko@...e.com>, Mike Kravetz <mike.kravetz@...cle.com> Cc: linux-mm@...ck.org, kernel-hardening@...ts.openwall.com, linux-hardening@...r.kernel.org, linux-security-module@...r.kernel.org, kernel@...ccoli.net, cascardo@...onical.com, Alexander Potapenko <glider@...gle.com>, James Morris <jamorris@...ux.microsoft.com>, Kees Cook <keescook@...omium.org> Subject: Re: [PATCH] mm, hugetlb: Avoid double clearing for hugetlb pages On 20.10.20 22:07, David Hildenbrand wrote: > On 20.10.20 21:19, Guilherme G. Piccoli wrote: >> Hi Michal, thanks a lot for your thorough response. I'll address the >> comments inline, below. Thanks also David and Mike - in fact, I almost >> don't need to respond here after Mike, he was right to the point I'm >> going to discuss heh... >> >> >> On 20/10/2020 05:20, Michal Hocko wrote: >>> >>> Yes zeroying is quite costly and that is to be expected when the feature >>> is enabled. Hugetlb like other allocator users perform their own >>> initialization rather than go through __GFP_ZERO path. More on that >>> below. >>> >>> Could you be more specific about why this is a problem. Hugetlb pool is >>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages >>> is non trivial amount of time but it doens't look like a major disaster >>> either. If the pool is allocated later it can take much more time due to >>> memory fragmentation. >>> >>> I definitely do not want to downplay this but I would like to hear about >>> the real life examples of the problem. >> >> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was >> just my simple test in a guest, the real case is much worse! It aligns >> with Mike's comment, we have complains of minute-like delays, due to a >> very big pool of hugepages being allocated. >> >> Users have their own methodology for allocating pages, some would prefer >> do that "later" for a variety of reasons, so early boot time allocations >> are not always used, that shouldn't be the only focus of the discussion >> here. >> In the specific report I had, the user complains about more than 3 >> minutes to allocate ~542G of 2M hugetlb pages. >> >> Now, you'll ask why in the heck they are using init_on_alloc then - >> right? So, the Kconfig option "CONFIG_INIT_ON_ALLOC_DEFAULT_ON" is set >> by default in Ubuntu, for hardening reasons. So, the workaround for the >> users complaining of delays in allocating hugetlb pages currently is to >> set "init_on_alloc" to 0. It's a bit lame to ask users to disable such >> hardening thing just because we have a double initialization in hugetlb... >> >> >>> >>> >>> This has been discussed already (http://email@example.com. >>> Previously it has been brought up in SLUB context AFAIR. Your numbers >>> are quite clear here but do we really need a gfp flag with all the >>> problems we tend to grow in with them? >>> >>> One potential way around this specifically for hugetlb would be to use >>> __GFP_ZERO when allocating from the allocator and marking the fact in >>> the struct page while it is sitting in the pool. Page fault handler >>> could then skip the zeroying phase. Not an act of beauty TBH but it >>> fits into the existing model of the full control over initialization. >>> Btw. it would allow to implement init_on_free semantic as well. I >>> haven't implemented the actual two main methods >>> hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because >>> I am not entirely sure about the current state of hugetlb struct page in >>> the pool. But there should be a lot of room in there (or in tail pages). >>> Mike will certainly know much better. But the skeleton of the patch >>> would look like something like this (not even compile tested). >>> [code...] >> >> Thanks a lot for pointing the previous discussion for me! I should have >> done my homework properly and read all versions of the patchset...my >> bad! I'm glad to see this problem was discussed and considered early in >> the patch submission, I guess it only missed more real-world numbers. >> >> Your approach seems interesting, but as per Mike's response (which seems >> to have anticipated all my arguments heheh) your approach is a bit >> reversed, solving a ""non-existent"" problem (of zeroing hugetlb pages >> in fault time), whereas the big problem hereby tentatively fixed is the >> massive delay on allocation time of the hugetlb pages. >> >> I understand that your suggestion has no burden of introducing more GFP >> flags, and I agree that those are potentially dangerous if misused (and >> I totally agree with David that __GFP_NOINIT_ON_ALLOC is heinous, I'd >> rather go with the originally proposed __GFP_NO_AUTOINIT), but... >> wouldn't it be letting the code just drive a design decision? Like "oh, >> adding a flag is so bad..better just let this bug/perf issue to stay". > > The main problem I have is that page alloc code does some internal page > allocator things ("init_on_alloc" - "Fill newly allocated pages and heap > objects with zeroes"), and we're allowing users of page alloc code *that > really shouldn't have to care* to override that behavior, exposing > unnecessary complexity. Mainly: other allocators. > > "__GFP_NOINIT_ON_ALLOC" - what exactly does it do? > "__GFP_NO_AUTOINIT" - what exactly does it do? > > __GFP_ZERO set: page always zero. > __GFP_ZERO not set: page zero with init_on_alloc, page not necessarily > zero without init_on_alloc. Users can find out by > looking at init_on_alloc. > > IMHO, even something like __GFP_DONT_ZERO would be clearer. But I still > somewhat don't like letting users of the buddy override configured > behavior. Yes, it could be used by other alloactors (like hugetlb) to > optimize. > > But it could also be used by any driver wanting to optimize the > "init_on_alloc" case, eventually introducing security issues because the > code tries to be smart. > BTW, there might be other users for something like __GFP_DONT_ZERO. Especially, memory ballooning drivers (and virtio-mem), whereby the hypervisor is (WHP) going to zap the page either way after allocation. You just cannot assume that when freeing such a page again, that it's actually zero. But then, somebody told the system to suffer ("alloc_on_init"), so there isn't too much motivation to optimize such corner cases. -- Thanks, David / dhildenb
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.