Message ID | 20240902095203.1559361-1-mhocko@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | remove PF_MEMALLOC_NORECLAIM | expand |
On Mon, Sep 02, 2024 at 11:51:48AM GMT, Michal Hocko wrote: > The previous version has been posted in [1]. Based on the review feedback > I have sent v2 of patches in the same threat but it seems that the > review has mostly settled on these patches. There is still an open > discussion on whether having a NORECLAIM allocator semantic (compare to > atomic) is worthwhile or how to deal with broken GFP_NOFAIL users but > those are not really relevant to this particular patchset as it 1) > doesn't aim to implement either of the two and 2) it aims at spreading > PF_MEMALLOC_NORECLAIM use while it doesn't have a properly defined > semantic now that it is not widely used and much harder to fix. > > I have collected Reviewed-bys and reposting here. These patches are > touching bcachefs, VFS and core MM so I am not sure which tree to merge > this through but I guess going through Andrew makes the most sense. > > Changes since v1; > - compile fixes > - rather than dropping PF_MEMALLOC_NORECLAIM alone reverted eab0af905bfc > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN") suggested > by Matthew. To reiterate: This is a trainwreck of bad ideas. Nack.
On Mon, 2 Sep 2024 05:53:59 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Mon, Sep 02, 2024 at 11:51:48AM GMT, Michal Hocko wrote: > > The previous version has been posted in [1]. Based on the review feedback > > I have sent v2 of patches in the same threat but it seems that the > > review has mostly settled on these patches. There is still an open > > discussion on whether having a NORECLAIM allocator semantic (compare to > > atomic) is worthwhile or how to deal with broken GFP_NOFAIL users but > > those are not really relevant to this particular patchset as it 1) > > doesn't aim to implement either of the two and 2) it aims at spreading > > PF_MEMALLOC_NORECLAIM use while it doesn't have a properly defined > > semantic now that it is not widely used and much harder to fix. > > > > I have collected Reviewed-bys and reposting here. These patches are > > touching bcachefs, VFS and core MM so I am not sure which tree to merge > > this through but I guess going through Andrew makes the most sense. > > > > Changes since v1; > > - compile fixes > > - rather than dropping PF_MEMALLOC_NORECLAIM alone reverted eab0af905bfc > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN") suggested > > by Matthew. > > To reiterate: > It would be helpful to summarize your concerns. What runtime impact do you expect this change will have upon bcachefs?
On Mon, Sep 02, 2024 at 02:52:52PM GMT, Andrew Morton wrote: > On Mon, 2 Sep 2024 05:53:59 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > On Mon, Sep 02, 2024 at 11:51:48AM GMT, Michal Hocko wrote: > > > The previous version has been posted in [1]. Based on the review feedback > > > I have sent v2 of patches in the same threat but it seems that the > > > review has mostly settled on these patches. There is still an open > > > discussion on whether having a NORECLAIM allocator semantic (compare to > > > atomic) is worthwhile or how to deal with broken GFP_NOFAIL users but > > > those are not really relevant to this particular patchset as it 1) > > > doesn't aim to implement either of the two and 2) it aims at spreading > > > PF_MEMALLOC_NORECLAIM use while it doesn't have a properly defined > > > semantic now that it is not widely used and much harder to fix. > > > > > > I have collected Reviewed-bys and reposting here. These patches are > > > touching bcachefs, VFS and core MM so I am not sure which tree to merge > > > this through but I guess going through Andrew makes the most sense. > > > > > > Changes since v1; > > > - compile fixes > > > - rather than dropping PF_MEMALLOC_NORECLAIM alone reverted eab0af905bfc > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN") suggested > > > by Matthew. > > > > To reiterate: > > > > It would be helpful to summarize your concerns. > > What runtime impact do you expect this change will have upon bcachefs? For bcachefs: I try really hard to minimize tail latency and make performance robust in extreme scenarios - thrashing. A large part of that is that btree locks must be held for no longer than necessary. We definitely don't want to recurse into other parts of the kernel, taking other locks (i.e. in memory reclaim) while holding btree locks; that's a great way to stack up (and potentially multiply) latencies. But gfp flags don't work with vmalloc allocations (and that's unlikely to change), and we require vmalloc fallbacks for e.g. btree node allocation. That's the big reason we want MEMALLOC_PF_NORECLAIM. Besides that, it's just cleaner, memalloc flags are the direction we want to be moving in, and it's going to be necessary if we ever want to do a malloc() that doesn't require a gfp flags parameter. That would be a win for safety and correctness in the kernel, and it's also likely required for proper Rust support. And the "GFP_NOFAIL must not fail" argument makes no sense, because a failing a GFP_NOFAIL allocation is the only sane thing to do if the allocation is buggy (too big, i.e. resulting from an integer overflow bug, or wrong context). The alternatives are at best never returning (stuck unkillable process), or a scheduling while atomic bug, or Michal was even proposing killing the process (handling it like a BUG()!). But we don't use BUG_ON() for things that we can't prove won't happen in the wild if we can write an error path. That is, PF_MEMALLOC_NORECLAIM lets us turn bugs into runtime errors.
On Mon, Sep 02, 2024 at 02:52:52PM -0700, Andrew Morton wrote:
> It would be helpful to summarize your concerns.
And that'd better be a really good argument for a change that was
pushed directly to Linus bypassing the maintainer after multiple
reviewers pointed out it was broken. This series simply undoes the
damage done by that, while also keeping the code dependend on it
working.
On Mon 02-09-24 18:32:33, Kent Overstreet wrote: > On Mon, Sep 02, 2024 at 02:52:52PM GMT, Andrew Morton wrote: > > On Mon, 2 Sep 2024 05:53:59 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > > On Mon, Sep 02, 2024 at 11:51:48AM GMT, Michal Hocko wrote: > > > > The previous version has been posted in [1]. Based on the review feedback > > > > I have sent v2 of patches in the same threat but it seems that the > > > > review has mostly settled on these patches. There is still an open > > > > discussion on whether having a NORECLAIM allocator semantic (compare to > > > > atomic) is worthwhile or how to deal with broken GFP_NOFAIL users but > > > > those are not really relevant to this particular patchset as it 1) > > > > doesn't aim to implement either of the two and 2) it aims at spreading > > > > PF_MEMALLOC_NORECLAIM use while it doesn't have a properly defined > > > > semantic now that it is not widely used and much harder to fix. > > > > > > > > I have collected Reviewed-bys and reposting here. These patches are > > > > touching bcachefs, VFS and core MM so I am not sure which tree to merge > > > > this through but I guess going through Andrew makes the most sense. > > > > > > > > Changes since v1; > > > > - compile fixes > > > > - rather than dropping PF_MEMALLOC_NORECLAIM alone reverted eab0af905bfc > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN") suggested > > > > by Matthew. > > > > > > To reiterate: > > > > > > > It would be helpful to summarize your concerns. > > > > What runtime impact do you expect this change will have upon bcachefs? > > For bcachefs: I try really hard to minimize tail latency and make > performance robust in extreme scenarios - thrashing. A large part of > that is that btree locks must be held for no longer than necessary. > > We definitely don't want to recurse into other parts of the kernel, > taking other locks (i.e. in memory reclaim) while holding btree locks; > that's a great way to stack up (and potentially multiply) latencies. OK, these two patches do not fail to do that. The only existing user is turned into GFP_NOWAIT so the final code works the same way. Right? > But gfp flags don't work with vmalloc allocations (and that's unlikely > to change), and we require vmalloc fallbacks for e.g. btree node > allocation. That's the big reason we want MEMALLOC_PF_NORECLAIM. Have you even tried to reach out to vmalloc maintainers and asked for GFP_NOWAIT support for vmalloc? Because I do not remember that. Sure kernel page tables are have hardcoded GFP_KERNEL context which slightly complicates that but that doesn't really mean the only potential solution is to use a per task flag to override that. Just from top of my head we can consider pre-allocating virtual address space for non-sleeping allocations. Maybe there are other options that only people deeply familiar with the vmalloc internals can see. This requires discussions not pushing a very particular solution through.
On Mon, Sep 02, 2024 at 06:32:40PM GMT, Kent Overstreet wrote: > On Mon, Sep 02, 2024 at 02:52:52PM GMT, Andrew Morton wrote: > > On Mon, 2 Sep 2024 05:53:59 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > > On Mon, Sep 02, 2024 at 11:51:48AM GMT, Michal Hocko wrote: > > > > The previous version has been posted in [1]. Based on the review feedback > > > > I have sent v2 of patches in the same threat but it seems that the > > > > review has mostly settled on these patches. There is still an open > > > > discussion on whether having a NORECLAIM allocator semantic (compare to > > > > atomic) is worthwhile or how to deal with broken GFP_NOFAIL users but > > > > those are not really relevant to this particular patchset as it 1) > > > > doesn't aim to implement either of the two and 2) it aims at spreading > > > > PF_MEMALLOC_NORECLAIM use while it doesn't have a properly defined > > > > semantic now that it is not widely used and much harder to fix. > > > > > > > > I have collected Reviewed-bys and reposting here. These patches are > > > > touching bcachefs, VFS and core MM so I am not sure which tree to merge > > > > this through but I guess going through Andrew makes the most sense. > > > > > > > > Changes since v1; > > > > - compile fixes > > > > - rather than dropping PF_MEMALLOC_NORECLAIM alone reverted eab0af905bfc > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN") suggested > > > > by Matthew. > > > > > > To reiterate: > > > > > > > It would be helpful to summarize your concerns. > > > > What runtime impact do you expect this change will have upon bcachefs? > > For bcachefs: I try really hard to minimize tail latency and make > performance robust in extreme scenarios - thrashing. A large part of > that is that btree locks must be held for no longer than necessary. > > We definitely don't want to recurse into other parts of the kernel, > taking other locks (i.e. in memory reclaim) while holding btree locks; > that's a great way to stack up (and potentially multiply) latencies. > > But gfp flags don't work with vmalloc allocations (and that's unlikely > to change), and we require vmalloc fallbacks for e.g. btree node > allocation. That's the big reason we want MEMALLOC_PF_NORECLAIM. > > Besides that, it's just cleaner, memalloc flags are the direction we > want to be moving in, and it's going to be necessary if we ever want to > do a malloc() that doesn't require a gfp flags parameter. That would be > a win for safety and correctness in the kernel, and it's also likely > required for proper Rust support. > > And the "GFP_NOFAIL must not fail" argument makes no sense, because a > failing a GFP_NOFAIL allocation is the only sane thing to do if the > allocation is buggy (too big, i.e. resulting from an integer overflow > bug, or wrong context). The alternatives are at best never returning > (stuck unkillable process), or a scheduling while atomic bug, or Michal > was even proposing killing the process (handling it like a BUG()!). > > But we don't use BUG_ON() for things that we can't prove won't happen in > the wild if we can write an error path. > > That is, PF_MEMALLOC_NORECLAIM lets us turn bugs into runtime errors. BTW, one of the reasons I've been giving this issue so much attention is because of filesystem folks mentioning that they want GFP_NOFAIL semantics more widely, and I actually _don't_ think that's a crazy idea, provided we go about it the right way. Not having error paths is right out; many allocations when you start to look through more obscure code have sizes that are controlled by userspace, so we'd be opening ourselves up to trivially expoitable security bugs. However, if we agreed that GFP_NOFAIL meant "only fail if it is not possible to satisfy this allocation" (and I have been arguing that that is the only sane meaning) - then that could lead to a lot of error paths getting simpler. Because there are a lot of places where there's essentially no good reason to bubble up an -ENOMEM to userspace; if we're actually out of memory the current allocation is just one out of many and not particularly special, better to let the oom killer handle it... So the error paths would be more along the lines of "there's a bug, or userspace has requested something crazy, just shut down gracefully". While we're at it, the definition of what allocation size is "too big" is something we'd want to look at. Right now it's hardcoded to INT_MAX for non GFP_NOFAIL and (I believe) 2 pages for GFP_NOFAL, we might want to consider doing something based on total memory in the machine and have the same limit apply to both...
On Tue 03-09-24 19:53:41, Kent Overstreet wrote: [...] > However, if we agreed that GFP_NOFAIL meant "only fail if it is not > possible to satisfy this allocation" (and I have been arguing that that > is the only sane meaning) - then that could lead to a lot of error paths > getting simpler. > > Because there are a lot of places where there's essentially no good > reason to bubble up an -ENOMEM to userspace; if we're actually out of > memory the current allocation is just one out of many and not > particularly special, better to let the oom killer handle it... This is exactly GFP_KERNEL semantic for low order allocations or kvmalloc for that matter. They simply never fail unless couple of corner cases - e.g. the allocating task is an oom victim and all of the oom memory reserves have been consumed. This is where we call "not possible to allocate". > So the error paths would be more along the lines of "there's a bug, or > userspace has requested something crazy, just shut down gracefully". How do you expect that to be done? Who is going to go over all those GFP_NOFAIL users? And what kind of guide lines should they follow? It is clear that they believe they cannot handle the failure gracefully therefore they have requested GFP_NOFAIL. Many of them do not have return value to return. So really what do you expect proper GFP_NOFAIL users to do and what should happen to those that are requesting unsupported size or allocation mode? > While we're at it, the definition of what allocation size is "too big" > is something we'd want to look at. Right now it's hardcoded to INT_MAX > for non GFP_NOFAIL and (I believe) 2 pages for GFP_NOFAL, we might want > to consider doing something based on total memory in the machine and > have the same limit apply to both... Yes, we need to define some reasonable maximum supported sizes. For the page allocator this has been order > 1 and we considering we have a warning about those requests for years without a single report then we can assume we do not have such abusers. for kvmalloc to story is different. Current INT_MAX is just not any practical limit. Past experience says that anything based on the amount of memory just doesn't work (e.g. hash table sizes that used to that scaling and there are other examples). So we should be practical here and look at existing users and see what they really need and put a cap above that.
On Wed, Sep 04, 2024 at 09:14:29AM GMT, Michal Hocko wrote: > On Tue 03-09-24 19:53:41, Kent Overstreet wrote: > [...] > > However, if we agreed that GFP_NOFAIL meant "only fail if it is not > > possible to satisfy this allocation" (and I have been arguing that that > > is the only sane meaning) - then that could lead to a lot of error paths > > getting simpler. > > > > Because there are a lot of places where there's essentially no good > > reason to bubble up an -ENOMEM to userspace; if we're actually out of > > memory the current allocation is just one out of many and not > > particularly special, better to let the oom killer handle it... > > This is exactly GFP_KERNEL semantic for low order allocations or > kvmalloc for that matter. They simply never fail unless couple of corner > cases - e.g. the allocating task is an oom victim and all of the oom > memory reserves have been consumed. This is where we call "not possible > to allocate". *nod* Which does beg the question of why GFP_NOFAIL exists. > > So the error paths would be more along the lines of "there's a bug, or > > userspace has requested something crazy, just shut down gracefully". > > How do you expect that to be done? Who is going to go over all those > GFP_NOFAIL users? And what kind of guide lines should they follow? It is > clear that they believe they cannot handle the failure gracefully > therefore they have requested GFP_NOFAIL. Many of them do not have > return value to return. They can't handle the allocatian failure and continue normal operation, but that's entirely different from not being able to handle the allocation failure at all - it's not hard to do an emergency shutdown, that's a normal thing for filesystems to do. And if you scan for GFP_NOFAIL uses in the kernel, a decent number already do just that. > So really what do you expect proper GFP_NOFAIL users to do and what > should happen to those that are requesting unsupported size or > allocation mode? Emergency shutdwon. And I'm not saying we have to rush to fix all the existing callers; they're clearly in existing well tested code, there's not much point to that. Additionally most of them are deep in the guts of filesystem transaction code where call paths to that site are limited - they're not library code that gets called by anything. But as a matter of policy going forward, yes we should be saying that even GFP_NOFAIL allocations should be checking for -ENOMEM. > Yes, we need to define some reasonable maximum supported sizes. For the > page allocator this has been order > 1 and we considering we have a > warning about those requests for years without a single report then we > can assume we do not have such abusers. for kvmalloc to story is > different. Current INT_MAX is just not any practical limit. Past > experience says that anything based on the amount of memory just doesn't > work (e.g. hash table sizes that used to that scaling and there are > other examples). So we should be practical here and look at existing > users and see what they really need and put a cap above that. Not following what you're saying about hash tables? Hash tables scale roughly with the amount of system memory/workingset. But it seems to me that the limit should be lower if you're on e.g. a 2 GB machine (not failing with a warning, just failing immediately rather than oom killing a bunch of stuff first) - and it's going to need to be raised above INT_MAX as large memory machines keep growing, I keep hitting it in bcachefs fsck code.
On Tue, Sep 03, 2024 at 09:06:17AM GMT, Michal Hocko wrote: > On Mon 02-09-24 18:32:33, Kent Overstreet wrote: > > On Mon, Sep 02, 2024 at 02:52:52PM GMT, Andrew Morton wrote: > > > On Mon, 2 Sep 2024 05:53:59 -0400 Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > > > > On Mon, Sep 02, 2024 at 11:51:48AM GMT, Michal Hocko wrote: > > > > > The previous version has been posted in [1]. Based on the review feedback > > > > > I have sent v2 of patches in the same threat but it seems that the > > > > > review has mostly settled on these patches. There is still an open > > > > > discussion on whether having a NORECLAIM allocator semantic (compare to > > > > > atomic) is worthwhile or how to deal with broken GFP_NOFAIL users but > > > > > those are not really relevant to this particular patchset as it 1) > > > > > doesn't aim to implement either of the two and 2) it aims at spreading > > > > > PF_MEMALLOC_NORECLAIM use while it doesn't have a properly defined > > > > > semantic now that it is not widely used and much harder to fix. > > > > > > > > > > I have collected Reviewed-bys and reposting here. These patches are > > > > > touching bcachefs, VFS and core MM so I am not sure which tree to merge > > > > > this through but I guess going through Andrew makes the most sense. > > > > > > > > > > Changes since v1; > > > > > - compile fixes > > > > > - rather than dropping PF_MEMALLOC_NORECLAIM alone reverted eab0af905bfc > > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN") suggested > > > > > by Matthew. > > > > > > > > To reiterate: > > > > > > > > > > It would be helpful to summarize your concerns. > > > > > > What runtime impact do you expect this change will have upon bcachefs? > > > > For bcachefs: I try really hard to minimize tail latency and make > > performance robust in extreme scenarios - thrashing. A large part of > > that is that btree locks must be held for no longer than necessary. > > > > We definitely don't want to recurse into other parts of the kernel, > > taking other locks (i.e. in memory reclaim) while holding btree locks; > > that's a great way to stack up (and potentially multiply) latencies. > > OK, these two patches do not fail to do that. The only existing user is > turned into GFP_NOWAIT so the final code works the same way. Right? https://lore.kernel.org/linux-mm/20240828140638.3204253-1-kent.overstreet@linux.dev/ > > But gfp flags don't work with vmalloc allocations (and that's unlikely > > to change), and we require vmalloc fallbacks for e.g. btree node > > allocation. That's the big reason we want MEMALLOC_PF_NORECLAIM. > > Have you even tried to reach out to vmalloc maintainers and asked for > GFP_NOWAIT support for vmalloc? Because I do not remember that. Sure > kernel page tables are have hardcoded GFP_KERNEL context which slightly > complicates that but that doesn't really mean the only potential > solution is to use a per task flag to override that. Just from top of my > head we can consider pre-allocating virtual address space for > non-sleeping allocations. Maybe there are other options that only people > deeply familiar with the vmalloc internals can see. That sounds really overly complicated.
On Tue, Sep 03, 2024 at 07:13:42AM GMT, Christoph Hellwig wrote: > On Mon, Sep 02, 2024 at 02:52:52PM -0700, Andrew Morton wrote: > > It would be helpful to summarize your concerns. > > And that'd better be a really good argument for a change that was > pushed directly to Linus bypassing the maintainer after multiple > reviewers pointed out it was broken. This series simply undoes the > damage done by that, while also keeping the code dependend on it > working. Well, to be blunt, I thought the "we don't want the allocator to even know if we're in a non-sleepable context" argument was too crazy to have real support, and moving towards PF_MEMALLOC flags is something we've been talking about quite a bit going back years. Little did I know the minefield I was walking into... But the disccussion seems to finally be cooling off and going in a more productive direction.
On Wed 04-09-24 12:05:56, Kent Overstreet wrote: > On Wed, Sep 04, 2024 at 09:14:29AM GMT, Michal Hocko wrote: > > On Tue 03-09-24 19:53:41, Kent Overstreet wrote: > > [...] > > > However, if we agreed that GFP_NOFAIL meant "only fail if it is not > > > possible to satisfy this allocation" (and I have been arguing that that > > > is the only sane meaning) - then that could lead to a lot of error paths > > > getting simpler. > > > > > > Because there are a lot of places where there's essentially no good > > > reason to bubble up an -ENOMEM to userspace; if we're actually out of > > > memory the current allocation is just one out of many and not > > > particularly special, better to let the oom killer handle it... > > > > This is exactly GFP_KERNEL semantic for low order allocations or > > kvmalloc for that matter. They simply never fail unless couple of corner > > cases - e.g. the allocating task is an oom victim and all of the oom > > memory reserves have been consumed. This is where we call "not possible > > to allocate". > > *nod* > > Which does beg the question of why GFP_NOFAIL exists. Exactly for the reason that even rare failure is not acceptable and there is no way to handle it other than keep retrying. Typical code was while (!(ptr = kmalloc())) ; Or the failure would be much more catastrophic than the retry loop taking unbound amount of time. > > > So the error paths would be more along the lines of "there's a bug, or > > > userspace has requested something crazy, just shut down gracefully". > > > > How do you expect that to be done? Who is going to go over all those > > GFP_NOFAIL users? And what kind of guide lines should they follow? It is > > clear that they believe they cannot handle the failure gracefully > > therefore they have requested GFP_NOFAIL. Many of them do not have > > return value to return. > > They can't handle the allocatian failure and continue normal operation, > but that's entirely different from not being able to handle the > allocation failure at all - it's not hard to do an emergency shutdown, > that's a normal thing for filesystems to do. > > And if you scan for GFP_NOFAIL uses in the kernel, a decent number > already do just that. It's been quite some time since I've looked the last time. And I am not saying all the existing ones really require something as strong as GFP_NOFAIL semantic. If they could be dropped then great! The fewer we have the better. But the point is there are some which _do_ need this. We have discussed that in other email thread where you have heard why XFS and EXT4 does that and why they are not going to change that model. For those users we absolutely need a predictable and well defined behavior because they know what they are doing. [...] > But as a matter of policy going forward, yes we should be saying that > even GFP_NOFAIL allocations should be checking for -ENOMEM. I argue that such NOFAIL semantic has no well defined semantic and legit users are forced to do while (!(ptr = kmalloc(GFP_NOFAIL))) ; or BUG_ON(!(ptr = kmalloc(GFP_NOFAIL))); So it has no real reason to exist. We at the allocator level have 2 choices. Either we tell users they will not get GFP_NOFAIL and you just do the above or we provide NOFAIL which really guarantees that there is no failure even if that means the allocation gets unbounded amount of time. The latter have a slight advantage because a) you can identify those callers more easily and b) the allocator can do some heuristics to help those allocations. We can still discuss how to handle unsupported cases (like GFP_ATOMIC | __GFP_NOFAIL or kmalloc($UNCHECKED_USER_INPUT_THAT_IS_TOO_LARGE, __GFP_NOFAIL)) but the fact of the Linux kernel is that we have legit users and we need to optimize for them. > > Yes, we need to define some reasonable maximum supported sizes. For the > > page allocator this has been order > 1 and we considering we have a > > warning about those requests for years without a single report then we > > can assume we do not have such abusers. for kvmalloc to story is > > different. Current INT_MAX is just not any practical limit. Past > > experience says that anything based on the amount of memory just doesn't > > work (e.g. hash table sizes that used to that scaling and there are > > other examples). So we should be practical here and look at existing > > users and see what they really need and put a cap above that. > > Not following what you're saying about hash tables? Hash tables scale > roughly with the amount of system memory/workingset. I do not have sha handy but I do remember dcache hashtable scaling with the amount of memory in the past and that led to GBs of memory allocated on TB systems. This is not the case anymore I just wanted to mention that scaling with the amount of memory can get really wrong easily. > But it seems to me that the limit should be lower if you're on e.g. a 2 > GB machine (not failing with a warning, just failing immediately rather > than oom killing a bunch of stuff first) - and it's going to need to be > raised above INT_MAX as large memory machines keep growing, I keep > hitting it in bcachefs fsck code. Do we actual usecase that would require more than couple of MB? The amount of memory wouldn't play any actual role then.
On Wed 04-09-24 12:15:15, Kent Overstreet wrote: > On Tue, Sep 03, 2024 at 09:06:17AM GMT, Michal Hocko wrote: > > On Mon 02-09-24 18:32:33, Kent Overstreet wrote: [...] > > > For bcachefs: I try really hard to minimize tail latency and make > > > performance robust in extreme scenarios - thrashing. A large part of > > > that is that btree locks must be held for no longer than necessary. > > > > > > We definitely don't want to recurse into other parts of the kernel, > > > taking other locks (i.e. in memory reclaim) while holding btree locks; > > > that's a great way to stack up (and potentially multiply) latencies. > > > > OK, these two patches do not fail to do that. The only existing user is > > turned into GFP_NOWAIT so the final code works the same way. Right? > > https://lore.kernel.org/linux-mm/20240828140638.3204253-1-kent.overstreet@linux.dev/ https://lore.kernel.org/linux-mm/Zs9xC3OJPbkMy25C@casper.infradead.org/ > > > But gfp flags don't work with vmalloc allocations (and that's unlikely > > > to change), and we require vmalloc fallbacks for e.g. btree node > > > allocation. That's the big reason we want MEMALLOC_PF_NORECLAIM. > > > > Have you even tried to reach out to vmalloc maintainers and asked for > > GFP_NOWAIT support for vmalloc? Because I do not remember that. Sure > > kernel page tables are have hardcoded GFP_KERNEL context which slightly > > complicates that but that doesn't really mean the only potential > > solution is to use a per task flag to override that. Just from top of my > > head we can consider pre-allocating virtual address space for > > non-sleeping allocations. Maybe there are other options that only people > > deeply familiar with the vmalloc internals can see. > > That sounds really overly complicated. Let vmalloc people discuss viable ways to deal with that. You as vmalloc consumer want to get NOWAIT support. Ask them and see what kind of solution they can offer to you as a user. This is how we develop kernel in a collaborative way. We do not enforce solutions we work with domain experts to work out a maintainable solution.
On Wed 04-09-24 12:27:04, Kent Overstreet wrote: > On Tue, Sep 03, 2024 at 07:13:42AM GMT, Christoph Hellwig wrote: > > On Mon, Sep 02, 2024 at 02:52:52PM -0700, Andrew Morton wrote: > > > It would be helpful to summarize your concerns. > > > > And that'd better be a really good argument for a change that was > > pushed directly to Linus bypassing the maintainer after multiple > > reviewers pointed out it was broken. This series simply undoes the > > damage done by that, while also keeping the code dependend on it > > working. > > Well, to be blunt, I thought the "we don't want the allocator to even > know if we're in a non-sleepable context" argument was too crazy to have > real support, and moving towards PF_MEMALLOC flags is something we've > been talking about quite a bit going back years. > > Little did I know the minefield I was walking into... There is a lot of historical baggage and several people tried to explain that things are quite complex and you cannot simply apply design choices same way as if you were developing something from scratch. > But the disccussion seems to finally be cooling off and going in a more > productive direction. Reality check: https://lore.kernel.org/all/8734mitahm.fsf@trenco.lwn.net/T/#u
On Wed, Sep 04, 2024 at 06:46:00PM GMT, Michal Hocko wrote: > On Wed 04-09-24 12:05:56, Kent Overstreet wrote: > > On Wed, Sep 04, 2024 at 09:14:29AM GMT, Michal Hocko wrote: > > > On Tue 03-09-24 19:53:41, Kent Overstreet wrote: > > > [...] > > > > However, if we agreed that GFP_NOFAIL meant "only fail if it is not > > > > possible to satisfy this allocation" (and I have been arguing that that > > > > is the only sane meaning) - then that could lead to a lot of error paths > > > > getting simpler. > > > > > > > > Because there are a lot of places where there's essentially no good > > > > reason to bubble up an -ENOMEM to userspace; if we're actually out of > > > > memory the current allocation is just one out of many and not > > > > particularly special, better to let the oom killer handle it... > > > > > > This is exactly GFP_KERNEL semantic for low order allocations or > > > kvmalloc for that matter. They simply never fail unless couple of corner > > > cases - e.g. the allocating task is an oom victim and all of the oom > > > memory reserves have been consumed. This is where we call "not possible > > > to allocate". > > > > *nod* > > > > Which does beg the question of why GFP_NOFAIL exists. > > Exactly for the reason that even rare failure is not acceptable and > there is no way to handle it other than keep retrying. Typical code was > while (!(ptr = kmalloc())) > ; But is it _rare_ failure, or _no_ failure? You seem to be saying (and I just reviewed the code, it looks like you're right) that there is essentially no difference in behaviour between GFP_KERNEL and GFP_NOFAIL. So given that - why the wart? I think we might be able to chalk it up to history; I'd have to go spunking through the history (or ask Dave or Ted, maybe they'll chime in), but I suspect GFP_KERNEL didn't provide such strong guarantees when the allocation loops & GFP_NOFAIL were introduced. > Or the failure would be much more catastrophic than the retry loop > taking unbound amount of time. But if GFP_KERNEL has the retry loop... > > And if you scan for GFP_NOFAIL uses in the kernel, a decent number > > already do just that. > > It's been quite some time since I've looked the last time. And I am not > saying all the existing ones really require something as strong as > GFP_NOFAIL semantic. If they could be dropped then great! The fewer we > have the better. > > But the point is there are some which _do_ need this. We have discussed > that in other email thread where you have heard why XFS and EXT4 does > that and why they are not going to change that model. No, I agree that they need the strong guarantees. But if there's an actual bug, returning an error is better than killing the task. Killing the task is really bad; these allocations are deep in contexts where locks and refcounts are held, and the system will just grind to a halt. > > But as a matter of policy going forward, yes we should be saying that > > even GFP_NOFAIL allocations should be checking for -ENOMEM. > > I argue that such NOFAIL semantic has no well defined semantic and legit > users are forced to do > while (!(ptr = kmalloc(GFP_NOFAIL))) ; > or > BUG_ON(!(ptr = kmalloc(GFP_NOFAIL))); > > So it has no real reason to exist. I'm arguing that it does, provided when it returns NULL is defined to be: - invalid allocation context - a size that is so big that it will never be possible to satisfy. Returning an error is better than not returning at all, and letting the system grind to a halt. Let's also get Dave or Ted to chime in here if we can, because I strongly suspect the situation was different when those retry loops were introduced. > We at the allocator level have 2 choices. Either we tell users they > will not get GFP_NOFAIL and you just do the above or we provide NOFAIL > which really guarantees that there is no failure even if that means the > allocation gets unbounded amount of time. The latter have a slight > advantage because a) you can identify those callers more easily and b) > the allocator can do some heuristics to help those allocations. Or option 3: recognize that this is a correctness/soundness issue, and that if there are buggy callers they need to be fixed. To give a non-kernel correlary, in the Rust world, soundness issues are taken very seriously, and they are the only situation in which existing code will be broken if necessary to fix them. Probably with equal amount of fighting as these threads, heh. > > > Yes, we need to define some reasonable maximum supported sizes. For the > > > page allocator this has been order > 1 and we considering we have a > > > warning about those requests for years without a single report then we > > > can assume we do not have such abusers. for kvmalloc to story is > > > different. Current INT_MAX is just not any practical limit. Past > > > experience says that anything based on the amount of memory just doesn't > > > work (e.g. hash table sizes that used to that scaling and there are > > > other examples). So we should be practical here and look at existing > > > users and see what they really need and put a cap above that. > > > > Not following what you're saying about hash tables? Hash tables scale > > roughly with the amount of system memory/workingset. > > I do not have sha handy but I do remember dcache hashtable scaling with > the amount of memory in the past and that led to GBs of memory allocated > on TB systems. This is not the case anymore I just wanted to mention > that scaling with the amount of memory can get really wrong easily. Was the solution then to change the dcache hash table implementation, rather than lifting the INT_MAX allocation size limit? > > But it seems to me that the limit should be lower if you're on e.g. a 2 > > GB machine (not failing with a warning, just failing immediately rather > > than oom killing a bunch of stuff first) - and it's going to need to be > > raised above INT_MAX as large memory machines keep growing, I keep > > hitting it in bcachefs fsck code. > > Do we actual usecase that would require more than couple of MB? The > amount of memory wouldn't play any actual role then. Which "amount of memory?" - not parsing that. For large allocations in bcachefs: in journal replay we read all the keys in the journal, and then we create a big flat array with references to all of those keys to sort and dedup them. We haven't hit the INT_MAX size limit there yet, but filesystem sizes being what they are, we will soon. I've heard of users with 150 TB filesystems, and once the fsck scalability issues are sorted we'll be aiming for petabytes. Dirty keys in the journal scales more with system memory, but I'm leasing machines right now with a quarter terabyte of ram. Another more pressing one is the extents -> backpointers and backpointers -> extents passes of fsck; we do a linear scan through one btree checking references to another btree. For the btree we're checking references to the lookups are random, so we need to cache and pin the entire btree in ram if possible, or if not whatever will fit and we run in multiple passes. This is the #1 scalability issue hitting a number of users right now, so I may need to rewrite it to pull backpointers into an eytzinger array and do our random lookups for backpointers on that - but that will be "the biggest vmalloc array we can possible allocate", so the INT_MAX size limit is clearly an issue there... Q: Why isn't this in userspace? A: Has to be in the kernel for online fsck to work, and these are fsck passes we can already run online...
On Wed, Sep 04, 2024 at 02:03:13PM -0400, Kent Overstreet wrote: > On Wed, Sep 04, 2024 at 06:46:00PM GMT, Michal Hocko wrote: > > On Wed 04-09-24 12:05:56, Kent Overstreet wrote: > > > But it seems to me that the limit should be lower if you're on e.g. a 2 > > > GB machine (not failing with a warning, just failing immediately rather > > > than oom killing a bunch of stuff first) - and it's going to need to be > > > raised above INT_MAX as large memory machines keep growing, I keep > > > hitting it in bcachefs fsck code. > > > > Do we actual usecase that would require more than couple of MB? The > > amount of memory wouldn't play any actual role then. > > Which "amount of memory?" - not parsing that. > > For large allocations in bcachefs: in journal replay we read all the > keys in the journal, and then we create a big flat array with references > to all of those keys to sort and dedup them. > > We haven't hit the INT_MAX size limit there yet, but filesystem sizes > being what they are, we will soon. I've heard of users with 150 TB > filesystems, and once the fsck scalability issues are sorted we'll be > aiming for petabytes. Dirty keys in the journal scales more with system > memory, but I'm leasing machines right now with a quarter terabyte of > ram. I've seen xfs_repair require a couple of TB of RAM to repair metadata heavy filesystems of relatively small size (sub-20TB). Once you get about a few hundred GB of metadata in the filesystem, the fsck cross-reference data set size can easily run into the TBs. So 256GB might *seem* like a lot of memory, but we were seeing xfs_repair exceed that amount of RAM for metadata heavy filesystems at least a decade ago... Indeed, we recently heard about a 6TB filesystem with 15 *billion* hardlinks in it. The cross reference for resolving all those hardlinks would require somewhere in the order of 1.5TB of RAM to hold. The only way to reliably handle random access data sets this large is with pageable memory.... > Another more pressing one is the extents -> backpointers and > backpointers -> extents passes of fsck; we do a linear scan through one > btree checking references to another btree. For the btree we're checking > references to the lookups are random, so we need to cache and pin the > entire btree in ram if possible, or if not whatever will fit and we run > in multiple passes. > > This is the #1 scalability issue hitting a number of users right now, so > I may need to rewrite it to pull backpointers into an eytzinger array > and do our random lookups for backpointers on that - but that will be > "the biggest vmalloc array we can possible allocate", so the INT_MAX > size limit is clearly an issue there... Given my above comments, I think you are approaching this problem the wrong way. It is known that the data set that can exceed physical kernel memory size, hence it needs to be swappable. That way users can extend the kernel memory capacity via swapfiles when bcachefs.fsck needs more memory than the system has physical RAM. This is a problem Darrick had to address for the XFS online repair code - we've known for a long time that repair needs to hold a data set larger than physical memory to complete successfully. Hence for online repair we needed a mechanism that provided us with pagable kernel memory. vmalloc() is not an option - it has hard size limits (both API based and physical capacity based). Hence Darrick designed and implemented pageable shmem backed memory files (xfiles) to hold these data sets. Hence the size limit of the online repair data set is physical RAM + swap space, same as it is for offline repair. You can find the xfile code in fs/xfs/scrub/xfile.[ch]. Support for large, sortable arrays of fixed size records built on xfiles can be found in xfarray.[ch], and blob storage in xfblob.[ch]. vmalloc() is really not a good solution for holding arbitrary sized data sets in kernel memory.... -Dave.
On Thu, Sep 05, 2024 at 08:34:39AM GMT, Dave Chinner wrote: > I've seen xfs_repair require a couple of TB of RAM to repair > metadata heavy filesystems of relatively small size (sub-20TB). > Once you get about a few hundred GB of metadata in the filesystem, > the fsck cross-reference data set size can easily run into the TBs. > > So 256GB might *seem* like a lot of memory, but we were seeing > xfs_repair exceed that amount of RAM for metadata heavy filesystems > at least a decade ago... > > Indeed, we recently heard about a 6TB filesystem with 15 *billion* > hardlinks in it. The cross reference for resolving all those > hardlinks would require somewhere in the order of 1.5TB of RAM to > hold. The only way to reliably handle random access data sets this > large is with pageable memory.... Christ... This is also where space efficiency of metadata starts to really matter. Of course you store full backreferences for every hardlink, which is nice in some ways and a pain in others. > > Another more pressing one is the extents -> backpointers and > > backpointers -> extents passes of fsck; we do a linear scan through one > > btree checking references to another btree. For the btree we're checking > > references to the lookups are random, so we need to cache and pin the > > entire btree in ram if possible, or if not whatever will fit and we run > > in multiple passes. > > > > This is the #1 scalability issue hitting a number of users right now, so > > I may need to rewrite it to pull backpointers into an eytzinger array > > and do our random lookups for backpointers on that - but that will be > > "the biggest vmalloc array we can possible allocate", so the INT_MAX > > size limit is clearly an issue there... > > Given my above comments, I think you are approaching this problem > the wrong way. It is known that the data set that can exceed > physical kernel memory size, hence it needs to be swappable. That > way users can extend the kernel memory capacity via swapfiles when > bcachefs.fsck needs more memory than the system has physical RAM. Well, it depends on the locality of the cross references - I don't think we want to go that route here, because if there isn't any locality in the cross references we'll just be thrashing; better to run in multiple passes, constraining each pass to what _will_ fit in ram... It would be nice if we had a way to guesstimate locality in extents <-> backpointers references - if there is locality, then it's better to just run in one pass - and we wouldn't bother with building up new tables, we'd just rely on the btree node cache. Perhaps that's what we'll do when online fsck is finished and we're optimizing more for "don't disturb the rest of the system too much" than "get it done as quick as possible". I do need to start making use of Darrick's swappable memory code in at least one other place though - the bucket tables when we're checking basic allocation info. That one just exceeded the INT_MAX limit for a user with a 30 TB hard drive, so I switched it to a radix tree for now, but it really should be swappable memory. Fortunately there's more locality in the accesses there. > Hence Darrick designed and implemented pageable shmem backed memory > files (xfiles) to hold these data sets. Hence the size limit of the > online repair data set is physical RAM + swap space, same as it is > for offline repair. You can find the xfile code in > fs/xfs/scrub/xfile.[ch]. > > Support for large, sortable arrays of fixed size records built on > xfiles can be found in xfarray.[ch], and blob storage in > xfblob.[ch]. *nod* I do wish we had normal virtually mapped swappable memory though - the thing I don't like about xfarray is that it requires a radix tree walk on every access, and we have _hardware_ that's meant to do that for us. But if you still care about 32 bit then that does necessitate Darrick's approach. I'm willing to consider 32 bit legacy for bcachefs, though.
On Wed 04-09-24 14:03:13, Kent Overstreet wrote: > On Wed, Sep 04, 2024 at 06:46:00PM GMT, Michal Hocko wrote: > > On Wed 04-09-24 12:05:56, Kent Overstreet wrote: > > > On Wed, Sep 04, 2024 at 09:14:29AM GMT, Michal Hocko wrote: > > > > On Tue 03-09-24 19:53:41, Kent Overstreet wrote: > > > > [...] > > > > > However, if we agreed that GFP_NOFAIL meant "only fail if it is not > > > > > possible to satisfy this allocation" (and I have been arguing that that > > > > > is the only sane meaning) - then that could lead to a lot of error paths > > > > > getting simpler. > > > > > > > > > > Because there are a lot of places where there's essentially no good > > > > > reason to bubble up an -ENOMEM to userspace; if we're actually out of > > > > > memory the current allocation is just one out of many and not > > > > > particularly special, better to let the oom killer handle it... > > > > > > > > This is exactly GFP_KERNEL semantic for low order allocations or > > > > kvmalloc for that matter. They simply never fail unless couple of corner > > > > cases - e.g. the allocating task is an oom victim and all of the oom > > > > memory reserves have been consumed. This is where we call "not possible > > > > to allocate". > > > > > > *nod* > > > > > > Which does beg the question of why GFP_NOFAIL exists. > > > > Exactly for the reason that even rare failure is not acceptable and > > there is no way to handle it other than keep retrying. Typical code was > > while (!(ptr = kmalloc())) > > ; > > But is it _rare_ failure, or _no_ failure? > > You seem to be saying (and I just reviewed the code, it looks like > you're right) that there is essentially no difference in behaviour > between GFP_KERNEL and GFP_NOFAIL. The fundamental difference is that (appart from unsupported allocation mode/size) the latter never returns NULL and you can rely on that fact. Our docummentation says: * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller * cannot handle allocation failures. The allocation could block * indefinitely but will never return with failure. Testing for * failure is pointless. > So given that - why the wart? > > I think we might be able to chalk it up to history; I'd have to go > spunking through the history (or ask Dave or Ted, maybe they'll chime > in), but I suspect GFP_KERNEL didn't provide such strong guarantees when > the allocation loops & GFP_NOFAIL were introduced. Sure, go ahead. If you manage to remove all existing users of __GFP_NOFAIL (without replacing them with retry loops in the caller) then I would be more than happy to remove __GFP_NOFAIL in the allocator. [...] > > But the point is there are some which _do_ need this. We have discussed > > that in other email thread where you have heard why XFS and EXT4 does > > that and why they are not going to change that model. > > No, I agree that they need the strong guarantees. > > But if there's an actual bug, returning an error is better than killing > the task. Killing the task is really bad; these allocations are deep in > contexts where locks and refcounts are held, and the system will just > grind to a halt. Not sure what you mean by these allocations but I am not aware that any of the existing user would be really buggy. Also as I've said elsewhere, there is simply no good way to handle a buggy caller. Killing the buggy context has some downsides, returning NULL has others. I have argued that the former has better predictable behavior than potentially silent failure. We can clearly disagree on this but I also do not see why that is relevant to the original discussion because my argument against PF_MEMALLOC_NORECLAIM was focused on correct GPF_NOFAIL nested context that would get an unexpected failure mode. No matter what kind of failure mode that is it would be unexpected for those users. > > > But as a matter of policy going forward, yes we should be saying that > > > even GFP_NOFAIL allocations should be checking for -ENOMEM. > > > > I argue that such NOFAIL semantic has no well defined semantic and legit > > users are forced to do > > while (!(ptr = kmalloc(GFP_NOFAIL))) ; > > or > > BUG_ON(!(ptr = kmalloc(GFP_NOFAIL))); > > > > So it has no real reason to exist. > > I'm arguing that it does, provided when it returns NULL is defined to > be: > - invalid allocation context > - a size that is so big that it will never be possible to satisfy. Those are not really important situations because you are arguing about a buggy code that needs fixing. As said above we can argue how to deal with those users to get a predictable behavior but as the matter of fact, correct users can expect never seeing the failure so handling failure might be a) impossible and b) unfeasible (i.e. you are adding a dead code that is never tested). [...] > For large allocations in bcachefs: in journal replay we read all the > keys in the journal, and then we create a big flat array with references > to all of those keys to sort and dedup them. > > We haven't hit the INT_MAX size limit there yet, but filesystem sizes > being what they are, we will soon. I've heard of users with 150 TB > filesystems, and once the fsck scalability issues are sorted we'll be > aiming for petabytes. Dirty keys in the journal scales more with system > memory, but I'm leasing machines right now with a quarter terabyte of > ram. I thought you were arguing about bcachefs handling failure mode so presumably you do not need to use __GFP_NOFAIL for those. I am sorry but I am getting lost in these arguments.
On Thu, Sep 05, 2024 at 01:26:50PM +0200, Michal Hocko wrote: > > > > > This is exactly GFP_KERNEL semantic for low order allocations or > > > > > kvmalloc for that matter. They simply never fail unless couple of corner > > > > > cases - e.g. the allocating task is an oom victim and all of the oom > > > > > memory reserves have been consumed. This is where we call "not possible > > > > > to allocate". > > > > > > > > Which does beg the question of why GFP_NOFAIL exists. > > > > > > Exactly for the reason that even rare failure is not acceptable and > > > there is no way to handle it other than keep retrying. Typical code was > > > while (!(ptr = kmalloc())) > > > ; > > > > But is it _rare_ failure, or _no_ failure? > > > > You seem to be saying (and I just reviewed the code, it looks like > > you're right) that there is essentially no difference in behaviour > > between GFP_KERNEL and GFP_NOFAIL. That may be the currrent state of affiars; but is it ****guaranteed**** forever and ever, amen, that GFP_KERNEL will never fail if the amount of memory allocated was lower than a particular multiple of the page size? If so, what is that size? I've checked, and this is not documented in the formal interface. > The fundamental difference is that (appart from unsupported allocation > mode/size) the latter never returns NULL and you can rely on that fact. > Our docummentation says: > * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. The allocation could block > * indefinitely but will never return with failure. Testing for > * failure is pointless. So if the documentation is going to give similar guarantees, as opposed to it being an accident of the current implementation that is subject to change at any time, then sure, we can probably get away with all or most of ext4's uses of __GFP_NOFAIL. But I don't want to do that and then have a "Lucy and Charlie Brown" moment from the Peanuts comics strip where the football suddenly gets snatched away from us[1] (and many file sytem users will be very, very sad and/or angry). [1] https://www.cracked.com/article_37831_good-grief-how-lucy-pulling-the-football-away-from-charlie-brown-became-a-signature-peanuts-gag.html It might be that other file systems have requirements which isblarger than the not-formally-defined GFP_KMALLOC guarantee, but it's true we can probably reduce the usage of GFP_NOFAIL if this guarantee is formalized. - Ted
On Thu, Sep 05, 2024 at 09:53:26AM GMT, Theodore Ts'o wrote: > On Thu, Sep 05, 2024 at 01:26:50PM +0200, Michal Hocko wrote: > > > > > > This is exactly GFP_KERNEL semantic for low order allocations or > > > > > > kvmalloc for that matter. They simply never fail unless couple of corner > > > > > > cases - e.g. the allocating task is an oom victim and all of the oom > > > > > > memory reserves have been consumed. This is where we call "not possible > > > > > > to allocate". > > > > > > > > > > Which does beg the question of why GFP_NOFAIL exists. > > > > > > > > Exactly for the reason that even rare failure is not acceptable and > > > > there is no way to handle it other than keep retrying. Typical code was > > > > while (!(ptr = kmalloc())) > > > > ; > > > > > > But is it _rare_ failure, or _no_ failure? > > > > > > You seem to be saying (and I just reviewed the code, it looks like > > > you're right) that there is essentially no difference in behaviour > > > between GFP_KERNEL and GFP_NOFAIL. > > That may be the currrent state of affiars; but is it > ****guaranteed**** forever and ever, amen, that GFP_KERNEL will never > fail if the amount of memory allocated was lower than a particular > multiple of the page size? If so, what is that size? I've checked, > and this is not documented in the formal interface. Yeah, and I think we really need to make that happen, in order to head off a lot more sillyness in the future. We'd also be documenting at the same time _exactly_ when it is required to check for errors: - small, fixed sized allocation in a known sleepable context, safe to skip - anything else, i.e. variable sized allocation or library code that can be called from different contexts: you check for errors (and probably that's just "something crazy has happened, emergency shutdown" for the xfs/ext4 paths > > The fundamental difference is that (appart from unsupported allocation > > mode/size) the latter never returns NULL and you can rely on that fact. > > Our docummentation says: > > * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > > * cannot handle allocation failures. The allocation could block > > * indefinitely but will never return with failure. Testing for > > * failure is pointless. > > So if the documentation is going to give similar guarantees, as > opposed to it being an accident of the current implementation that is > subject to change at any time, then sure, we can probably get away > with all or most of ext4's uses of __GFP_NOFAIL. But I don't want to > do that and then have a "Lucy and Charlie Brown" moment from the > Peanuts comics strip where the football suddenly gets snatched away > from us[1] (and many file sytem users will be very, very sad and/or > angry). yeah absolutely, and the "what is a small allocation" limit needs to be nailed down as well
On Thu 05-09-24 09:53:26, Theodore Ts'o wrote: > On Thu, Sep 05, 2024 at 01:26:50PM +0200, Michal Hocko wrote: > > > > > > This is exactly GFP_KERNEL semantic for low order allocations or > > > > > > kvmalloc for that matter. They simply never fail unless couple of corner > > > > > > cases - e.g. the allocating task is an oom victim and all of the oom > > > > > > memory reserves have been consumed. This is where we call "not possible > > > > > > to allocate". > > > > > > > > > > Which does beg the question of why GFP_NOFAIL exists. > > > > > > > > Exactly for the reason that even rare failure is not acceptable and > > > > there is no way to handle it other than keep retrying. Typical code was > > > > while (!(ptr = kmalloc())) > > > > ; > > > > > > But is it _rare_ failure, or _no_ failure? > > > > > > You seem to be saying (and I just reviewed the code, it looks like > > > you're right) that there is essentially no difference in behaviour > > > between GFP_KERNEL and GFP_NOFAIL. > > That may be the currrent state of affiars; but is it > ****guaranteed**** forever and ever, amen, that GFP_KERNEL will never > fail if the amount of memory allocated was lower than a particular > multiple of the page size? No, GFP_KERNEL is not guaranteed. Allocator tries as hard as it can to satisfy those allocations for order <= PAGE_ALLOC_COSTLY_ORDER. GFP_NOFAIL is guaranteed for order <= 1 for page allocator and there is no practical limit for vmalloc currently. This is what our documentation says * The default allocator behavior depends on the request size. We have a concept * of so-called costly allocations (with order > %PAGE_ALLOC_COSTLY_ORDER). * !costly allocations are too essential to fail so they are implicitly * non-failing by default (with some exceptions like OOM victims might fail so * the caller still has to check for failures) while costly requests try to be * not disruptive and back off even without invoking the OOM killer. * The following three modifiers might be used to override some of these * implicit rules. There is no guarantee this will be that way for ever. This is unlikely to change though.
On Thu, Sep 05, 2024 at 10:05:15AM -0400, Kent Overstreet wrote: > > That may be the currrent state of affiars; but is it > > ****guaranteed**** forever and ever, amen, that GFP_KERNEL will never > > fail if the amount of memory allocated was lower than a particular > > multiple of the page size? If so, what is that size? I've checked, > > and this is not documented in the formal interface. > > Yeah, and I think we really need to make that happen, in order to head > off a lot more sillyness in the future. I don't think there's any "sillyness"; I hear that you believe that it's silly, but I think what we have is totally fine. I've done a quick check of ext4, and we do check the error returns in most if not all of the calls where we pass in __GFP_NOFAIL and/or are small allocations less than the block size. We won't crash if someone sends a patch which violates the documented guarantee of __GFP_NOFAIL. So what's the sillynesss? In any case, Michal has said ix-nay on making GFP_KERNEL == GFP_NOFAIL for small allocations as documented guarantee, as opposed to the way things work today, so as far as I'm concerned, the matter is closed. - Ted
Thanks, I have queued this for 6.12-rc1. Kent, if this results in observable runtime issues with bcachefs, please report those and let's see what we can come up with to address them.
On Tue, Sep 10, 2024 at 12:29:12PM GMT, Andrew Morton wrote: > Thanks, I have queued this for 6.12-rc1. > > Kent, if this results in observable runtime issues with bcachefs, > please report those and let's see what we can come up with to address > them. Andrew, have you ever had to hunt down tail latency bugs?