Message ID | cover.1720159494.git.wqu@suse.com (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: remove __GFP_NOFAIL usage for debug builds | expand |
On Fri, Jul 05, 2024 at 03:45:37PM +0930, Qu Wenruo wrote: > This patchset removes all __GFP_NOFAIL flags usage inside btrfs for > DEBUG builds. > > There are 3 call sites utilizing __GFP_NOFAIL: > > - __alloc_extent_buffer() > It's for the extent_buffer structure allocation. > All callers are already handling the errors. > > - attach_eb_folio_to_filemap() > It's for the filemap_add_folio() call, the flag is also passed to mem > cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL > correctly, as I'm hitting soft lockups when testing larger folios > > New error handling is added. > > - btrfs_alloc_folio_array() > This is for page allocation for extent buffers. > All callers are already handling the errors. > > Furthermore, to enable more testing while not affecting end users, the > change is only implemented for DEBUG builds. > > Qu Wenruo (3): > btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer() > btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap() > btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array() The reason I want to leave NOFAIL is because in a cgroup memory constrained environment we could get an errant ENOMEM on some sort of metadata operation, which then gets turned into an aborted transaction. I don't want a memory constrained cgroup flipping the whole file system read only because it got an ENOMEM in a place where we have no choice but to abort the transaction. If we could eliminate that possibility then hooray, but that's not actually possible, because any COW for a multi-modification case (think finish ordered io) could result in an ENOMEM and thus a transaction abort. We need to live with NOFAIL for these cases. Thanks, Josef
在 2024/7/6 00:25, Josef Bacik 写道: > On Fri, Jul 05, 2024 at 03:45:37PM +0930, Qu Wenruo wrote: >> This patchset removes all __GFP_NOFAIL flags usage inside btrfs for >> DEBUG builds. >> >> There are 3 call sites utilizing __GFP_NOFAIL: >> >> - __alloc_extent_buffer() >> It's for the extent_buffer structure allocation. >> All callers are already handling the errors. >> >> - attach_eb_folio_to_filemap() >> It's for the filemap_add_folio() call, the flag is also passed to mem >> cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL >> correctly, as I'm hitting soft lockups when testing larger folios >> >> New error handling is added. >> >> - btrfs_alloc_folio_array() >> This is for page allocation for extent buffers. >> All callers are already handling the errors. >> >> Furthermore, to enable more testing while not affecting end users, the >> change is only implemented for DEBUG builds. >> >> Qu Wenruo (3): >> btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer() >> btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap() >> btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array() > > The reason I want to leave NOFAIL is because in a cgroup memory constrained > environment we could get an errant ENOMEM on some sort of metadata operation, > which then gets turned into an aborted transaction. I don't want a memory > constrained cgroup flipping the whole file system read only because it got an > ENOMEM in a place where we have no choice but to abort the transaction. Well, I'm already seeing mem_cgroup_charge() soft locks up, possible due to the _NOFAIL flag and larger folios at filemap_add_folio(). Thus the _NOFAIL flag is not only affecting page allocation, but also passed to mem cgroup related code, and that can already be problematic. > > If we could eliminate that possibility then hooray, but that's not actually > possible, because any COW for a multi-modification case (think finish ordered > io) could result in an ENOMEM and thus a transaction abort. We need to live > with NOFAIL for these cases. Thanks, In that cgroup controlled case, I'm wondering how it would even handle btrfs extent buffer allocation. Wouldn't all the eb charged to certain cgroup meanwhile the eb pages are a shared resource for all btrfs operations? If so, I'd say the mem cgroup on btree inode is already problematic, and we should eliminate the behavior other than using _NOFAIL to workaround it. Thanks, Qu > > Josef >
On Fri, Jul 05, 2024 at 10:55:43AM -0400, Josef Bacik wrote: > On Fri, Jul 05, 2024 at 03:45:37PM +0930, Qu Wenruo wrote: > > This patchset removes all __GFP_NOFAIL flags usage inside btrfs for > > DEBUG builds. > > > > There are 3 call sites utilizing __GFP_NOFAIL: > > > > - __alloc_extent_buffer() > > It's for the extent_buffer structure allocation. > > All callers are already handling the errors. > > > > - attach_eb_folio_to_filemap() > > It's for the filemap_add_folio() call, the flag is also passed to mem > > cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL > > correctly, as I'm hitting soft lockups when testing larger folios > > > > New error handling is added. > > > > - btrfs_alloc_folio_array() > > This is for page allocation for extent buffers. > > All callers are already handling the errors. > > > > Furthermore, to enable more testing while not affecting end users, the > > change is only implemented for DEBUG builds. > > > > Qu Wenruo (3): > > btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer() > > btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap() > > btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array() > > The reason I want to leave NOFAIL is because in a cgroup memory constrained > environment we could get an errant ENOMEM on some sort of metadata operation, > which then gets turned into an aborted transaction. I don't want a memory > constrained cgroup flipping the whole file system read only because it got an > ENOMEM in a place where we have no choice but to abort the transaction. > > If we could eliminate that possibility then hooray, but that's not actually > possible, because any COW for a multi-modification case (think finish ordered > io) could result in an ENOMEM and thus a transaction abort. We need to live > with NOFAIL for these cases. Thanks, I agree with keeping NOFAIL. Please add the above as a comment to btrfs_alloc_folio_array().
在 2024/7/30 23:25, David Sterba 写道: > On Fri, Jul 05, 2024 at 10:55:43AM -0400, Josef Bacik wrote: >> On Fri, Jul 05, 2024 at 03:45:37PM +0930, Qu Wenruo wrote: >>> This patchset removes all __GFP_NOFAIL flags usage inside btrfs for >>> DEBUG builds. >>> >>> There are 3 call sites utilizing __GFP_NOFAIL: >>> >>> - __alloc_extent_buffer() >>> It's for the extent_buffer structure allocation. >>> All callers are already handling the errors. >>> >>> - attach_eb_folio_to_filemap() >>> It's for the filemap_add_folio() call, the flag is also passed to mem >>> cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL >>> correctly, as I'm hitting soft lockups when testing larger folios >>> >>> New error handling is added. >>> >>> - btrfs_alloc_folio_array() >>> This is for page allocation for extent buffers. >>> All callers are already handling the errors. >>> >>> Furthermore, to enable more testing while not affecting end users, the >>> change is only implemented for DEBUG builds. >>> >>> Qu Wenruo (3): >>> btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer() >>> btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap() >>> btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array() >> >> The reason I want to leave NOFAIL is because in a cgroup memory constrained >> environment we could get an errant ENOMEM on some sort of metadata operation, >> which then gets turned into an aborted transaction. I don't want a memory >> constrained cgroup flipping the whole file system read only because it got an >> ENOMEM in a place where we have no choice but to abort the transaction. >> >> If we could eliminate that possibility then hooray, but that's not actually >> possible, because any COW for a multi-modification case (think finish ordered >> io) could result in an ENOMEM and thus a transaction abort. We need to live >> with NOFAIL for these cases. Thanks, > > I agree with keeping NOFAIL. Please add the above as a comment to > btrfs_alloc_folio_array(). That will soon no longer be a problem. The cgroup guys are fine if certain inode should not be limited by mem cgroup, so I already have patches to use root mem cgroup so that it will not be charged at all. https://lore.kernel.org/linux-mm/6a9ba2c8e70c7b5c4316404612f281a031f847da.1721384771.git.wqu@suse.com/ Furthermore, using NOFAIL just to workaround mem cgroup looks like an abuse to me. Thanks, Qu
On Wed, Jul 31, 2024 at 07:47:51AM +0930, Qu Wenruo wrote: > > > 在 2024/7/30 23:25, David Sterba 写道: > > On Fri, Jul 05, 2024 at 10:55:43AM -0400, Josef Bacik wrote: > >> On Fri, Jul 05, 2024 at 03:45:37PM +0930, Qu Wenruo wrote: > >>> This patchset removes all __GFP_NOFAIL flags usage inside btrfs for > >>> DEBUG builds. > >>> > >>> There are 3 call sites utilizing __GFP_NOFAIL: > >>> > >>> - __alloc_extent_buffer() > >>> It's for the extent_buffer structure allocation. > >>> All callers are already handling the errors. > >>> > >>> - attach_eb_folio_to_filemap() > >>> It's for the filemap_add_folio() call, the flag is also passed to mem > >>> cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL > >>> correctly, as I'm hitting soft lockups when testing larger folios > >>> > >>> New error handling is added. > >>> > >>> - btrfs_alloc_folio_array() > >>> This is for page allocation for extent buffers. > >>> All callers are already handling the errors. > >>> > >>> Furthermore, to enable more testing while not affecting end users, the > >>> change is only implemented for DEBUG builds. > >>> > >>> Qu Wenruo (3): > >>> btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer() > >>> btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap() > >>> btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array() > >> > >> The reason I want to leave NOFAIL is because in a cgroup memory constrained > >> environment we could get an errant ENOMEM on some sort of metadata operation, > >> which then gets turned into an aborted transaction. I don't want a memory > >> constrained cgroup flipping the whole file system read only because it got an > >> ENOMEM in a place where we have no choice but to abort the transaction. > >> > >> If we could eliminate that possibility then hooray, but that's not actually > >> possible, because any COW for a multi-modification case (think finish ordered > >> io) could result in an ENOMEM and thus a transaction abort. We need to live > >> with NOFAIL for these cases. Thanks, > > > > I agree with keeping NOFAIL. Please add the above as a comment to > > btrfs_alloc_folio_array(). > > That will soon no longer be a problem. > > The cgroup guys are fine if certain inode should not be limited by mem > cgroup, so I already have patches to use root mem cgroup so that it will > not be charged at all. > > https://lore.kernel.org/linux-mm/6a9ba2c8e70c7b5c4316404612f281a031f847da.1721384771.git.wqu@suse.com/ > > Furthermore, using NOFAIL just to workaround mem cgroup looks like an > abuse to me. It's not just because of cgroup, the nofail semantics for metadata means that MM subsystem should try to get some memory instead of failing, where the filesystem operation can wait a bit. What josef described regarding transaction aborts applies in general.
在 2024/8/1 02:43, David Sterba 写道: > On Wed, Jul 31, 2024 at 07:47:51AM +0930, Qu Wenruo wrote: >> >> >> 在 2024/7/30 23:25, David Sterba 写道: >>> On Fri, Jul 05, 2024 at 10:55:43AM -0400, Josef Bacik wrote: >>>> On Fri, Jul 05, 2024 at 03:45:37PM +0930, Qu Wenruo wrote: >>>>> This patchset removes all __GFP_NOFAIL flags usage inside btrfs for >>>>> DEBUG builds. >>>>> >>>>> There are 3 call sites utilizing __GFP_NOFAIL: >>>>> >>>>> - __alloc_extent_buffer() >>>>> It's for the extent_buffer structure allocation. >>>>> All callers are already handling the errors. >>>>> >>>>> - attach_eb_folio_to_filemap() >>>>> It's for the filemap_add_folio() call, the flag is also passed to mem >>>>> cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL >>>>> correctly, as I'm hitting soft lockups when testing larger folios >>>>> >>>>> New error handling is added. >>>>> >>>>> - btrfs_alloc_folio_array() >>>>> This is for page allocation for extent buffers. >>>>> All callers are already handling the errors. >>>>> >>>>> Furthermore, to enable more testing while not affecting end users, the >>>>> change is only implemented for DEBUG builds. >>>>> >>>>> Qu Wenruo (3): >>>>> btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer() >>>>> btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap() >>>>> btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array() >>>> >>>> The reason I want to leave NOFAIL is because in a cgroup memory constrained >>>> environment we could get an errant ENOMEM on some sort of metadata operation, >>>> which then gets turned into an aborted transaction. I don't want a memory >>>> constrained cgroup flipping the whole file system read only because it got an >>>> ENOMEM in a place where we have no choice but to abort the transaction. >>>> >>>> If we could eliminate that possibility then hooray, but that's not actually >>>> possible, because any COW for a multi-modification case (think finish ordered >>>> io) could result in an ENOMEM and thus a transaction abort. We need to live >>>> with NOFAIL for these cases. Thanks, >>> >>> I agree with keeping NOFAIL. Please add the above as a comment to >>> btrfs_alloc_folio_array(). >> >> That will soon no longer be a problem. >> >> The cgroup guys are fine if certain inode should not be limited by mem >> cgroup, so I already have patches to use root mem cgroup so that it will >> not be charged at all. >> >> https://lore.kernel.org/linux-mm/6a9ba2c8e70c7b5c4316404612f281a031f847da.1721384771.git.wqu@suse.com/ >> >> Furthermore, using NOFAIL just to workaround mem cgroup looks like an >> abuse to me. > > It's not just because of cgroup, the nofail semantics for metadata > means that MM subsystem should try to get some memory instead of > failing, where the filesystem operation can wait a bit. What josef > described regarding transaction aborts applies in general. > If you are only talking about the original patch, yes, we should not remove NOFAIL for folio allocation, at least for now. But this patchset is already a little outdated, the latest way to remove NOFAIL is to only remove the usage for filemap_add_folio(). Filemap_add_folio() is only doing two things: - Cgroup charge This will be addressed by manually using root memcgroup, so that we will never trigger cgroup charge anymore. - Xarray split (aka new slot allocation) This is only allocating some xarray slots, if that fails I do not really believe NOFAIL can do anything much better. So all your argument on the NOFAIL and metadata allocation no longer stands. Because I'm not even going to touch folio allocation part (even I still believe we should not go NOFAIL for metadata folio allocation). Thanks, Qu