[0/3] btrfs: remove __GFP_NOFAIL usage for debug builds

Message ID	cover.1720159494.git.wqu@suse.com (mailing list archive)
Headers	show Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E0C1433C7 for <linux-btrfs@vger.kernel.org>; Fri, 5 Jul 2024 06:16:04 +0000 (UTC) From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH 0/3] btrfs: remove __GFP_NOFAIL usage for debug builds Date: Fri, 5 Jul 2024 15:45:37 +0930 Message-ID: <cover.1720159494.git.wqu@suse.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	btrfs: remove __GFP_NOFAIL usage for debug builds \| expand [0/3] btrfs: remove __GFP_NOFAIL usage for debug builds [1/3] btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer() [2/3] btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap() [3/3] btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array()

Qu Wenruo July 5, 2024, 6:15 a.m. UTC

This patchset removes all __GFP_NOFAIL flags usage inside btrfs for
DEBUG builds.

There are 3 call sites utilizing __GFP_NOFAIL:

- __alloc_extent_buffer()
  It's for the extent_buffer structure allocation.
  All callers are already handling the errors.

- attach_eb_folio_to_filemap()
  It's for the filemap_add_folio() call, the flag is also passed to mem
  cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL
  correctly, as I'm hitting soft lockups when testing larger folios

  New error handling is added.

- btrfs_alloc_folio_array()
  This is for page allocation for extent buffers.
  All callers are already handling the errors.

Furthermore, to enable more testing while not affecting end users, the
change is only implemented for DEBUG builds.

Qu Wenruo (3):
  btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer()
  btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap()
  btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array()

 fs/btrfs/extent_io.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

Josef Bacik July 5, 2024, 2:55 p.m. UTC | #1

On Fri, Jul 05, 2024 at 03:45:37PM +0930, Qu Wenruo wrote:
> This patchset removes all __GFP_NOFAIL flags usage inside btrfs for
> DEBUG builds.
> 
> There are 3 call sites utilizing __GFP_NOFAIL:
> 
> - __alloc_extent_buffer()
>   It's for the extent_buffer structure allocation.
>   All callers are already handling the errors.
> 
> - attach_eb_folio_to_filemap()
>   It's for the filemap_add_folio() call, the flag is also passed to mem
>   cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL
>   correctly, as I'm hitting soft lockups when testing larger folios
> 
>   New error handling is added.
> 
> - btrfs_alloc_folio_array()
>   This is for page allocation for extent buffers.
>   All callers are already handling the errors.
> 
> Furthermore, to enable more testing while not affecting end users, the
> change is only implemented for DEBUG builds.
> 
> Qu Wenruo (3):
>   btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer()
>   btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap()
>   btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array()

The reason I want to leave NOFAIL is because in a cgroup memory constrained
environment we could get an errant ENOMEM on some sort of metadata operation,
which then gets turned into an aborted transaction.  I don't want a memory
constrained cgroup flipping the whole file system read only because it got an
ENOMEM in a place where we have no choice but to abort the transaction.

If we could eliminate that possibility then hooray, but that's not actually
possible, because any COW for a multi-modification case (think finish ordered
io) could result in an ENOMEM and thus a transaction abort.  We need to live
with NOFAIL for these cases.  Thanks,

Josef

Qu Wenruo July 5, 2024, 10:40 p.m. UTC | #2

在 2024/7/6 00:25, Josef Bacik 写道:
> On Fri, Jul 05, 2024 at 03:45:37PM +0930, Qu Wenruo wrote:
>> This patchset removes all __GFP_NOFAIL flags usage inside btrfs for
>> DEBUG builds.
>>
>> There are 3 call sites utilizing __GFP_NOFAIL:
>>
>> - __alloc_extent_buffer()
>>    It's for the extent_buffer structure allocation.
>>    All callers are already handling the errors.
>>
>> - attach_eb_folio_to_filemap()
>>    It's for the filemap_add_folio() call, the flag is also passed to mem
>>    cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL
>>    correctly, as I'm hitting soft lockups when testing larger folios
>>
>>    New error handling is added.
>>
>> - btrfs_alloc_folio_array()
>>    This is for page allocation for extent buffers.
>>    All callers are already handling the errors.
>>
>> Furthermore, to enable more testing while not affecting end users, the
>> change is only implemented for DEBUG builds.
>>
>> Qu Wenruo (3):
>>    btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer()
>>    btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap()
>>    btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array()
>
> The reason I want to leave NOFAIL is because in a cgroup memory constrained
> environment we could get an errant ENOMEM on some sort of metadata operation,
> which then gets turned into an aborted transaction.  I don't want a memory
> constrained cgroup flipping the whole file system read only because it got an
> ENOMEM in a place where we have no choice but to abort the transaction.

Well, I'm already seeing mem_cgroup_charge() soft locks up, possible due
to the _NOFAIL flag and larger folios at filemap_add_folio().

Thus the _NOFAIL flag is not only affecting page allocation, but also
passed to mem cgroup related code, and that can already be problematic.

>
> If we could eliminate that possibility then hooray, but that's not actually
> possible, because any COW for a multi-modification case (think finish ordered
> io) could result in an ENOMEM and thus a transaction abort.  We need to live
> with NOFAIL for these cases.  Thanks,

In that cgroup controlled case, I'm wondering how it would even handle
btrfs extent buffer allocation.

Wouldn't all the eb charged to certain cgroup meanwhile the eb pages are
a shared resource for all btrfs operations?

If so, I'd say the mem cgroup on btree inode is already problematic, and
we should eliminate the behavior other than using _NOFAIL to workaround it.

Thanks,
Qu

>
> Josef
>

David Sterba July 30, 2024, 1:55 p.m. UTC | #3

On Fri, Jul 05, 2024 at 10:55:43AM -0400, Josef Bacik wrote:
> On Fri, Jul 05, 2024 at 03:45:37PM +0930, Qu Wenruo wrote:
> > This patchset removes all __GFP_NOFAIL flags usage inside btrfs for
> > DEBUG builds.
> > 
> > There are 3 call sites utilizing __GFP_NOFAIL:
> > 
> > - __alloc_extent_buffer()
> >   It's for the extent_buffer structure allocation.
> >   All callers are already handling the errors.
> > 
> > - attach_eb_folio_to_filemap()
> >   It's for the filemap_add_folio() call, the flag is also passed to mem
> >   cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL
> >   correctly, as I'm hitting soft lockups when testing larger folios
> > 
> >   New error handling is added.
> > 
> > - btrfs_alloc_folio_array()
> >   This is for page allocation for extent buffers.
> >   All callers are already handling the errors.
> > 
> > Furthermore, to enable more testing while not affecting end users, the
> > change is only implemented for DEBUG builds.
> > 
> > Qu Wenruo (3):
> >   btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer()
> >   btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap()
> >   btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array()
> 
> The reason I want to leave NOFAIL is because in a cgroup memory constrained
> environment we could get an errant ENOMEM on some sort of metadata operation,
> which then gets turned into an aborted transaction.  I don't want a memory
> constrained cgroup flipping the whole file system read only because it got an
> ENOMEM in a place where we have no choice but to abort the transaction.
> 
> If we could eliminate that possibility then hooray, but that's not actually
> possible, because any COW for a multi-modification case (think finish ordered
> io) could result in an ENOMEM and thus a transaction abort.  We need to live
> with NOFAIL for these cases.  Thanks,

I agree with keeping NOFAIL.  Please add the above as a comment to
btrfs_alloc_folio_array().

Qu Wenruo July 30, 2024, 10:17 p.m. UTC | #4

在 2024/7/30 23:25, David Sterba 写道:
> On Fri, Jul 05, 2024 at 10:55:43AM -0400, Josef Bacik wrote:
>> On Fri, Jul 05, 2024 at 03:45:37PM +0930, Qu Wenruo wrote:
>>> This patchset removes all __GFP_NOFAIL flags usage inside btrfs for
>>> DEBUG builds.
>>>
>>> There are 3 call sites utilizing __GFP_NOFAIL:
>>>
>>> - __alloc_extent_buffer()
>>>    It's for the extent_buffer structure allocation.
>>>    All callers are already handling the errors.
>>>
>>> - attach_eb_folio_to_filemap()
>>>    It's for the filemap_add_folio() call, the flag is also passed to mem
>>>    cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL
>>>    correctly, as I'm hitting soft lockups when testing larger folios
>>>
>>>    New error handling is added.
>>>
>>> - btrfs_alloc_folio_array()
>>>    This is for page allocation for extent buffers.
>>>    All callers are already handling the errors.
>>>
>>> Furthermore, to enable more testing while not affecting end users, the
>>> change is only implemented for DEBUG builds.
>>>
>>> Qu Wenruo (3):
>>>    btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer()
>>>    btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap()
>>>    btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array()
>>
>> The reason I want to leave NOFAIL is because in a cgroup memory constrained
>> environment we could get an errant ENOMEM on some sort of metadata operation,
>> which then gets turned into an aborted transaction.  I don't want a memory
>> constrained cgroup flipping the whole file system read only because it got an
>> ENOMEM in a place where we have no choice but to abort the transaction.
>>
>> If we could eliminate that possibility then hooray, but that's not actually
>> possible, because any COW for a multi-modification case (think finish ordered
>> io) could result in an ENOMEM and thus a transaction abort.  We need to live
>> with NOFAIL for these cases.  Thanks,
> 
> I agree with keeping NOFAIL.  Please add the above as a comment to
> btrfs_alloc_folio_array().

That will soon no longer be a problem.

The cgroup guys are fine if certain inode should not be limited by mem 
cgroup, so I already have patches to use root mem cgroup so that it will 
not be charged at all.

https://lore.kernel.org/linux-mm/6a9ba2c8e70c7b5c4316404612f281a031f847da.1721384771.git.wqu@suse.com/

Furthermore, using NOFAIL just to workaround mem cgroup looks like an 
abuse to me.

Thanks,
Qu

David Sterba July 31, 2024, 5:13 p.m. UTC | #5

On Wed, Jul 31, 2024 at 07:47:51AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/7/30 23:25, David Sterba 写道:
> > On Fri, Jul 05, 2024 at 10:55:43AM -0400, Josef Bacik wrote:
> >> On Fri, Jul 05, 2024 at 03:45:37PM +0930, Qu Wenruo wrote:
> >>> This patchset removes all __GFP_NOFAIL flags usage inside btrfs for
> >>> DEBUG builds.
> >>>
> >>> There are 3 call sites utilizing __GFP_NOFAIL:
> >>>
> >>> - __alloc_extent_buffer()
> >>>    It's for the extent_buffer structure allocation.
> >>>    All callers are already handling the errors.
> >>>
> >>> - attach_eb_folio_to_filemap()
> >>>    It's for the filemap_add_folio() call, the flag is also passed to mem
> >>>    cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL
> >>>    correctly, as I'm hitting soft lockups when testing larger folios
> >>>
> >>>    New error handling is added.
> >>>
> >>> - btrfs_alloc_folio_array()
> >>>    This is for page allocation for extent buffers.
> >>>    All callers are already handling the errors.
> >>>
> >>> Furthermore, to enable more testing while not affecting end users, the
> >>> change is only implemented for DEBUG builds.
> >>>
> >>> Qu Wenruo (3):
> >>>    btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer()
> >>>    btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap()
> >>>    btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array()
> >>
> >> The reason I want to leave NOFAIL is because in a cgroup memory constrained
> >> environment we could get an errant ENOMEM on some sort of metadata operation,
> >> which then gets turned into an aborted transaction.  I don't want a memory
> >> constrained cgroup flipping the whole file system read only because it got an
> >> ENOMEM in a place where we have no choice but to abort the transaction.
> >>
> >> If we could eliminate that possibility then hooray, but that's not actually
> >> possible, because any COW for a multi-modification case (think finish ordered
> >> io) could result in an ENOMEM and thus a transaction abort.  We need to live
> >> with NOFAIL for these cases.  Thanks,
> > 
> > I agree with keeping NOFAIL.  Please add the above as a comment to
> > btrfs_alloc_folio_array().
> 
> That will soon no longer be a problem.
> 
> The cgroup guys are fine if certain inode should not be limited by mem 
> cgroup, so I already have patches to use root mem cgroup so that it will 
> not be charged at all.
> 
> https://lore.kernel.org/linux-mm/6a9ba2c8e70c7b5c4316404612f281a031f847da.1721384771.git.wqu@suse.com/
> 
> Furthermore, using NOFAIL just to workaround mem cgroup looks like an 
> abuse to me.

It's not just because of cgroup, the nofail semantics for metadata
means that MM subsystem should try to get some memory instead of
failing, where the filesystem operation can wait a bit. What josef
described regarding transaction aborts applies in general.

Qu Wenruo Aug. 4, 2024, 9:22 a.m. UTC | #6

在 2024/8/1 02:43, David Sterba 写道:
> On Wed, Jul 31, 2024 at 07:47:51AM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/7/30 23:25, David Sterba 写道:
>>> On Fri, Jul 05, 2024 at 10:55:43AM -0400, Josef Bacik wrote:
>>>> On Fri, Jul 05, 2024 at 03:45:37PM +0930, Qu Wenruo wrote:
>>>>> This patchset removes all __GFP_NOFAIL flags usage inside btrfs for
>>>>> DEBUG builds.
>>>>>
>>>>> There are 3 call sites utilizing __GFP_NOFAIL:
>>>>>
>>>>> - __alloc_extent_buffer()
>>>>>     It's for the extent_buffer structure allocation.
>>>>>     All callers are already handling the errors.
>>>>>
>>>>> - attach_eb_folio_to_filemap()
>>>>>     It's for the filemap_add_folio() call, the flag is also passed to mem
>>>>>     cgroup, which I suspect is not handling larger folio and __GFP_NOFAIL
>>>>>     correctly, as I'm hitting soft lockups when testing larger folios
>>>>>
>>>>>     New error handling is added.
>>>>>
>>>>> - btrfs_alloc_folio_array()
>>>>>     This is for page allocation for extent buffers.
>>>>>     All callers are already handling the errors.
>>>>>
>>>>> Furthermore, to enable more testing while not affecting end users, the
>>>>> change is only implemented for DEBUG builds.
>>>>>
>>>>> Qu Wenruo (3):
>>>>>     btrfs: do not use __GFP_NOFAIL flag for __alloc_extent_buffer()
>>>>>     btrfs: do not use __GFP_NOFAIL flag for attach_eb_folio_to_filemap()
>>>>>     btrfs: do not use __GFP_NOFAIL flag for btrfs_alloc_folio_array()
>>>>
>>>> The reason I want to leave NOFAIL is because in a cgroup memory constrained
>>>> environment we could get an errant ENOMEM on some sort of metadata operation,
>>>> which then gets turned into an aborted transaction.  I don't want a memory
>>>> constrained cgroup flipping the whole file system read only because it got an
>>>> ENOMEM in a place where we have no choice but to abort the transaction.
>>>>
>>>> If we could eliminate that possibility then hooray, but that's not actually
>>>> possible, because any COW for a multi-modification case (think finish ordered
>>>> io) could result in an ENOMEM and thus a transaction abort.  We need to live
>>>> with NOFAIL for these cases.  Thanks,
>>>
>>> I agree with keeping NOFAIL.  Please add the above as a comment to
>>> btrfs_alloc_folio_array().
>>
>> That will soon no longer be a problem.
>>
>> The cgroup guys are fine if certain inode should not be limited by mem
>> cgroup, so I already have patches to use root mem cgroup so that it will
>> not be charged at all.
>>
>> https://lore.kernel.org/linux-mm/6a9ba2c8e70c7b5c4316404612f281a031f847da.1721384771.git.wqu@suse.com/
>>
>> Furthermore, using NOFAIL just to workaround mem cgroup looks like an
>> abuse to me.
>
> It's not just because of cgroup, the nofail semantics for metadata
> means that MM subsystem should try to get some memory instead of
> failing, where the filesystem operation can wait a bit. What josef
> described regarding transaction aborts applies in general.
>

If you are only talking about the original patch, yes, we should not
remove NOFAIL for folio allocation, at least for now.

But this patchset is already a little outdated, the latest way to remove
NOFAIL is to only remove the usage for filemap_add_folio().

Filemap_add_folio() is only doing two things:

- Cgroup charge
   This will be addressed by manually using root memcgroup, so that
   we will never trigger cgroup charge anymore.

- Xarray split (aka new slot allocation)
   This is only allocating some xarray slots, if that fails I do not
   really believe NOFAIL can do anything much better.

So all your argument on the NOFAIL and metadata allocation no longer stands.

Because I'm not even going to touch folio allocation part (even I still
believe we should not go NOFAIL for metadata folio allocation).

Thanks,
Qu

[0/3] btrfs: remove __GFP_NOFAIL usage for debug builds

Message

Comments