mbox series

[v2,0/3] btrfs: fix unexpected -ENOMEM with percpu_counter_init when create snapshot

Message ID 20221214021125.28289-1-robbieko@synology.com (mailing list archive)
Headers show
Series btrfs: fix unexpected -ENOMEM with percpu_counter_init when create snapshot | expand

Message

robbieko Dec. 14, 2022, 2:11 a.m. UTC
From: Robbie Ko <robbieko@synology.com>

[Issue]
When creating subvolume/snapshot, the transaction may be abort due to -ENOMEM

  WARNING: CPU: 1 PID: 411 at fs/btrfs/transaction.c:1937 create_pending_snapshot+0xe30/0xe70 [btrfs]()
  CPU: 1 PID: 411 Comm: btrfs Tainted: P O 4.4.180+ #42661
  Call Trace:
    create_pending_snapshot+0xe30/0xe70 [btrfs]
    create_pending_snapshots+0x89/0xb0 [btrfs]
    btrfs_commit_transaction+0x469/0xc60 [btrfs]
    btrfs_mksubvol+0x5bd/0x690 [btrfs]
    btrfs_mksnapshot+0x102/0x170 [btrfs]
    btrfs_ioctl_snap_create_transid+0x1ad/0x1c0 [btrfs]
    btrfs_ioctl_snap_create_v2+0x102/0x160 [btrfs]
    btrfs_ioctl+0x2111/0x3130 [btrfs]
    do_vfs_ioctl+0x7ea/0xa80
    SyS_ioctl+0xa1/0xb0
    entry_SYSCALL_64_fastpath+0x1e/0x8e
  ---[ end trace 910c8f86780ca385 ]---
  BTRFS: error (device dm-2) in create_pending_snapshot:1937: errno=-12 Out of memory

[Cause]
During creating a subvolume/snapshot, it is necessary to allocate memory for Initializing fs root.
Therefore, it can only use GFP_NOFS to allocate memory to avoid deadlock issues.
However, atomic allocation is required when processing percpu_counter_init
without GFP_KERNEL due to the unique structure of percpu_counter.
In this situation, allocating memory for initializing fs root may cause
unexpected -ENOMEM when free memory is low and causes btrfs transaction to abort.

[Fix]
We allocate memory at the beginning of creating a subvolume/snapshot.
This way, we can ensure the memory is enough when initializing fs root.
Even -ENOMEM happens at the beginning of creating a subvolume/snapshot,
the transaction won’t abort since it hasn’t started yet.

Robbie Ko (3):
  btrfs: refactor anon_dev with new_fs_root_args for create
    subvolume/snapshot
  btrfs: change snapshot_lock to dynamic pointer
  btrfs: add snapshot_lock to new_fs_root_args

 fs/btrfs/ctree.h       |   2 +-
 fs/btrfs/disk-io.c     | 107 ++++++++++++++++++++++++++++++-----------
 fs/btrfs/disk-io.h     |  12 ++++-
 fs/btrfs/file.c        |   8 +--
 fs/btrfs/inode.c       |  12 ++---
 fs/btrfs/ioctl.c       |  38 +++++++--------
 fs/btrfs/transaction.c |   2 +-
 fs/btrfs/transaction.h |   5 +-
 8 files changed, 123 insertions(+), 63 deletions(-)

Comments

Josef Bacik Dec. 14, 2022, 4:59 p.m. UTC | #1
On Wed, Dec 14, 2022 at 10:11:22AM +0800, robbieko wrote:
> From: Robbie Ko <robbieko@synology.com>
> 
> [Issue]
> When creating subvolume/snapshot, the transaction may be abort due to -ENOMEM
> 
>   WARNING: CPU: 1 PID: 411 at fs/btrfs/transaction.c:1937 create_pending_snapshot+0xe30/0xe70 [btrfs]()
>   CPU: 1 PID: 411 Comm: btrfs Tainted: P O 4.4.180+ #42661
>   Call Trace:
>     create_pending_snapshot+0xe30/0xe70 [btrfs]
>     create_pending_snapshots+0x89/0xb0 [btrfs]
>     btrfs_commit_transaction+0x469/0xc60 [btrfs]
>     btrfs_mksubvol+0x5bd/0x690 [btrfs]
>     btrfs_mksnapshot+0x102/0x170 [btrfs]
>     btrfs_ioctl_snap_create_transid+0x1ad/0x1c0 [btrfs]
>     btrfs_ioctl_snap_create_v2+0x102/0x160 [btrfs]
>     btrfs_ioctl+0x2111/0x3130 [btrfs]
>     do_vfs_ioctl+0x7ea/0xa80
>     SyS_ioctl+0xa1/0xb0
>     entry_SYSCALL_64_fastpath+0x1e/0x8e
>   ---[ end trace 910c8f86780ca385 ]---
>   BTRFS: error (device dm-2) in create_pending_snapshot:1937: errno=-12 Out of memory
> 
> [Cause]
> During creating a subvolume/snapshot, it is necessary to allocate memory for Initializing fs root.
> Therefore, it can only use GFP_NOFS to allocate memory to avoid deadlock issues.
> However, atomic allocation is required when processing percpu_counter_init
> without GFP_KERNEL due to the unique structure of percpu_counter.
> In this situation, allocating memory for initializing fs root may cause
> unexpected -ENOMEM when free memory is low and causes btrfs transaction to abort.
> 
> [Fix]
> We allocate memory at the beginning of creating a subvolume/snapshot.
> This way, we can ensure the memory is enough when initializing fs root.
> Even -ENOMEM happens at the beginning of creating a subvolume/snapshot,
> the transaction won’t abort since it hasn’t started yet.
> 

Honestly I'd rather just make the btrfs_drew_lock use an atomic_t for the
writers counter as well.  This is only taken in truncate an nocow writes, and in
nocow writes there are a looooot of slower things that have to be done that
we're not winning a lot with the percpu counter.  Is there any reason not to
just do that and leave all this code alone?  Thanks,

Josef
David Sterba Dec. 14, 2022, 6:07 p.m. UTC | #2
On Wed, Dec 14, 2022 at 11:59:10AM -0500, Josef Bacik wrote:
> On Wed, Dec 14, 2022 at 10:11:22AM +0800, robbieko wrote:
> > From: Robbie Ko <robbieko@synology.com>
> > 
> > [Issue]
> > When creating subvolume/snapshot, the transaction may be abort due to -ENOMEM
> > 
> >   WARNING: CPU: 1 PID: 411 at fs/btrfs/transaction.c:1937 create_pending_snapshot+0xe30/0xe70 [btrfs]()
> >   CPU: 1 PID: 411 Comm: btrfs Tainted: P O 4.4.180+ #42661
> >   Call Trace:
> >     create_pending_snapshot+0xe30/0xe70 [btrfs]
> >     create_pending_snapshots+0x89/0xb0 [btrfs]
> >     btrfs_commit_transaction+0x469/0xc60 [btrfs]
> >     btrfs_mksubvol+0x5bd/0x690 [btrfs]
> >     btrfs_mksnapshot+0x102/0x170 [btrfs]
> >     btrfs_ioctl_snap_create_transid+0x1ad/0x1c0 [btrfs]
> >     btrfs_ioctl_snap_create_v2+0x102/0x160 [btrfs]
> >     btrfs_ioctl+0x2111/0x3130 [btrfs]
> >     do_vfs_ioctl+0x7ea/0xa80
> >     SyS_ioctl+0xa1/0xb0
> >     entry_SYSCALL_64_fastpath+0x1e/0x8e
> >   ---[ end trace 910c8f86780ca385 ]---
> >   BTRFS: error (device dm-2) in create_pending_snapshot:1937: errno=-12 Out of memory
> > 
> > [Cause]
> > During creating a subvolume/snapshot, it is necessary to allocate memory for Initializing fs root.
> > Therefore, it can only use GFP_NOFS to allocate memory to avoid deadlock issues.
> > However, atomic allocation is required when processing percpu_counter_init
> > without GFP_KERNEL due to the unique structure of percpu_counter.
> > In this situation, allocating memory for initializing fs root may cause
> > unexpected -ENOMEM when free memory is low and causes btrfs transaction to abort.
> > 
> > [Fix]
> > We allocate memory at the beginning of creating a subvolume/snapshot.
> > This way, we can ensure the memory is enough when initializing fs root.
> > Even -ENOMEM happens at the beginning of creating a subvolume/snapshot,
> > the transaction won’t abort since it hasn’t started yet.
> 
> Honestly I'd rather just make the btrfs_drew_lock use an atomic_t for the
> writers counter as well.  This is only taken in truncate an nocow writes, and in
> nocow writes there are a looooot of slower things that have to be done that
> we're not winning a lot with the percpu counter.  Is there any reason not to
> just do that and leave all this code alone?  Thanks,

The percpu counter for writers is there since the original commit
8257b2dc3c1a1057 "Btrfs: introduce btrfs_{start, end}_nocow_write() for
each subvolume". The reason could be to avoid hammering the same
cacheline from all the readers but then the writers do that anyway.
This happens in context related to IO or there's some waiting anyway, so
yeah using atomic for writers should be ok as well.
robbieko Dec. 19, 2022, 6:54 a.m. UTC | #3
David Sterba 於 2022/12/15 上午2:07 寫道:
> On Wed, Dec 14, 2022 at 11:59:10AM -0500, Josef Bacik wrote:
>> On Wed, Dec 14, 2022 at 10:11:22AM +0800, robbieko wrote:
>>> From: Robbie Ko <robbieko@synology.com>
>>>
>>> [Issue]
>>> When creating subvolume/snapshot, the transaction may be abort due to -ENOMEM
>>>
>>>    WARNING: CPU: 1 PID: 411 at fs/btrfs/transaction.c:1937 create_pending_snapshot+0xe30/0xe70 [btrfs]()
>>>    CPU: 1 PID: 411 Comm: btrfs Tainted: P O 4.4.180+ #42661
>>>    Call Trace:
>>>      create_pending_snapshot+0xe30/0xe70 [btrfs]
>>>      create_pending_snapshots+0x89/0xb0 [btrfs]
>>>      btrfs_commit_transaction+0x469/0xc60 [btrfs]
>>>      btrfs_mksubvol+0x5bd/0x690 [btrfs]
>>>      btrfs_mksnapshot+0x102/0x170 [btrfs]
>>>      btrfs_ioctl_snap_create_transid+0x1ad/0x1c0 [btrfs]
>>>      btrfs_ioctl_snap_create_v2+0x102/0x160 [btrfs]
>>>      btrfs_ioctl+0x2111/0x3130 [btrfs]
>>>      do_vfs_ioctl+0x7ea/0xa80
>>>      SyS_ioctl+0xa1/0xb0
>>>      entry_SYSCALL_64_fastpath+0x1e/0x8e
>>>    ---[ end trace 910c8f86780ca385 ]---
>>>    BTRFS: error (device dm-2) in create_pending_snapshot:1937: errno=-12 Out of memory
>>>
>>> [Cause]
>>> During creating a subvolume/snapshot, it is necessary to allocate memory for Initializing fs root.
>>> Therefore, it can only use GFP_NOFS to allocate memory to avoid deadlock issues.
>>> However, atomic allocation is required when processing percpu_counter_init
>>> without GFP_KERNEL due to the unique structure of percpu_counter.
>>> In this situation, allocating memory for initializing fs root may cause
>>> unexpected -ENOMEM when free memory is low and causes btrfs transaction to abort.
>>>
>>> [Fix]
>>> We allocate memory at the beginning of creating a subvolume/snapshot.
>>> This way, we can ensure the memory is enough when initializing fs root.
>>> Even -ENOMEM happens at the beginning of creating a subvolume/snapshot,
>>> the transaction won’t abort since it hasn’t started yet.
>> Honestly I'd rather just make the btrfs_drew_lock use an atomic_t for the
>> writers counter as well.  This is only taken in truncate an nocow writes, and in
>> nocow writes there are a looooot of slower things that have to be done that
>> we're not winning a lot with the percpu counter.  Is there any reason not to
>> just do that and leave all this code alone?  Thanks,
> The percpu counter for writers is there since the original commit
> 8257b2dc3c1a1057 "Btrfs: introduce btrfs_{start, end}_nocow_write() for
> each subvolume". The reason could be to avoid hammering the same
> cacheline from all the readers but then the writers do that anyway.
> This happens in context related to IO or there's some waiting anyway, so
> yeah using atomic for writers should be ok as well.

Sorry for the late reply, I've been busy recently.
This modification will not affect the original btrfs_drew_lock behavior,
and the framework can also provide future scenarios where memory
needs to be allocated in init_fs_root.

Thanks.

Robbie Ko
David Sterba Jan. 10, 2023, 3:08 p.m. UTC | #4
On Mon, Dec 19, 2022 at 02:54:06PM +0800, robbieko wrote:
> 
> David Sterba 於 2022/12/15 上午2:07 寫道:
> > On Wed, Dec 14, 2022 at 11:59:10AM -0500, Josef Bacik wrote:
> >> On Wed, Dec 14, 2022 at 10:11:22AM +0800, robbieko wrote:
> >>> From: Robbie Ko <robbieko@synology.com>
> >>>
> >>> [Issue]
> >>> When creating subvolume/snapshot, the transaction may be abort due to -ENOMEM
> >>>
> >>>    WARNING: CPU: 1 PID: 411 at fs/btrfs/transaction.c:1937 create_pending_snapshot+0xe30/0xe70 [btrfs]()
> >>>    CPU: 1 PID: 411 Comm: btrfs Tainted: P O 4.4.180+ #42661
> >>>    Call Trace:
> >>>      create_pending_snapshot+0xe30/0xe70 [btrfs]
> >>>      create_pending_snapshots+0x89/0xb0 [btrfs]
> >>>      btrfs_commit_transaction+0x469/0xc60 [btrfs]
> >>>      btrfs_mksubvol+0x5bd/0x690 [btrfs]
> >>>      btrfs_mksnapshot+0x102/0x170 [btrfs]
> >>>      btrfs_ioctl_snap_create_transid+0x1ad/0x1c0 [btrfs]
> >>>      btrfs_ioctl_snap_create_v2+0x102/0x160 [btrfs]
> >>>      btrfs_ioctl+0x2111/0x3130 [btrfs]
> >>>      do_vfs_ioctl+0x7ea/0xa80
> >>>      SyS_ioctl+0xa1/0xb0
> >>>      entry_SYSCALL_64_fastpath+0x1e/0x8e
> >>>    ---[ end trace 910c8f86780ca385 ]---
> >>>    BTRFS: error (device dm-2) in create_pending_snapshot:1937: errno=-12 Out of memory
> >>>
> >>> [Cause]
> >>> During creating a subvolume/snapshot, it is necessary to allocate memory for Initializing fs root.
> >>> Therefore, it can only use GFP_NOFS to allocate memory to avoid deadlock issues.
> >>> However, atomic allocation is required when processing percpu_counter_init
> >>> without GFP_KERNEL due to the unique structure of percpu_counter.
> >>> In this situation, allocating memory for initializing fs root may cause
> >>> unexpected -ENOMEM when free memory is low and causes btrfs transaction to abort.
> >>>
> >>> [Fix]
> >>> We allocate memory at the beginning of creating a subvolume/snapshot.
> >>> This way, we can ensure the memory is enough when initializing fs root.
> >>> Even -ENOMEM happens at the beginning of creating a subvolume/snapshot,
> >>> the transaction won’t abort since it hasn’t started yet.
> >> Honestly I'd rather just make the btrfs_drew_lock use an atomic_t for the
> >> writers counter as well.  This is only taken in truncate an nocow writes, and in
> >> nocow writes there are a looooot of slower things that have to be done that
> >> we're not winning a lot with the percpu counter.  Is there any reason not to
> >> just do that and leave all this code alone?  Thanks,
> > The percpu counter for writers is there since the original commit
> > 8257b2dc3c1a1057 "Btrfs: introduce btrfs_{start, end}_nocow_write() for
> > each subvolume". The reason could be to avoid hammering the same
> > cacheline from all the readers but then the writers do that anyway.
> > This happens in context related to IO or there's some waiting anyway, so
> > yeah using atomic for writers should be ok as well.
> 
> Sorry for the late reply, I've been busy recently.
> This modification will not affect the original btrfs_drew_lock behavior,
> and the framework can also provide future scenarios where memory
> needs to be allocated in init_fs_root.

With the conversion to atomic_t the preallocation can remain unchanged
as there would be only the anon bdev preallocated, then there's not much
reason to have the infrastructure.

I'm now testing the patch below, it's relatively short an can be
backported if needed but it can potentially make the performance worse
in some cases.

---
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7d5a3967ba0..7acedc6c2a56 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1523,15 +1523,7 @@ static int btrfs_init_fs_root(struct btrfs_root *root, dev_t anon_dev)
 	int ret;
 	unsigned int nofs_flag;
 
-	/*
-	 * We might be called under a transaction (e.g. indirect backref
-	 * resolution) which could deadlock if it triggers memory reclaim
-	 */
-	nofs_flag = memalloc_nofs_save();
-	ret = btrfs_drew_lock_init(&root->snapshot_lock);
-	memalloc_nofs_restore(nofs_flag);
-	if (ret)
-		goto fail;
+	 btrfs_drew_lock_init(&root->snapshot_lock);
 
 	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID &&
 	    !btrfs_is_data_reloc_root(root)) {
@@ -2242,7 +2234,6 @@ void btrfs_put_root(struct btrfs_root *root)
 		WARN_ON(test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state));
 		if (root->anon_dev)
 			free_anon_bdev(root->anon_dev);
-		btrfs_drew_lock_destroy(&root->snapshot_lock);
 		free_root_extent_buffers(root);
 #ifdef CONFIG_BTRFS_DEBUG
 		spin_lock(&root->fs_info->fs_roots_radix_lock);
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index 870528d87526..3a496b0d3d2b 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -325,24 +325,12 @@ struct extent_buffer *btrfs_try_read_lock_root_node(struct btrfs_root *root)
  * acquire the lock.
  */
 
-int btrfs_drew_lock_init(struct btrfs_drew_lock *lock)
+void btrfs_drew_lock_init(struct btrfs_drew_lock *lock)
 {
-	int ret;
-
-	ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL);
-	if (ret)
-		return ret;
-
 	atomic_set(&lock->readers, 0);
+	atomic_set(&lock->writers, 0);
 	init_waitqueue_head(&lock->pending_readers);
 	init_waitqueue_head(&lock->pending_writers);
-
-	return 0;
-}
-
-void btrfs_drew_lock_destroy(struct btrfs_drew_lock *lock)
-{
-	percpu_counter_destroy(&lock->writers);
 }
 
 /* Return true if acquisition is successful, false otherwise */
@@ -351,10 +339,10 @@ bool btrfs_drew_try_write_lock(struct btrfs_drew_lock *lock)
 	if (atomic_read(&lock->readers))
 		return false;
 
-	percpu_counter_inc(&lock->writers);
+	atomic_inc(&lock->writers);
 
 	/* Ensure writers count is updated before we check for pending readers */
-	smp_mb();
+	smp_mb__after_atomic();
 	if (atomic_read(&lock->readers)) {
 		btrfs_drew_write_unlock(lock);
 		return false;
@@ -374,7 +362,7 @@ void btrfs_drew_write_lock(struct btrfs_drew_lock *lock)
 
 void btrfs_drew_write_unlock(struct btrfs_drew_lock *lock)
 {
-	percpu_counter_dec(&lock->writers);
+	atomic_dec(&lock->writers);
 	cond_wake_up(&lock->pending_readers);
 }
 
@@ -390,8 +378,7 @@ void btrfs_drew_read_lock(struct btrfs_drew_lock *lock)
 	 */
 	smp_mb__after_atomic();
 
-	wait_event(lock->pending_readers,
-		   percpu_counter_sum(&lock->writers) == 0);
+	wait_event(lock->pending_readers, atomic_read(&lock->writers) == 0);
 }
 
 void btrfs_drew_read_unlock(struct btrfs_drew_lock *lock)
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index 11c2269b4b6f..edb9b4a0dba1 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -195,13 +195,12 @@ static inline void btrfs_tree_unlock_rw(struct extent_buffer *eb, int rw)
 
 struct btrfs_drew_lock {
 	atomic_t readers;
-	struct percpu_counter writers;
+	atomic_t writers;
 	wait_queue_head_t pending_writers;
 	wait_queue_head_t pending_readers;
 };
 
-int btrfs_drew_lock_init(struct btrfs_drew_lock *lock);
-void btrfs_drew_lock_destroy(struct btrfs_drew_lock *lock);
+void btrfs_drew_lock_init(struct btrfs_drew_lock *lock);
 void btrfs_drew_write_lock(struct btrfs_drew_lock *lock);
 bool btrfs_drew_try_write_lock(struct btrfs_drew_lock *lock);
 void btrfs_drew_write_unlock(struct btrfs_drew_lock *lock);
David Sterba March 9, 2023, 6:47 p.m. UTC | #5
On Tue, Jan 10, 2023 at 04:08:18PM +0100, David Sterba wrote:
> > Sorry for the late reply, I've been busy recently.
> > This modification will not affect the original btrfs_drew_lock behavior,
> > and the framework can also provide future scenarios where memory
> > needs to be allocated in init_fs_root.
> 
> With the conversion to atomic_t the preallocation can remain unchanged
> as there would be only the anon bdev preallocated, then there's not much
> reason to have the infrastructure.
> 
> I'm now testing the patch below, it's relatively short an can be
> backported if needed but it can potentially make the performance worse
> in some cases.

I forgot to CC you, the patch implementing the switch to atomic has been
sent and merged to misc-next.

https://lore.kernel.org/linux-btrfs/20230301204708.25710-1-dsterba@suse.com/
robbieko March 10, 2023, 1:07 a.m. UTC | #6
David Sterba 於 2023/3/10 上午2:47 寫道:
> On Tue, Jan 10, 2023 at 04:08:18PM +0100, David Sterba wrote:
>>> Sorry for the late reply, I've been busy recently.
>>> This modification will not affect the original btrfs_drew_lock behavior,
>>> and the framework can also provide future scenarios where memory
>>> needs to be allocated in init_fs_root.
>> With the conversion to atomic_t the preallocation can remain unchanged
>> as there would be only the anon bdev preallocated, then there's not much
>> reason to have the infrastructure.
>>
>> I'm now testing the patch below, it's relatively short an can be
>> backported if needed but it can potentially make the performance worse
>> in some cases.
> I forgot to CC you, the patch implementing the switch to atomic has been
> sent and merged to misc-next.
>
> https://lore.kernel.org/linux-btrfs/20230301204708.25710-1-dsterba@suse.com/
OK, Thank you so much.