Message ID | 20221214021125.28289-1-robbieko@synology.com (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: fix unexpected -ENOMEM with percpu_counter_init when create snapshot | expand |
On Wed, Dec 14, 2022 at 10:11:22AM +0800, robbieko wrote: > From: Robbie Ko <robbieko@synology.com> > > [Issue] > When creating subvolume/snapshot, the transaction may be abort due to -ENOMEM > > WARNING: CPU: 1 PID: 411 at fs/btrfs/transaction.c:1937 create_pending_snapshot+0xe30/0xe70 [btrfs]() > CPU: 1 PID: 411 Comm: btrfs Tainted: P O 4.4.180+ #42661 > Call Trace: > create_pending_snapshot+0xe30/0xe70 [btrfs] > create_pending_snapshots+0x89/0xb0 [btrfs] > btrfs_commit_transaction+0x469/0xc60 [btrfs] > btrfs_mksubvol+0x5bd/0x690 [btrfs] > btrfs_mksnapshot+0x102/0x170 [btrfs] > btrfs_ioctl_snap_create_transid+0x1ad/0x1c0 [btrfs] > btrfs_ioctl_snap_create_v2+0x102/0x160 [btrfs] > btrfs_ioctl+0x2111/0x3130 [btrfs] > do_vfs_ioctl+0x7ea/0xa80 > SyS_ioctl+0xa1/0xb0 > entry_SYSCALL_64_fastpath+0x1e/0x8e > ---[ end trace 910c8f86780ca385 ]--- > BTRFS: error (device dm-2) in create_pending_snapshot:1937: errno=-12 Out of memory > > [Cause] > During creating a subvolume/snapshot, it is necessary to allocate memory for Initializing fs root. > Therefore, it can only use GFP_NOFS to allocate memory to avoid deadlock issues. > However, atomic allocation is required when processing percpu_counter_init > without GFP_KERNEL due to the unique structure of percpu_counter. > In this situation, allocating memory for initializing fs root may cause > unexpected -ENOMEM when free memory is low and causes btrfs transaction to abort. > > [Fix] > We allocate memory at the beginning of creating a subvolume/snapshot. > This way, we can ensure the memory is enough when initializing fs root. > Even -ENOMEM happens at the beginning of creating a subvolume/snapshot, > the transaction won’t abort since it hasn’t started yet. > Honestly I'd rather just make the btrfs_drew_lock use an atomic_t for the writers counter as well. This is only taken in truncate an nocow writes, and in nocow writes there are a looooot of slower things that have to be done that we're not winning a lot with the percpu counter. Is there any reason not to just do that and leave all this code alone? Thanks, Josef
On Wed, Dec 14, 2022 at 11:59:10AM -0500, Josef Bacik wrote: > On Wed, Dec 14, 2022 at 10:11:22AM +0800, robbieko wrote: > > From: Robbie Ko <robbieko@synology.com> > > > > [Issue] > > When creating subvolume/snapshot, the transaction may be abort due to -ENOMEM > > > > WARNING: CPU: 1 PID: 411 at fs/btrfs/transaction.c:1937 create_pending_snapshot+0xe30/0xe70 [btrfs]() > > CPU: 1 PID: 411 Comm: btrfs Tainted: P O 4.4.180+ #42661 > > Call Trace: > > create_pending_snapshot+0xe30/0xe70 [btrfs] > > create_pending_snapshots+0x89/0xb0 [btrfs] > > btrfs_commit_transaction+0x469/0xc60 [btrfs] > > btrfs_mksubvol+0x5bd/0x690 [btrfs] > > btrfs_mksnapshot+0x102/0x170 [btrfs] > > btrfs_ioctl_snap_create_transid+0x1ad/0x1c0 [btrfs] > > btrfs_ioctl_snap_create_v2+0x102/0x160 [btrfs] > > btrfs_ioctl+0x2111/0x3130 [btrfs] > > do_vfs_ioctl+0x7ea/0xa80 > > SyS_ioctl+0xa1/0xb0 > > entry_SYSCALL_64_fastpath+0x1e/0x8e > > ---[ end trace 910c8f86780ca385 ]--- > > BTRFS: error (device dm-2) in create_pending_snapshot:1937: errno=-12 Out of memory > > > > [Cause] > > During creating a subvolume/snapshot, it is necessary to allocate memory for Initializing fs root. > > Therefore, it can only use GFP_NOFS to allocate memory to avoid deadlock issues. > > However, atomic allocation is required when processing percpu_counter_init > > without GFP_KERNEL due to the unique structure of percpu_counter. > > In this situation, allocating memory for initializing fs root may cause > > unexpected -ENOMEM when free memory is low and causes btrfs transaction to abort. > > > > [Fix] > > We allocate memory at the beginning of creating a subvolume/snapshot. > > This way, we can ensure the memory is enough when initializing fs root. > > Even -ENOMEM happens at the beginning of creating a subvolume/snapshot, > > the transaction won’t abort since it hasn’t started yet. > > Honestly I'd rather just make the btrfs_drew_lock use an atomic_t for the > writers counter as well. This is only taken in truncate an nocow writes, and in > nocow writes there are a looooot of slower things that have to be done that > we're not winning a lot with the percpu counter. Is there any reason not to > just do that and leave all this code alone? Thanks, The percpu counter for writers is there since the original commit 8257b2dc3c1a1057 "Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume". The reason could be to avoid hammering the same cacheline from all the readers but then the writers do that anyway. This happens in context related to IO or there's some waiting anyway, so yeah using atomic for writers should be ok as well.
David Sterba 於 2022/12/15 上午2:07 寫道: > On Wed, Dec 14, 2022 at 11:59:10AM -0500, Josef Bacik wrote: >> On Wed, Dec 14, 2022 at 10:11:22AM +0800, robbieko wrote: >>> From: Robbie Ko <robbieko@synology.com> >>> >>> [Issue] >>> When creating subvolume/snapshot, the transaction may be abort due to -ENOMEM >>> >>> WARNING: CPU: 1 PID: 411 at fs/btrfs/transaction.c:1937 create_pending_snapshot+0xe30/0xe70 [btrfs]() >>> CPU: 1 PID: 411 Comm: btrfs Tainted: P O 4.4.180+ #42661 >>> Call Trace: >>> create_pending_snapshot+0xe30/0xe70 [btrfs] >>> create_pending_snapshots+0x89/0xb0 [btrfs] >>> btrfs_commit_transaction+0x469/0xc60 [btrfs] >>> btrfs_mksubvol+0x5bd/0x690 [btrfs] >>> btrfs_mksnapshot+0x102/0x170 [btrfs] >>> btrfs_ioctl_snap_create_transid+0x1ad/0x1c0 [btrfs] >>> btrfs_ioctl_snap_create_v2+0x102/0x160 [btrfs] >>> btrfs_ioctl+0x2111/0x3130 [btrfs] >>> do_vfs_ioctl+0x7ea/0xa80 >>> SyS_ioctl+0xa1/0xb0 >>> entry_SYSCALL_64_fastpath+0x1e/0x8e >>> ---[ end trace 910c8f86780ca385 ]--- >>> BTRFS: error (device dm-2) in create_pending_snapshot:1937: errno=-12 Out of memory >>> >>> [Cause] >>> During creating a subvolume/snapshot, it is necessary to allocate memory for Initializing fs root. >>> Therefore, it can only use GFP_NOFS to allocate memory to avoid deadlock issues. >>> However, atomic allocation is required when processing percpu_counter_init >>> without GFP_KERNEL due to the unique structure of percpu_counter. >>> In this situation, allocating memory for initializing fs root may cause >>> unexpected -ENOMEM when free memory is low and causes btrfs transaction to abort. >>> >>> [Fix] >>> We allocate memory at the beginning of creating a subvolume/snapshot. >>> This way, we can ensure the memory is enough when initializing fs root. >>> Even -ENOMEM happens at the beginning of creating a subvolume/snapshot, >>> the transaction won’t abort since it hasn’t started yet. >> Honestly I'd rather just make the btrfs_drew_lock use an atomic_t for the >> writers counter as well. This is only taken in truncate an nocow writes, and in >> nocow writes there are a looooot of slower things that have to be done that >> we're not winning a lot with the percpu counter. Is there any reason not to >> just do that and leave all this code alone? Thanks, > The percpu counter for writers is there since the original commit > 8257b2dc3c1a1057 "Btrfs: introduce btrfs_{start, end}_nocow_write() for > each subvolume". The reason could be to avoid hammering the same > cacheline from all the readers but then the writers do that anyway. > This happens in context related to IO or there's some waiting anyway, so > yeah using atomic for writers should be ok as well. Sorry for the late reply, I've been busy recently. This modification will not affect the original btrfs_drew_lock behavior, and the framework can also provide future scenarios where memory needs to be allocated in init_fs_root. Thanks. Robbie Ko
On Mon, Dec 19, 2022 at 02:54:06PM +0800, robbieko wrote: > > David Sterba 於 2022/12/15 上午2:07 寫道: > > On Wed, Dec 14, 2022 at 11:59:10AM -0500, Josef Bacik wrote: > >> On Wed, Dec 14, 2022 at 10:11:22AM +0800, robbieko wrote: > >>> From: Robbie Ko <robbieko@synology.com> > >>> > >>> [Issue] > >>> When creating subvolume/snapshot, the transaction may be abort due to -ENOMEM > >>> > >>> WARNING: CPU: 1 PID: 411 at fs/btrfs/transaction.c:1937 create_pending_snapshot+0xe30/0xe70 [btrfs]() > >>> CPU: 1 PID: 411 Comm: btrfs Tainted: P O 4.4.180+ #42661 > >>> Call Trace: > >>> create_pending_snapshot+0xe30/0xe70 [btrfs] > >>> create_pending_snapshots+0x89/0xb0 [btrfs] > >>> btrfs_commit_transaction+0x469/0xc60 [btrfs] > >>> btrfs_mksubvol+0x5bd/0x690 [btrfs] > >>> btrfs_mksnapshot+0x102/0x170 [btrfs] > >>> btrfs_ioctl_snap_create_transid+0x1ad/0x1c0 [btrfs] > >>> btrfs_ioctl_snap_create_v2+0x102/0x160 [btrfs] > >>> btrfs_ioctl+0x2111/0x3130 [btrfs] > >>> do_vfs_ioctl+0x7ea/0xa80 > >>> SyS_ioctl+0xa1/0xb0 > >>> entry_SYSCALL_64_fastpath+0x1e/0x8e > >>> ---[ end trace 910c8f86780ca385 ]--- > >>> BTRFS: error (device dm-2) in create_pending_snapshot:1937: errno=-12 Out of memory > >>> > >>> [Cause] > >>> During creating a subvolume/snapshot, it is necessary to allocate memory for Initializing fs root. > >>> Therefore, it can only use GFP_NOFS to allocate memory to avoid deadlock issues. > >>> However, atomic allocation is required when processing percpu_counter_init > >>> without GFP_KERNEL due to the unique structure of percpu_counter. > >>> In this situation, allocating memory for initializing fs root may cause > >>> unexpected -ENOMEM when free memory is low and causes btrfs transaction to abort. > >>> > >>> [Fix] > >>> We allocate memory at the beginning of creating a subvolume/snapshot. > >>> This way, we can ensure the memory is enough when initializing fs root. > >>> Even -ENOMEM happens at the beginning of creating a subvolume/snapshot, > >>> the transaction won’t abort since it hasn’t started yet. > >> Honestly I'd rather just make the btrfs_drew_lock use an atomic_t for the > >> writers counter as well. This is only taken in truncate an nocow writes, and in > >> nocow writes there are a looooot of slower things that have to be done that > >> we're not winning a lot with the percpu counter. Is there any reason not to > >> just do that and leave all this code alone? Thanks, > > The percpu counter for writers is there since the original commit > > 8257b2dc3c1a1057 "Btrfs: introduce btrfs_{start, end}_nocow_write() for > > each subvolume". The reason could be to avoid hammering the same > > cacheline from all the readers but then the writers do that anyway. > > This happens in context related to IO or there's some waiting anyway, so > > yeah using atomic for writers should be ok as well. > > Sorry for the late reply, I've been busy recently. > This modification will not affect the original btrfs_drew_lock behavior, > and the framework can also provide future scenarios where memory > needs to be allocated in init_fs_root. With the conversion to atomic_t the preallocation can remain unchanged as there would be only the anon bdev preallocated, then there's not much reason to have the infrastructure. I'm now testing the patch below, it's relatively short an can be backported if needed but it can potentially make the performance worse in some cases. --- diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index a7d5a3967ba0..7acedc6c2a56 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1523,15 +1523,7 @@ static int btrfs_init_fs_root(struct btrfs_root *root, dev_t anon_dev) int ret; unsigned int nofs_flag; - /* - * We might be called under a transaction (e.g. indirect backref - * resolution) which could deadlock if it triggers memory reclaim - */ - nofs_flag = memalloc_nofs_save(); - ret = btrfs_drew_lock_init(&root->snapshot_lock); - memalloc_nofs_restore(nofs_flag); - if (ret) - goto fail; + btrfs_drew_lock_init(&root->snapshot_lock); if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID && !btrfs_is_data_reloc_root(root)) { @@ -2242,7 +2234,6 @@ void btrfs_put_root(struct btrfs_root *root) WARN_ON(test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state)); if (root->anon_dev) free_anon_bdev(root->anon_dev); - btrfs_drew_lock_destroy(&root->snapshot_lock); free_root_extent_buffers(root); #ifdef CONFIG_BTRFS_DEBUG spin_lock(&root->fs_info->fs_roots_radix_lock); diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c index 870528d87526..3a496b0d3d2b 100644 --- a/fs/btrfs/locking.c +++ b/fs/btrfs/locking.c @@ -325,24 +325,12 @@ struct extent_buffer *btrfs_try_read_lock_root_node(struct btrfs_root *root) * acquire the lock. */ -int btrfs_drew_lock_init(struct btrfs_drew_lock *lock) +void btrfs_drew_lock_init(struct btrfs_drew_lock *lock) { - int ret; - - ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL); - if (ret) - return ret; - atomic_set(&lock->readers, 0); + atomic_set(&lock->writers, 0); init_waitqueue_head(&lock->pending_readers); init_waitqueue_head(&lock->pending_writers); - - return 0; -} - -void btrfs_drew_lock_destroy(struct btrfs_drew_lock *lock) -{ - percpu_counter_destroy(&lock->writers); } /* Return true if acquisition is successful, false otherwise */ @@ -351,10 +339,10 @@ bool btrfs_drew_try_write_lock(struct btrfs_drew_lock *lock) if (atomic_read(&lock->readers)) return false; - percpu_counter_inc(&lock->writers); + atomic_inc(&lock->writers); /* Ensure writers count is updated before we check for pending readers */ - smp_mb(); + smp_mb__after_atomic(); if (atomic_read(&lock->readers)) { btrfs_drew_write_unlock(lock); return false; @@ -374,7 +362,7 @@ void btrfs_drew_write_lock(struct btrfs_drew_lock *lock) void btrfs_drew_write_unlock(struct btrfs_drew_lock *lock) { - percpu_counter_dec(&lock->writers); + atomic_dec(&lock->writers); cond_wake_up(&lock->pending_readers); } @@ -390,8 +378,7 @@ void btrfs_drew_read_lock(struct btrfs_drew_lock *lock) */ smp_mb__after_atomic(); - wait_event(lock->pending_readers, - percpu_counter_sum(&lock->writers) == 0); + wait_event(lock->pending_readers, atomic_read(&lock->writers) == 0); } void btrfs_drew_read_unlock(struct btrfs_drew_lock *lock) diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h index 11c2269b4b6f..edb9b4a0dba1 100644 --- a/fs/btrfs/locking.h +++ b/fs/btrfs/locking.h @@ -195,13 +195,12 @@ static inline void btrfs_tree_unlock_rw(struct extent_buffer *eb, int rw) struct btrfs_drew_lock { atomic_t readers; - struct percpu_counter writers; + atomic_t writers; wait_queue_head_t pending_writers; wait_queue_head_t pending_readers; }; -int btrfs_drew_lock_init(struct btrfs_drew_lock *lock); -void btrfs_drew_lock_destroy(struct btrfs_drew_lock *lock); +void btrfs_drew_lock_init(struct btrfs_drew_lock *lock); void btrfs_drew_write_lock(struct btrfs_drew_lock *lock); bool btrfs_drew_try_write_lock(struct btrfs_drew_lock *lock); void btrfs_drew_write_unlock(struct btrfs_drew_lock *lock);
On Tue, Jan 10, 2023 at 04:08:18PM +0100, David Sterba wrote: > > Sorry for the late reply, I've been busy recently. > > This modification will not affect the original btrfs_drew_lock behavior, > > and the framework can also provide future scenarios where memory > > needs to be allocated in init_fs_root. > > With the conversion to atomic_t the preallocation can remain unchanged > as there would be only the anon bdev preallocated, then there's not much > reason to have the infrastructure. > > I'm now testing the patch below, it's relatively short an can be > backported if needed but it can potentially make the performance worse > in some cases. I forgot to CC you, the patch implementing the switch to atomic has been sent and merged to misc-next. https://lore.kernel.org/linux-btrfs/20230301204708.25710-1-dsterba@suse.com/
David Sterba 於 2023/3/10 上午2:47 寫道: > On Tue, Jan 10, 2023 at 04:08:18PM +0100, David Sterba wrote: >>> Sorry for the late reply, I've been busy recently. >>> This modification will not affect the original btrfs_drew_lock behavior, >>> and the framework can also provide future scenarios where memory >>> needs to be allocated in init_fs_root. >> With the conversion to atomic_t the preallocation can remain unchanged >> as there would be only the anon bdev preallocated, then there's not much >> reason to have the infrastructure. >> >> I'm now testing the patch below, it's relatively short an can be >> backported if needed but it can potentially make the performance worse >> in some cases. > I forgot to CC you, the patch implementing the switch to atomic has been > sent and merged to misc-next. > > https://lore.kernel.org/linux-btrfs/20230301204708.25710-1-dsterba@suse.com/ OK, Thank you so much.
From: Robbie Ko <robbieko@synology.com> [Issue] When creating subvolume/snapshot, the transaction may be abort due to -ENOMEM WARNING: CPU: 1 PID: 411 at fs/btrfs/transaction.c:1937 create_pending_snapshot+0xe30/0xe70 [btrfs]() CPU: 1 PID: 411 Comm: btrfs Tainted: P O 4.4.180+ #42661 Call Trace: create_pending_snapshot+0xe30/0xe70 [btrfs] create_pending_snapshots+0x89/0xb0 [btrfs] btrfs_commit_transaction+0x469/0xc60 [btrfs] btrfs_mksubvol+0x5bd/0x690 [btrfs] btrfs_mksnapshot+0x102/0x170 [btrfs] btrfs_ioctl_snap_create_transid+0x1ad/0x1c0 [btrfs] btrfs_ioctl_snap_create_v2+0x102/0x160 [btrfs] btrfs_ioctl+0x2111/0x3130 [btrfs] do_vfs_ioctl+0x7ea/0xa80 SyS_ioctl+0xa1/0xb0 entry_SYSCALL_64_fastpath+0x1e/0x8e ---[ end trace 910c8f86780ca385 ]--- BTRFS: error (device dm-2) in create_pending_snapshot:1937: errno=-12 Out of memory [Cause] During creating a subvolume/snapshot, it is necessary to allocate memory for Initializing fs root. Therefore, it can only use GFP_NOFS to allocate memory to avoid deadlock issues. However, atomic allocation is required when processing percpu_counter_init without GFP_KERNEL due to the unique structure of percpu_counter. In this situation, allocating memory for initializing fs root may cause unexpected -ENOMEM when free memory is low and causes btrfs transaction to abort. [Fix] We allocate memory at the beginning of creating a subvolume/snapshot. This way, we can ensure the memory is enough when initializing fs root. Even -ENOMEM happens at the beginning of creating a subvolume/snapshot, the transaction won’t abort since it hasn’t started yet. Robbie Ko (3): btrfs: refactor anon_dev with new_fs_root_args for create subvolume/snapshot btrfs: change snapshot_lock to dynamic pointer btrfs: add snapshot_lock to new_fs_root_args fs/btrfs/ctree.h | 2 +- fs/btrfs/disk-io.c | 107 ++++++++++++++++++++++++++++++----------- fs/btrfs/disk-io.h | 12 ++++- fs/btrfs/file.c | 8 +-- fs/btrfs/inode.c | 12 ++--- fs/btrfs/ioctl.c | 38 +++++++-------- fs/btrfs/transaction.c | 2 +- fs/btrfs/transaction.h | 5 +- 8 files changed, 123 insertions(+), 63 deletions(-)