Message ID | c20733c28d02562ff09bfff6739b01b5f710bed7.1743004734.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: improvements to the release_folio callback | expand |
On Thu, Mar 27, 2025 at 04:13:51PM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When the release_folio callback (from struct address_space_operations) is > invoked we don't allow the folio to be released if its range is currently > locked in the inode's io_tree, as it may indicate the folio may be needed > by the task that locked the range. > > However if the range is locked because an ordered extent is finishing, > then we can safely allow the folio to be released because ordered extent > completion doesn't need to use the folio at all. > > When we are under memory pressure, the kernel starts writeback of dirty > pages (folios) with the goal of releasing the pages from the page cache > after writeback completes, however this often is not possible on btrfs > because: > > * Once the writeback completes we queue the ordered extent completion; > > * Once the ordered extent completion starts, we lock the range in the > inode's io_tree (at btrfs_finish_one_ordered()); > > * If the release_folio callback is called while the folio's range is > locked in the inode's io_tree, we don't allow the folio to be > released, so the kernel has to try to release memory elsewhere, > which may result in triggering more writeback or releasing other > pages from the page cache which may be more useful to have around > for applications. > > In contrast, when the release_folio callback is invoked after writeback > finishes and before ordered extent completion starts or locks the range, > we allow the folio to be released, as well as when the release_folio > callback is invoked after ordered extent completion unlocks the range. > > Improve on this by detecting if the range is locked for ordered extent > completion and if it is, allow the folio to be released. This detection > is achieved by adding a new extent flag in the io_tree that is set when > the range is locked during ordered extent completion. > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > fs/btrfs/extent-io-tree.c | 22 +++++++++++++++++ > fs/btrfs/extent-io-tree.h | 6 +++++ > fs/btrfs/extent_io.c | 52 +++++++++++++++++++++------------------ > fs/btrfs/inode.c | 6 +++-- > 4 files changed, 60 insertions(+), 26 deletions(-) > > diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c > index 13de6af279e5..14510a71a8fd 100644 > --- a/fs/btrfs/extent-io-tree.c > +++ b/fs/btrfs/extent-io-tree.c > @@ -1752,6 +1752,28 @@ bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32 > return bitset; > } > > +void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits) > +{ > + struct extent_state *state; > + > + *bits = 0; > + > + spin_lock(&tree->lock); > + state = tree_search(tree, start); > + while (state) { > + if (state->start > end) > + break; > + > + *bits |= state->state; > + > + if (state->end >= end) > + break; > + > + state = next_state(state); > + } > + spin_unlock(&tree->lock); > +} > + > /* > * Check if the whole range [@start,@end) contains the single @bit set. > */ > diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h > index 6ffef1cd37c1..e49f24151167 100644 > --- a/fs/btrfs/extent-io-tree.h > +++ b/fs/btrfs/extent-io-tree.h > @@ -38,6 +38,11 @@ enum { > * that is left for the ordered extent completion. > */ > ENUM_BIT(EXTENT_DELALLOC_NEW), > + /* > + * Mark that a range is being locked for finishing an ordered extent. > + * Used together with EXTENT_LOCKED. > + */ > + ENUM_BIT(EXTENT_FINISHING_ORDERED), > /* > * When an ordered extent successfully completes for a region marked as > * a new delalloc range, use this flag when clearing a new delalloc > @@ -166,6 +171,7 @@ void free_extent_state(struct extent_state *state); > bool test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bit, > struct extent_state *cached_state); > bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32 bit); > +void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits); > int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, > u32 bits, struct extent_changeset *changeset); > int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index a11b22fcd154..6b9a80f9e0f5 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2627,33 +2627,37 @@ static bool try_release_extent_state(struct extent_io_tree *tree, > { > u64 start = folio_pos(folio); > u64 end = start + folio_size(folio) - 1; > - bool ret; > + u32 range_bits; > + u32 clear_bits; > + int ret; > > - if (test_range_bit_exists(tree, start, end, EXTENT_LOCKED)) { > - ret = false; > - } else { > - u32 clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM | > - EXTENT_DELALLOC_NEW | EXTENT_CTLBITS | > - EXTENT_QGROUP_RESERVED); > - int ret2; > + get_range_bits(tree, start, end, &range_bits); There's a difference how much of the tree is traversed, test_range_bit_exists() stops on first occurence of EXTENT_LOCKED (a single bit), get_range_bits() unconditionally explores the whole tree. > > - /* > - * At this point we can safely clear everything except the > - * locked bit, the nodatasum bit and the delalloc new bit. > - * The delalloc new bit will be cleared by ordered extent > - * completion. > - */ > - ret2 = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL); > + /* > + * We can release the folio if it's locked only for ordered extent > + * completion, since that doesn't require using the folio. > + */ > + if ((range_bits & EXTENT_LOCKED) && > + !(range_bits & EXTENT_FINISHING_ORDERED)) Here we need to know that LOCKED exists and FINISHING_ORDERED does not exist in the range. This can be proven when the whole tree is traversed, but could be in some cases be reduced to if (test_range_bit_exists(..., LOCKED) && !test_range_bit_exists(, FINISHING_ORDERED)) where in some percent of cases the whole tree won't be traversed (and the lock held for a shorter time). This depends on the runtime what combinations of the locks exist, it's possible than in the average case the whole tree would be traversed anyway, and get_range_bits() is OK.
On Mon, Mar 31, 2025 at 11:39 PM David Sterba <dsterba@suse.cz> wrote: > > On Thu, Mar 27, 2025 at 04:13:51PM +0000, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > When the release_folio callback (from struct address_space_operations) is > > invoked we don't allow the folio to be released if its range is currently > > locked in the inode's io_tree, as it may indicate the folio may be needed > > by the task that locked the range. > > > > However if the range is locked because an ordered extent is finishing, > > then we can safely allow the folio to be released because ordered extent > > completion doesn't need to use the folio at all. > > > > When we are under memory pressure, the kernel starts writeback of dirty > > pages (folios) with the goal of releasing the pages from the page cache > > after writeback completes, however this often is not possible on btrfs > > because: > > > > * Once the writeback completes we queue the ordered extent completion; > > > > * Once the ordered extent completion starts, we lock the range in the > > inode's io_tree (at btrfs_finish_one_ordered()); > > > > * If the release_folio callback is called while the folio's range is > > locked in the inode's io_tree, we don't allow the folio to be > > released, so the kernel has to try to release memory elsewhere, > > which may result in triggering more writeback or releasing other > > pages from the page cache which may be more useful to have around > > for applications. > > > > In contrast, when the release_folio callback is invoked after writeback > > finishes and before ordered extent completion starts or locks the range, > > we allow the folio to be released, as well as when the release_folio > > callback is invoked after ordered extent completion unlocks the range. > > > > Improve on this by detecting if the range is locked for ordered extent > > completion and if it is, allow the folio to be released. This detection > > is achieved by adding a new extent flag in the io_tree that is set when > > the range is locked during ordered extent completion. > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > --- > > fs/btrfs/extent-io-tree.c | 22 +++++++++++++++++ > > fs/btrfs/extent-io-tree.h | 6 +++++ > > fs/btrfs/extent_io.c | 52 +++++++++++++++++++++------------------ > > fs/btrfs/inode.c | 6 +++-- > > 4 files changed, 60 insertions(+), 26 deletions(-) > > > > diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c > > index 13de6af279e5..14510a71a8fd 100644 > > --- a/fs/btrfs/extent-io-tree.c > > +++ b/fs/btrfs/extent-io-tree.c > > @@ -1752,6 +1752,28 @@ bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32 > > return bitset; > > } > > > > +void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits) > > +{ > > + struct extent_state *state; > > + > > + *bits = 0; > > + > > + spin_lock(&tree->lock); > > + state = tree_search(tree, start); > > + while (state) { > > + if (state->start > end) > > + break; > > + > > + *bits |= state->state; > > + > > + if (state->end >= end) > > + break; > > + > > + state = next_state(state); > > + } > > + spin_unlock(&tree->lock); > > +} > > + > > /* > > * Check if the whole range [@start,@end) contains the single @bit set. > > */ > > diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h > > index 6ffef1cd37c1..e49f24151167 100644 > > --- a/fs/btrfs/extent-io-tree.h > > +++ b/fs/btrfs/extent-io-tree.h > > @@ -38,6 +38,11 @@ enum { > > * that is left for the ordered extent completion. > > */ > > ENUM_BIT(EXTENT_DELALLOC_NEW), > > + /* > > + * Mark that a range is being locked for finishing an ordered extent. > > + * Used together with EXTENT_LOCKED. > > + */ > > + ENUM_BIT(EXTENT_FINISHING_ORDERED), > > /* > > * When an ordered extent successfully completes for a region marked as > > * a new delalloc range, use this flag when clearing a new delalloc > > @@ -166,6 +171,7 @@ void free_extent_state(struct extent_state *state); > > bool test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bit, > > struct extent_state *cached_state); > > bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32 bit); > > +void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits); > > int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, > > u32 bits, struct extent_changeset *changeset); > > int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index a11b22fcd154..6b9a80f9e0f5 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -2627,33 +2627,37 @@ static bool try_release_extent_state(struct extent_io_tree *tree, > > { > > u64 start = folio_pos(folio); > > u64 end = start + folio_size(folio) - 1; > > - bool ret; > > + u32 range_bits; > > + u32 clear_bits; > > + int ret; > > > > - if (test_range_bit_exists(tree, start, end, EXTENT_LOCKED)) { > > - ret = false; > > - } else { > > - u32 clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM | > > - EXTENT_DELALLOC_NEW | EXTENT_CTLBITS | > > - EXTENT_QGROUP_RESERVED); > > - int ret2; > > + get_range_bits(tree, start, end, &range_bits); > > There's a difference how much of the tree is traversed, > test_range_bit_exists() stops on first occurence of EXTENT_LOCKED (a > single bit), get_range_bits() unconditionally explores the whole tree. The whole tree? The folio's range, unless there are no records in the tree outside the folio's range, in which case it is a very small tree. But it's not a problem for several reasons: 1) When a range is not under IO or locked, there's typically no state record in the tree; 2) If a range is under delalloc or writeback, we don't even get to this function - we exit early in btrfs_release_folio() if the folio is dirty or under writeback; 3) The IO is folio size based (whether single page or multi page in the near future), so it's unlikely to find more than one state record for the range - correct me if you know of cases where we'll get more than one. 4) For bit ranges other than EXTENT_LOCKED, extent state records are merged, so even for very uncommon scenarios it's unlikely to find more than 1 state record for the folio's range. So traversing the whole range is far from being a bottleneck or causing more lock contention, even if there are multiple state records - we already had to traverse the whole range with test_range_bit_exists() when the folio's range is not locked. > > > > > - /* > > - * At this point we can safely clear everything except the > > - * locked bit, the nodatasum bit and the delalloc new bit. > > - * The delalloc new bit will be cleared by ordered extent > > - * completion. > > - */ > > - ret2 = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL); > > + /* > > + * We can release the folio if it's locked only for ordered extent > > + * completion, since that doesn't require using the folio. > > + */ > > + if ((range_bits & EXTENT_LOCKED) && > > + !(range_bits & EXTENT_FINISHING_ORDERED)) > > Here we need to know that LOCKED exists and FINISHING_ORDERED does not > exist in the range. This can be proven when the whole tree is traversed, I still don't get why you say the whole tree. > but could be in some cases be reduced to > > if (test_range_bit_exists(..., LOCKED) && > !test_range_bit_exists(, FINISHING_ORDERED)) > > where in some percent of cases the whole tree won't be traversed (and > the lock held for a shorter time). This depends on the runtime what > combinations of the locks exist, it's possible than in the average case > the whole tree would be traversed anyway, and get_range_bits() is OK. I don't see what specific cases you are talking about, correct me if any of the things I'd said above is wrong and we can have a lot of extent state records in the folio's range. But even if we do, it's the same behaviour as for test_range_bit_exists() when the range is not locked. Thanks.
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c index 13de6af279e5..14510a71a8fd 100644 --- a/fs/btrfs/extent-io-tree.c +++ b/fs/btrfs/extent-io-tree.c @@ -1752,6 +1752,28 @@ bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32 return bitset; } +void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits) +{ + struct extent_state *state; + + *bits = 0; + + spin_lock(&tree->lock); + state = tree_search(tree, start); + while (state) { + if (state->start > end) + break; + + *bits |= state->state; + + if (state->end >= end) + break; + + state = next_state(state); + } + spin_unlock(&tree->lock); +} + /* * Check if the whole range [@start,@end) contains the single @bit set. */ diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h index 6ffef1cd37c1..e49f24151167 100644 --- a/fs/btrfs/extent-io-tree.h +++ b/fs/btrfs/extent-io-tree.h @@ -38,6 +38,11 @@ enum { * that is left for the ordered extent completion. */ ENUM_BIT(EXTENT_DELALLOC_NEW), + /* + * Mark that a range is being locked for finishing an ordered extent. + * Used together with EXTENT_LOCKED. + */ + ENUM_BIT(EXTENT_FINISHING_ORDERED), /* * When an ordered extent successfully completes for a region marked as * a new delalloc range, use this flag when clearing a new delalloc @@ -166,6 +171,7 @@ void free_extent_state(struct extent_state *state); bool test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bit, struct extent_state *cached_state); bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32 bit); +void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits); int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 bits, struct extent_changeset *changeset); int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index a11b22fcd154..6b9a80f9e0f5 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2627,33 +2627,37 @@ static bool try_release_extent_state(struct extent_io_tree *tree, { u64 start = folio_pos(folio); u64 end = start + folio_size(folio) - 1; - bool ret; + u32 range_bits; + u32 clear_bits; + int ret; - if (test_range_bit_exists(tree, start, end, EXTENT_LOCKED)) { - ret = false; - } else { - u32 clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM | - EXTENT_DELALLOC_NEW | EXTENT_CTLBITS | - EXTENT_QGROUP_RESERVED); - int ret2; + get_range_bits(tree, start, end, &range_bits); - /* - * At this point we can safely clear everything except the - * locked bit, the nodatasum bit and the delalloc new bit. - * The delalloc new bit will be cleared by ordered extent - * completion. - */ - ret2 = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL); + /* + * We can release the folio if it's locked only for ordered extent + * completion, since that doesn't require using the folio. + */ + if ((range_bits & EXTENT_LOCKED) && + !(range_bits & EXTENT_FINISHING_ORDERED)) + return false; - /* if clear_extent_bit failed for enomem reasons, - * we can't allow the release to continue. - */ - if (ret2 < 0) - ret = false; - else - ret = true; - } - return ret; + clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM | EXTENT_DELALLOC_NEW | + EXTENT_CTLBITS | EXTENT_QGROUP_RESERVED | + EXTENT_FINISHING_ORDERED); + /* + * At this point we can safely clear everything except the locked, + * nodatasum, delalloc new and finishing ordered bits. The delalloc new + * bit will be cleared by ordered extent completion. + */ + ret = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL); + /* + * If clear_extent_bit failed for enomem reasons, we can't allow the + * release to continue. + */ + if (ret < 0) + return false; + + return true; } /* diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e283627c087d..469b3fd64f17 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3132,8 +3132,10 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent) * depending on their current state). */ if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags)) { - clear_bits |= EXTENT_LOCKED; - lock_extent(io_tree, start, end, &cached_state); + clear_bits |= EXTENT_LOCKED | EXTENT_FINISHING_ORDERED; + __lock_extent(io_tree, start, end, + EXTENT_LOCKED | EXTENT_FINISHING_ORDERED, + &cached_state); } if (freespace_inode)