Message ID | 20240902114303.922472-1-luca.stefani.ge1@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: Split remaining space to discard in chunks | expand |
On Mon, Sep 02, 2024 at 01:43:00PM +0200, Luca Stefani wrote: > Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device, > mostly empty although we will do the split according to our super block > locations, the last super block ends at 256G, we can submit a huge > discard for the range [256G, 8T), causing a super large delay. I'm not sure that this will be different than what we already do, or have the large delays been observed in practice? The range passed to blkdev_issue_discard() might be large but internally it's still split to smaller sizes depending on the queue limits, IOW the device. Bio is allocated and limited by bio_discard_limit(bdev, *sector); https://elixir.bootlin.com/linux/v6.10.7/source/block/blk-lib.c#L38 struct bio *blk_alloc_discard_bio(struct block_device *bdev, sector_t *sector, sector_t *nr_sects, gfp_t gfp_mask) { sector_t bio_sects = min(*nr_sects, bio_discard_limit(bdev, *sector)); struct bio *bio; if (!bio_sects) return NULL; bio = bio_alloc(bdev, 0, REQ_OP_DISCARD, gfp_mask); ... Then used in __blkdev_issue_discard() https://elixir.bootlin.com/linux/v6.10.7/source/block/blk-lib.c#L63 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop) { struct bio *bio; while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, gfp_mask))) *biop = bio_chain_and_submit(*biop, bio); return 0; } This is basically just a loop, chopping the input range as needed. The btrfs code does effectively the same, there's only the superblock, progress accounting and error handling done. As the maximum size of a single discard request depends on a device we don't need to artificially limit it because this would require more IO requests and can be slower.
On 02/09/24 22:11, David Sterba wrote: > On Mon, Sep 02, 2024 at 01:43:00PM +0200, Luca Stefani wrote: >> Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device, >> mostly empty although we will do the split according to our super block >> locations, the last super block ends at 256G, we can submit a huge >> discard for the range [256G, 8T), causing a super large delay. > > I'm not sure that this will be different than what we already do, or > have the large delays been observed in practice? The range passed to > blkdev_issue_discard() might be large but internally it's still split to > smaller sizes depending on the queue limits, IOW the device. > > Bio is allocated and limited by bio_discard_limit(bdev, *sector); > https://elixir.bootlin.com/linux/v6.10.7/source/block/blk-lib.c#L38 > > struct bio *blk_alloc_discard_bio(struct block_device *bdev, > sector_t *sector, sector_t *nr_sects, gfp_t gfp_mask) > { > sector_t bio_sects = min(*nr_sects, bio_discard_limit(bdev, *sector)); > struct bio *bio; > > if (!bio_sects) > return NULL; > > bio = bio_alloc(bdev, 0, REQ_OP_DISCARD, gfp_mask); > ... > > > Then used in __blkdev_issue_discard() > https://elixir.bootlin.com/linux/v6.10.7/source/block/blk-lib.c#L63 > > int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, > sector_t nr_sects, gfp_t gfp_mask, struct bio **biop) > { > struct bio *bio; > > while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, > gfp_mask))) > *biop = bio_chain_and_submit(*biop, bio); > return 0; > } > > This is basically just a loop, chopping the input range as needed. The > btrfs code does effectively the same, there's only the superblock, > progress accounting and error handling done. > > As the maximum size of a single discard request depends on a device we > don't need to artificially limit it because this would require more IO > requests and can be slower. Thanks for taking a look, this change was prompted after I've been seeing issues due to the discard kthread blocking an userspace process causing device not to suspend. https://lore.kernel.org/lkml/20240822164908.4957-1-luca.stefani.ge1@gmail.com/ is the proposed solution, but Qu mentioned that there is another place where it could happen that I didn't cover, and I think what I change here (unless it's the wrong place) allows me to add the similar `btrfs_trim_interrupted` checks to stop. Please let me know if that makes sense to you, if that's the case I guess it would make sense to send the 2 patches together? Luca.
On Mon, Sep 02, 2024 at 10:17:37PM +0200, Luca Stefani wrote: > > sector_t nr_sects, gfp_t gfp_mask, struct bio **biop) > > { > > struct bio *bio; > > > > while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, > > gfp_mask))) > > *biop = bio_chain_and_submit(*biop, bio); > > return 0; > > } > > > > This is basically just a loop, chopping the input range as needed. The > > btrfs code does effectively the same, there's only the superblock, > > progress accounting and error handling done. > > > > As the maximum size of a single discard request depends on a device we > > don't need to artificially limit it because this would require more IO > > requests and can be slower. > > Thanks for taking a look, this change was prompted after I've been > seeing issues due to the discard kthread blocking an userspace process > causing device not to suspend. > https://lore.kernel.org/lkml/20240822164908.4957-1-luca.stefani.ge1@gmail.com/ > is the proposed solution, but Qu mentioned that there is another place > where it could happen that I didn't cover, and I think what I change > here (unless it's the wrong place) allows me to add the similar > `btrfs_trim_interrupted` checks to stop. > > Please let me know if that makes sense to you, if that's the case I > guess it would make sense to send the 2 patches together? Yeah for inserting the cancellation points it would make sense to do the chunking. I'd suggest to do the same logic like blk_alloc_discard_bio() and use the block device discard request limit and not a fixed constant.
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index feec49e6f9c8..6ad92876bca0 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1239,7 +1239,7 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len, u64 *discarded_bytes) { int j, ret = 0; - u64 bytes_left, end; + u64 bytes_left, bytes_to_discard, end; u64 aligned_start = ALIGN(start, 1 << SECTOR_SHIFT); /* Adjust the range to be aligned to 512B sectors if necessary. */ @@ -1300,13 +1300,25 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len, bytes_left = end - start; } - if (bytes_left) { + while (bytes_left) { + if (bytes_left > BTRFS_MAX_DATA_CHUNK_SIZE) + bytes_to_discard = BTRFS_MAX_DATA_CHUNK_SIZE; + else + bytes_to_discard = bytes_left; + ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT, - bytes_left >> SECTOR_SHIFT, + bytes_to_discard >> SECTOR_SHIFT, GFP_NOFS); + if (!ret) - *discarded_bytes += bytes_left; + *discarded_bytes += bytes_to_discard; + else if (ret != -EOPNOTSUPP) + return ret; + + start += bytes_to_discard; + bytes_left -= bytes_to_discard; } + return ret; }
Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device, mostly empty although we will do the split according to our super block locations, the last super block ends at 256G, we can submit a huge discard for the range [256G, 8T), causing a super large delay. We now split the space left to discard based off the max data chunk size (10G) to solve the problem. Reported-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Closes: https://lore.kernel.org/lkml/2e15214b-7e95-4e64-a899-725de12c9037@gmail.com/T/#mdfef1d8b36334a15c54cd009f6aadf49e260e105 Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> --- fs/btrfs/extent-tree.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-)