Message ID | 20240903071625.957275-3-luca.stefani.ge1@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: Don't block suspend during fstrim | expand |
You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio.
在 2024/9/3 16:57, Christoph Hellwig 写道:
> You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio.
Just curious, shouldn't blk_ioctl_discard() also check frozen status to
prevent block device trim from block suspension?
And if that's the case, would it make more sense to move the signal and
freezing checks into blk_alloc_discard_bio()?
Thanks,
Qu
On 03/09/24 09:16, Luca Stefani wrote: > Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device, > mostly empty although we will do the split according to our super block > locations, the last super block ends at 256G, we can submit a huge > discard for the range [256G, 8T), causing a super large delay. > > We now split the space left to discard based the block discard limit > in preparation of introduction of cancellation signals handling. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180 > Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737 > Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> > --- > fs/btrfs/extent-tree.c | 24 +++++++++++++++++++----- > 1 file changed, 19 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index a5966324607d..9c1ddf13659e 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -1301,12 +1301,26 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len, > } > > if (bytes_left) { > - ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT, > - bytes_left >> SECTOR_SHIFT, > - GFP_NOFS); > - if (!ret) > - *discarded_bytes += bytes_left; I removed this by mistake, will be re-added. > + u64 bytes_to_discard; > + struct bio *bio = NULL; > + sector_t sector = start >> SECTOR_SHIFT; > + sector_t nr_sects = bytes_left >> SECTOR_SHIFT; > + > + while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, > + GFP_NOFS))) { > + ret = submit_bio_wait(bio); > + bio_put(bio); > + > + if (!ret) > + bytes_to_discard = bio->bi_iter.bi_size; > + else if (ret != -EOPNOTSUPP) > + return ret; I think I got the logic wrong, we probably want to `continue` in case ret is set, but it's not -EOPNOTSUPP, otherwise bytes_to_discard might be left uninitialized. bio->bi_iter.bi_size can be used directly for all those cases, so I'll remove bytes_to_discard as a whole. > + > + start += bytes_to_discard; > + bytes_left -= bytes_to_discard; > + } > } > + > return ret; > } > I'll fix those up for v4, but I'll wait for more comments before doing so.
On Tue, Sep 03, 2024 at 09:16:11AM +0200, Luca Stefani wrote: > Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device, > mostly empty although we will do the split according to our super block > locations, the last super block ends at 256G, we can submit a huge > discard for the range [256G, 8T), causing a super large delay. > > We now split the space left to discard based the block discard limit > in preparation of introduction of cancellation signals handling. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180 > Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737 > Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> > --- > fs/btrfs/extent-tree.c | 24 +++++++++++++++++++----- > 1 file changed, 19 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index a5966324607d..9c1ddf13659e 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -1301,12 +1301,26 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len, > } > > if (bytes_left) { > - ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT, > - bytes_left >> SECTOR_SHIFT, > - GFP_NOFS); > - if (!ret) > - *discarded_bytes += bytes_left; > + u64 bytes_to_discard; > + struct bio *bio = NULL; > + sector_t sector = start >> SECTOR_SHIFT; > + sector_t nr_sects = bytes_left >> SECTOR_SHIFT; > + > + while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, > + GFP_NOFS))) { > + ret = submit_bio_wait(bio); > + bio_put(bio); > + > + if (!ret) > + bytes_to_discard = bio->bi_iter.bi_size; > + else if (ret != -EOPNOTSUPP) > + return ret; > + > + start += bytes_to_discard; > + bytes_left -= bytes_to_discard; > + } This is not what I anticipated, we only wanted to know the optimal request size but now it's reimplementing the bio submission and compared to blkdev_issue_discard() it lacks blk_start_plug/blk_finish_plug. As we won't get the bio_discard_limit() export for some reason I suggest to go back to setting the maximum chunk limit in our code and set it to something like 8G.
On Tue, Sep 03, 2024 at 07:13:29PM +0930, Qu Wenruo wrote: > > > 在 2024/9/3 16:57, Christoph Hellwig 写道: > > You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio. > > Just curious, shouldn't blk_ioctl_discard() also check frozen status to > prevent block device trim from block suspension? Someone needs to explain what 'block suspension' is for that first. > And if that's the case, would it make more sense to move the signal and > freezing checks into blk_alloc_discard_bio()? No. Look at that the recent history of the dicard support. We tried that and it badly breaks callers not expecting the signal handling.
在 2024/9/4 13:43, Christoph Hellwig 写道: > On Tue, Sep 03, 2024 at 07:13:29PM +0930, Qu Wenruo wrote: >> >> >> 在 2024/9/3 16:57, Christoph Hellwig 写道: >>> You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio. >> >> Just curious, shouldn't blk_ioctl_discard() also check frozen status to >> prevent block device trim from block suspension? > > Someone needs to explain what 'block suspension' is for that first. E.g. a long running full device trim preventing PM suspension/hibernation. The original report shows the fstrim of btrfs is preventing the system entering suspension/hibernation: PM: suspend entry (deep) Filesystems sync: 0.060 seconds Freezing user space processes Freezing user space processes failed after 20.007 seconds (1 tasks refusing to freeze, wq_busy=0): task:fstrim state:D stack:0 pid:15564 tgid:15564 ppid:1 flags:0x00004006 Call Trace: <TASK> __schedule+0x381/0x1540 schedule+0x24/0xb0 schedule_timeout+0x1ea/0x2a0 io_schedule_timeout+0x19/0x50 wait_for_completion_io+0x78/0x140 submit_bio_wait+0xaa/0xc0 blkdev_issue_discard+0x65/0xb0 btrfs_issue_discard+0xcf/0x160 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82] btrfs_discard_extent+0x120/0x2a0 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82] do_trimming+0xd4/0x220 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82] trim_bitmaps+0x418/0x520 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82] btrfs_trim_block_group+0xcb/0x130 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82] btrfs_trim_fs+0x119/0x460 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82] btrfs_ioctl_fitrim+0xfb/0x160 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82] btrfs_ioctl+0x11cc/0x29f0 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82] __x64_sys_ioctl+0x92/0xd0 do_syscall_64+0x5b/0x80 entry_SYSCALL_64_after_hwframe+0x7c/0xe6 RIP: 0033:0x7f5f3b529f9b RSP: 002b:00007fff279ebc20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fff279ebd60 RCX: 00007f5f3b529f9b RDX: 00007fff279ebc90 RSI: 00000000c0185879 RDI: 0000000000000003 RBP: 000055748718b2d0 R08: 00005574871899e8 R09: 00007fff279eb010 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 R13: 000055748718ac40 R14: 000055748718b290 R15: 000055748718b290 </TASK> OOM killer enabled. Restarting tasks ... done. random: crng reseeded on system resumption PM: suspend exit PM: suspend entry (s2idle) Filesystems sync: 0.047 seconds Considering blkdev_issue_discard() is already doing cond_sched() and btrfs is also doing cond_sched() for each block group, it looks like PM freezing needs its own freezing() checks, just like what ext4 is doing. > >> And if that's the case, would it make more sense to move the signal and >> freezing checks into blk_alloc_discard_bio()? > > No. Look at that the recent history of the dicard support. We tried > that and it badly breaks callers not expecting the signal handling. OK, got it, at least for btrfs we would go the blk_alloc_discard_bio() and do signal and freezing checks. Thanks, Qu
Hi Luca,
kernel test robot noticed the following build errors:
[auto build test ERROR on v6.11-rc6]
[also build test ERROR on linus/master]
[cannot apply to kdave/for-next next-20240906]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Luca-Stefani/btrfs-Always-update-fstrim_range-on-failure/20240903-191851
base: v6.11-rc6
patch link: https://lore.kernel.org/r/20240903071625.957275-3-luca.stefani.ge1%40gmail.com
patch subject: [PATCH v3 2/3] btrfs: Split remaining space to discard in chunks
config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20240907/202409070311.SB9wmgSx-lkp@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240907/202409070311.SB9wmgSx-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409070311.SB9wmgSx-lkp@intel.com/
All errors (new ones prefixed by >>, old ones prefixed by <<):
>> ERROR: modpost: "blk_alloc_discard_bio" [fs/btrfs/btrfs.ko] undefined!
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a5966324607d..9c1ddf13659e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1301,12 +1301,26 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len, } if (bytes_left) { - ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT, - bytes_left >> SECTOR_SHIFT, - GFP_NOFS); - if (!ret) - *discarded_bytes += bytes_left; + u64 bytes_to_discard; + struct bio *bio = NULL; + sector_t sector = start >> SECTOR_SHIFT; + sector_t nr_sects = bytes_left >> SECTOR_SHIFT; + + while ((bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, + GFP_NOFS))) { + ret = submit_bio_wait(bio); + bio_put(bio); + + if (!ret) + bytes_to_discard = bio->bi_iter.bi_size; + else if (ret != -EOPNOTSUPP) + return ret; + + start += bytes_to_discard; + bytes_left -= bytes_to_discard; + } } + return ret; }
Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device, mostly empty although we will do the split according to our super block locations, the last super block ends at 256G, we can submit a huge discard for the range [256G, 8T), causing a super large delay. We now split the space left to discard based the block discard limit in preparation of introduction of cancellation signals handling. Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180 Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737 Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> --- fs/btrfs/extent-tree.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-)