diff mbox series

[v3,2/3] btrfs: Split remaining space to discard in chunks

Message ID 20240903071625.957275-3-luca.stefani.ge1@gmail.com (mailing list archive)
State New, archived
Headers show
Series btrfs: Don't block suspend during fstrim | expand

Commit Message

Luca Stefani Sept. 3, 2024, 7:16 a.m. UTC
Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
mostly empty although we will do the split according to our super block
locations, the last super block ends at 256G, we can submit a huge
discard for the range [256G, 8T), causing a super large delay.

We now split the space left to discard based the block discard limit
in preparation of introduction of cancellation signals handling.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
---
 fs/btrfs/extent-tree.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

Comments

Christoph Hellwig Sept. 3, 2024, 7:27 a.m. UTC | #1
You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio.
Qu Wenruo Sept. 3, 2024, 9:43 a.m. UTC | #2
在 2024/9/3 16:57, Christoph Hellwig 写道:
> You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio.

Just curious, shouldn't blk_ioctl_discard() also check frozen status to
prevent block device trim from block suspension?

And if that's the case, would it make more sense to move the signal and
freezing checks into blk_alloc_discard_bio()?

Thanks,
Qu
Luca Stefani Sept. 3, 2024, 10:39 a.m. UTC | #3
On 03/09/24 09:16, Luca Stefani wrote:
> Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
> mostly empty although we will do the split according to our super block
> locations, the last super block ends at 256G, we can submit a huge
> discard for the range [256G, 8T), causing a super large delay.
> 
> We now split the space left to discard based the block discard limit
> in preparation of introduction of cancellation signals handling.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
> ---
>   fs/btrfs/extent-tree.c | 24 +++++++++++++++++++-----
>   1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a5966324607d..9c1ddf13659e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1301,12 +1301,26 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
>   	}
>   
>   	if (bytes_left) {
> -		ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> -					   bytes_left >> SECTOR_SHIFT,
> -					   GFP_NOFS);
> -		if (!ret)
> -			*discarded_bytes += bytes_left;
I removed this by mistake, will be re-added.
> +		u64 bytes_to_discard;
> +		struct bio *bio = NULL;
> +		sector_t sector = start >> SECTOR_SHIFT;
> +		sector_t nr_sects = bytes_left >> SECTOR_SHIFT;
> +
> +		while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
> +				GFP_NOFS))) {
> +			ret = submit_bio_wait(bio);
> +			bio_put(bio);
> +
> +			if (!ret)
> +				bytes_to_discard = bio->bi_iter.bi_size;
> +			else if (ret != -EOPNOTSUPP)
> +				return ret;
I think I got the logic wrong, we probably want to `continue` in case 
ret is set, but it's not -EOPNOTSUPP, otherwise bytes_to_discard might 
be left uninitialized.
bio->bi_iter.bi_size can be used directly for all those cases, so I'll 
remove bytes_to_discard as a whole.
> +
> +			start += bytes_to_discard;
> +			bytes_left -= bytes_to_discard;
> +		}
>   	}
> +
>   	return ret;
>   }
>   

I'll fix those up for v4, but I'll wait for more comments before doing so.
David Sterba Sept. 3, 2024, 7:39 p.m. UTC | #4
On Tue, Sep 03, 2024 at 09:16:11AM +0200, Luca Stefani wrote:
> Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
> mostly empty although we will do the split according to our super block
> locations, the last super block ends at 256G, we can submit a huge
> discard for the range [256G, 8T), causing a super large delay.
> 
> We now split the space left to discard based the block discard limit
> in preparation of introduction of cancellation signals handling.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
> ---
>  fs/btrfs/extent-tree.c | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a5966324607d..9c1ddf13659e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -1301,12 +1301,26 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
>  	}
>  
>  	if (bytes_left) {
> -		ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
> -					   bytes_left >> SECTOR_SHIFT,
> -					   GFP_NOFS);
> -		if (!ret)
> -			*discarded_bytes += bytes_left;
> +		u64 bytes_to_discard;
> +		struct bio *bio = NULL;
> +		sector_t sector = start >> SECTOR_SHIFT;
> +		sector_t nr_sects = bytes_left >> SECTOR_SHIFT;
> +
> +		while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
> +				GFP_NOFS))) {
> +			ret = submit_bio_wait(bio);
> +			bio_put(bio);
> +
> +			if (!ret)
> +				bytes_to_discard = bio->bi_iter.bi_size;
> +			else if (ret != -EOPNOTSUPP)
> +				return ret;
> +
> +			start += bytes_to_discard;
> +			bytes_left -= bytes_to_discard;
> +		}

This is not what I anticipated, we only wanted to know the optimal
request size but now it's reimplementing the bio submission and compared
to blkdev_issue_discard() it lacks blk_start_plug/blk_finish_plug.

As we won't get the bio_discard_limit() export for some reason I suggest
to go back to setting the maximum chunk limit in our code and set it to
something like 8G.
Christoph Hellwig Sept. 4, 2024, 4:13 a.m. UTC | #5
On Tue, Sep 03, 2024 at 07:13:29PM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/9/3 16:57, Christoph Hellwig 写道:
> > You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio.
> 
> Just curious, shouldn't blk_ioctl_discard() also check frozen status to
> prevent block device trim from block suspension?

Someone needs to explain what 'block suspension' is for that first.

> And if that's the case, would it make more sense to move the signal and
> freezing checks into blk_alloc_discard_bio()?

No.  Look at that the recent history of the dicard support.  We tried
that and it badly breaks callers not expecting the signal handling.
Qu Wenruo Sept. 4, 2024, 5:34 a.m. UTC | #6
在 2024/9/4 13:43, Christoph Hellwig 写道:
> On Tue, Sep 03, 2024 at 07:13:29PM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/9/3 16:57, Christoph Hellwig 写道:
>>> You'll also need add a EXPORT_SYMBOL_GPL for blk_alloc_discard_bio.
>>
>> Just curious, shouldn't blk_ioctl_discard() also check frozen status to
>> prevent block device trim from block suspension?
> 
> Someone needs to explain what 'block suspension' is for that first.

E.g. a long running full device trim preventing PM suspension/hibernation.

The original report shows the fstrim of btrfs is preventing the system 
entering suspension/hibernation:

  PM: suspend entry (deep)
  Filesystems sync: 0.060 seconds
  Freezing user space processes
  Freezing user space processes failed after 20.007 seconds (1 tasks 
refusing to freeze, wq_busy=0):
  task:fstrim          state:D stack:0     pid:15564 tgid:15564 ppid:1 
    flags:0x00004006
  Call Trace:
   <TASK>
   __schedule+0x381/0x1540
   schedule+0x24/0xb0
   schedule_timeout+0x1ea/0x2a0
   io_schedule_timeout+0x19/0x50
   wait_for_completion_io+0x78/0x140
   submit_bio_wait+0xaa/0xc0
   blkdev_issue_discard+0x65/0xb0
   btrfs_issue_discard+0xcf/0x160 [btrfs 
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   btrfs_discard_extent+0x120/0x2a0 [btrfs 
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   do_trimming+0xd4/0x220 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   trim_bitmaps+0x418/0x520 [btrfs 7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   btrfs_trim_block_group+0xcb/0x130 [btrfs 
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   btrfs_trim_fs+0x119/0x460 [btrfs 
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   btrfs_ioctl_fitrim+0xfb/0x160 [btrfs 
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   btrfs_ioctl+0x11cc/0x29f0 [btrfs 
7ab35b9b86062a46f6ff578bb32d55ecf8e6bf82]
   __x64_sys_ioctl+0x92/0xd0
   do_syscall_64+0x5b/0x80
   entry_SYSCALL_64_after_hwframe+0x7c/0xe6
  RIP: 0033:0x7f5f3b529f9b
  RSP: 002b:00007fff279ebc20 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 00007fff279ebd60 RCX: 00007f5f3b529f9b
  RDX: 00007fff279ebc90 RSI: 00000000c0185879 RDI: 0000000000000003
  RBP: 000055748718b2d0 R08: 00005574871899e8 R09: 00007fff279eb010
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
  R13: 000055748718ac40 R14: 000055748718b290 R15: 000055748718b290
   </TASK>
  OOM killer enabled.
  Restarting tasks ... done.
  random: crng reseeded on system resumption
  PM: suspend exit
  PM: suspend entry (s2idle)
  Filesystems sync: 0.047 seconds

Considering blkdev_issue_discard() is already doing cond_sched() and 
btrfs is also doing cond_sched() for each block group, it looks like PM 
freezing needs its own freezing() checks, just like what ext4 is doing.

> 
>> And if that's the case, would it make more sense to move the signal and
>> freezing checks into blk_alloc_discard_bio()?
> 
> No.  Look at that the recent history of the dicard support.  We tried
> that and it badly breaks callers not expecting the signal handling.

OK, got it, at least for btrfs we would go the blk_alloc_discard_bio() 
and do signal and freezing checks.

Thanks,
Qu
kernel test robot Sept. 6, 2024, 8:10 p.m. UTC | #7
Hi Luca,

kernel test robot noticed the following build errors:

[auto build test ERROR on v6.11-rc6]
[also build test ERROR on linus/master]
[cannot apply to kdave/for-next next-20240906]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Luca-Stefani/btrfs-Always-update-fstrim_range-on-failure/20240903-191851
base:   v6.11-rc6
patch link:    https://lore.kernel.org/r/20240903071625.957275-3-luca.stefani.ge1%40gmail.com
patch subject: [PATCH v3 2/3] btrfs: Split remaining space to discard in chunks
config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20240907/202409070311.SB9wmgSx-lkp@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240907/202409070311.SB9wmgSx-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409070311.SB9wmgSx-lkp@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "blk_alloc_discard_bio" [fs/btrfs/btrfs.ko] undefined!
diff mbox series

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a5966324607d..9c1ddf13659e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1301,12 +1301,26 @@  static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 	}
 
 	if (bytes_left) {
-		ret = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT,
-					   bytes_left >> SECTOR_SHIFT,
-					   GFP_NOFS);
-		if (!ret)
-			*discarded_bytes += bytes_left;
+		u64 bytes_to_discard;
+		struct bio *bio = NULL;
+		sector_t sector = start >> SECTOR_SHIFT;
+		sector_t nr_sects = bytes_left >> SECTOR_SHIFT;
+
+		while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
+				GFP_NOFS))) {
+			ret = submit_bio_wait(bio);
+			bio_put(bio);
+
+			if (!ret)
+				bytes_to_discard = bio->bi_iter.bi_size;
+			else if (ret != -EOPNOTSUPP)
+				return ret;
+
+			start += bytes_to_discard;
+			bytes_left -= bytes_to_discard;
+		}
 	}
+
 	return ret;
 }