diff mbox series

dm: retain stacked max_sectors when setting queue_limits

Message ID 20240522025117.75568-1-snitzer@kernel.org (mailing list archive)
State New, archived
Headers show
Series dm: retain stacked max_sectors when setting queue_limits | expand

Commit Message

Mike Snitzer May 22, 2024, 2:51 a.m. UTC
Otherwise, blk_validate_limits() will throw-away the max_sectors that
was stacked from underlying device(s). In doing so it can set a
max_sectors limit that violates underlying device limits.

This caused dm-multipath IO failures like the following because the
underlying devices' max_sectors were stacked up to be 1024, yet
blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP
(2560):

[ 1214.673233] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[ 1214.673267] device-mapper: multipath: 254:3: Failing path 8:32.
[ 1214.675196] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[ 1214.675224] device-mapper: multipath: 254:3: Failing path 8:16.
[ 1214.675309] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[ 1214.675338] device-mapper: multipath: 254:3: Failing path 8:48.
[ 1214.675413] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[ 1214.675441] device-mapper: multipath: 254:3: Failing path 8:64.

The initial bug report included:

[   13.822701] blk_insert_cloned_request: over max size limit. (248 > 128)
[   13.829351] device-mapper: multipath: 253:3: Failing path 8:32.
[   13.835307] blk_insert_cloned_request: over max size limit. (248 > 128)
[   13.841928] device-mapper: multipath: 253:3: Failing path 65:16.
[   13.844532] blk_insert_cloned_request: over max size limit. (248 > 128)
[   13.854363] blk_insert_cloned_request: over max size limit. (248 > 128)
[   13.854580] device-mapper: multipath: 253:4: Failing path 8:48.
[   13.861166] device-mapper: multipath: 253:3: Failing path 8:192.

Reported-by: Marco Patalano <mpatalan@redhat.com>
Reported-by: Ewan Milne <emilne@redhat.com>
Fixes: 1c0e720228ad ("dm: use queue_limits_set")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 drivers/md/dm-table.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Christoph Hellwig May 22, 2024, 2:24 p.m. UTC | #1
On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote:
> Otherwise, blk_validate_limits() will throw-away the max_sectors that
> was stacked from underlying device(s). In doing so it can set a
> max_sectors limit that violates underlying device limits.

Hmm, yes it sort of is "throwing the limit away", but it really
recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors.

> 
> This caused dm-multipath IO failures like the following because the
> underlying devices' max_sectors were stacked up to be 1024, yet
> blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP
> (2560):

I suspect the problem is that SCSI messed directly with max_sectors instead
and ignores max_user_sectors (and really shouldn't touch either, but that's
a separate discussion).  Can you try the patch below and maybe also provide
the sysfs output for max_sectors_kb and max_hw_sectors_kb for all involved
devices?

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 332eb9dac22d91..f6c822c9cbd2d3 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
 	 */
 	if (sdkp->first_scan ||
 	    q->limits.max_sectors > q->limits.max_dev_sectors ||
-	    q->limits.max_sectors > q->limits.max_hw_sectors)
+	    q->limits.max_sectors > q->limits.max_hw_sectors) {
 		q->limits.max_sectors = rw_max;
+		q->limits.max_user_sectors = rw_max;
+	}
 
 	sdkp->first_scan = 0;
Mike Snitzer May 22, 2024, 4:48 p.m. UTC | #2
On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote:
> On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote:
> > Otherwise, blk_validate_limits() will throw-away the max_sectors that
> > was stacked from underlying device(s). In doing so it can set a
> > max_sectors limit that violates underlying device limits.
> 
> Hmm, yes it sort of is "throwing the limit away", but it really
> recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors.

Yes, but it needs to do that recalculation at each level of a stacked
device.  And then we need to combine them via blk_stack_limits() -- as
is done with the various limits stacking loops in
drivers/md/dm-table.c:dm_calculate_queue_limits

> > This caused dm-multipath IO failures like the following because the
> > underlying devices' max_sectors were stacked up to be 1024, yet
> > blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP
> > (2560):
> 
> I suspect the problem is that SCSI messed directly with max_sectors instead
> and ignores max_user_sectors (and really shouldn't touch either, but that's
> a separate discussion).  Can you try the patch below and maybe also provide
> the sysfs output for max_sectors_kb and max_hw_sectors_kb for all involved
> devices?

FYI, you can easily reproduce with:

git clone https://github.com/snitm/mptest.git
cd mptest
<edit so it uses: export MULTIPATH_BACKEND_MODULE="scsidebug">
./runtest ./tests/test_00_no_failure

Also, best to change this line:
./lib/mpath_generic:        local _feature="4 queue_if_no_path retain_attached_hw_handler queue_mode $MULTIPATH_QUEUE_MODE"
to:
./lib/mpath_generic:        local _feature="3 retain_attached_hw_handler queue_mode $MULTIPATH_QUEUE_MODE"
Otherwise the test will hang due to queue_if_no_path.

all underlying scsi-debug scsi devices:

./max_hw_sectors_kb:2147483647
./max_sectors_kb:512

multipath device:

before my fix:
./max_hw_sectors_kb:2147483647
./max_sectors_kb:1280

after my fix:
./max_hw_sectors_kb:2147483647
./max_sectors_kb:512
 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 332eb9dac22d91..f6c822c9cbd2d3 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
>  	 */
>  	if (sdkp->first_scan ||
>  	    q->limits.max_sectors > q->limits.max_dev_sectors ||
> -	    q->limits.max_sectors > q->limits.max_hw_sectors)
> +	    q->limits.max_sectors > q->limits.max_hw_sectors) {
>  		q->limits.max_sectors = rw_max;
> +		q->limits.max_user_sectors = rw_max;
> +	}
>  
>  	sdkp->first_scan = 0;
>  
> 

Driver shouldn't be changing max_user_sectors..

But it also didn't fix it (mpath still gets ./max_sectors_kb:1280):

[   74.872485] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[   74.872505] device-mapper: multipath: 254:3: Failing path 8:16.
[   74.872620] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[   74.872641] device-mapper: multipath: 254:3: Failing path 8:32.
[   74.872712] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[   74.872732] device-mapper: multipath: 254:3: Failing path 8:48.
[   74.872788] blk_insert_cloned_request: over max size limit. (2048 > 1024)
[   74.872808] device-mapper: multipath: 254:3: Failing path 8:64.

Simply setting max_user_sectors won't help with stacked devices
because blk_stack_limits() doesn't stack max_user_sectors.  It'll
inform the underlying device's blk_validate_limits() calculation which
will result in max_sectors having the desired value (which it already
did, as I showed above).  But when stacking limits from underlying
devices up to the higher-level dm-mpath queue_limits we still have
information loss.

Mike
Ewan Milne May 22, 2024, 5:37 p.m. UTC | #3
We tried the sd_revalidate_disk() change, just to be sure.  It didn't fix it.

-Ewan

On Wed, May 22, 2024 at 12:49 PM Mike Snitzer <snitzer@kernel.org> wrote:
>
> On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote:
> > On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote:
> > > Otherwise, blk_validate_limits() will throw-away the max_sectors that
> > > was stacked from underlying device(s). In doing so it can set a
> > > max_sectors limit that violates underlying device limits.
> >
> > Hmm, yes it sort of is "throwing the limit away", but it really
> > recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors.
>
> Yes, but it needs to do that recalculation at each level of a stacked
> device.  And then we need to combine them via blk_stack_limits() -- as
> is done with the various limits stacking loops in
> drivers/md/dm-table.c:dm_calculate_queue_limits
>
> > > This caused dm-multipath IO failures like the following because the
> > > underlying devices' max_sectors were stacked up to be 1024, yet
> > > blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP
> > > (2560):
> >
> > I suspect the problem is that SCSI messed directly with max_sectors instead
> > and ignores max_user_sectors (and really shouldn't touch either, but that's
> > a separate discussion).  Can you try the patch below and maybe also provide
> > the sysfs output for max_sectors_kb and max_hw_sectors_kb for all involved
> > devices?
>
> FYI, you can easily reproduce with:
>
> git clone https://github.com/snitm/mptest.git
> cd mptest
> <edit so it uses: export MULTIPATH_BACKEND_MODULE="scsidebug">
> ./runtest ./tests/test_00_no_failure
>
> Also, best to change this line:
> ./lib/mpath_generic:        local _feature="4 queue_if_no_path retain_attached_hw_handler queue_mode $MULTIPATH_QUEUE_MODE"
> to:
> ./lib/mpath_generic:        local _feature="3 retain_attached_hw_handler queue_mode $MULTIPATH_QUEUE_MODE"
> Otherwise the test will hang due to queue_if_no_path.
>
> all underlying scsi-debug scsi devices:
>
> ./max_hw_sectors_kb:2147483647
> ./max_sectors_kb:512
>
> multipath device:
>
> before my fix:
> ./max_hw_sectors_kb:2147483647
> ./max_sectors_kb:1280
>
> after my fix:
> ./max_hw_sectors_kb:2147483647
> ./max_sectors_kb:512
>
> > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > index 332eb9dac22d91..f6c822c9cbd2d3 100644
> > --- a/drivers/scsi/sd.c
> > +++ b/drivers/scsi/sd.c
> > @@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
> >        */
> >       if (sdkp->first_scan ||
> >           q->limits.max_sectors > q->limits.max_dev_sectors ||
> > -         q->limits.max_sectors > q->limits.max_hw_sectors)
> > +         q->limits.max_sectors > q->limits.max_hw_sectors) {
> >               q->limits.max_sectors = rw_max;
> > +             q->limits.max_user_sectors = rw_max;
> > +     }
> >
> >       sdkp->first_scan = 0;
> >
> >
>
> Driver shouldn't be changing max_user_sectors..
>
> But it also didn't fix it (mpath still gets ./max_sectors_kb:1280):
>
> [   74.872485] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872505] device-mapper: multipath: 254:3: Failing path 8:16.
> [   74.872620] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872641] device-mapper: multipath: 254:3: Failing path 8:32.
> [   74.872712] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872732] device-mapper: multipath: 254:3: Failing path 8:48.
> [   74.872788] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872808] device-mapper: multipath: 254:3: Failing path 8:64.
>
> Simply setting max_user_sectors won't help with stacked devices
> because blk_stack_limits() doesn't stack max_user_sectors.  It'll
> inform the underlying device's blk_validate_limits() calculation which
> will result in max_sectors having the desired value (which it already
> did, as I showed above).  But when stacking limits from underlying
> devices up to the higher-level dm-mpath queue_limits we still have
> information loss.
>
> Mike
>
Ewan Milne May 22, 2024, 8:33 p.m. UTC | #4
On Tue, May 21, 2024 at 10:51 PM Mike Snitzer <snitzer@kernel.org> wrote:
>
> Otherwise, blk_validate_limits() will throw-away the max_sectors that
> was stacked from underlying device(s). In doing so it can set a
> max_sectors limit that violates underlying device limits.
>
> This caused dm-multipath IO failures like the following because the
> underlying devices' max_sectors were stacked up to be 1024, yet
> blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP
> (2560):
>
> [ 1214.673233] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [ 1214.673267] device-mapper: multipath: 254:3: Failing path 8:32.
> [ 1214.675196] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [ 1214.675224] device-mapper: multipath: 254:3: Failing path 8:16.
> [ 1214.675309] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [ 1214.675338] device-mapper: multipath: 254:3: Failing path 8:48.
> [ 1214.675413] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [ 1214.675441] device-mapper: multipath: 254:3: Failing path 8:64.
>
> The initial bug report included:
>
> [   13.822701] blk_insert_cloned_request: over max size limit. (248 > 128)
> [   13.829351] device-mapper: multipath: 253:3: Failing path 8:32.
> [   13.835307] blk_insert_cloned_request: over max size limit. (248 > 128)
> [   13.841928] device-mapper: multipath: 253:3: Failing path 65:16.
> [   13.844532] blk_insert_cloned_request: over max size limit. (248 > 128)
> [   13.854363] blk_insert_cloned_request: over max size limit. (248 > 128)
> [   13.854580] device-mapper: multipath: 253:4: Failing path 8:48.
> [   13.861166] device-mapper: multipath: 253:3: Failing path 8:192.
>
> Reported-by: Marco Patalano <mpatalan@redhat.com>
> Reported-by: Ewan Milne <emilne@redhat.com>
> Fixes: 1c0e720228ad ("dm: use queue_limits_set")
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  drivers/md/dm-table.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 88114719fe18..6463b4afeaa4 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1961,6 +1961,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>                               struct queue_limits *limits)
>  {
>         bool wc = false, fua = false;
> +       unsigned int max_hw_sectors;
>         int r;
>
>         if (dm_table_supports_nowait(t))
> @@ -1981,9 +1982,16 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>         if (!dm_table_supports_secure_erase(t))
>                 limits->max_secure_erase_sectors = 0;
>
> +       /* Don't allow queue_limits_set() to throw-away stacked max_sectors */
> +       max_hw_sectors = limits->max_hw_sectors;
> +       limits->max_hw_sectors = limits->max_sectors;
>         r = queue_limits_set(q, limits);
>         if (r)
>                 return r;
> +       /* Restore stacked max_hw_sectors */
> +       mutex_lock(&q->limits_lock);
> +       limits->max_hw_sectors = max_hw_sectors;
> +       mutex_unlock(&q->limits_lock);
>
>         if (dm_table_supports_flush(t, (1UL << QUEUE_FLAG_WC))) {
>                 wc = true;
> --
> 2.43.0
>

This fixed the FC DM-MP failures, so:

Tested-by: Marco Patalano <mpatalan@redhat.com>
Revewied-by: Ewan D. Milne <emilne@redhat.com>
Ming Lei May 23, 2024, 1:52 a.m. UTC | #5
On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote:
> On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote:
> > On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote:
> > > Otherwise, blk_validate_limits() will throw-away the max_sectors that
> > > was stacked from underlying device(s). In doing so it can set a
> > > max_sectors limit that violates underlying device limits.
> > 
> > Hmm, yes it sort of is "throwing the limit away", but it really
> > recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors.
> 
> Yes, but it needs to do that recalculation at each level of a stacked
> device.  And then we need to combine them via blk_stack_limits() -- as
> is done with the various limits stacking loops in
> drivers/md/dm-table.c:dm_calculate_queue_limits

This way looks one stacking specific requirement, just wondering why not
put the logic into blk_validate_limits() by passing 'stacking' parameter?
Then raid can benefit from it oo.

thanks,
Ming
Christoph Hellwig May 23, 2024, 7:16 a.m. UTC | #6
On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote:
> On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote:
> > On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote:
> > > Otherwise, blk_validate_limits() will throw-away the max_sectors that
> > > was stacked from underlying device(s). In doing so it can set a
> > > max_sectors limit that violates underlying device limits.
> > 
> > Hmm, yes it sort of is "throwing the limit away", but it really
> > recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors.
> 
> Yes, but it needs to do that recalculation at each level of a stacked
> device.  And then we need to combine them via blk_stack_limits() -- as
> is done with the various limits stacking loops in
> drivers/md/dm-table.c:dm_calculate_queue_limits
> 
> > > This caused dm-multipath IO failures like the following because the
> > > underlying devices' max_sectors were stacked up to be 1024, yet
> > > blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP
> > > (2560):
> > 
> > I suspect the problem is that SCSI messed directly with max_sectors instead
> > and ignores max_user_sectors (and really shouldn't touch either, but that's
> > a separate discussion).  Can you try the patch below and maybe also provide
> > the sysfs output for max_sectors_kb and max_hw_sectors_kb for all involved
> > devices?
> 
> FYI, you can easily reproduce with:

Running this (with your suggested edits) on Linus' current tree
(commit c760b3725e52403dc1b28644fb09c47a83cacea6) doesn't show any
failure even after dozens of runs.  What am I doing wrong?
Christoph Hellwig May 23, 2024, 8:27 a.m. UTC | #7
On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote:
> [   74.872485] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872505] device-mapper: multipath: 254:3: Failing path 8:16.
> [   74.872620] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872641] device-mapper: multipath: 254:3: Failing path 8:32.
> [   74.872712] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872732] device-mapper: multipath: 254:3: Failing path 8:48.
> [   74.872788] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> [   74.872808] device-mapper: multipath: 254:3: Failing path 8:64.
> 
> Simply setting max_user_sectors won't help with stacked devices
> because blk_stack_limits() doesn't stack max_user_sectors.  It'll
> inform the underlying device's blk_validate_limits() calculation which
> will result in max_sectors having the desired value (which it already
> did, as I showed above).  But when stacking limits from underlying
> devices up to the higher-level dm-mpath queue_limits we still have
> information loss.

So while I can't reproduce it, I think the main issue is that
max_sectors really just is a voluntary limit, and enforcing that at
the lower device doesn't really make any sense.  So we could just
check blk_insert_cloned_request to check max_hw_sectors instead.
Or my below preferre variant to just drop the check, as the
max_sectors == 0 check indicates it's pretty sketchy to start with.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index fc364a226e952f..61b108aa20044d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3041,29 +3041,9 @@ void blk_mq_submit_bio(struct bio *bio)
 blk_status_t blk_insert_cloned_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq));
 	unsigned int max_segments = blk_rq_get_max_segments(rq);
 	blk_status_t ret;
 
-	if (blk_rq_sectors(rq) > max_sectors) {
-		/*
-		 * SCSI device does not have a good way to return if
-		 * Write Same/Zero is actually supported. If a device rejects
-		 * a non-read/write command (discard, write same,etc.) the
-		 * low-level device driver will set the relevant queue limit to
-		 * 0 to prevent blk-lib from issuing more of the offending
-		 * operations. Commands queued prior to the queue limit being
-		 * reset need to be completed with BLK_STS_NOTSUPP to avoid I/O
-		 * errors being propagated to upper layers.
-		 */
-		if (max_sectors == 0)
-			return BLK_STS_NOTSUPP;
-
-		printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
-			__func__, blk_rq_sectors(rq), max_sectors);
-		return BLK_STS_IOERR;
-	}
-
 	/*
 	 * The queue settings related to segment counting may differ from the
 	 * original queue.

> 
> Mike
---end quoted text---
Mike Snitzer May 23, 2024, 2:12 p.m. UTC | #8
On Thu, May 23, 2024 at 10:27:31AM +0200, Christoph Hellwig wrote:
> On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote:
> > [   74.872485] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> > [   74.872505] device-mapper: multipath: 254:3: Failing path 8:16.
> > [   74.872620] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> > [   74.872641] device-mapper: multipath: 254:3: Failing path 8:32.
> > [   74.872712] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> > [   74.872732] device-mapper: multipath: 254:3: Failing path 8:48.
> > [   74.872788] blk_insert_cloned_request: over max size limit. (2048 > 1024)
> > [   74.872808] device-mapper: multipath: 254:3: Failing path 8:64.
> > 
> > Simply setting max_user_sectors won't help with stacked devices
> > because blk_stack_limits() doesn't stack max_user_sectors.  It'll
> > inform the underlying device's blk_validate_limits() calculation which
> > will result in max_sectors having the desired value (which it already
> > did, as I showed above).  But when stacking limits from underlying
> > devices up to the higher-level dm-mpath queue_limits we still have
> > information loss.
> 
> So while I can't reproduce it, I think the main issue is that
> max_sectors really just is a voluntary limit, and enforcing that at
> the lower device doesn't really make any sense.  So we could just
> check blk_insert_cloned_request to check max_hw_sectors instead.

I haven't tried your patch but we still want properly stacked
max_sectors configured for the device.

> Or my below preferre variant to just drop the check, as the
> max_sectors == 0 check indicates it's pretty sketchy to start with.

At this point in the 6.10 release I don't want further whack-a-mole
fixes due to fallout from removing longstanding negative checks.

Not sure what is sketchy about the max_sectors == 0 check, the large
comment block explains that check quite well.  We want to avoid EIO
for unsupported operations (otherwise we'll get spurious path failures
in the context of dm-multipath).  Could be we can remove this check
after an audit of how LLD handle servicing IO for unsupported
operations -- so best to work through it during a devel cycle.

Not sure why scsi_debug based testing with mptest isn't triggering it
for you. Are you seeing these limits for the underlying scsi_debug
devices?

./max_hw_sectors_kb:2147483647
./max_sectors_kb:512

What are those limits for the mptest created 'mp' dm-multipath device?

Mike
Christoph Hellwig May 23, 2024, 2:49 p.m. UTC | #9
On Thu, May 23, 2024 at 10:12:24AM -0400, Mike Snitzer wrote:
> Not sure what is sketchy about the max_sectors == 0 check, the large
> comment block explains that check quite well.  We want to avoid EIO
> for unsupported operations (otherwise we'll get spurious path failures
> in the context of dm-multipath).  Could be we can remove this check
> after an audit of how LLD handle servicing IO for unsupported
> operations -- so best to work through it during a devel cycle.

Think of what happens if you create a dm device, and then reduce
max_sectors using sysfs on the lower device after the dm device
was created: you'll trivially trigger this check.

> Not sure why scsi_debug based testing with mptest isn't triggering it
> for you. Are you seeing these limits for the underlying scsi_debug
> devices?
> 
> ./max_hw_sectors_kb:2147483647
> ./max_sectors_kb:512

root@testvm:~/mptest# cat /sys/block/sdc/queue/max_hw_sectors_kb 
2147483647

root@testvm:~/mptest# cat /sys/block/sdd/queue/max_sectors_kb 
512

root@testvm:~/mptest# cat /sys/block/dm-0/queue/max_hw_sectors_kb 
2147483647

root@testvm:~/mptest# cat /sys/block/dm-0/queue/max_sectors_kb 
1280

so they don't match, but for some reason bigger bios never get built.
Mike Snitzer May 23, 2024, 3:44 p.m. UTC | #10
On Thu, May 23, 2024 at 04:49:38PM +0200, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 10:12:24AM -0400, Mike Snitzer wrote:
> > Not sure what is sketchy about the max_sectors == 0 check, the large
> > comment block explains that check quite well.  We want to avoid EIO
> > for unsupported operations (otherwise we'll get spurious path failures
> > in the context of dm-multipath).  Could be we can remove this check
> > after an audit of how LLD handle servicing IO for unsupported
> > operations -- so best to work through it during a devel cycle.
> 
> Think of what happens if you create a dm device, and then reduce
> max_sectors using sysfs on the lower device after the dm device
> was created: you'll trivially trigger this check.
> 
> > Not sure why scsi_debug based testing with mptest isn't triggering it
> > for you. Are you seeing these limits for the underlying scsi_debug
> > devices?
> > 
> > ./max_hw_sectors_kb:2147483647
> > ./max_sectors_kb:512
> 
> root@testvm:~/mptest# cat /sys/block/sdc/queue/max_hw_sectors_kb 
> 2147483647
> 
> root@testvm:~/mptest# cat /sys/block/sdd/queue/max_sectors_kb 
> 512
> 
> root@testvm:~/mptest# cat /sys/block/dm-0/queue/max_hw_sectors_kb 
> 2147483647
> 
> root@testvm:~/mptest# cat /sys/block/dm-0/queue/max_sectors_kb 
> 1280
> 
> so they don't match, but for some reason bigger bios never get built.

Weird... I'm running in a VMware guest but I don't see why that'd make
a difference on larger IOs being formed (given it is virtual
scsi_debug devices).

In any case, we know I can reproduce with this scsi_debug-based mptest
test and Marco has verified my fix resolves the issue on his FC
multipath testbed.

But I've just floated a patch to elevate the fix to block core (based
on Ming's suggestion):
https://patchwork.kernel.org/project/dm-devel/patch/Zk9i7V2GRoHxBPRu@kernel.org/

Let me know, thanks.
Christoph Hellwig May 23, 2024, 3:50 p.m. UTC | #11
On Thu, May 23, 2024 at 11:44:05AM -0400, Mike Snitzer wrote:
> a difference on larger IOs being formed (given it is virtual
> scsi_debug devices).
> 
> In any case, we know I can reproduce with this scsi_debug-based mptest
> test and Marco has verified my fix resolves the issue on his FC
> multipath testbed.
> 
> But I've just floated a patch to elevate the fix to block core (based
> on Ming's suggestion):
> https://patchwork.kernel.org/project/dm-devel/patch/Zk9i7V2GRoHxBPRu@kernel.org/

I still think that is wrong.  Unfortunately I can't actually reproduce
the issue locally, but I think we want sd to set the user_max_sectors
and stack if you want to see the limits propagated, i.e. the combined
patch below.   In the longer run I need to get SCSI out of messing
with max_sectors directly, and the blk-mq stacking to stop looking
at it vs just the hardware limits (or just drop the check).

diff --git a/block/blk-settings.c b/block/blk-settings.c
index a7fe8e90240a6e..7a672021daee6a 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -611,6 +611,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	unsigned int top, bottom, alignment, ret = 0;
 
 	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
+	t->max_user_sectors = min_not_zero(t->max_user_sectors,
+			b->max_user_sectors);
 	t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
 	t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 332eb9dac22d91..f6c822c9cbd2d3 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
 	 */
 	if (sdkp->first_scan ||
 	    q->limits.max_sectors > q->limits.max_dev_sectors ||
-	    q->limits.max_sectors > q->limits.max_hw_sectors)
+	    q->limits.max_sectors > q->limits.max_hw_sectors) {
 		q->limits.max_sectors = rw_max;
+		q->limits.max_user_sectors = rw_max;
+	}
 
 	sdkp->first_scan = 0;
Mike Snitzer May 23, 2024, 4:44 p.m. UTC | #12
On Thu, May 23, 2024 at 05:50:09PM +0200, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 11:44:05AM -0400, Mike Snitzer wrote:
> > a difference on larger IOs being formed (given it is virtual
> > scsi_debug devices).
> > 
> > In any case, we know I can reproduce with this scsi_debug-based mptest
> > test and Marco has verified my fix resolves the issue on his FC
> > multipath testbed.
> > 
> > But I've just floated a patch to elevate the fix to block core (based
> > on Ming's suggestion):
> > https://patchwork.kernel.org/project/dm-devel/patch/Zk9i7V2GRoHxBPRu@kernel.org/
> 
> I still think that is wrong.  Unfortunately I can't actually reproduce
> the issue locally, but I think we want sd to set the user_max_sectors
> and stack if you want to see the limits propagated, i.e. the combined
> patch below.   In the longer run I need to get SCSI out of messing
> with max_sectors directly, and the blk-mq stacking to stop looking
> at it vs just the hardware limits (or just drop the check).

This "works" but it doesn't safeguard blk_stack_limits() and
blk_validate_limits() from other drivers that weren't trained to
(ab)use max_user_sectors to get blk_validate_limits() to preserve the
underlying device's max_sectors.

But I suppose we can worry about any other similar issues if/when
reported.

Please send a proper patch to Jens so we can get this fixed for
6.10-rc1. Thanks.

Acked-by: Mike Snitzer <snitzer@kernel.org>

> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index a7fe8e90240a6e..7a672021daee6a 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -611,6 +611,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  	unsigned int top, bottom, alignment, ret = 0;
>  
>  	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
> +	t->max_user_sectors = min_not_zero(t->max_user_sectors,
> +			b->max_user_sectors);
>  	t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
>  	t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
>  	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 332eb9dac22d91..f6c822c9cbd2d3 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
>  	 */
>  	if (sdkp->first_scan ||
>  	    q->limits.max_sectors > q->limits.max_dev_sectors ||
> -	    q->limits.max_sectors > q->limits.max_hw_sectors)
> +	    q->limits.max_sectors > q->limits.max_hw_sectors) {
>  		q->limits.max_sectors = rw_max;
> +		q->limits.max_user_sectors = rw_max;
> +	}
>  
>  	sdkp->first_scan = 0;
>  
> 
>
Christoph Hellwig May 23, 2024, 5:03 p.m. UTC | #13
On Thu, May 23, 2024 at 12:44:05PM -0400, Mike Snitzer wrote:
> This "works" but it doesn't safeguard blk_stack_limits() and
> blk_validate_limits() from other drivers that weren't trained to
> (ab)use max_user_sectors to get blk_validate_limits() to preserve the
> underlying device's max_sectors.

It does in that sd/sr are the only remaining places directly setting
queue limits without the commit API.  And I am working on converting
them for 6.11.
diff mbox series

Patch

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 88114719fe18..6463b4afeaa4 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1961,6 +1961,7 @@  int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 			      struct queue_limits *limits)
 {
 	bool wc = false, fua = false;
+	unsigned int max_hw_sectors;
 	int r;
 
 	if (dm_table_supports_nowait(t))
@@ -1981,9 +1982,16 @@  int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 	if (!dm_table_supports_secure_erase(t))
 		limits->max_secure_erase_sectors = 0;
 
+	/* Don't allow queue_limits_set() to throw-away stacked max_sectors */
+	max_hw_sectors = limits->max_hw_sectors;
+	limits->max_hw_sectors = limits->max_sectors;
 	r = queue_limits_set(q, limits);
 	if (r)
 		return r;
+	/* Restore stacked max_hw_sectors */
+	mutex_lock(&q->limits_lock);
+	limits->max_hw_sectors = max_hw_sectors;
+	mutex_unlock(&q->limits_lock);
 
 	if (dm_table_supports_flush(t, (1UL << QUEUE_FLAG_WC))) {
 		wc = true;