Message ID | 20240522025117.75568-1-snitzer@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | dm: retain stacked max_sectors when setting queue_limits | expand |
On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote: > Otherwise, blk_validate_limits() will throw-away the max_sectors that > was stacked from underlying device(s). In doing so it can set a > max_sectors limit that violates underlying device limits. Hmm, yes it sort of is "throwing the limit away", but it really recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors. > > This caused dm-multipath IO failures like the following because the > underlying devices' max_sectors were stacked up to be 1024, yet > blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP > (2560): I suspect the problem is that SCSI messed directly with max_sectors instead and ignores max_user_sectors (and really shouldn't touch either, but that's a separate discussion). Can you try the patch below and maybe also provide the sysfs output for max_sectors_kb and max_hw_sectors_kb for all involved devices? diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 332eb9dac22d91..f6c822c9cbd2d3 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk) */ if (sdkp->first_scan || q->limits.max_sectors > q->limits.max_dev_sectors || - q->limits.max_sectors > q->limits.max_hw_sectors) + q->limits.max_sectors > q->limits.max_hw_sectors) { q->limits.max_sectors = rw_max; + q->limits.max_user_sectors = rw_max; + } sdkp->first_scan = 0;
On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote: > On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote: > > Otherwise, blk_validate_limits() will throw-away the max_sectors that > > was stacked from underlying device(s). In doing so it can set a > > max_sectors limit that violates underlying device limits. > > Hmm, yes it sort of is "throwing the limit away", but it really > recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors. Yes, but it needs to do that recalculation at each level of a stacked device. And then we need to combine them via blk_stack_limits() -- as is done with the various limits stacking loops in drivers/md/dm-table.c:dm_calculate_queue_limits > > This caused dm-multipath IO failures like the following because the > > underlying devices' max_sectors were stacked up to be 1024, yet > > blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP > > (2560): > > I suspect the problem is that SCSI messed directly with max_sectors instead > and ignores max_user_sectors (and really shouldn't touch either, but that's > a separate discussion). Can you try the patch below and maybe also provide > the sysfs output for max_sectors_kb and max_hw_sectors_kb for all involved > devices? FYI, you can easily reproduce with: git clone https://github.com/snitm/mptest.git cd mptest <edit so it uses: export MULTIPATH_BACKEND_MODULE="scsidebug"> ./runtest ./tests/test_00_no_failure Also, best to change this line: ./lib/mpath_generic: local _feature="4 queue_if_no_path retain_attached_hw_handler queue_mode $MULTIPATH_QUEUE_MODE" to: ./lib/mpath_generic: local _feature="3 retain_attached_hw_handler queue_mode $MULTIPATH_QUEUE_MODE" Otherwise the test will hang due to queue_if_no_path. all underlying scsi-debug scsi devices: ./max_hw_sectors_kb:2147483647 ./max_sectors_kb:512 multipath device: before my fix: ./max_hw_sectors_kb:2147483647 ./max_sectors_kb:1280 after my fix: ./max_hw_sectors_kb:2147483647 ./max_sectors_kb:512 > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index 332eb9dac22d91..f6c822c9cbd2d3 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk) > */ > if (sdkp->first_scan || > q->limits.max_sectors > q->limits.max_dev_sectors || > - q->limits.max_sectors > q->limits.max_hw_sectors) > + q->limits.max_sectors > q->limits.max_hw_sectors) { > q->limits.max_sectors = rw_max; > + q->limits.max_user_sectors = rw_max; > + } > > sdkp->first_scan = 0; > > Driver shouldn't be changing max_user_sectors.. But it also didn't fix it (mpath still gets ./max_sectors_kb:1280): [ 74.872485] blk_insert_cloned_request: over max size limit. (2048 > 1024) [ 74.872505] device-mapper: multipath: 254:3: Failing path 8:16. [ 74.872620] blk_insert_cloned_request: over max size limit. (2048 > 1024) [ 74.872641] device-mapper: multipath: 254:3: Failing path 8:32. [ 74.872712] blk_insert_cloned_request: over max size limit. (2048 > 1024) [ 74.872732] device-mapper: multipath: 254:3: Failing path 8:48. [ 74.872788] blk_insert_cloned_request: over max size limit. (2048 > 1024) [ 74.872808] device-mapper: multipath: 254:3: Failing path 8:64. Simply setting max_user_sectors won't help with stacked devices because blk_stack_limits() doesn't stack max_user_sectors. It'll inform the underlying device's blk_validate_limits() calculation which will result in max_sectors having the desired value (which it already did, as I showed above). But when stacking limits from underlying devices up to the higher-level dm-mpath queue_limits we still have information loss. Mike
We tried the sd_revalidate_disk() change, just to be sure. It didn't fix it. -Ewan On Wed, May 22, 2024 at 12:49 PM Mike Snitzer <snitzer@kernel.org> wrote: > > On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote: > > On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote: > > > Otherwise, blk_validate_limits() will throw-away the max_sectors that > > > was stacked from underlying device(s). In doing so it can set a > > > max_sectors limit that violates underlying device limits. > > > > Hmm, yes it sort of is "throwing the limit away", but it really > > recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors. > > Yes, but it needs to do that recalculation at each level of a stacked > device. And then we need to combine them via blk_stack_limits() -- as > is done with the various limits stacking loops in > drivers/md/dm-table.c:dm_calculate_queue_limits > > > > This caused dm-multipath IO failures like the following because the > > > underlying devices' max_sectors were stacked up to be 1024, yet > > > blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP > > > (2560): > > > > I suspect the problem is that SCSI messed directly with max_sectors instead > > and ignores max_user_sectors (and really shouldn't touch either, but that's > > a separate discussion). Can you try the patch below and maybe also provide > > the sysfs output for max_sectors_kb and max_hw_sectors_kb for all involved > > devices? > > FYI, you can easily reproduce with: > > git clone https://github.com/snitm/mptest.git > cd mptest > <edit so it uses: export MULTIPATH_BACKEND_MODULE="scsidebug"> > ./runtest ./tests/test_00_no_failure > > Also, best to change this line: > ./lib/mpath_generic: local _feature="4 queue_if_no_path retain_attached_hw_handler queue_mode $MULTIPATH_QUEUE_MODE" > to: > ./lib/mpath_generic: local _feature="3 retain_attached_hw_handler queue_mode $MULTIPATH_QUEUE_MODE" > Otherwise the test will hang due to queue_if_no_path. > > all underlying scsi-debug scsi devices: > > ./max_hw_sectors_kb:2147483647 > ./max_sectors_kb:512 > > multipath device: > > before my fix: > ./max_hw_sectors_kb:2147483647 > ./max_sectors_kb:1280 > > after my fix: > ./max_hw_sectors_kb:2147483647 > ./max_sectors_kb:512 > > > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > > index 332eb9dac22d91..f6c822c9cbd2d3 100644 > > --- a/drivers/scsi/sd.c > > +++ b/drivers/scsi/sd.c > > @@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk) > > */ > > if (sdkp->first_scan || > > q->limits.max_sectors > q->limits.max_dev_sectors || > > - q->limits.max_sectors > q->limits.max_hw_sectors) > > + q->limits.max_sectors > q->limits.max_hw_sectors) { > > q->limits.max_sectors = rw_max; > > + q->limits.max_user_sectors = rw_max; > > + } > > > > sdkp->first_scan = 0; > > > > > > Driver shouldn't be changing max_user_sectors.. > > But it also didn't fix it (mpath still gets ./max_sectors_kb:1280): > > [ 74.872485] blk_insert_cloned_request: over max size limit. (2048 > 1024) > [ 74.872505] device-mapper: multipath: 254:3: Failing path 8:16. > [ 74.872620] blk_insert_cloned_request: over max size limit. (2048 > 1024) > [ 74.872641] device-mapper: multipath: 254:3: Failing path 8:32. > [ 74.872712] blk_insert_cloned_request: over max size limit. (2048 > 1024) > [ 74.872732] device-mapper: multipath: 254:3: Failing path 8:48. > [ 74.872788] blk_insert_cloned_request: over max size limit. (2048 > 1024) > [ 74.872808] device-mapper: multipath: 254:3: Failing path 8:64. > > Simply setting max_user_sectors won't help with stacked devices > because blk_stack_limits() doesn't stack max_user_sectors. It'll > inform the underlying device's blk_validate_limits() calculation which > will result in max_sectors having the desired value (which it already > did, as I showed above). But when stacking limits from underlying > devices up to the higher-level dm-mpath queue_limits we still have > information loss. > > Mike >
On Tue, May 21, 2024 at 10:51 PM Mike Snitzer <snitzer@kernel.org> wrote: > > Otherwise, blk_validate_limits() will throw-away the max_sectors that > was stacked from underlying device(s). In doing so it can set a > max_sectors limit that violates underlying device limits. > > This caused dm-multipath IO failures like the following because the > underlying devices' max_sectors were stacked up to be 1024, yet > blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP > (2560): > > [ 1214.673233] blk_insert_cloned_request: over max size limit. (2048 > 1024) > [ 1214.673267] device-mapper: multipath: 254:3: Failing path 8:32. > [ 1214.675196] blk_insert_cloned_request: over max size limit. (2048 > 1024) > [ 1214.675224] device-mapper: multipath: 254:3: Failing path 8:16. > [ 1214.675309] blk_insert_cloned_request: over max size limit. (2048 > 1024) > [ 1214.675338] device-mapper: multipath: 254:3: Failing path 8:48. > [ 1214.675413] blk_insert_cloned_request: over max size limit. (2048 > 1024) > [ 1214.675441] device-mapper: multipath: 254:3: Failing path 8:64. > > The initial bug report included: > > [ 13.822701] blk_insert_cloned_request: over max size limit. (248 > 128) > [ 13.829351] device-mapper: multipath: 253:3: Failing path 8:32. > [ 13.835307] blk_insert_cloned_request: over max size limit. (248 > 128) > [ 13.841928] device-mapper: multipath: 253:3: Failing path 65:16. > [ 13.844532] blk_insert_cloned_request: over max size limit. (248 > 128) > [ 13.854363] blk_insert_cloned_request: over max size limit. (248 > 128) > [ 13.854580] device-mapper: multipath: 253:4: Failing path 8:48. > [ 13.861166] device-mapper: multipath: 253:3: Failing path 8:192. > > Reported-by: Marco Patalano <mpatalan@redhat.com> > Reported-by: Ewan Milne <emilne@redhat.com> > Fixes: 1c0e720228ad ("dm: use queue_limits_set") > Signed-off-by: Mike Snitzer <snitzer@kernel.org> > --- > drivers/md/dm-table.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c > index 88114719fe18..6463b4afeaa4 100644 > --- a/drivers/md/dm-table.c > +++ b/drivers/md/dm-table.c > @@ -1961,6 +1961,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, > struct queue_limits *limits) > { > bool wc = false, fua = false; > + unsigned int max_hw_sectors; > int r; > > if (dm_table_supports_nowait(t)) > @@ -1981,9 +1982,16 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, > if (!dm_table_supports_secure_erase(t)) > limits->max_secure_erase_sectors = 0; > > + /* Don't allow queue_limits_set() to throw-away stacked max_sectors */ > + max_hw_sectors = limits->max_hw_sectors; > + limits->max_hw_sectors = limits->max_sectors; > r = queue_limits_set(q, limits); > if (r) > return r; > + /* Restore stacked max_hw_sectors */ > + mutex_lock(&q->limits_lock); > + limits->max_hw_sectors = max_hw_sectors; > + mutex_unlock(&q->limits_lock); > > if (dm_table_supports_flush(t, (1UL << QUEUE_FLAG_WC))) { > wc = true; > -- > 2.43.0 > This fixed the FC DM-MP failures, so: Tested-by: Marco Patalano <mpatalan@redhat.com> Revewied-by: Ewan D. Milne <emilne@redhat.com>
On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote: > On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote: > > On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote: > > > Otherwise, blk_validate_limits() will throw-away the max_sectors that > > > was stacked from underlying device(s). In doing so it can set a > > > max_sectors limit that violates underlying device limits. > > > > Hmm, yes it sort of is "throwing the limit away", but it really > > recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors. > > Yes, but it needs to do that recalculation at each level of a stacked > device. And then we need to combine them via blk_stack_limits() -- as > is done with the various limits stacking loops in > drivers/md/dm-table.c:dm_calculate_queue_limits This way looks one stacking specific requirement, just wondering why not put the logic into blk_validate_limits() by passing 'stacking' parameter? Then raid can benefit from it oo. thanks, Ming
On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote: > On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote: > > On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote: > > > Otherwise, blk_validate_limits() will throw-away the max_sectors that > > > was stacked from underlying device(s). In doing so it can set a > > > max_sectors limit that violates underlying device limits. > > > > Hmm, yes it sort of is "throwing the limit away", but it really > > recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors. > > Yes, but it needs to do that recalculation at each level of a stacked > device. And then we need to combine them via blk_stack_limits() -- as > is done with the various limits stacking loops in > drivers/md/dm-table.c:dm_calculate_queue_limits > > > > This caused dm-multipath IO failures like the following because the > > > underlying devices' max_sectors were stacked up to be 1024, yet > > > blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP > > > (2560): > > > > I suspect the problem is that SCSI messed directly with max_sectors instead > > and ignores max_user_sectors (and really shouldn't touch either, but that's > > a separate discussion). Can you try the patch below and maybe also provide > > the sysfs output for max_sectors_kb and max_hw_sectors_kb for all involved > > devices? > > FYI, you can easily reproduce with: Running this (with your suggested edits) on Linus' current tree (commit c760b3725e52403dc1b28644fb09c47a83cacea6) doesn't show any failure even after dozens of runs. What am I doing wrong?
On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote: > [ 74.872485] blk_insert_cloned_request: over max size limit. (2048 > 1024) > [ 74.872505] device-mapper: multipath: 254:3: Failing path 8:16. > [ 74.872620] blk_insert_cloned_request: over max size limit. (2048 > 1024) > [ 74.872641] device-mapper: multipath: 254:3: Failing path 8:32. > [ 74.872712] blk_insert_cloned_request: over max size limit. (2048 > 1024) > [ 74.872732] device-mapper: multipath: 254:3: Failing path 8:48. > [ 74.872788] blk_insert_cloned_request: over max size limit. (2048 > 1024) > [ 74.872808] device-mapper: multipath: 254:3: Failing path 8:64. > > Simply setting max_user_sectors won't help with stacked devices > because blk_stack_limits() doesn't stack max_user_sectors. It'll > inform the underlying device's blk_validate_limits() calculation which > will result in max_sectors having the desired value (which it already > did, as I showed above). But when stacking limits from underlying > devices up to the higher-level dm-mpath queue_limits we still have > information loss. So while I can't reproduce it, I think the main issue is that max_sectors really just is a voluntary limit, and enforcing that at the lower device doesn't really make any sense. So we could just check blk_insert_cloned_request to check max_hw_sectors instead. Or my below preferre variant to just drop the check, as the max_sectors == 0 check indicates it's pretty sketchy to start with. diff --git a/block/blk-mq.c b/block/blk-mq.c index fc364a226e952f..61b108aa20044d 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3041,29 +3041,9 @@ void blk_mq_submit_bio(struct bio *bio) blk_status_t blk_insert_cloned_request(struct request *rq) { struct request_queue *q = rq->q; - unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq)); unsigned int max_segments = blk_rq_get_max_segments(rq); blk_status_t ret; - if (blk_rq_sectors(rq) > max_sectors) { - /* - * SCSI device does not have a good way to return if - * Write Same/Zero is actually supported. If a device rejects - * a non-read/write command (discard, write same,etc.) the - * low-level device driver will set the relevant queue limit to - * 0 to prevent blk-lib from issuing more of the offending - * operations. Commands queued prior to the queue limit being - * reset need to be completed with BLK_STS_NOTSUPP to avoid I/O - * errors being propagated to upper layers. - */ - if (max_sectors == 0) - return BLK_STS_NOTSUPP; - - printk(KERN_ERR "%s: over max size limit. (%u > %u)\n", - __func__, blk_rq_sectors(rq), max_sectors); - return BLK_STS_IOERR; - } - /* * The queue settings related to segment counting may differ from the * original queue. > > Mike ---end quoted text---
On Thu, May 23, 2024 at 10:27:31AM +0200, Christoph Hellwig wrote: > On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote: > > [ 74.872485] blk_insert_cloned_request: over max size limit. (2048 > 1024) > > [ 74.872505] device-mapper: multipath: 254:3: Failing path 8:16. > > [ 74.872620] blk_insert_cloned_request: over max size limit. (2048 > 1024) > > [ 74.872641] device-mapper: multipath: 254:3: Failing path 8:32. > > [ 74.872712] blk_insert_cloned_request: over max size limit. (2048 > 1024) > > [ 74.872732] device-mapper: multipath: 254:3: Failing path 8:48. > > [ 74.872788] blk_insert_cloned_request: over max size limit. (2048 > 1024) > > [ 74.872808] device-mapper: multipath: 254:3: Failing path 8:64. > > > > Simply setting max_user_sectors won't help with stacked devices > > because blk_stack_limits() doesn't stack max_user_sectors. It'll > > inform the underlying device's blk_validate_limits() calculation which > > will result in max_sectors having the desired value (which it already > > did, as I showed above). But when stacking limits from underlying > > devices up to the higher-level dm-mpath queue_limits we still have > > information loss. > > So while I can't reproduce it, I think the main issue is that > max_sectors really just is a voluntary limit, and enforcing that at > the lower device doesn't really make any sense. So we could just > check blk_insert_cloned_request to check max_hw_sectors instead. I haven't tried your patch but we still want properly stacked max_sectors configured for the device. > Or my below preferre variant to just drop the check, as the > max_sectors == 0 check indicates it's pretty sketchy to start with. At this point in the 6.10 release I don't want further whack-a-mole fixes due to fallout from removing longstanding negative checks. Not sure what is sketchy about the max_sectors == 0 check, the large comment block explains that check quite well. We want to avoid EIO for unsupported operations (otherwise we'll get spurious path failures in the context of dm-multipath). Could be we can remove this check after an audit of how LLD handle servicing IO for unsupported operations -- so best to work through it during a devel cycle. Not sure why scsi_debug based testing with mptest isn't triggering it for you. Are you seeing these limits for the underlying scsi_debug devices? ./max_hw_sectors_kb:2147483647 ./max_sectors_kb:512 What are those limits for the mptest created 'mp' dm-multipath device? Mike
On Thu, May 23, 2024 at 10:12:24AM -0400, Mike Snitzer wrote: > Not sure what is sketchy about the max_sectors == 0 check, the large > comment block explains that check quite well. We want to avoid EIO > for unsupported operations (otherwise we'll get spurious path failures > in the context of dm-multipath). Could be we can remove this check > after an audit of how LLD handle servicing IO for unsupported > operations -- so best to work through it during a devel cycle. Think of what happens if you create a dm device, and then reduce max_sectors using sysfs on the lower device after the dm device was created: you'll trivially trigger this check. > Not sure why scsi_debug based testing with mptest isn't triggering it > for you. Are you seeing these limits for the underlying scsi_debug > devices? > > ./max_hw_sectors_kb:2147483647 > ./max_sectors_kb:512 root@testvm:~/mptest# cat /sys/block/sdc/queue/max_hw_sectors_kb 2147483647 root@testvm:~/mptest# cat /sys/block/sdd/queue/max_sectors_kb 512 root@testvm:~/mptest# cat /sys/block/dm-0/queue/max_hw_sectors_kb 2147483647 root@testvm:~/mptest# cat /sys/block/dm-0/queue/max_sectors_kb 1280 so they don't match, but for some reason bigger bios never get built.
On Thu, May 23, 2024 at 04:49:38PM +0200, Christoph Hellwig wrote: > On Thu, May 23, 2024 at 10:12:24AM -0400, Mike Snitzer wrote: > > Not sure what is sketchy about the max_sectors == 0 check, the large > > comment block explains that check quite well. We want to avoid EIO > > for unsupported operations (otherwise we'll get spurious path failures > > in the context of dm-multipath). Could be we can remove this check > > after an audit of how LLD handle servicing IO for unsupported > > operations -- so best to work through it during a devel cycle. > > Think of what happens if you create a dm device, and then reduce > max_sectors using sysfs on the lower device after the dm device > was created: you'll trivially trigger this check. > > > Not sure why scsi_debug based testing with mptest isn't triggering it > > for you. Are you seeing these limits for the underlying scsi_debug > > devices? > > > > ./max_hw_sectors_kb:2147483647 > > ./max_sectors_kb:512 > > root@testvm:~/mptest# cat /sys/block/sdc/queue/max_hw_sectors_kb > 2147483647 > > root@testvm:~/mptest# cat /sys/block/sdd/queue/max_sectors_kb > 512 > > root@testvm:~/mptest# cat /sys/block/dm-0/queue/max_hw_sectors_kb > 2147483647 > > root@testvm:~/mptest# cat /sys/block/dm-0/queue/max_sectors_kb > 1280 > > so they don't match, but for some reason bigger bios never get built. Weird... I'm running in a VMware guest but I don't see why that'd make a difference on larger IOs being formed (given it is virtual scsi_debug devices). In any case, we know I can reproduce with this scsi_debug-based mptest test and Marco has verified my fix resolves the issue on his FC multipath testbed. But I've just floated a patch to elevate the fix to block core (based on Ming's suggestion): https://patchwork.kernel.org/project/dm-devel/patch/Zk9i7V2GRoHxBPRu@kernel.org/ Let me know, thanks.
On Thu, May 23, 2024 at 11:44:05AM -0400, Mike Snitzer wrote: > a difference on larger IOs being formed (given it is virtual > scsi_debug devices). > > In any case, we know I can reproduce with this scsi_debug-based mptest > test and Marco has verified my fix resolves the issue on his FC > multipath testbed. > > But I've just floated a patch to elevate the fix to block core (based > on Ming's suggestion): > https://patchwork.kernel.org/project/dm-devel/patch/Zk9i7V2GRoHxBPRu@kernel.org/ I still think that is wrong. Unfortunately I can't actually reproduce the issue locally, but I think we want sd to set the user_max_sectors and stack if you want to see the limits propagated, i.e. the combined patch below. In the longer run I need to get SCSI out of messing with max_sectors directly, and the blk-mq stacking to stop looking at it vs just the hardware limits (or just drop the check). diff --git a/block/blk-settings.c b/block/blk-settings.c index a7fe8e90240a6e..7a672021daee6a 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -611,6 +611,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, unsigned int top, bottom, alignment, ret = 0; t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors); + t->max_user_sectors = min_not_zero(t->max_user_sectors, + b->max_user_sectors); t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors); t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors); t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors, diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 332eb9dac22d91..f6c822c9cbd2d3 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk) */ if (sdkp->first_scan || q->limits.max_sectors > q->limits.max_dev_sectors || - q->limits.max_sectors > q->limits.max_hw_sectors) + q->limits.max_sectors > q->limits.max_hw_sectors) { q->limits.max_sectors = rw_max; + q->limits.max_user_sectors = rw_max; + } sdkp->first_scan = 0;
On Thu, May 23, 2024 at 05:50:09PM +0200, Christoph Hellwig wrote: > On Thu, May 23, 2024 at 11:44:05AM -0400, Mike Snitzer wrote: > > a difference on larger IOs being formed (given it is virtual > > scsi_debug devices). > > > > In any case, we know I can reproduce with this scsi_debug-based mptest > > test and Marco has verified my fix resolves the issue on his FC > > multipath testbed. > > > > But I've just floated a patch to elevate the fix to block core (based > > on Ming's suggestion): > > https://patchwork.kernel.org/project/dm-devel/patch/Zk9i7V2GRoHxBPRu@kernel.org/ > > I still think that is wrong. Unfortunately I can't actually reproduce > the issue locally, but I think we want sd to set the user_max_sectors > and stack if you want to see the limits propagated, i.e. the combined > patch below. In the longer run I need to get SCSI out of messing > with max_sectors directly, and the blk-mq stacking to stop looking > at it vs just the hardware limits (or just drop the check). This "works" but it doesn't safeguard blk_stack_limits() and blk_validate_limits() from other drivers that weren't trained to (ab)use max_user_sectors to get blk_validate_limits() to preserve the underlying device's max_sectors. But I suppose we can worry about any other similar issues if/when reported. Please send a proper patch to Jens so we can get this fixed for 6.10-rc1. Thanks. Acked-by: Mike Snitzer <snitzer@kernel.org> > > diff --git a/block/blk-settings.c b/block/blk-settings.c > index a7fe8e90240a6e..7a672021daee6a 100644 > --- a/block/blk-settings.c > +++ b/block/blk-settings.c > @@ -611,6 +611,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, > unsigned int top, bottom, alignment, ret = 0; > > t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors); > + t->max_user_sectors = min_not_zero(t->max_user_sectors, > + b->max_user_sectors); > t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors); > t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors); > t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors, > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index 332eb9dac22d91..f6c822c9cbd2d3 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -3700,8 +3700,10 @@ static int sd_revalidate_disk(struct gendisk *disk) > */ > if (sdkp->first_scan || > q->limits.max_sectors > q->limits.max_dev_sectors || > - q->limits.max_sectors > q->limits.max_hw_sectors) > + q->limits.max_sectors > q->limits.max_hw_sectors) { > q->limits.max_sectors = rw_max; > + q->limits.max_user_sectors = rw_max; > + } > > sdkp->first_scan = 0; > > >
On Thu, May 23, 2024 at 12:44:05PM -0400, Mike Snitzer wrote: > This "works" but it doesn't safeguard blk_stack_limits() and > blk_validate_limits() from other drivers that weren't trained to > (ab)use max_user_sectors to get blk_validate_limits() to preserve the > underlying device's max_sectors. It does in that sd/sr are the only remaining places directly setting queue limits without the commit API. And I am working on converting them for 6.11.
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 88114719fe18..6463b4afeaa4 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -1961,6 +1961,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, struct queue_limits *limits) { bool wc = false, fua = false; + unsigned int max_hw_sectors; int r; if (dm_table_supports_nowait(t)) @@ -1981,9 +1982,16 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, if (!dm_table_supports_secure_erase(t)) limits->max_secure_erase_sectors = 0; + /* Don't allow queue_limits_set() to throw-away stacked max_sectors */ + max_hw_sectors = limits->max_hw_sectors; + limits->max_hw_sectors = limits->max_sectors; r = queue_limits_set(q, limits); if (r) return r; + /* Restore stacked max_hw_sectors */ + mutex_lock(&q->limits_lock); + limits->max_hw_sectors = max_hw_sectors; + mutex_unlock(&q->limits_lock); if (dm_table_supports_flush(t, (1UL << QUEUE_FLAG_WC))) { wc = true;
Otherwise, blk_validate_limits() will throw-away the max_sectors that was stacked from underlying device(s). In doing so it can set a max_sectors limit that violates underlying device limits. This caused dm-multipath IO failures like the following because the underlying devices' max_sectors were stacked up to be 1024, yet blk_validate_limits() defaulted max_sectors to BLK_DEF_MAX_SECTORS_CAP (2560): [ 1214.673233] blk_insert_cloned_request: over max size limit. (2048 > 1024) [ 1214.673267] device-mapper: multipath: 254:3: Failing path 8:32. [ 1214.675196] blk_insert_cloned_request: over max size limit. (2048 > 1024) [ 1214.675224] device-mapper: multipath: 254:3: Failing path 8:16. [ 1214.675309] blk_insert_cloned_request: over max size limit. (2048 > 1024) [ 1214.675338] device-mapper: multipath: 254:3: Failing path 8:48. [ 1214.675413] blk_insert_cloned_request: over max size limit. (2048 > 1024) [ 1214.675441] device-mapper: multipath: 254:3: Failing path 8:64. The initial bug report included: [ 13.822701] blk_insert_cloned_request: over max size limit. (248 > 128) [ 13.829351] device-mapper: multipath: 253:3: Failing path 8:32. [ 13.835307] blk_insert_cloned_request: over max size limit. (248 > 128) [ 13.841928] device-mapper: multipath: 253:3: Failing path 65:16. [ 13.844532] blk_insert_cloned_request: over max size limit. (248 > 128) [ 13.854363] blk_insert_cloned_request: over max size limit. (248 > 128) [ 13.854580] device-mapper: multipath: 253:4: Failing path 8:48. [ 13.861166] device-mapper: multipath: 253:3: Failing path 8:192. Reported-by: Marco Patalano <mpatalan@redhat.com> Reported-by: Ewan Milne <emilne@redhat.com> Fixes: 1c0e720228ad ("dm: use queue_limits_set") Signed-off-by: Mike Snitzer <snitzer@kernel.org> --- drivers/md/dm-table.c | 8 ++++++++ 1 file changed, 8 insertions(+)