diff mbox series

[for-6.10-rc1] block: fix blk_validate_limits() to properly handle stacked devices

Message ID Zk9i7V2GRoHxBPRu@kernel.org (mailing list archive)
State Rejected, archived
Delegated to: Mike Snitzer
Headers show
Series [for-6.10-rc1] block: fix blk_validate_limits() to properly handle stacked devices | expand

Commit Message

Mike Snitzer May 23, 2024, 3:38 p.m. UTC
On Thu, May 23, 2024 at 09:52:40AM +0800, Ming Lei wrote:
> On Wed, May 22, 2024 at 12:48:59PM -0400, Mike Snitzer wrote:
> > On Wed, May 22, 2024 at 04:24:58PM +0200, Christoph Hellwig wrote:
> > > On Tue, May 21, 2024 at 10:51:17PM -0400, Mike Snitzer wrote:
> > > > Otherwise, blk_validate_limits() will throw-away the max_sectors that
> > > > was stacked from underlying device(s). In doing so it can set a
> > > > max_sectors limit that violates underlying device limits.
> > > 
> > > Hmm, yes it sort of is "throwing the limit away", but it really
> > > recalculates it from max_hw_sectors, max_dev_sectors and user_max_sectors.
> > 
> > Yes, but it needs to do that recalculation at each level of a stacked
> > device.  And then we need to combine them via blk_stack_limits() -- as
> > is done with the various limits stacking loops in
> > drivers/md/dm-table.c:dm_calculate_queue_limits
> 
> This way looks one stacking specific requirement, just wondering why not
> put the logic into blk_validate_limits() by passing 'stacking' parameter?
> Then raid can benefit from it too.

Sure, we could elevate it to blk_validate_limits (and callers) but
adding a 'stacking' parameter is more intrusive on an API level.

Best to just update blk_set_stacking_limits() to set a new 'stacking'
flag in struct queue_limits, and update blk_stack_limits() to stack
that flag up.

I've verified this commit to work and have staged it in linux-next via
linux-dm.git's 'for-next', see:

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=cedc03d697ff255dd5b600146521434e2e921815

Jens (and obviously: Christoph, Ming and others), I'm happy to send
this to Linus tomorrow morning if you could please provide your
Reviewed-by or Acked-by.  I'd prefer to keep the intermediate DM fix
just to "show the work and testing".

Thanks,
Mike

From cedc03d697ff255dd5b600146521434e2e921815 Mon Sep 17 00:00:00 2001
From: Mike Snitzer <snitzer@kernel.org>
Date: Thu, 23 May 2024 11:19:29 -0400
Subject: [PATCH] block: fix blk_validate_limits() to properly handle stacked devices

For the benefit of other stacking block drivers, e.g. MD, elevate the
DM fix from commit 0ead1c8e8e48 ("dm: retain stacked max_sectors when
setting queue_limits") to block core.

Switches to using a bool bitfield in struct queue_limits (for old
member 'zoned' and new member 'stacking') to not grow that struct.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 block/blk-settings.c   | 30 ++++++++++++++++++++++++++++--
 drivers/md/dm-table.c  |  8 --------
 include/linux/blkdev.h |  3 ++-
 3 files changed, 30 insertions(+), 11 deletions(-)

Comments

Christoph Hellwig May 23, 2024, 3:44 p.m. UTC | #1
On Thu, May 23, 2024 at 11:38:21AM -0400, Mike Snitzer wrote:
> Sure, we could elevate it to blk_validate_limits (and callers) but
> adding a 'stacking' parameter is more intrusive on an API level.
> 
> Best to just update blk_set_stacking_limits() to set a new 'stacking'
> flag in struct queue_limits, and update blk_stack_limits() to stack
> that flag up.
> 
> I've verified this commit to work and have staged it in linux-next via
> linux-dm.git's 'for-next', see:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=cedc03d697ff255dd5b600146521434e2e921815
> 
> Jens (and obviously: Christoph, Ming and others), I'm happy to send
> this to Linus tomorrow morning if you could please provide your
> Reviewed-by or Acked-by.  I'd prefer to keep the intermediate DM fix
> just to "show the work and testing".

A stacking flag in the limits is fundamentally wrong, please don't
do this.
Mike Snitzer May 23, 2024, 3:48 p.m. UTC | #2
On Thu, May 23, 2024 at 05:44:35PM +0200, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 11:38:21AM -0400, Mike Snitzer wrote:
> > Sure, we could elevate it to blk_validate_limits (and callers) but
> > adding a 'stacking' parameter is more intrusive on an API level.
> > 
> > Best to just update blk_set_stacking_limits() to set a new 'stacking'
> > flag in struct queue_limits, and update blk_stack_limits() to stack
> > that flag up.
> > 
> > I've verified this commit to work and have staged it in linux-next via
> > linux-dm.git's 'for-next', see:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=cedc03d697ff255dd5b600146521434e2e921815
> > 
> > Jens (and obviously: Christoph, Ming and others), I'm happy to send
> > this to Linus tomorrow morning if you could please provide your
> > Reviewed-by or Acked-by.  I'd prefer to keep the intermediate DM fix
> > just to "show the work and testing".
> 
> A stacking flag in the limits is fundamentally wrong, please don't
> do this.

Um, how so?  It serves as a hint to how the limits were constructed.

Reality is, we have stacking block devices that regularly are _not_
accounted for when people make changes to block core queue_limits
code.  That is a serious problem.

Happy to see the need for the 'stacking' flag to go away in time but I
fail to see why it is "fundamentally wrong".

Mike
Christoph Hellwig May 23, 2024, 3:52 p.m. UTC | #3
On Thu, May 23, 2024 at 11:48:50AM -0400, Mike Snitzer wrote:
> 
> Reality is, we have stacking block devices that regularly are _not_
> accounted for when people make changes to block core queue_limits
> code.  That is a serious problem.

Well, maybe we need to sure blktests covers this instead of either
impossible to run on a stock distro (device_mapper tests) or always
somehow hanging (lvm2 testsuite) separate tests..

> Happy to see the need for the 'stacking' flag to go away in time but I
> fail to see why it is "fundamentally wrong".

Because stacking limits should not in any way be built different.
Both the stacking flag and your restoring of the value just hack
around the problem instead of trying to address it.  Let's use a
little time to sort this out properly instead of piling up hacks.
Mike Snitzer May 23, 2024, 4:38 p.m. UTC | #4
On Thu, May 23, 2024 at 05:52:17PM +0200, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 11:48:50AM -0400, Mike Snitzer wrote:
> > 
> > Reality is, we have stacking block devices that regularly are _not_
> > accounted for when people make changes to block core queue_limits
> > code.  That is a serious problem.
> 
> Well, maybe we need to sure blktests covers this instead of either
> impossible to run on a stock distro (device_mapper tests) or always
> somehow hanging (lvm2 testsuite) separate tests..

Happy to see mptest get folded into blktests (its just bash code) --
but it doesn't reproduce for you so not a reliable safeguard.

The other testsuites aren't applicable to this particular issue, but I
don't disagree that the unicorn automated test frameworks used for DM
aren't robust enough.

> > Happy to see the need for the 'stacking' flag to go away in time but I
> > fail to see why it is "fundamentally wrong".
> 
> Because stacking limits should not in any way be built different.
> Both the stacking flag and your restoring of the value just hack
> around the problem instead of trying to address it.  Let's use a
> little time to sort this out properly instead of piling up hacks.

I have limited patience for queue_limits bugs given how long it took
us to stablize limits stacking (and how regressions keep happening).

All of the changes to render max_sectors and max_discard_sectors
unsettable directly (and only allow them being set in terms of an ever
growing array of overrides) were quite the big hammer. But yeah, it
is what it is.

I do appreciate your concern with me wanting limits stacking to be a
more pronounced first-class citizen in block core's queue_limits code.
I was trying to get the code back to a reliable state relative to
blk_stack_limits() et al -- for any underlying driver that might also
be blind to (ab)using max_user_sectors to steer blk_validate_limits().
I'm concerned such hacks are also fragile.

But in general I know we share the same goal, so I'll slow down and
yield to you to run with piecewise fixes where needed.  Will reply to
your latest patch now.

Mike
Christoph Hellwig May 23, 2024, 5:05 p.m. UTC | #5
On Thu, May 23, 2024 at 12:38:24PM -0400, Mike Snitzer wrote:
> Happy to see mptest get folded into blktests (its just bash code) --
> but it doesn't reproduce for you so not a reliable safeguard.

It is a lot better than not running it.  And I'll look into why
it doesn't reproduce.  Right now the only thing I can think of is
different kernel configs, maybe related to schedulers.  Can you send
me your .config?

Is adding mptests something you want to do, or you'd prefer outhers
to take care of?
Mike Snitzer May 23, 2024, 5:14 p.m. UTC | #6
On Thu, May 23, 2024 at 07:05:29PM +0200, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 12:38:24PM -0400, Mike Snitzer wrote:
> > Happy to see mptest get folded into blktests (its just bash code) --
> > but it doesn't reproduce for you so not a reliable safeguard.
> 
> It is a lot better than not running it.  And I'll look into why
> it doesn't reproduce.  Right now the only thing I can think of is
> different kernel configs, maybe related to schedulers.  Can you send
> me your .config?

Will do off-list.

> Is adding mptests something you want to do, or you'd prefer outhers
> to take care of?

I won't have time in the near or mid-term.  So if someone else would
like to convert mptests over to blktests that'd be wonderful.

Fanning out to test the various supported test permutations would be
cool (meaning, test with both MULTIPATH_BACKEND_MODULE=scsidebug and
MULTIPATH_BACKEND_MODULE=tcmloop ... but tcmloop is brittle due to
sysfs changes vs targetcli's sysfs expectations -- but that's a
targetcli issue that might have been fixed, assuming that code is
supported still?  Not revisited in a few months)
diff mbox series

Patch

diff --git a/block/blk-settings.c b/block/blk-settings.c
index cdbaef159c4b..24c799072f6c 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -35,6 +35,7 @@  EXPORT_SYMBOL_GPL(blk_queue_rq_timeout);
 void blk_set_stacking_limits(struct queue_limits *lim)
 {
 	memset(lim, 0, sizeof(*lim));
+	lim->stacking = true;
 	lim->logical_block_size = SECTOR_SIZE;
 	lim->physical_block_size = SECTOR_SIZE;
 	lim->io_min = SECTOR_SIZE;
@@ -103,7 +104,7 @@  static int blk_validate_zoned_limits(struct queue_limits *lim)
  */
 static int blk_validate_limits(struct queue_limits *lim)
 {
-	unsigned int max_hw_sectors;
+	unsigned int max_hw_sectors, stacked_max_hw_sectors = 0;
 
 	/*
 	 * Unless otherwise specified, default to 512 byte logical blocks and a
@@ -121,6 +122,23 @@  static int blk_validate_limits(struct queue_limits *lim)
 	if (lim->io_min < lim->physical_block_size)
 		lim->io_min = lim->physical_block_size;
 
+
+	/*
+	 * For stacked block devices, don't throw-away stacked max_sectors.
+	 */
+	if (lim->stacking && lim->max_hw_sectors) {
+		/*
+		 * lim->max_sectors and lim->max_hw_sectors were already
+		 * validated, relative underlying device(s) in this stacked
+		 * block device.
+		 */
+		stacked_max_hw_sectors = lim->max_hw_sectors;
+		/*
+		 * Impose stacked max_sectors as upper-bound for code below.
+		 */
+		lim->max_hw_sectors = lim->max_sectors;
+	}
+
 	/*
 	 * max_hw_sectors has a somewhat weird default for historical reason,
 	 * but driver really should set their own instead of relying on this
@@ -155,6 +173,11 @@  static int blk_validate_limits(struct queue_limits *lim)
 	lim->max_sectors = round_down(lim->max_sectors,
 			lim->logical_block_size >> SECTOR_SHIFT);
 
+	if (stacked_max_hw_sectors) {
+		/* Restore previously validated stacked max_hw_sectors */
+		lim->max_hw_sectors = max_hw_sectors;
+	}
+
 	/*
 	 * Random default for the maximum number of segments.  Driver should not
 	 * rely on this and set their own.
@@ -881,11 +904,14 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 						   b->max_secure_erase_sectors);
 	t->zone_write_granularity = max(t->zone_write_granularity,
 					b->zone_write_granularity);
-	t->zoned = max(t->zoned, b->zoned);
+	t->zoned |= b->zoned;
 	if (!t->zoned) {
 		t->zone_write_granularity = 0;
 		t->max_zone_append_sectors = 0;
 	}
+
+	t->stacking |= b->stacking;
+
 	return ret;
 }
 EXPORT_SYMBOL(blk_stack_limits);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 6463b4afeaa4..88114719fe18 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1961,7 +1961,6 @@  int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 			      struct queue_limits *limits)
 {
 	bool wc = false, fua = false;
-	unsigned int max_hw_sectors;
 	int r;
 
 	if (dm_table_supports_nowait(t))
@@ -1982,16 +1981,9 @@  int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 	if (!dm_table_supports_secure_erase(t))
 		limits->max_secure_erase_sectors = 0;
 
-	/* Don't allow queue_limits_set() to throw-away stacked max_sectors */
-	max_hw_sectors = limits->max_hw_sectors;
-	limits->max_hw_sectors = limits->max_sectors;
 	r = queue_limits_set(q, limits);
 	if (r)
 		return r;
-	/* Restore stacked max_hw_sectors */
-	mutex_lock(&q->limits_lock);
-	limits->max_hw_sectors = max_hw_sectors;
-	mutex_unlock(&q->limits_lock);
 
 	if (dm_table_supports_flush(t, (1UL << QUEUE_FLAG_WC))) {
 		wc = true;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c3e8f7cf96be..ad1b00e5cc3e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -307,7 +307,8 @@  struct queue_limits {
 	unsigned char		misaligned;
 	unsigned char		discard_misaligned;
 	unsigned char		raid_partial_stripes_expensive;
-	bool			zoned;
+	bool			zoned:1;
+	bool			stacking:1;
 	unsigned int		max_open_zones;
 	unsigned int		max_active_zones;