diff mbox series

[RFC,7/7] dm: allow devices to revalidate existing zones

Message ID 20250309222904.449803-8-bmarzins@redhat.com (mailing list archive)
State New
Headers show
Series dm: fix issues with swapping dm tables | expand

Commit Message

Benjamin Marzinski March 9, 2025, 10:29 p.m. UTC
dm_revalidate_zones() only allowed devices that had no zone resources
set up to call blk_revalidate_disk_zones(). If the device already had
zone resources, disk->nr_zones would always equal md->nr_zones so
dm_revalidate_zones() returned without doing any work. Instead, always
call blk_revalidate_disk_zones() if you are loading a new zoned table.

However, if the device emulates zone append operations and already has
zone append emulation resources, the table size cannot change when
loading a new table. Otherwise, all those resources will be garbage.

If emulated zone append operations are needed and the zone write pointer
offsets of the new table do not match those of the old table, writes to
the device will still fail. This patch allows users to safely grow and
shrink zone devices. But swapping arbitrary zoned tables will still not
work.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 drivers/md/dm-zone.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

Comments

Damien Le Moal March 9, 2025, 11:59 p.m. UTC | #1
On 3/10/25 07:29, Benjamin Marzinski wrote:
> dm_revalidate_zones() only allowed devices that had no zone resources
> set up to call blk_revalidate_disk_zones(). If the device already had
> zone resources, disk->nr_zones would always equal md->nr_zones so
> dm_revalidate_zones() returned without doing any work. Instead, always
> call blk_revalidate_disk_zones() if you are loading a new zoned table.
> 
> However, if the device emulates zone append operations and already has
> zone append emulation resources, the table size cannot change when
> loading a new table. Otherwise, all those resources will be garbage.
> 
> If emulated zone append operations are needed and the zone write pointer
> offsets of the new table do not match those of the old table, writes to
> the device will still fail. This patch allows users to safely grow and
> shrink zone devices. But swapping arbitrary zoned tables will still not
> work.

I do not think that this patch correctly address the shrinking of dm zoned
device: blk_revalidate_disk_zones() will look at a smaller set of zones, which
will leave already hashed zone write plugs outside of that new zone range in the
disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and
reallocate the hash table if the number of zones shrinks. For a physical drive,
this can only happen if the drive is reformatted with some magic vendor unique
command, which is why this was never implemented as that is not a valid
production use case.

To make things simpler, I think we should allow growing/shrinking zoned device
tables, and much less swapping tables between zoned and not-zoned. I am more
inclined to avoid all these corner cases by simply not supporting table
switching for zoned device. That would be much safer I think.
No-one complained about any issue with table switching until now, which likely
means that no-one is using this. So what about simply returning an error for
table switching for a zoned device ? If someone request this support, we can
revisit this.

> 
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> ---
>  drivers/md/dm-zone.c | 23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index ac86011640c3..7e9ebeee7eac 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -164,16 +164,8 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
>  	if (!get_capacity(disk))
>  		return 0;
>  
> -	/* Revalidate only if something changed. */
> -	if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
> -		DMINFO("%s using %s zone append",
> -		       disk->disk_name,
> -		       queue_emulates_zone_append(q) ? "emulated" : "native");
> -		md->nr_zones = 0;
> -	}
> -
> -	if (md->nr_zones)
> -		return 0;
> +	DMINFO("%s using %s zone append", disk->disk_name,
> +	       queue_emulates_zone_append(q) ? "emulated" : "native");
>  
>  	/*
>  	 * Our table is not live yet. So the call to dm_get_live_table()
> @@ -392,6 +384,17 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
>  		return 0;
>  	}
>  
> +	/*
> +	 * If the device needs zone append emulation, and the device already has
> +	 * zone append emulation resources, make sure that the chunk_sectors
> +	 * hasn't changed size. Otherwise those resources will be garbage.
> +	 */
> +	if (!lim->max_hw_zone_append_sectors && disk->zone_wplugs_hash &&
> +	    q->limits.chunk_sectors != lim->chunk_sectors) {
> +		DMERR("Cannot change zone size when swapping tables");
> +		return -EINVAL;
> +	}
> +
>  	/*
>  	 * Warn once (when the capacity is not yet set) if the mapped device is
>  	 * partially using zone resources of the target devices as that leads to
Benjamin Marzinski March 10, 2025, 5:43 p.m. UTC | #2
On Mon, Mar 10, 2025 at 08:59:26AM +0900, Damien Le Moal wrote:
> On 3/10/25 07:29, Benjamin Marzinski wrote:
> > dm_revalidate_zones() only allowed devices that had no zone resources
> > set up to call blk_revalidate_disk_zones(). If the device already had
> > zone resources, disk->nr_zones would always equal md->nr_zones so
> > dm_revalidate_zones() returned without doing any work. Instead, always
> > call blk_revalidate_disk_zones() if you are loading a new zoned table.
> > 
> > However, if the device emulates zone append operations and already has
> > zone append emulation resources, the table size cannot change when
> > loading a new table. Otherwise, all those resources will be garbage.
> > 
> > If emulated zone append operations are needed and the zone write pointer
> > offsets of the new table do not match those of the old table, writes to
> > the device will still fail. This patch allows users to safely grow and
> > shrink zone devices. But swapping arbitrary zoned tables will still not
> > work.
> 
> I do not think that this patch correctly address the shrinking of dm zoned
> device: blk_revalidate_disk_zones() will look at a smaller set of zones, which
> will leave already hashed zone write plugs outside of that new zone range in the
> disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and
> reallocate the hash table if the number of zones shrinks.

This is necessary for DM. There could be plugged bios that are on on
these no longer in-range zones.  They will obviously fail when they get
reissued, but we need to keep the plugs around so that they *do* get
reissued. A cleaner alternative would be to add code to immediately
error out all the plugged bios on shrinks, but I was trying to avoid
adding a bunch of code to deal with these cases (of course simply
disallowing them adds even less code).

-Ben

> For a physical drive,
> this can only happen if the drive is reformatted with some magic vendor unique
> command, which is why this was never implemented as that is not a valid
> production use case.
> 
> To make things simpler, I think we should allow growing/shrinking zoned device
> tables, and much less swapping tables between zoned and not-zoned. I am more
> inclined to avoid all these corner cases by simply not supporting table
> switching for zoned device. That would be much safer I think.
> No-one complained about any issue with table switching until now, which likely
> means that no-one is using this. So what about simply returning an error for
> table switching for a zoned device ? If someone request this support, we can
> revisit this.
> 
> > 
> > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > ---
> >  drivers/md/dm-zone.c | 23 +++++++++++++----------
> >  1 file changed, 13 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> > index ac86011640c3..7e9ebeee7eac 100644
> > --- a/drivers/md/dm-zone.c
> > +++ b/drivers/md/dm-zone.c
> > @@ -164,16 +164,8 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
> >  	if (!get_capacity(disk))
> >  		return 0;
> >  
> > -	/* Revalidate only if something changed. */
> > -	if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
> > -		DMINFO("%s using %s zone append",
> > -		       disk->disk_name,
> > -		       queue_emulates_zone_append(q) ? "emulated" : "native");
> > -		md->nr_zones = 0;
> > -	}
> > -
> > -	if (md->nr_zones)
> > -		return 0;
> > +	DMINFO("%s using %s zone append", disk->disk_name,
> > +	       queue_emulates_zone_append(q) ? "emulated" : "native");
> >  
> >  	/*
> >  	 * Our table is not live yet. So the call to dm_get_live_table()
> > @@ -392,6 +384,17 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> >  		return 0;
> >  	}
> >  
> > +	/*
> > +	 * If the device needs zone append emulation, and the device already has
> > +	 * zone append emulation resources, make sure that the chunk_sectors
> > +	 * hasn't changed size. Otherwise those resources will be garbage.
> > +	 */
> > +	if (!lim->max_hw_zone_append_sectors && disk->zone_wplugs_hash &&
> > +	    q->limits.chunk_sectors != lim->chunk_sectors) {
> > +		DMERR("Cannot change zone size when swapping tables");
> > +		return -EINVAL;
> > +	}
> > +
> >  	/*
> >  	 * Warn once (when the capacity is not yet set) if the mapped device is
> >  	 * partially using zone resources of the target devices as that leads to
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research
Damien Le Moal March 10, 2025, 11:19 p.m. UTC | #3
On 3/11/25 02:43, Benjamin Marzinski wrote:
> On Mon, Mar 10, 2025 at 08:59:26AM +0900, Damien Le Moal wrote:
>> On 3/10/25 07:29, Benjamin Marzinski wrote:
>>> dm_revalidate_zones() only allowed devices that had no zone resources
>>> set up to call blk_revalidate_disk_zones(). If the device already had
>>> zone resources, disk->nr_zones would always equal md->nr_zones so
>>> dm_revalidate_zones() returned without doing any work. Instead, always
>>> call blk_revalidate_disk_zones() if you are loading a new zoned table.
>>>
>>> However, if the device emulates zone append operations and already has
>>> zone append emulation resources, the table size cannot change when
>>> loading a new table. Otherwise, all those resources will be garbage.
>>>
>>> If emulated zone append operations are needed and the zone write pointer
>>> offsets of the new table do not match those of the old table, writes to
>>> the device will still fail. This patch allows users to safely grow and
>>> shrink zone devices. But swapping arbitrary zoned tables will still not
>>> work.
>>
>> I do not think that this patch correctly address the shrinking of dm zoned
>> device: blk_revalidate_disk_zones() will look at a smaller set of zones, which
>> will leave already hashed zone write plugs outside of that new zone range in the
>> disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and
>> reallocate the hash table if the number of zones shrinks.
> 
> This is necessary for DM. There could be plugged bios that are on on
> these no longer in-range zones.  They will obviously fail when they get
> reissued, but we need to keep the plugs around so that they *do* get
> reissued. A cleaner alternative would be to add code to immediately
> error out all the plugged bios on shrinks, but I was trying to avoid
> adding a bunch of code to deal with these cases (of course simply
> disallowing them adds even less code).

I am confused now :)
Under the assumption that we do not allow switching to a new table that changes
the zone configuration (in particualr, there is no grow/shrink of the device),
then I do not think we have to do anything special for DM.
Benjamin Marzinski March 10, 2025, 11:42 p.m. UTC | #4
On Tue, Mar 11, 2025 at 08:19:29AM +0900, Damien Le Moal wrote:
> On 3/11/25 02:43, Benjamin Marzinski wrote:
> > On Mon, Mar 10, 2025 at 08:59:26AM +0900, Damien Le Moal wrote:
> >> On 3/10/25 07:29, Benjamin Marzinski wrote:
> >>> dm_revalidate_zones() only allowed devices that had no zone resources
> >>> set up to call blk_revalidate_disk_zones(). If the device already had
> >>> zone resources, disk->nr_zones would always equal md->nr_zones so
> >>> dm_revalidate_zones() returned without doing any work. Instead, always
> >>> call blk_revalidate_disk_zones() if you are loading a new zoned table.
> >>>
> >>> However, if the device emulates zone append operations and already has
> >>> zone append emulation resources, the table size cannot change when
> >>> loading a new table. Otherwise, all those resources will be garbage.
> >>>
> >>> If emulated zone append operations are needed and the zone write pointer
> >>> offsets of the new table do not match those of the old table, writes to
> >>> the device will still fail. This patch allows users to safely grow and
> >>> shrink zone devices. But swapping arbitrary zoned tables will still not
> >>> work.
> >>
> >> I do not think that this patch correctly address the shrinking of dm zoned
> >> device: blk_revalidate_disk_zones() will look at a smaller set of zones, which
> >> will leave already hashed zone write plugs outside of that new zone range in the
> >> disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and
> >> reallocate the hash table if the number of zones shrinks.
> > 
> > This is necessary for DM. There could be plugged bios that are on on
> > these no longer in-range zones.  They will obviously fail when they get
> > reissued, but we need to keep the plugs around so that they *do* get
> > reissued. A cleaner alternative would be to add code to immediately
> > error out all the plugged bios on shrinks, but I was trying to avoid
> > adding a bunch of code to deal with these cases (of course simply
> > disallowing them adds even less code).
> 
> I am confused now :)
> Under the assumption that we do not allow switching to a new table that changes
> the zone configuration (in particualr, there is no grow/shrink of the device),
> then I do not think we have to do anything special for DM.

If we don't allow switching between zoned tables, then we obviously
don't need to make DM call blk_revalidate_disk_zones() on a zoned table
switch. I was just saying that I know that this patch would leave
out-of-range zone plugs behind on a shrink, but that is necessary to
allow shrinking while there could still be outstanding plugged bios
attached to those plugs.

So, if we wanted to allow shrinking, then I think this patch is correct
(although erroring out all the out-of-range bios would be a cleaner
solution). But assuming we don't allow shrinking, then you are correct.
We don't need to do anything special for DM.

-Ben

> 
> 
> -- 
> Damien Le Moal
> Western Digital Research
Damien Le Moal March 11, 2025, midnight UTC | #5
On 3/11/25 08:42, Benjamin Marzinski wrote:
> On Tue, Mar 11, 2025 at 08:19:29AM +0900, Damien Le Moal wrote:
>> On 3/11/25 02:43, Benjamin Marzinski wrote:
>>> On Mon, Mar 10, 2025 at 08:59:26AM +0900, Damien Le Moal wrote:
>>>> On 3/10/25 07:29, Benjamin Marzinski wrote:
>>>>> dm_revalidate_zones() only allowed devices that had no zone resources
>>>>> set up to call blk_revalidate_disk_zones(). If the device already had
>>>>> zone resources, disk->nr_zones would always equal md->nr_zones so
>>>>> dm_revalidate_zones() returned without doing any work. Instead, always
>>>>> call blk_revalidate_disk_zones() if you are loading a new zoned table.
>>>>>
>>>>> However, if the device emulates zone append operations and already has
>>>>> zone append emulation resources, the table size cannot change when
>>>>> loading a new table. Otherwise, all those resources will be garbage.
>>>>>
>>>>> If emulated zone append operations are needed and the zone write pointer
>>>>> offsets of the new table do not match those of the old table, writes to
>>>>> the device will still fail. This patch allows users to safely grow and
>>>>> shrink zone devices. But swapping arbitrary zoned tables will still not
>>>>> work.
>>>>
>>>> I do not think that this patch correctly address the shrinking of dm zoned
>>>> device: blk_revalidate_disk_zones() will look at a smaller set of zones, which
>>>> will leave already hashed zone write plugs outside of that new zone range in the
>>>> disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and
>>>> reallocate the hash table if the number of zones shrinks.
>>>
>>> This is necessary for DM. There could be plugged bios that are on on
>>> these no longer in-range zones.  They will obviously fail when they get
>>> reissued, but we need to keep the plugs around so that they *do* get
>>> reissued. A cleaner alternative would be to add code to immediately
>>> error out all the plugged bios on shrinks, but I was trying to avoid
>>> adding a bunch of code to deal with these cases (of course simply
>>> disallowing them adds even less code).
>>
>> I am confused now :)
>> Under the assumption that we do not allow switching to a new table that changes
>> the zone configuration (in particualr, there is no grow/shrink of the device),
>> then I do not think we have to do anything special for DM.
> 
> If we don't allow switching between zoned tables, then we obviously
> don't need to make DM call blk_revalidate_disk_zones() on a zoned table
> switch. I was just saying that I know that this patch would leave
> out-of-range zone plugs behind on a shrink, but that is necessary to
> allow shrinking while there could still be outstanding plugged bios
> attached to those plugs.
> 
> So, if we wanted to allow shrinking, then I think this patch is correct
> (although erroring out all the out-of-range bios would be a cleaner
> solution). But assuming we don't allow shrinking, then you are correct.
> We don't need to do anything special for DM.

OK. Got it.
As mentioned, I think we really still need to allow switching (or inserting ?)
dm-error. In that case, calling blk_revalidate_disk_zones() should still be fine
as long as there is no change to the zone config, which is true for dm-error. Do
you agree ?
diff mbox series

Patch

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index ac86011640c3..7e9ebeee7eac 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -164,16 +164,8 @@  int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
 	if (!get_capacity(disk))
 		return 0;
 
-	/* Revalidate only if something changed. */
-	if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
-		DMINFO("%s using %s zone append",
-		       disk->disk_name,
-		       queue_emulates_zone_append(q) ? "emulated" : "native");
-		md->nr_zones = 0;
-	}
-
-	if (md->nr_zones)
-		return 0;
+	DMINFO("%s using %s zone append", disk->disk_name,
+	       queue_emulates_zone_append(q) ? "emulated" : "native");
 
 	/*
 	 * Our table is not live yet. So the call to dm_get_live_table()
@@ -392,6 +384,17 @@  int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
 		return 0;
 	}
 
+	/*
+	 * If the device needs zone append emulation, and the device already has
+	 * zone append emulation resources, make sure that the chunk_sectors
+	 * hasn't changed size. Otherwise those resources will be garbage.
+	 */
+	if (!lim->max_hw_zone_append_sectors && disk->zone_wplugs_hash &&
+	    q->limits.chunk_sectors != lim->chunk_sectors) {
+		DMERR("Cannot change zone size when swapping tables");
+		return -EINVAL;
+	}
+
 	/*
 	 * Warn once (when the capacity is not yet set) if the mapped device is
 	 * partially using zone resources of the target devices as that leads to