Message ID | 20250309222904.449803-8-bmarzins@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | dm: fix issues with swapping dm tables | expand |
On 3/10/25 07:29, Benjamin Marzinski wrote: > dm_revalidate_zones() only allowed devices that had no zone resources > set up to call blk_revalidate_disk_zones(). If the device already had > zone resources, disk->nr_zones would always equal md->nr_zones so > dm_revalidate_zones() returned without doing any work. Instead, always > call blk_revalidate_disk_zones() if you are loading a new zoned table. > > However, if the device emulates zone append operations and already has > zone append emulation resources, the table size cannot change when > loading a new table. Otherwise, all those resources will be garbage. > > If emulated zone append operations are needed and the zone write pointer > offsets of the new table do not match those of the old table, writes to > the device will still fail. This patch allows users to safely grow and > shrink zone devices. But swapping arbitrary zoned tables will still not > work. I do not think that this patch correctly address the shrinking of dm zoned device: blk_revalidate_disk_zones() will look at a smaller set of zones, which will leave already hashed zone write plugs outside of that new zone range in the disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and reallocate the hash table if the number of zones shrinks. For a physical drive, this can only happen if the drive is reformatted with some magic vendor unique command, which is why this was never implemented as that is not a valid production use case. To make things simpler, I think we should allow growing/shrinking zoned device tables, and much less swapping tables between zoned and not-zoned. I am more inclined to avoid all these corner cases by simply not supporting table switching for zoned device. That would be much safer I think. No-one complained about any issue with table switching until now, which likely means that no-one is using this. So what about simply returning an error for table switching for a zoned device ? If someone request this support, we can revisit this. > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> > --- > drivers/md/dm-zone.c | 23 +++++++++++++---------- > 1 file changed, 13 insertions(+), 10 deletions(-) > > diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c > index ac86011640c3..7e9ebeee7eac 100644 > --- a/drivers/md/dm-zone.c > +++ b/drivers/md/dm-zone.c > @@ -164,16 +164,8 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q) > if (!get_capacity(disk)) > return 0; > > - /* Revalidate only if something changed. */ > - if (!disk->nr_zones || disk->nr_zones != md->nr_zones) { > - DMINFO("%s using %s zone append", > - disk->disk_name, > - queue_emulates_zone_append(q) ? "emulated" : "native"); > - md->nr_zones = 0; > - } > - > - if (md->nr_zones) > - return 0; > + DMINFO("%s using %s zone append", disk->disk_name, > + queue_emulates_zone_append(q) ? "emulated" : "native"); > > /* > * Our table is not live yet. So the call to dm_get_live_table() > @@ -392,6 +384,17 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q, > return 0; > } > > + /* > + * If the device needs zone append emulation, and the device already has > + * zone append emulation resources, make sure that the chunk_sectors > + * hasn't changed size. Otherwise those resources will be garbage. > + */ > + if (!lim->max_hw_zone_append_sectors && disk->zone_wplugs_hash && > + q->limits.chunk_sectors != lim->chunk_sectors) { > + DMERR("Cannot change zone size when swapping tables"); > + return -EINVAL; > + } > + > /* > * Warn once (when the capacity is not yet set) if the mapped device is > * partially using zone resources of the target devices as that leads to
On Mon, Mar 10, 2025 at 08:59:26AM +0900, Damien Le Moal wrote: > On 3/10/25 07:29, Benjamin Marzinski wrote: > > dm_revalidate_zones() only allowed devices that had no zone resources > > set up to call blk_revalidate_disk_zones(). If the device already had > > zone resources, disk->nr_zones would always equal md->nr_zones so > > dm_revalidate_zones() returned without doing any work. Instead, always > > call blk_revalidate_disk_zones() if you are loading a new zoned table. > > > > However, if the device emulates zone append operations and already has > > zone append emulation resources, the table size cannot change when > > loading a new table. Otherwise, all those resources will be garbage. > > > > If emulated zone append operations are needed and the zone write pointer > > offsets of the new table do not match those of the old table, writes to > > the device will still fail. This patch allows users to safely grow and > > shrink zone devices. But swapping arbitrary zoned tables will still not > > work. > > I do not think that this patch correctly address the shrinking of dm zoned > device: blk_revalidate_disk_zones() will look at a smaller set of zones, which > will leave already hashed zone write plugs outside of that new zone range in the > disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and > reallocate the hash table if the number of zones shrinks. This is necessary for DM. There could be plugged bios that are on on these no longer in-range zones. They will obviously fail when they get reissued, but we need to keep the plugs around so that they *do* get reissued. A cleaner alternative would be to add code to immediately error out all the plugged bios on shrinks, but I was trying to avoid adding a bunch of code to deal with these cases (of course simply disallowing them adds even less code). -Ben > For a physical drive, > this can only happen if the drive is reformatted with some magic vendor unique > command, which is why this was never implemented as that is not a valid > production use case. > > To make things simpler, I think we should allow growing/shrinking zoned device > tables, and much less swapping tables between zoned and not-zoned. I am more > inclined to avoid all these corner cases by simply not supporting table > switching for zoned device. That would be much safer I think. > No-one complained about any issue with table switching until now, which likely > means that no-one is using this. So what about simply returning an error for > table switching for a zoned device ? If someone request this support, we can > revisit this. > > > > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> > > --- > > drivers/md/dm-zone.c | 23 +++++++++++++---------- > > 1 file changed, 13 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c > > index ac86011640c3..7e9ebeee7eac 100644 > > --- a/drivers/md/dm-zone.c > > +++ b/drivers/md/dm-zone.c > > @@ -164,16 +164,8 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q) > > if (!get_capacity(disk)) > > return 0; > > > > - /* Revalidate only if something changed. */ > > - if (!disk->nr_zones || disk->nr_zones != md->nr_zones) { > > - DMINFO("%s using %s zone append", > > - disk->disk_name, > > - queue_emulates_zone_append(q) ? "emulated" : "native"); > > - md->nr_zones = 0; > > - } > > - > > - if (md->nr_zones) > > - return 0; > > + DMINFO("%s using %s zone append", disk->disk_name, > > + queue_emulates_zone_append(q) ? "emulated" : "native"); > > > > /* > > * Our table is not live yet. So the call to dm_get_live_table() > > @@ -392,6 +384,17 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q, > > return 0; > > } > > > > + /* > > + * If the device needs zone append emulation, and the device already has > > + * zone append emulation resources, make sure that the chunk_sectors > > + * hasn't changed size. Otherwise those resources will be garbage. > > + */ > > + if (!lim->max_hw_zone_append_sectors && disk->zone_wplugs_hash && > > + q->limits.chunk_sectors != lim->chunk_sectors) { > > + DMERR("Cannot change zone size when swapping tables"); > > + return -EINVAL; > > + } > > + > > /* > > * Warn once (when the capacity is not yet set) if the mapped device is > > * partially using zone resources of the target devices as that leads to > > > -- > Damien Le Moal > Western Digital Research
On 3/11/25 02:43, Benjamin Marzinski wrote: > On Mon, Mar 10, 2025 at 08:59:26AM +0900, Damien Le Moal wrote: >> On 3/10/25 07:29, Benjamin Marzinski wrote: >>> dm_revalidate_zones() only allowed devices that had no zone resources >>> set up to call blk_revalidate_disk_zones(). If the device already had >>> zone resources, disk->nr_zones would always equal md->nr_zones so >>> dm_revalidate_zones() returned without doing any work. Instead, always >>> call blk_revalidate_disk_zones() if you are loading a new zoned table. >>> >>> However, if the device emulates zone append operations and already has >>> zone append emulation resources, the table size cannot change when >>> loading a new table. Otherwise, all those resources will be garbage. >>> >>> If emulated zone append operations are needed and the zone write pointer >>> offsets of the new table do not match those of the old table, writes to >>> the device will still fail. This patch allows users to safely grow and >>> shrink zone devices. But swapping arbitrary zoned tables will still not >>> work. >> >> I do not think that this patch correctly address the shrinking of dm zoned >> device: blk_revalidate_disk_zones() will look at a smaller set of zones, which >> will leave already hashed zone write plugs outside of that new zone range in the >> disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and >> reallocate the hash table if the number of zones shrinks. > > This is necessary for DM. There could be plugged bios that are on on > these no longer in-range zones. They will obviously fail when they get > reissued, but we need to keep the plugs around so that they *do* get > reissued. A cleaner alternative would be to add code to immediately > error out all the plugged bios on shrinks, but I was trying to avoid > adding a bunch of code to deal with these cases (of course simply > disallowing them adds even less code). I am confused now :) Under the assumption that we do not allow switching to a new table that changes the zone configuration (in particualr, there is no grow/shrink of the device), then I do not think we have to do anything special for DM.
On Tue, Mar 11, 2025 at 08:19:29AM +0900, Damien Le Moal wrote: > On 3/11/25 02:43, Benjamin Marzinski wrote: > > On Mon, Mar 10, 2025 at 08:59:26AM +0900, Damien Le Moal wrote: > >> On 3/10/25 07:29, Benjamin Marzinski wrote: > >>> dm_revalidate_zones() only allowed devices that had no zone resources > >>> set up to call blk_revalidate_disk_zones(). If the device already had > >>> zone resources, disk->nr_zones would always equal md->nr_zones so > >>> dm_revalidate_zones() returned without doing any work. Instead, always > >>> call blk_revalidate_disk_zones() if you are loading a new zoned table. > >>> > >>> However, if the device emulates zone append operations and already has > >>> zone append emulation resources, the table size cannot change when > >>> loading a new table. Otherwise, all those resources will be garbage. > >>> > >>> If emulated zone append operations are needed and the zone write pointer > >>> offsets of the new table do not match those of the old table, writes to > >>> the device will still fail. This patch allows users to safely grow and > >>> shrink zone devices. But swapping arbitrary zoned tables will still not > >>> work. > >> > >> I do not think that this patch correctly address the shrinking of dm zoned > >> device: blk_revalidate_disk_zones() will look at a smaller set of zones, which > >> will leave already hashed zone write plugs outside of that new zone range in the > >> disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and > >> reallocate the hash table if the number of zones shrinks. > > > > This is necessary for DM. There could be plugged bios that are on on > > these no longer in-range zones. They will obviously fail when they get > > reissued, but we need to keep the plugs around so that they *do* get > > reissued. A cleaner alternative would be to add code to immediately > > error out all the plugged bios on shrinks, but I was trying to avoid > > adding a bunch of code to deal with these cases (of course simply > > disallowing them adds even less code). > > I am confused now :) > Under the assumption that we do not allow switching to a new table that changes > the zone configuration (in particualr, there is no grow/shrink of the device), > then I do not think we have to do anything special for DM. If we don't allow switching between zoned tables, then we obviously don't need to make DM call blk_revalidate_disk_zones() on a zoned table switch. I was just saying that I know that this patch would leave out-of-range zone plugs behind on a shrink, but that is necessary to allow shrinking while there could still be outstanding plugged bios attached to those plugs. So, if we wanted to allow shrinking, then I think this patch is correct (although erroring out all the out-of-range bios would be a cleaner solution). But assuming we don't allow shrinking, then you are correct. We don't need to do anything special for DM. -Ben > > > -- > Damien Le Moal > Western Digital Research
On 3/11/25 08:42, Benjamin Marzinski wrote: > On Tue, Mar 11, 2025 at 08:19:29AM +0900, Damien Le Moal wrote: >> On 3/11/25 02:43, Benjamin Marzinski wrote: >>> On Mon, Mar 10, 2025 at 08:59:26AM +0900, Damien Le Moal wrote: >>>> On 3/10/25 07:29, Benjamin Marzinski wrote: >>>>> dm_revalidate_zones() only allowed devices that had no zone resources >>>>> set up to call blk_revalidate_disk_zones(). If the device already had >>>>> zone resources, disk->nr_zones would always equal md->nr_zones so >>>>> dm_revalidate_zones() returned without doing any work. Instead, always >>>>> call blk_revalidate_disk_zones() if you are loading a new zoned table. >>>>> >>>>> However, if the device emulates zone append operations and already has >>>>> zone append emulation resources, the table size cannot change when >>>>> loading a new table. Otherwise, all those resources will be garbage. >>>>> >>>>> If emulated zone append operations are needed and the zone write pointer >>>>> offsets of the new table do not match those of the old table, writes to >>>>> the device will still fail. This patch allows users to safely grow and >>>>> shrink zone devices. But swapping arbitrary zoned tables will still not >>>>> work. >>>> >>>> I do not think that this patch correctly address the shrinking of dm zoned >>>> device: blk_revalidate_disk_zones() will look at a smaller set of zones, which >>>> will leave already hashed zone write plugs outside of that new zone range in the >>>> disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and >>>> reallocate the hash table if the number of zones shrinks. >>> >>> This is necessary for DM. There could be plugged bios that are on on >>> these no longer in-range zones. They will obviously fail when they get >>> reissued, but we need to keep the plugs around so that they *do* get >>> reissued. A cleaner alternative would be to add code to immediately >>> error out all the plugged bios on shrinks, but I was trying to avoid >>> adding a bunch of code to deal with these cases (of course simply >>> disallowing them adds even less code). >> >> I am confused now :) >> Under the assumption that we do not allow switching to a new table that changes >> the zone configuration (in particualr, there is no grow/shrink of the device), >> then I do not think we have to do anything special for DM. > > If we don't allow switching between zoned tables, then we obviously > don't need to make DM call blk_revalidate_disk_zones() on a zoned table > switch. I was just saying that I know that this patch would leave > out-of-range zone plugs behind on a shrink, but that is necessary to > allow shrinking while there could still be outstanding plugged bios > attached to those plugs. > > So, if we wanted to allow shrinking, then I think this patch is correct > (although erroring out all the out-of-range bios would be a cleaner > solution). But assuming we don't allow shrinking, then you are correct. > We don't need to do anything special for DM. OK. Got it. As mentioned, I think we really still need to allow switching (or inserting ?) dm-error. In that case, calling blk_revalidate_disk_zones() should still be fine as long as there is no change to the zone config, which is true for dm-error. Do you agree ?
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c index ac86011640c3..7e9ebeee7eac 100644 --- a/drivers/md/dm-zone.c +++ b/drivers/md/dm-zone.c @@ -164,16 +164,8 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q) if (!get_capacity(disk)) return 0; - /* Revalidate only if something changed. */ - if (!disk->nr_zones || disk->nr_zones != md->nr_zones) { - DMINFO("%s using %s zone append", - disk->disk_name, - queue_emulates_zone_append(q) ? "emulated" : "native"); - md->nr_zones = 0; - } - - if (md->nr_zones) - return 0; + DMINFO("%s using %s zone append", disk->disk_name, + queue_emulates_zone_append(q) ? "emulated" : "native"); /* * Our table is not live yet. So the call to dm_get_live_table() @@ -392,6 +384,17 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q, return 0; } + /* + * If the device needs zone append emulation, and the device already has + * zone append emulation resources, make sure that the chunk_sectors + * hasn't changed size. Otherwise those resources will be garbage. + */ + if (!lim->max_hw_zone_append_sectors && disk->zone_wplugs_hash && + q->limits.chunk_sectors != lim->chunk_sectors) { + DMERR("Cannot change zone size when swapping tables"); + return -EINVAL; + } + /* * Warn once (when the capacity is not yet set) if the mapped device is * partially using zone resources of the target devices as that leads to
dm_revalidate_zones() only allowed devices that had no zone resources set up to call blk_revalidate_disk_zones(). If the device already had zone resources, disk->nr_zones would always equal md->nr_zones so dm_revalidate_zones() returned without doing any work. Instead, always call blk_revalidate_disk_zones() if you are loading a new zoned table. However, if the device emulates zone append operations and already has zone append emulation resources, the table size cannot change when loading a new table. Otherwise, all those resources will be garbage. If emulated zone append operations are needed and the zone write pointer offsets of the new table do not match those of the old table, writes to the device will still fail. This patch allows users to safely grow and shrink zone devices. But swapping arbitrary zoned tables will still not work. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> --- drivers/md/dm-zone.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-)