mbox series

[v2,0/6] dm: fix issues with swapping dm tables

Message ID 20250317044510.2200856-1-bmarzins@redhat.com (mailing list archive)
Headers show
Series dm: fix issues with swapping dm tables | expand

Message

Benjamin Marzinski March 17, 2025, 4:45 a.m. UTC
There were multiple places in dm's __bind() function where it could fail
and not completely roll back, leaving the device using the the old
table, but with device limits and resources from the new table.
Additionally, unused mempools for request-based devices were not always
freed immediately.

Finally, there were a number of issues with switching zoned tables that
emulate zone append (in other words, dm-crypt on top of zoned devices).
dm_blk_report_zones() could be called while the device was suspended and
modifying zoned resources or could possibly fail to end a srcu read
section.  More importantly, blk_revalidate_disk_zones() would never get
called when updating a zoned table. This could cause the dm device to
see the wrong zone write offsets, not have a large enough zwplugs
reserved in its mempool, or read invalid memory when checking the
conventional zones bitmap.

This patchset fixes these issues. It deals with the problems around
blk_revalidate_disk_zones() by only calling it for devices that have no 
zone write plug resources. This will always correctly update the zoned
settings. If a device has zone write plug resources, calling
blk_revalidate_disk_zones() will not correctly update them in many
cases, so DM simply doesn't call it for devices with zone write plug
resources. Instead of allowing people to load tables that can break the
device, like currently happens, DM disallosw any table reloads that
change the zoned setting for devices that already have zone write plug
resources.

Specifically, if a device already has zone plug resources allocated, it
can only switch to another zoned table that also emulates zone append.
Also, it cannot change the device size or the zone size. There are some
tweaks to make sure that a device can always switch to an error target.

Changes in V2:
- Made queue_limits_set() optionally return the old limits (grabbed
  while holding the limits_lock), and used this in
  dm_table_set_restrictions()
- dropped changes to disk_free_zone_resources() and the
  blk_revalidate_disk_zones() code path (removed patches 0005 & 0006)
- Instead of always calling blk_revalidate_disk_zones(), disallow
  changes that would change zone settings if the device has
  zone write plug resources (final patch).

Benjamin Marzinski (6):
  dm: don't change md if dm_table_set_restrictions() fails
  dm: free table mempools if not used in __bind
  block: make queue_limits_set() optionally return old limits
  dm: handle failures in dm_table_set_restrictions
  dm: fix dm_blk_report_zones
  dm: limit swapping tables for devices with zone write plugs

 block/blk-settings.c   |  9 ++++-
 drivers/md/dm-core.h   |  1 +
 drivers/md/dm-table.c  | 66 ++++++++++++++++++++++++++-------
 drivers/md/dm-zone.c   | 84 +++++++++++++++++++++++++++++-------------
 drivers/md/dm.c        | 36 +++++++++++-------
 drivers/md/dm.h        |  6 +++
 drivers/md/md-linear.c |  2 +-
 drivers/md/raid0.c     |  2 +-
 drivers/md/raid1.c     |  2 +-
 drivers/md/raid10.c    |  2 +-
 drivers/md/raid5.c     |  2 +-
 include/linux/blkdev.h |  3 +-
 12 files changed, 154 insertions(+), 61 deletions(-)

Comments

Benjamin Marzinski March 19, 2025, 7:07 p.m. UTC | #1
I noticed another issue while playing areound with this.  If you have a
linear device stacked on top of a zoned dm-crypt device, it will
allocate zone write plug resources, although it won't use them.
Actually, this is also true if you create a linear device on top of the
a zoned scsi_debug device.

The reason is that both dm-crypt and scsi_debug set
lim->max_hw_zone_append_sectors to 0. This makes sense, both of them are
simply emulating zone append. DM's check for devices that need zone
append emulation, dm_table_supports_zone_append(), correctly identifies
these linear devices as not needing emulation, so the
DMF_EMULATE_ZONE_APPEND isn't set and dm doesn't do zone write plugging.
But since these devices have lim->max_hw_zone_append_sectors = 0,
blk_revalidate_disk_zones() thinks they do need zone append emulation,
and allocates resources for it.

My question is "How do we solve this?" The easy DM answer is to change
dm_set_zones_restrictions() to do something like:

if (!dm_table_supports_zone_append(t))
	lim->max_hw_zone_append_sectors = 0;
else if (lim->max_hw_zone_append_sectors == 0)
	lim->max_hw_zone_append_sectors = lim->max_zone_append_sectors;

But I wonder if the correct fix is to change blk_stack_limits()
to do something like:

t->max_hw_zone_append_sectors = min(t->max_hw_zone_append_sectors,
                                    b->max_zone_append_sectors);

Even if the bottom device is emulating zone appends, from the point
of view of the top device, that is the max zone append sectors that
the underlying *hardware* supports. It's just that more drivers
than just DM use this, and prehaps some really do need to know what
the actual hardware supports, and not just what the device below then
can support or emulate.

Thoughts?

-Ben

On Mon, Mar 17, 2025 at 12:45:04AM -0400, Benjamin Marzinski wrote:
> There were multiple places in dm's __bind() function where it could fail
> and not completely roll back, leaving the device using the the old
> table, but with device limits and resources from the new table.
> Additionally, unused mempools for request-based devices were not always
> freed immediately.
> 
> Finally, there were a number of issues with switching zoned tables that
> emulate zone append (in other words, dm-crypt on top of zoned devices).
> dm_blk_report_zones() could be called while the device was suspended and
> modifying zoned resources or could possibly fail to end a srcu read
> section.  More importantly, blk_revalidate_disk_zones() would never get
> called when updating a zoned table. This could cause the dm device to
> see the wrong zone write offsets, not have a large enough zwplugs
> reserved in its mempool, or read invalid memory when checking the
> conventional zones bitmap.
> 
> This patchset fixes these issues. It deals with the problems around
> blk_revalidate_disk_zones() by only calling it for devices that have no 
> zone write plug resources. This will always correctly update the zoned
> settings. If a device has zone write plug resources, calling
> blk_revalidate_disk_zones() will not correctly update them in many
> cases, so DM simply doesn't call it for devices with zone write plug
> resources. Instead of allowing people to load tables that can break the
> device, like currently happens, DM disallosw any table reloads that
> change the zoned setting for devices that already have zone write plug
> resources.
> 
> Specifically, if a device already has zone plug resources allocated, it
> can only switch to another zoned table that also emulates zone append.
> Also, it cannot change the device size or the zone size. There are some
> tweaks to make sure that a device can always switch to an error target.
> 
> Changes in V2:
> - Made queue_limits_set() optionally return the old limits (grabbed
>   while holding the limits_lock), and used this in
>   dm_table_set_restrictions()
> - dropped changes to disk_free_zone_resources() and the
>   blk_revalidate_disk_zones() code path (removed patches 0005 & 0006)
> - Instead of always calling blk_revalidate_disk_zones(), disallow
>   changes that would change zone settings if the device has
>   zone write plug resources (final patch).
> 
> Benjamin Marzinski (6):
>   dm: don't change md if dm_table_set_restrictions() fails
>   dm: free table mempools if not used in __bind
>   block: make queue_limits_set() optionally return old limits
>   dm: handle failures in dm_table_set_restrictions
>   dm: fix dm_blk_report_zones
>   dm: limit swapping tables for devices with zone write plugs
> 
>  block/blk-settings.c   |  9 ++++-
>  drivers/md/dm-core.h   |  1 +
>  drivers/md/dm-table.c  | 66 ++++++++++++++++++++++++++-------
>  drivers/md/dm-zone.c   | 84 +++++++++++++++++++++++++++++-------------
>  drivers/md/dm.c        | 36 +++++++++++-------
>  drivers/md/dm.h        |  6 +++
>  drivers/md/md-linear.c |  2 +-
>  drivers/md/raid0.c     |  2 +-
>  drivers/md/raid1.c     |  2 +-
>  drivers/md/raid10.c    |  2 +-
>  drivers/md/raid5.c     |  2 +-
>  include/linux/blkdev.h |  3 +-
>  12 files changed, 154 insertions(+), 61 deletions(-)
> 
> -- 
> 2.48.1
>