diff mbox series

[v2,5/6] dm: fix dm_blk_report_zones

Message ID 20250317044510.2200856-6-bmarzins@redhat.com (mailing list archive)
State New
Headers show
Series dm: fix issues with swapping dm tables | expand

Commit Message

Benjamin Marzinski March 17, 2025, 4:45 a.m. UTC
If dm_get_live_table() returned NULL, dm_put_live_table() was never
called.  Also, if md->zone_revalidate_map is set, check that
dm_blk_report_zones() is being called from the process that set it in
__bind(). Otherwise the zone resources could change while accessing
them. Finally, it is possible that md->zone_revalidate_map will change
while calling this function. Only read it once, so that we are always
using the same value. Otherwise we might miss a call to
dm_put_live_table().

Fixes: f211268ed1f9b ("dm: Use the block layer zone append emulation")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 drivers/md/dm-core.h |  1 +
 drivers/md/dm-zone.c | 23 +++++++++++++++--------
 2 files changed, 16 insertions(+), 8 deletions(-)

Comments

Christoph Hellwig March 18, 2025, 6:57 a.m. UTC | #1
On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote:
> __bind(). Otherwise the zone resources could change while accessing
> them. Finally, it is possible that md->zone_revalidate_map will change
> while calling this function. Only read it once, so that we are always
> using the same value. Otherwise we might miss a call to
> dm_put_live_table().

This checking for calling context is pretty ugly.  Can you just use
a separate report zones helper or at least a clearly documented flag
for it?
Benjamin Marzinski March 18, 2025, 10:10 p.m. UTC | #2
On Tue, Mar 18, 2025 at 07:57:21AM +0100, Christoph Hellwig wrote:
> On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote:
> > __bind(). Otherwise the zone resources could change while accessing
> > them. Finally, it is possible that md->zone_revalidate_map will change
> > while calling this function. Only read it once, so that we are always
> > using the same value. Otherwise we might miss a call to
> > dm_put_live_table().
> 
> This checking for calling context is pretty ugly.  Can you just use
> a separate report zones helper or at least a clearly documented flag for it?

Not sure how that would work. The goal is to keep another process,
calling something like blkdev_report_zones_ioctl(), from being able to
call its report_zones_cb, while DM is in blk_revalidate_disk_zones()
which needs to use that same disk->fops->report_zones() interface in
order to call blk_revalidate_zone_cb(). We need some way to distinguish
between the callers. We could export blk_revalidate_zone_cb() and have
dm_blk_report_zones() check if it was called with that report_zones_cb.
That seems just as ugly to me. But if you like that better, fine.
Otherwise I don't see how we can distinguish between the call from
blk_revalidate_disk_zones() and a call from something like
blkdev_report_zones_ioctl(). Am I not understanding your suggestion?

Allowing the blkdev_report_zones_ioctl() call to go ahead using
md->zone_revalidate_map runs into three problems.

1. It's running while the device is switching tables, and there's no
guarantee that DM won't encounter an error and have to fail back. So it
could report information for this unused table. We could just say that
this is what you get from trying to grab the zone information of a
device while it's switching tables. Fine.

2. Without this patch, it's possible that while
blkdev_report_zones_ioctl() is still running its report_zones_cb, DM
fails switching tables and frees the new table that's being used by
blkdev_report_zones_ioctl(), causing use-after-free errors. However,
this is solvable by calling srcu_read_lock(&md->io_barrier) to make sure
that we're in a SRCU read section while using md->zone_revalidate_map.
Making that chage should make DM as safe as any other zoned device,
which brings me to...

3. On any zoned device, not just DM, what's stopping
one process from syncing the zone write plug offsets:
blkdev_report_zones_ioctl() -> blkdev_report_zones() ->
various.report_zones() -> disk_report_zones_cb() ->
disk_zone_wplug_sync_wp_offset()

While another process is running into problems while dealing with the
gendisk's zone configuration changing:

blk_revalidate_disk_zones() -> disk_free_zone_resources()

I don't see any synchronizing between these two call paths. Am I missing
something that stops this? Can this only happen for DM devices, for some
reason? If this can't happen, I'm fine with just adding a
srcu_read_lock() to dm_blk_report_zones() and calling it good.  If it
can happen, then we should fix that.

-Ben
Damien Le Moal March 19, 2025, 1:40 a.m. UTC | #3
On 3/19/25 7:10 AM, Benjamin Marzinski wrote:
> On Tue, Mar 18, 2025 at 07:57:21AM +0100, Christoph Hellwig wrote:
>> On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote:
>>> __bind(). Otherwise the zone resources could change while accessing
>>> them. Finally, it is possible that md->zone_revalidate_map will change
>>> while calling this function. Only read it once, so that we are always
>>> using the same value. Otherwise we might miss a call to
>>> dm_put_live_table().
>>
>> This checking for calling context is pretty ugly.  Can you just use
>> a separate report zones helper or at least a clearly documented flag for it?
> 
> Not sure how that would work. The goal is to keep another process,
> calling something like blkdev_report_zones_ioctl(), from being able to
> call its report_zones_cb, while DM is in blk_revalidate_disk_zones()
> which needs to use that same disk->fops->report_zones() interface in
> order to call blk_revalidate_zone_cb(). We need some way to distinguish
> between the callers. We could export blk_revalidate_zone_cb() and have
> dm_blk_report_zones() check if it was called with that report_zones_cb.
> That seems just as ugly to me. But if you like that better, fine.
> Otherwise I don't see how we can distinguish between the call from
> blk_revalidate_disk_zones() and a call from something like
> blkdev_report_zones_ioctl(). Am I not understanding your suggestion?
> 
> Allowing the blkdev_report_zones_ioctl() call to go ahead using
> md->zone_revalidate_map runs into three problems.
> 
> 1. It's running while the device is switching tables, and there's no
> guarantee that DM won't encounter an error and have to fail back. So it
> could report information for this unused table. We could just say that
> this is what you get from trying to grab the zone information of a
> device while it's switching tables. Fine.
> 
> 2. Without this patch, it's possible that while
> blkdev_report_zones_ioctl() is still running its report_zones_cb, DM
> fails switching tables and frees the new table that's being used by
> blkdev_report_zones_ioctl(), causing use-after-free errors. However,
> this is solvable by calling srcu_read_lock(&md->io_barrier) to make sure
> that we're in a SRCU read section while using md->zone_revalidate_map.
> Making that chage should make DM as safe as any other zoned device,
> which brings me to...
> 
> 3. On any zoned device, not just DM, what's stopping
> one process from syncing the zone write plug offsets:
> blkdev_report_zones_ioctl() -> blkdev_report_zones() ->
> various.report_zones() -> disk_report_zones_cb() ->
> disk_zone_wplug_sync_wp_offset()

disk_zone_wplug_sync_wp_offset() is a nop for any zone write plug that does not
have BLK_ZONE_WPLUG_NEED_WP_UPDATE set. And that flag is set only for zones
that had a write error. Given that recovering from write errors on zoned device
requires to:
1) stop writing to the zone,
2) Do a report zone (which will sync the wp offset)
3) Restart writing
there is no write IOs going on for the zone that is being reported, for a well
behaved user. If the user is not well behaved, it will get errors/out of sync
wp, until the user behaves :) So I would not worry too much about this.
As we discussed, the table switching should be allowed only for the wildcard
target (dm-error) and that's it. We should not allow table switching otherwise.
And given that inserting dm-error does not cause any change to the zone
configuration, I do not see why we need to revalidate zones.

> 
> While another process is running into problems while dealing with the
> gendisk's zone configuration changing:
> 
> blk_revalidate_disk_zones() -> disk_free_zone_resources()

We should not allow the zone configuration to change. That is impossible to
deal with at the user level.

> I don't see any synchronizing between these two call paths. Am I missing
> something that stops this? Can this only happen for DM devices, for some
> reason? If this can't happen, I'm fine with just adding a
> srcu_read_lock() to dm_blk_report_zones() and calling it good.  If it
> can happen, then we should fix that.

I think it can happen today but no-one ever had this issue because no-one has
tried to switch a dm-table on a zoned drive. And I really think we should
prevent that from being done, except for dm-error since that's used for fstests.

> 
> -Ben
>
Benjamin Marzinski March 19, 2025, 4:50 p.m. UTC | #4
On Wed, Mar 19, 2025 at 10:40:38AM +0900, Damien Le Moal wrote:
> On 3/19/25 7:10 AM, Benjamin Marzinski wrote:
> > On Tue, Mar 18, 2025 at 07:57:21AM +0100, Christoph Hellwig wrote:
> >> On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote:
> >>> __bind(). Otherwise the zone resources could change while accessing
> >>> them. Finally, it is possible that md->zone_revalidate_map will change
> >>> while calling this function. Only read it once, so that we are always
> >>> using the same value. Otherwise we might miss a call to
> >>> dm_put_live_table().
> >>
> >> This checking for calling context is pretty ugly.  Can you just use
> >> a separate report zones helper or at least a clearly documented flag for it?
> > 
> > Not sure how that would work. The goal is to keep another process,
> > calling something like blkdev_report_zones_ioctl(), from being able to
> > call its report_zones_cb, while DM is in blk_revalidate_disk_zones()
> > which needs to use that same disk->fops->report_zones() interface in
> > order to call blk_revalidate_zone_cb(). We need some way to distinguish
> > between the callers. We could export blk_revalidate_zone_cb() and have
> > dm_blk_report_zones() check if it was called with that report_zones_cb.
> > That seems just as ugly to me. But if you like that better, fine.
> > Otherwise I don't see how we can distinguish between the call from
> > blk_revalidate_disk_zones() and a call from something like
> > blkdev_report_zones_ioctl(). Am I not understanding your suggestion?
> > 
> > Allowing the blkdev_report_zones_ioctl() call to go ahead using
> > md->zone_revalidate_map runs into three problems.
> > 
> > 1. It's running while the device is switching tables, and there's no
> > guarantee that DM won't encounter an error and have to fail back. So it
> > could report information for this unused table. We could just say that
> > this is what you get from trying to grab the zone information of a
> > device while it's switching tables. Fine.
> > 
> > 2. Without this patch, it's possible that while
> > blkdev_report_zones_ioctl() is still running its report_zones_cb, DM
> > fails switching tables and frees the new table that's being used by
> > blkdev_report_zones_ioctl(), causing use-after-free errors. However,
> > this is solvable by calling srcu_read_lock(&md->io_barrier) to make sure
> > that we're in a SRCU read section while using md->zone_revalidate_map.
> > Making that chage should make DM as safe as any other zoned device,
> > which brings me to...
> > 
> > 3. On any zoned device, not just DM, what's stopping
> > one process from syncing the zone write plug offsets:
> > blkdev_report_zones_ioctl() -> blkdev_report_zones() ->
> > various.report_zones() -> disk_report_zones_cb() ->
> > disk_zone_wplug_sync_wp_offset()
> 
> disk_zone_wplug_sync_wp_offset() is a nop for any zone write plug that does not
> have BLK_ZONE_WPLUG_NEED_WP_UPDATE set. And that flag is set only for zones
> that had a write error. Given that recovering from write errors on zoned device
> requires to:
> 1) stop writing to the zone,
> 2) Do a report zone (which will sync the wp offset)
> 3) Restart writing
> there is no write IOs going on for the zone that is being reported, for a well
> behaved user. If the user is not well behaved, it will get errors/out of sync
> wp, until the user behaves :) So I would not worry too much about this.

The issue isn't that there could be IO errors, it's that we are reading
and writing from these zwplugs without any way to keep them from being
freed while we're doing so. 

> As we discussed, the table switching should be allowed only for the wildcard
> target (dm-error) and that's it. We should not allow table switching otherwise.
> And given that inserting dm-error does not cause any change to the zone
> configuration, I do not see why we need to revalidate zones.

With this patchset we will never call blk_revalidate_disk_zones() when
we have disk->zone_wplugs_hash, which means that
disk_zone_wplug_sync_wp_offset() can never be called once we are using a
zoned table, so that's not a problem.

The problem we do have happens when we are initially switching to that
zoned table. blk_revalidate_disk_zones() can allocate
disk->zone_wplugs_hash, and the later fail and free it. In this case, if
some userspace process does a BLKREPORTZONE ioctl() while
md->zone_revalidate_map is set and disk->zone_wplugs_hash exists, it
could well be still running when disk->zone_wplugs_hash gets freed. We
obviously can't ban loading a zoned dm-crypt table at all.

I admit, this is a very unlikely occurance, but very unlikely memory
corruptions are still memory corruptions, and much harder to track down
when they do occur. My original patch, which only allows the task that
set md->zone_revalidate_map to work during suspends, solves this issue.
So does my suggestion that dm_blk_report_zones() only allow calls with
blk_revalidate_zone_cb() as the callback to use md->zone_revalidate_map.
If Christoph has a better solution that I'm misunderstanding, I'm fine
with that to. But since we want people to be able to set up dm-crypt on
top of zoned devices, we have to guard against userspace processes
running at the same time as we're doing that, in case we fail and free
the resources.

> > 
> > While another process is running into problems while dealing with the
> > gendisk's zone configuration changing:
> > 
> > blk_revalidate_disk_zones() -> disk_free_zone_resources()
> 
> We should not allow the zone configuration to change. That is impossible to
> deal with at the user level.

The comments above blk_revalidate_disk_zones() sure make it sound like
it could get called after a device has already been set up

/*
 * blk_revalidate_disk_zones - (re)allocate and initialize zone write plugs
 * @disk:       Target disk
 *
 * Helper function for low-level device drivers to check, (re) allocate and
 * initialize resources used for managing zoned disks. This function should
 * normally be called by blk-mq based drivers when a zoned gendisk is probed
 * and when the zone configuration of the gendisk changes (e.g. after a format).
 * Before calling this function, the device driver must already have set the
 * device zone size (chunk_sector limit) and the max zone append limit.
 * BIO based drivers can also use this function as long as the device queue
 * can be safely frozen.
 */

and a brief look does make it appear that nvme and scsi device can call
this when the device is rescanned. Perhaps there's no risk of them
failing in this case? 

> > I don't see any synchronizing between these two call paths. Am I missing
> > something that stops this? Can this only happen for DM devices, for some
> > reason? If this can't happen, I'm fine with just adding a
> > srcu_read_lock() to dm_blk_report_zones() and calling it good.  If it
> > can happen, then we should fix that.
> 
> I think it can happen today but no-one ever had this issue because no-one has
> tried to switch a dm-table on a zoned drive. And I really think we should
> prevent that from being done, except for dm-error since that's used for fstests.

This patchset allows zoned devices that don't allocate zone write plug
resources to be arbitrarily changed. This never caused problems, so we
have no idea how often people were doing it. I don't see anything unsafe
there, and people typically do all sorts of things with linear dm devices.

For devices that have allocated zone write plug resources, they can only
reload tables that do not change the zone resources (only dm-crypat and
dm-error targets will work here), and blk_revalidate_disk_zones() is
never called. Not allowing any dm-crypt reloads would keep people from
reloading a dm-crypt device with the exact same table or simply with an
optional parameter changed. Doing that is safe, and again we have no
idea how often this has been happening, since it wouldn't cause any
problems. 

-Ben

> > 
> > -Ben
> > 
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research
diff mbox series

Patch

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 3637761f3585..f3a3f2ef6322 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -141,6 +141,7 @@  struct mapped_device {
 #ifdef CONFIG_BLK_DEV_ZONED
 	unsigned int nr_zones;
 	void *zone_revalidate_map;
+	struct task_struct *revalidate_map_task;
 #endif
 
 #ifdef CONFIG_IMA
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 681058feb63b..11e19281bb64 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -56,24 +56,29 @@  int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
 {
 	struct mapped_device *md = disk->private_data;
 	struct dm_table *map;
-	int srcu_idx, ret;
+	struct dm_table *zone_revalidate_map = md->zone_revalidate_map;
+	int srcu_idx, ret = -EIO;
 
-	if (!md->zone_revalidate_map) {
-		/* Regular user context */
+	if (!zone_revalidate_map || md->revalidate_map_task != current) {
+		/*
+		 * Regular user context or
+		 * Zone revalidation during __bind() is in progress, but this
+		 * call is from a different process
+		 */
 		if (dm_suspended_md(md))
 			return -EAGAIN;
 
 		map = dm_get_live_table(md, &srcu_idx);
-		if (!map)
-			return -EIO;
 	} else {
 		/* Zone revalidation during __bind() */
-		map = md->zone_revalidate_map;
+		map = zone_revalidate_map;
 	}
 
-	ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb, data);
+	if (map)
+		ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb,
+					     data);
 
-	if (!md->zone_revalidate_map)
+	if (!zone_revalidate_map)
 		dm_put_live_table(md, srcu_idx);
 
 	return ret;
@@ -175,7 +180,9 @@  int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
 	 * our table for dm_blk_report_zones() to use directly.
 	 */
 	md->zone_revalidate_map = t;
+	md->revalidate_map_task = current;
 	ret = blk_revalidate_disk_zones(disk);
+	md->revalidate_map_task = NULL;
 	md->zone_revalidate_map = NULL;
 
 	if (ret) {