Message ID | 20250317044510.2200856-6-bmarzins@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | dm: fix issues with swapping dm tables | expand |
On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote: > __bind(). Otherwise the zone resources could change while accessing > them. Finally, it is possible that md->zone_revalidate_map will change > while calling this function. Only read it once, so that we are always > using the same value. Otherwise we might miss a call to > dm_put_live_table(). This checking for calling context is pretty ugly. Can you just use a separate report zones helper or at least a clearly documented flag for it?
On Tue, Mar 18, 2025 at 07:57:21AM +0100, Christoph Hellwig wrote: > On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote: > > __bind(). Otherwise the zone resources could change while accessing > > them. Finally, it is possible that md->zone_revalidate_map will change > > while calling this function. Only read it once, so that we are always > > using the same value. Otherwise we might miss a call to > > dm_put_live_table(). > > This checking for calling context is pretty ugly. Can you just use > a separate report zones helper or at least a clearly documented flag for it? Not sure how that would work. The goal is to keep another process, calling something like blkdev_report_zones_ioctl(), from being able to call its report_zones_cb, while DM is in blk_revalidate_disk_zones() which needs to use that same disk->fops->report_zones() interface in order to call blk_revalidate_zone_cb(). We need some way to distinguish between the callers. We could export blk_revalidate_zone_cb() and have dm_blk_report_zones() check if it was called with that report_zones_cb. That seems just as ugly to me. But if you like that better, fine. Otherwise I don't see how we can distinguish between the call from blk_revalidate_disk_zones() and a call from something like blkdev_report_zones_ioctl(). Am I not understanding your suggestion? Allowing the blkdev_report_zones_ioctl() call to go ahead using md->zone_revalidate_map runs into three problems. 1. It's running while the device is switching tables, and there's no guarantee that DM won't encounter an error and have to fail back. So it could report information for this unused table. We could just say that this is what you get from trying to grab the zone information of a device while it's switching tables. Fine. 2. Without this patch, it's possible that while blkdev_report_zones_ioctl() is still running its report_zones_cb, DM fails switching tables and frees the new table that's being used by blkdev_report_zones_ioctl(), causing use-after-free errors. However, this is solvable by calling srcu_read_lock(&md->io_barrier) to make sure that we're in a SRCU read section while using md->zone_revalidate_map. Making that chage should make DM as safe as any other zoned device, which brings me to... 3. On any zoned device, not just DM, what's stopping one process from syncing the zone write plug offsets: blkdev_report_zones_ioctl() -> blkdev_report_zones() -> various.report_zones() -> disk_report_zones_cb() -> disk_zone_wplug_sync_wp_offset() While another process is running into problems while dealing with the gendisk's zone configuration changing: blk_revalidate_disk_zones() -> disk_free_zone_resources() I don't see any synchronizing between these two call paths. Am I missing something that stops this? Can this only happen for DM devices, for some reason? If this can't happen, I'm fine with just adding a srcu_read_lock() to dm_blk_report_zones() and calling it good. If it can happen, then we should fix that. -Ben
On 3/19/25 7:10 AM, Benjamin Marzinski wrote: > On Tue, Mar 18, 2025 at 07:57:21AM +0100, Christoph Hellwig wrote: >> On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote: >>> __bind(). Otherwise the zone resources could change while accessing >>> them. Finally, it is possible that md->zone_revalidate_map will change >>> while calling this function. Only read it once, so that we are always >>> using the same value. Otherwise we might miss a call to >>> dm_put_live_table(). >> >> This checking for calling context is pretty ugly. Can you just use >> a separate report zones helper or at least a clearly documented flag for it? > > Not sure how that would work. The goal is to keep another process, > calling something like blkdev_report_zones_ioctl(), from being able to > call its report_zones_cb, while DM is in blk_revalidate_disk_zones() > which needs to use that same disk->fops->report_zones() interface in > order to call blk_revalidate_zone_cb(). We need some way to distinguish > between the callers. We could export blk_revalidate_zone_cb() and have > dm_blk_report_zones() check if it was called with that report_zones_cb. > That seems just as ugly to me. But if you like that better, fine. > Otherwise I don't see how we can distinguish between the call from > blk_revalidate_disk_zones() and a call from something like > blkdev_report_zones_ioctl(). Am I not understanding your suggestion? > > Allowing the blkdev_report_zones_ioctl() call to go ahead using > md->zone_revalidate_map runs into three problems. > > 1. It's running while the device is switching tables, and there's no > guarantee that DM won't encounter an error and have to fail back. So it > could report information for this unused table. We could just say that > this is what you get from trying to grab the zone information of a > device while it's switching tables. Fine. > > 2. Without this patch, it's possible that while > blkdev_report_zones_ioctl() is still running its report_zones_cb, DM > fails switching tables and frees the new table that's being used by > blkdev_report_zones_ioctl(), causing use-after-free errors. However, > this is solvable by calling srcu_read_lock(&md->io_barrier) to make sure > that we're in a SRCU read section while using md->zone_revalidate_map. > Making that chage should make DM as safe as any other zoned device, > which brings me to... > > 3. On any zoned device, not just DM, what's stopping > one process from syncing the zone write plug offsets: > blkdev_report_zones_ioctl() -> blkdev_report_zones() -> > various.report_zones() -> disk_report_zones_cb() -> > disk_zone_wplug_sync_wp_offset() disk_zone_wplug_sync_wp_offset() is a nop for any zone write plug that does not have BLK_ZONE_WPLUG_NEED_WP_UPDATE set. And that flag is set only for zones that had a write error. Given that recovering from write errors on zoned device requires to: 1) stop writing to the zone, 2) Do a report zone (which will sync the wp offset) 3) Restart writing there is no write IOs going on for the zone that is being reported, for a well behaved user. If the user is not well behaved, it will get errors/out of sync wp, until the user behaves :) So I would not worry too much about this. As we discussed, the table switching should be allowed only for the wildcard target (dm-error) and that's it. We should not allow table switching otherwise. And given that inserting dm-error does not cause any change to the zone configuration, I do not see why we need to revalidate zones. > > While another process is running into problems while dealing with the > gendisk's zone configuration changing: > > blk_revalidate_disk_zones() -> disk_free_zone_resources() We should not allow the zone configuration to change. That is impossible to deal with at the user level. > I don't see any synchronizing between these two call paths. Am I missing > something that stops this? Can this only happen for DM devices, for some > reason? If this can't happen, I'm fine with just adding a > srcu_read_lock() to dm_blk_report_zones() and calling it good. If it > can happen, then we should fix that. I think it can happen today but no-one ever had this issue because no-one has tried to switch a dm-table on a zoned drive. And I really think we should prevent that from being done, except for dm-error since that's used for fstests. > > -Ben >
On Wed, Mar 19, 2025 at 10:40:38AM +0900, Damien Le Moal wrote: > On 3/19/25 7:10 AM, Benjamin Marzinski wrote: > > On Tue, Mar 18, 2025 at 07:57:21AM +0100, Christoph Hellwig wrote: > >> On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote: > >>> __bind(). Otherwise the zone resources could change while accessing > >>> them. Finally, it is possible that md->zone_revalidate_map will change > >>> while calling this function. Only read it once, so that we are always > >>> using the same value. Otherwise we might miss a call to > >>> dm_put_live_table(). > >> > >> This checking for calling context is pretty ugly. Can you just use > >> a separate report zones helper or at least a clearly documented flag for it? > > > > Not sure how that would work. The goal is to keep another process, > > calling something like blkdev_report_zones_ioctl(), from being able to > > call its report_zones_cb, while DM is in blk_revalidate_disk_zones() > > which needs to use that same disk->fops->report_zones() interface in > > order to call blk_revalidate_zone_cb(). We need some way to distinguish > > between the callers. We could export blk_revalidate_zone_cb() and have > > dm_blk_report_zones() check if it was called with that report_zones_cb. > > That seems just as ugly to me. But if you like that better, fine. > > Otherwise I don't see how we can distinguish between the call from > > blk_revalidate_disk_zones() and a call from something like > > blkdev_report_zones_ioctl(). Am I not understanding your suggestion? > > > > Allowing the blkdev_report_zones_ioctl() call to go ahead using > > md->zone_revalidate_map runs into three problems. > > > > 1. It's running while the device is switching tables, and there's no > > guarantee that DM won't encounter an error and have to fail back. So it > > could report information for this unused table. We could just say that > > this is what you get from trying to grab the zone information of a > > device while it's switching tables. Fine. > > > > 2. Without this patch, it's possible that while > > blkdev_report_zones_ioctl() is still running its report_zones_cb, DM > > fails switching tables and frees the new table that's being used by > > blkdev_report_zones_ioctl(), causing use-after-free errors. However, > > this is solvable by calling srcu_read_lock(&md->io_barrier) to make sure > > that we're in a SRCU read section while using md->zone_revalidate_map. > > Making that chage should make DM as safe as any other zoned device, > > which brings me to... > > > > 3. On any zoned device, not just DM, what's stopping > > one process from syncing the zone write plug offsets: > > blkdev_report_zones_ioctl() -> blkdev_report_zones() -> > > various.report_zones() -> disk_report_zones_cb() -> > > disk_zone_wplug_sync_wp_offset() > > disk_zone_wplug_sync_wp_offset() is a nop for any zone write plug that does not > have BLK_ZONE_WPLUG_NEED_WP_UPDATE set. And that flag is set only for zones > that had a write error. Given that recovering from write errors on zoned device > requires to: > 1) stop writing to the zone, > 2) Do a report zone (which will sync the wp offset) > 3) Restart writing > there is no write IOs going on for the zone that is being reported, for a well > behaved user. If the user is not well behaved, it will get errors/out of sync > wp, until the user behaves :) So I would not worry too much about this. The issue isn't that there could be IO errors, it's that we are reading and writing from these zwplugs without any way to keep them from being freed while we're doing so. > As we discussed, the table switching should be allowed only for the wildcard > target (dm-error) and that's it. We should not allow table switching otherwise. > And given that inserting dm-error does not cause any change to the zone > configuration, I do not see why we need to revalidate zones. With this patchset we will never call blk_revalidate_disk_zones() when we have disk->zone_wplugs_hash, which means that disk_zone_wplug_sync_wp_offset() can never be called once we are using a zoned table, so that's not a problem. The problem we do have happens when we are initially switching to that zoned table. blk_revalidate_disk_zones() can allocate disk->zone_wplugs_hash, and the later fail and free it. In this case, if some userspace process does a BLKREPORTZONE ioctl() while md->zone_revalidate_map is set and disk->zone_wplugs_hash exists, it could well be still running when disk->zone_wplugs_hash gets freed. We obviously can't ban loading a zoned dm-crypt table at all. I admit, this is a very unlikely occurance, but very unlikely memory corruptions are still memory corruptions, and much harder to track down when they do occur. My original patch, which only allows the task that set md->zone_revalidate_map to work during suspends, solves this issue. So does my suggestion that dm_blk_report_zones() only allow calls with blk_revalidate_zone_cb() as the callback to use md->zone_revalidate_map. If Christoph has a better solution that I'm misunderstanding, I'm fine with that to. But since we want people to be able to set up dm-crypt on top of zoned devices, we have to guard against userspace processes running at the same time as we're doing that, in case we fail and free the resources. > > > > While another process is running into problems while dealing with the > > gendisk's zone configuration changing: > > > > blk_revalidate_disk_zones() -> disk_free_zone_resources() > > We should not allow the zone configuration to change. That is impossible to > deal with at the user level. The comments above blk_revalidate_disk_zones() sure make it sound like it could get called after a device has already been set up /* * blk_revalidate_disk_zones - (re)allocate and initialize zone write plugs * @disk: Target disk * * Helper function for low-level device drivers to check, (re) allocate and * initialize resources used for managing zoned disks. This function should * normally be called by blk-mq based drivers when a zoned gendisk is probed * and when the zone configuration of the gendisk changes (e.g. after a format). * Before calling this function, the device driver must already have set the * device zone size (chunk_sector limit) and the max zone append limit. * BIO based drivers can also use this function as long as the device queue * can be safely frozen. */ and a brief look does make it appear that nvme and scsi device can call this when the device is rescanned. Perhaps there's no risk of them failing in this case? > > I don't see any synchronizing between these two call paths. Am I missing > > something that stops this? Can this only happen for DM devices, for some > > reason? If this can't happen, I'm fine with just adding a > > srcu_read_lock() to dm_blk_report_zones() and calling it good. If it > > can happen, then we should fix that. > > I think it can happen today but no-one ever had this issue because no-one has > tried to switch a dm-table on a zoned drive. And I really think we should > prevent that from being done, except for dm-error since that's used for fstests. This patchset allows zoned devices that don't allocate zone write plug resources to be arbitrarily changed. This never caused problems, so we have no idea how often people were doing it. I don't see anything unsafe there, and people typically do all sorts of things with linear dm devices. For devices that have allocated zone write plug resources, they can only reload tables that do not change the zone resources (only dm-crypat and dm-error targets will work here), and blk_revalidate_disk_zones() is never called. Not allowing any dm-crypt reloads would keep people from reloading a dm-crypt device with the exact same table or simply with an optional parameter changed. Doing that is safe, and again we have no idea how often this has been happening, since it wouldn't cause any problems. -Ben > > > > -Ben > > > > > -- > Damien Le Moal > Western Digital Research
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 3637761f3585..f3a3f2ef6322 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -141,6 +141,7 @@ struct mapped_device { #ifdef CONFIG_BLK_DEV_ZONED unsigned int nr_zones; void *zone_revalidate_map; + struct task_struct *revalidate_map_task; #endif #ifdef CONFIG_IMA diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c index 681058feb63b..11e19281bb64 100644 --- a/drivers/md/dm-zone.c +++ b/drivers/md/dm-zone.c @@ -56,24 +56,29 @@ int dm_blk_report_zones(struct gendisk *disk, sector_t sector, { struct mapped_device *md = disk->private_data; struct dm_table *map; - int srcu_idx, ret; + struct dm_table *zone_revalidate_map = md->zone_revalidate_map; + int srcu_idx, ret = -EIO; - if (!md->zone_revalidate_map) { - /* Regular user context */ + if (!zone_revalidate_map || md->revalidate_map_task != current) { + /* + * Regular user context or + * Zone revalidation during __bind() is in progress, but this + * call is from a different process + */ if (dm_suspended_md(md)) return -EAGAIN; map = dm_get_live_table(md, &srcu_idx); - if (!map) - return -EIO; } else { /* Zone revalidation during __bind() */ - map = md->zone_revalidate_map; + map = zone_revalidate_map; } - ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb, data); + if (map) + ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb, + data); - if (!md->zone_revalidate_map) + if (!zone_revalidate_map) dm_put_live_table(md, srcu_idx); return ret; @@ -175,7 +180,9 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q) * our table for dm_blk_report_zones() to use directly. */ md->zone_revalidate_map = t; + md->revalidate_map_task = current; ret = blk_revalidate_disk_zones(disk); + md->revalidate_map_task = NULL; md->zone_revalidate_map = NULL; if (ret) {