Message ID | 20240605075144.153141-2-dlemoal@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix DM zone resource limits stacking | expand |
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
On Wed, Jun 05, 2024 at 04:51:42PM +0900, Damien Le Moal wrote: > Make sure that the zone resource limits of a zoned block device are > correct by checking that: > (a) If the device has a max active zones limit, make sure that the max > open zones limit is lower than the max active zones limit. > (b) If the device has zone resource limits, check that the limits > values are lower than the number of sequential zones of the device. > If it is not, assume that the zoned device has no limits by setting > the limits to 0. > > For (a), a check is added to blk_validate_zoned_limits() and an error > returned if the max open zones limit exceeds the value of the max active > zone limit (if there is one). > > For (b), given that we need the number of sequential zones of the zoned > device, this check is added to disk_update_zone_resources(). This is > safe to do as that function is executed with the disk queue frozen and > the check executed after queue_limits_start_update() which takes the > queue limits lock. Of note is that the early return in this function > for zoned devices that do not use zone write plugging (e.g. DM devices > using native zone append) is moved to after the new check and adjustment > of the zone resource limits so that the check applies to any zoned > device. > > Signed-off-by: Damien Le Moal <dlemoal@kernel.org> > --- > block/blk-settings.c | 8 ++++++++ > block/blk-zoned.c | 20 ++++++++++++++++---- > 2 files changed, 24 insertions(+), 4 deletions(-) > > diff --git a/block/blk-settings.c b/block/blk-settings.c > index effeb9a639bb..474c709ea85b 100644 > --- a/block/blk-settings.c > +++ b/block/blk-settings.c > @@ -80,6 +80,14 @@ static int blk_validate_zoned_limits(struct queue_limits *lim) > if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_BLK_DEV_ZONED))) > return -EINVAL; > > + /* > + * Given that active zones include open zones, the maximum number of > + * open zones cannot be larger than the maximum numbber of active zones. s/numbber/number/ > + */ > + if (lim->max_active_zones && > + lim->max_open_zones > lim->max_active_zones) > + return -EINVAL; > + > if (lim->zone_write_granularity < lim->logical_block_size) > lim->zone_write_granularity = lim->logical_block_size; > > diff --git a/block/blk-zoned.c b/block/blk-zoned.c > index 52abebf56027..8f89705f5e1c 100644 > --- a/block/blk-zoned.c > +++ b/block/blk-zoned.c > @@ -1647,8 +1647,22 @@ static int disk_update_zone_resources(struct gendisk *disk, > return -ENODEV; > } > > + lim = queue_limits_start_update(q); > + > + /* > + * Some devices can advertize zone resource limits that are larger than > + * the number of sequential zones of the zoned block device, e.g. a > + * small ZNS namespace. For such case, assume that the zoned device has > + * no zone resource limits. > + */ > + nr_seq_zones = disk->nr_zones - nr_conv_zones; > + if (lim.max_open_zones >= nr_seq_zones) > + lim.max_open_zones = 0; > + if (lim.max_active_zones >= nr_seq_zones) > + lim.max_active_zones = 0; > + Is this really correct to transform to no limits? The MAR and MOR limits are defined in the I/O Command Set Specific Identify Namespace Data Structure for the Zoned Namespace Command Set. However, the user has no ability to control these limits themselves during a namespace management create ns, or for the format command (and this still seems to be the case in the latest ZNS spec 1.1d). Which means that the controller has no way of knowing the number of resources to allocate to each namespace. Some (all?) controllers will right now simply report the same MAR/MOR for all namespaces. So if I use the namespace management command to create two small zoned namespaces, the number of sequential zones might be smaller than the limits in both namespaces, but could together be exceeding the limit. How is ignoring the limit that we got from the device better than actually exposing the limit which we got from the device? Since AFAICT, this also means that we will expose 0 to sysfs instead of the value that the device reported. Perhaps we should only do this optimization if: - the device is not ZNS, or - the device is ZNS and does not support NS management, or - the device is ZNS and supports NS management and implements TP4115 (Zoned Namespace Resource Management supported bit is set, even if that TP does not seem to be part of a Ratified ZNS version yet...) Kind regards, Niklas
On 6/6/24 2:25 AM, Niklas Cassel wrote: > On Wed, Jun 05, 2024 at 04:51:42PM +0900, Damien Le Moal wrote: >> Make sure that the zone resource limits of a zoned block device are >> correct by checking that: >> (a) If the device has a max active zones limit, make sure that the max >> open zones limit is lower than the max active zones limit. >> (b) If the device has zone resource limits, check that the limits >> values are lower than the number of sequential zones of the device. >> If it is not, assume that the zoned device has no limits by setting >> the limits to 0. >> >> For (a), a check is added to blk_validate_zoned_limits() and an error >> returned if the max open zones limit exceeds the value of the max active >> zone limit (if there is one). >> >> For (b), given that we need the number of sequential zones of the zoned >> device, this check is added to disk_update_zone_resources(). This is >> safe to do as that function is executed with the disk queue frozen and >> the check executed after queue_limits_start_update() which takes the >> queue limits lock. Of note is that the early return in this function >> for zoned devices that do not use zone write plugging (e.g. DM devices >> using native zone append) is moved to after the new check and adjustment >> of the zone resource limits so that the check applies to any zoned >> device. >> >> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> >> --- >> block/blk-settings.c | 8 ++++++++ >> block/blk-zoned.c | 20 ++++++++++++++++---- >> 2 files changed, 24 insertions(+), 4 deletions(-) >> >> diff --git a/block/blk-settings.c b/block/blk-settings.c >> index effeb9a639bb..474c709ea85b 100644 >> --- a/block/blk-settings.c >> +++ b/block/blk-settings.c >> @@ -80,6 +80,14 @@ static int blk_validate_zoned_limits(struct queue_limits *lim) >> if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_BLK_DEV_ZONED))) >> return -EINVAL; >> >> + /* >> + * Given that active zones include open zones, the maximum number of >> + * open zones cannot be larger than the maximum numbber of active zones. > > s/numbber/number/ > > >> + */ >> + if (lim->max_active_zones && >> + lim->max_open_zones > lim->max_active_zones) >> + return -EINVAL; >> + >> if (lim->zone_write_granularity < lim->logical_block_size) >> lim->zone_write_granularity = lim->logical_block_size; >> >> diff --git a/block/blk-zoned.c b/block/blk-zoned.c >> index 52abebf56027..8f89705f5e1c 100644 >> --- a/block/blk-zoned.c >> +++ b/block/blk-zoned.c >> @@ -1647,8 +1647,22 @@ static int disk_update_zone_resources(struct gendisk *disk, >> return -ENODEV; >> } >> >> + lim = queue_limits_start_update(q); >> + >> + /* >> + * Some devices can advertize zone resource limits that are larger than >> + * the number of sequential zones of the zoned block device, e.g. a >> + * small ZNS namespace. For such case, assume that the zoned device has >> + * no zone resource limits. >> + */ >> + nr_seq_zones = disk->nr_zones - nr_conv_zones; >> + if (lim.max_open_zones >= nr_seq_zones) >> + lim.max_open_zones = 0; >> + if (lim.max_active_zones >= nr_seq_zones) >> + lim.max_active_zones = 0; >> + > > Is this really correct to transform to no limits? > > The MAR and MOR limits are defined in the I/O Command Set Specific Identify > Namespace Data Structure for the Zoned Namespace Command Set. > > However, the user has no ability to control these limits themselves > during a namespace management create ns, or for the format command > (and this still seems to be the case in the latest ZNS spec 1.1d). > > Which means that the controller has no way of knowing the number of > resources to allocate to each namespace. > > Some (all?) controllers will right now simply report the same MAR/MOR > for all namespaces. > > > So if I use the namespace management command to create two small > zoned namespaces, the number of sequential zones might be smaller > than the limits in both namespaces, but could together be exceeding > the limit. > > How is ignoring the limit that we got from the device better than > actually exposing the limit which we got from the device? If the limits are larger than the number of zones in the namespace, there is no limits at all. This is what this change does. No question that this is correct. Even if we do not change the values, any application/fs looking at these limits being larger than the number of zones will conclude the same. The problem you are raising is the reliability of the limits themselves, and for NVMe ZNS, given that MOR/MAR are not defined per namespace, we are in the same situation as with DM devices sharing the same zoned block dev through different targets: even if the user respects the limits, write errors may happen due to the backing dev limits (or controller limits for ZNS) being exceeded. Nothing much we can do to easily deal with this right now. We would need to constantly track zone states and implement a software driven zone state machine checking the limits all the time to actually provide guarantees. > Since AFAICT, this also means that we will expose 0 to sysfs > instead of the value that the device reported. Yes. But the value reported by the device is for the whole controller. The sysfs attributes are for the block device == namespace. > Perhaps we should only do this optimization if: > - the device is not ZNS, or > - the device is ZNS and does not support NS management, or > - the device is ZNS and supports NS management and implements TP4115 > (Zoned Namespace Resource Management supported bit is set, even if > that TP does not seem to be part of a Ratified ZNS version yet...) Right now, this all works the same way for DM and nvme zns, so I think this is all good. If anything, we should probably add a warning in the nvme driver about the potentially unreliable moz/moz limits if we see a ZNS device with multiple zoned namespaces.
On Thu, Jun 06, 2024 at 09:06:58AM +0900, Damien Le Moal wrote: > On 6/6/24 2:25 AM, Niklas Cassel wrote: > > On Wed, Jun 05, 2024 at 04:51:42PM +0900, Damien Le Moal wrote: > > The problem you are raising is the reliability of the limits themselves, and > for NVMe ZNS, given that MOR/MAR are not defined per namespace, we are in the > same situation as with DM devices sharing the same zoned block dev through > different targets: even if the user respects the limits, write errors may > happen due to the backing dev limits (or controller limits for ZNS) being > exceeded. Nothing much we can do to easily deal with this right now. We would > need to constantly track zone states and implement a software driven zone state > machine checking the limits all the time to actually provide guarantees. > > > Since AFAICT, this also means that we will expose 0 to sysfs > > instead of the value that the device reported. > > Yes. But the value reported by the device is for the whole controller. The > sysfs attributes are for the block device == namespace. The limits are defined in the I/O Command Set Specific Identify Namespace Data Structure for the Zoned Namespace Command Set, so they are per NS, otherwise they would have been defined in the I/O Command Set Specific Identify Controller Data Structure for the Zoned Namespace Command Set. > > > Perhaps we should only do this optimization if: > > - the device is not ZNS, or > > - the device is ZNS and does not support NS management, or > > - the device is ZNS and supports NS management and implements TP4115 > > (Zoned Namespace Resource Management supported bit is set, even if > > that TP does not seem to be part of a Ratified ZNS version yet...) > > Right now, this all works the same way for DM and nvme zns, so I think this is > all good. If anything, we should probably add a warning in the nvme driver > about the potentially unreliable moz/moz limits if we see a ZNS device with > multiple zoned namespaces. Well, it is only a problem for ZNS devices with NS management. If there are two ZNS namespaces on the device, and the device does not support NS management, the device vendor would have been seriously silly to not allocate and set the limits in the I/O Command Set Specific Identify Namespace Data Structure for the Zoned Namespace Command Set correctly. But yes, this concern cannot be solved in disk_update_zone_resources(), which operates on per gendisk (and there is one gendisk per namespace), so not much this function can do. If we were to do something, it would have to be done in the nvme driver. Perhaps if the device is ZNS, and does support NS management, but does not have the Zoned Namespace Resource Management supported bit is set, divide the MAR/MOR values reported by each namespace by the number of ZNS namespaces? Kind regards, Niklas
On 6/6/24 10:21 AM, Niklas Cassel wrote: >> Right now, this all works the same way for DM and nvme zns, so I think this is >> all good. If anything, we should probably add a warning in the nvme driver >> about the potentially unreliable moz/moz limits if we see a ZNS device with >> multiple zoned namespaces. > > Well, it is only a problem for ZNS devices with NS management. > > If there are two ZNS namespaces on the device, and the device does not > support NS management, the device vendor would have been seriously silly > to not allocate and set the limits in the I/O Command Set Specific Identify > Namespace Data Structure for the Zoned Namespace Command Set correctly. > > But yes, this concern cannot be solved in disk_update_zone_resources(), > which operates on per gendisk (and there is one gendisk per namespace), > so not much this function can do. If we were to do something, it would > have to be done in the nvme driver. > > > Perhaps if the device is ZNS, and does support NS management, but does > not have the Zoned Namespace Resource Management supported bit is set, > divide the MAR/MOR values reported by each namespace by the number of > ZNS namespaces? Maybe. But that would still not provide any guarantee: a buggy application not respecting the limits would be able to steal resources from the other namespace. In any case, I think this is a discussion to have on the nvme list.
On Thu, Jun 06, 2024 at 11:12:08AM +0900, Damien Le Moal wrote: > Maybe. But that would still not provide any guarantee: a buggy application not > respecting the limits would be able to steal resources from the other namespace. > > In any case, I think this is a discussion to have on the nvme list. Yes, the per-controller limits in NVMe are a big mistake that was already mentioned during ZNS standardization. We don't really have any good way to deal with it except strongly recommending users to not use multiple ZNS namespaes per controller.
diff --git a/block/blk-settings.c b/block/blk-settings.c index effeb9a639bb..474c709ea85b 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -80,6 +80,14 @@ static int blk_validate_zoned_limits(struct queue_limits *lim) if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_BLK_DEV_ZONED))) return -EINVAL; + /* + * Given that active zones include open zones, the maximum number of + * open zones cannot be larger than the maximum numbber of active zones. + */ + if (lim->max_active_zones && + lim->max_open_zones > lim->max_active_zones) + return -EINVAL; + if (lim->zone_write_granularity < lim->logical_block_size) lim->zone_write_granularity = lim->logical_block_size; diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 52abebf56027..8f89705f5e1c 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -1647,8 +1647,22 @@ static int disk_update_zone_resources(struct gendisk *disk, return -ENODEV; } + lim = queue_limits_start_update(q); + + /* + * Some devices can advertize zone resource limits that are larger than + * the number of sequential zones of the zoned block device, e.g. a + * small ZNS namespace. For such case, assume that the zoned device has + * no zone resource limits. + */ + nr_seq_zones = disk->nr_zones - nr_conv_zones; + if (lim.max_open_zones >= nr_seq_zones) + lim.max_open_zones = 0; + if (lim.max_active_zones >= nr_seq_zones) + lim.max_active_zones = 0; + if (!disk->zone_wplugs_pool) - return 0; + goto commit; /* * If the device has no limit on the maximum number of open and active @@ -1657,9 +1671,6 @@ static int disk_update_zone_resources(struct gendisk *disk, * dynamic zone write plug allocation when simultaneously writing to * more zones than the size of the mempool. */ - lim = queue_limits_start_update(q); - - nr_seq_zones = disk->nr_zones - nr_conv_zones; pool_size = max(lim.max_open_zones, lim.max_active_zones); if (!pool_size) pool_size = min(BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE, nr_seq_zones); @@ -1673,6 +1684,7 @@ static int disk_update_zone_resources(struct gendisk *disk, lim.max_open_zones = 0; } +commit: return queue_limits_commit_update(q, &lim); }
Make sure that the zone resource limits of a zoned block device are correct by checking that: (a) If the device has a max active zones limit, make sure that the max open zones limit is lower than the max active zones limit. (b) If the device has zone resource limits, check that the limits values are lower than the number of sequential zones of the device. If it is not, assume that the zoned device has no limits by setting the limits to 0. For (a), a check is added to blk_validate_zoned_limits() and an error returned if the max open zones limit exceeds the value of the max active zone limit (if there is one). For (b), given that we need the number of sequential zones of the zoned device, this check is added to disk_update_zone_resources(). This is safe to do as that function is executed with the disk queue frozen and the check executed after queue_limits_start_update() which takes the queue limits lock. Of note is that the early return in this function for zoned devices that do not use zone write plugging (e.g. DM devices using native zone append) is moved to after the new check and adjustment of the zone resource limits so that the check applies to any zoned device. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> --- block/blk-settings.c | 8 ++++++++ block/blk-zoned.c | 20 ++++++++++++++++---- 2 files changed, 24 insertions(+), 4 deletions(-)