Message ID | 20240412-vv-daxctl-fixes-v1-2-6e808174e24f@intel.com (mailing list archive) |
---|---|
State | Accepted |
Commit | f533bd784859cb4fad62f4512f43b8310905cc8c |
Delegated to: | Patchwork Bot |
Headers | show |
Series | daxctl: Fix error handling and propagation in daxctl/device.c | expand |
Vishal Verma wrote: > The loop through the provided list of devices in do_xaction_device() > returns the status based on whatever the last device did. Since the > order of processing devices, especially in cases like the 'all' keyword, > can be effectively random, this can lead to the same command, and same > effects, exiting with a different error code based on device ordering. > > This was noticed with flakiness in the daxctl-create.sh unit test. Its > 'destroy-device all' command would either pass or fail based on the > order it tried to destroy devices in. (Recall that until now, destroying > a daxX.0 device would result in a failure). > > Make this slightly more consistent by saving a failed status in > do_xaction_device if any iteration of the loop produces a failure. > Return this saved status instead of returning the status of the last > device processed. I think "this is the way", at least it follows what cxl/memdev.c is doing. However we have ended up with an error scheme per tool when it comes to reporting errors for multi-device operations. cxl/memdev.c: report the first error daxctl/device.c: now fixed to report the first error ndctl/namespace.c: reports last result (same daxctl/device.c issue), unless in the greedy-create case cxl/region.c: reports the last error even if that is not the last result, immune to the above bug, but why different? The struggle here is that all of these tools continue on error, so it has always been the case that the only way to get a reliable error code vs action carried out is to not use the "all" or "multi-device" ways to specify the devices to operate upon. I don't have a good answer besides, be careful when using "all". It might make sense to bring ndctl/namespace.c in line to guarantee "unless 100% of the attempts are successful the command reports failure". However, it might be too late to make that change there if it breaks people's scripts. ndctl/namespace.c does not suffer from needing to know that namspaceX.0 can not be deleted since the deletion there is exclusively done by setting namespace size to zero. I think this daxctl change has a low risk of breaking folks because the primary failure case is fixed to swallow the error. Reviewed-by: Dan Williams <dan.j.williams@intel.com>
On Fri, 2024-04-12 at 14:30 -0700, Dan Williams wrote: > Vishal Verma wrote: > > The loop through the provided list of devices in do_xaction_device() > > returns the status based on whatever the last device did. Since the > > order of processing devices, especially in cases like the 'all' keyword, > > can be effectively random, this can lead to the same command, and same > > effects, exiting with a different error code based on device ordering. > > > > This was noticed with flakiness in the daxctl-create.sh unit test. Its > > 'destroy-device all' command would either pass or fail based on the > > order it tried to destroy devices in. (Recall that until now, destroying > > a daxX.0 device would result in a failure). > > > > Make this slightly more consistent by saving a failed status in > > do_xaction_device if any iteration of the loop produces a failure. > > Return this saved status instead of returning the status of the last > > device processed. > > I think "this is the way", at least it follows what cxl/memdev.c is > doing. However we have ended up with an error scheme per tool when it > comes to reporting errors for multi-device operations. > > cxl/memdev.c: report the first error > > daxctl/device.c: now fixed to report the first error With this patch it's actually reporting the last error, not first. > > ndctl/namespace.c: reports last result (same daxctl/device.c issue), unless in the greedy-create case > > cxl/region.c: reports the last error even if that is not the last > result, immune to the above bug, but why different? I guess I always preferred last error instead of first, but happy to change these to last to match cxl/memdev.c. I can see how first error is probably slightly better (we'd normally want to know about the first thing that fails). > > The struggle here is that all of these tools continue on error, so it > has always been the case that the only way to get a reliable error code > vs action carried out is to not use the "all" or "multi-device" ways to > specify the devices to operate upon. > > I don't have a good answer besides, be careful when using "all". > > It might make sense to bring ndctl/namespace.c in line to guarantee > "unless 100% of the attempts are successful the command reports > failure". However, it might be too late to make that change there if it > breaks people's scripts. ndctl/namespace.c does not suffer from needing > to know that namspaceX.0 can not be deleted since the deletion there is > exclusively done by setting namespace size to zero. Agreed, I'm onboard with holding off changes to ndctl/namespace.c unless there's a reported problem, or we hit a test suite failure or something down the line. > > I think this daxctl change has a low risk of breaking folks because the > primary failure case is fixed to swallow the error. > > Reviewed-by: Dan Williams <dan.j.williams@intel.com> Thanks Dan!
On 4/12/24 2:05 PM, Vishal Verma wrote: > The loop through the provided list of devices in do_xaction_device() > returns the status based on whatever the last device did. Since the > order of processing devices, especially in cases like the 'all' keyword, > can be effectively random, this can lead to the same command, and same > effects, exiting with a different error code based on device ordering. > > This was noticed with flakiness in the daxctl-create.sh unit test. Its > 'destroy-device all' command would either pass or fail based on the > order it tried to destroy devices in. (Recall that until now, destroying > a daxX.0 device would result in a failure). > > Make this slightly more consistent by saving a failed status in > do_xaction_device if any iteration of the loop produces a failure. > Return this saved status instead of returning the status of the last > device processed. > > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Alison Schofield <alison.schofield@intel.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> > --- > daxctl/device.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/daxctl/device.c b/daxctl/device.c > index 83c61389..14d62148 100644 > --- a/daxctl/device.c > +++ b/daxctl/device.c > @@ -1012,7 +1012,7 @@ static int do_xaction_device(const char *device, enum device_action action, > struct json_object *jdevs = NULL; > struct daxctl_region *region; > struct daxctl_dev *dev; > - int rc = -ENXIO; > + int rc = -ENXIO, saved_rc = 0; > > *processed = 0; > > @@ -1059,6 +1059,8 @@ static int do_xaction_device(const char *device, enum device_action action, > rc = -EINVAL; > break; > } > + if (rc) > + saved_rc = rc; > } > } > > @@ -1070,7 +1072,7 @@ static int do_xaction_device(const char *device, enum device_action action, > if (jdevs) > util_display_json_array(stdout, jdevs, flags); > > - return rc; > + return saved_rc; > } > > int cmd_create_device(int argc, const char **argv, struct daxctl_ctx *ctx) >
diff --git a/daxctl/device.c b/daxctl/device.c index 83c61389..14d62148 100644 --- a/daxctl/device.c +++ b/daxctl/device.c @@ -1012,7 +1012,7 @@ static int do_xaction_device(const char *device, enum device_action action, struct json_object *jdevs = NULL; struct daxctl_region *region; struct daxctl_dev *dev; - int rc = -ENXIO; + int rc = -ENXIO, saved_rc = 0; *processed = 0; @@ -1059,6 +1059,8 @@ static int do_xaction_device(const char *device, enum device_action action, rc = -EINVAL; break; } + if (rc) + saved_rc = rc; } } @@ -1070,7 +1072,7 @@ static int do_xaction_device(const char *device, enum device_action action, if (jdevs) util_display_json_array(stdout, jdevs, flags); - return rc; + return saved_rc; } int cmd_create_device(int argc, const char **argv, struct daxctl_ctx *ctx)
The loop through the provided list of devices in do_xaction_device() returns the status based on whatever the last device did. Since the order of processing devices, especially in cases like the 'all' keyword, can be effectively random, this can lead to the same command, and same effects, exiting with a different error code based on device ordering. This was noticed with flakiness in the daxctl-create.sh unit test. Its 'destroy-device all' command would either pass or fail based on the order it tried to destroy devices in. (Recall that until now, destroying a daxX.0 device would result in a failure). Make this slightly more consistent by saving a failed status in do_xaction_device if any iteration of the loop produces a failure. Return this saved status instead of returning the status of the last device processed. Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> --- daxctl/device.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)