diff mbox series

[ndctl,2/2] daxctl/device.c: Fix error propagation in do_xaction_device()

Message ID 20240412-vv-daxctl-fixes-v1-2-6e808174e24f@intel.com (mailing list archive)
State Accepted
Commit f533bd784859cb4fad62f4512f43b8310905cc8c
Delegated to: Patchwork Bot
Headers show
Series daxctl: Fix error handling and propagation in daxctl/device.c | expand

Commit Message

Verma, Vishal L April 12, 2024, 9:05 p.m. UTC
The loop through the provided list of devices in do_xaction_device()
returns the status based on whatever the last device did. Since the
order of processing devices, especially in cases like the 'all' keyword,
can be effectively random, this can lead to the same command, and same
effects, exiting with a different error code based on device ordering.

This was noticed with flakiness in the daxctl-create.sh unit test. Its
'destroy-device all' command would either pass or fail based on the
order it tried to destroy devices in. (Recall that until now, destroying
a daxX.0 device would result in a failure).

Make this slightly more consistent by saving a failed status in
do_xaction_device if any iteration of the loop produces a failure.
Return this saved status instead of returning the status of the last
device processed.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 daxctl/device.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Dan Williams April 12, 2024, 9:30 p.m. UTC | #1
Vishal Verma wrote:
> The loop through the provided list of devices in do_xaction_device()
> returns the status based on whatever the last device did. Since the
> order of processing devices, especially in cases like the 'all' keyword,
> can be effectively random, this can lead to the same command, and same
> effects, exiting with a different error code based on device ordering.
> 
> This was noticed with flakiness in the daxctl-create.sh unit test. Its
> 'destroy-device all' command would either pass or fail based on the
> order it tried to destroy devices in. (Recall that until now, destroying
> a daxX.0 device would result in a failure).
> 
> Make this slightly more consistent by saving a failed status in
> do_xaction_device if any iteration of the loop produces a failure.
> Return this saved status instead of returning the status of the last
> device processed.

I think "this is the way", at least it follows what cxl/memdev.c is
doing. However we have ended up with an error scheme per tool when it
comes to reporting errors for multi-device operations.

cxl/memdev.c: report the first error

daxctl/device.c: now fixed to report the first error

ndctl/namespace.c: reports last result (same daxctl/device.c issue), unless in the greedy-create case

cxl/region.c: reports the last error even if that is not the last
result, immune to the above bug, but why different?

The struggle here is that all of these tools continue on error, so it
has always been the case that the only way to get a reliable error code
vs action carried out is to not use the "all" or "multi-device" ways to
specify the devices to operate upon.

I don't have a good answer besides, be careful when using "all".

It might make sense to bring ndctl/namespace.c in line to guarantee
"unless 100% of the attempts are successful the command reports
failure". However, it might be too late to make that change there if it
breaks people's scripts. ndctl/namespace.c does not suffer from needing
to know that namspaceX.0 can not be deleted since the deletion there is
exclusively done by setting namespace size to zero.

I think this daxctl change has a low risk of breaking folks because the
primary failure case is fixed to swallow the error.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Verma, Vishal L April 12, 2024, 9:42 p.m. UTC | #2
On Fri, 2024-04-12 at 14:30 -0700, Dan Williams wrote:
> Vishal Verma wrote:
> > The loop through the provided list of devices in do_xaction_device()
> > returns the status based on whatever the last device did. Since the
> > order of processing devices, especially in cases like the 'all' keyword,
> > can be effectively random, this can lead to the same command, and same
> > effects, exiting with a different error code based on device ordering.
> > 
> > This was noticed with flakiness in the daxctl-create.sh unit test. Its
> > 'destroy-device all' command would either pass or fail based on the
> > order it tried to destroy devices in. (Recall that until now, destroying
> > a daxX.0 device would result in a failure).
> > 
> > Make this slightly more consistent by saving a failed status in
> > do_xaction_device if any iteration of the loop produces a failure.
> > Return this saved status instead of returning the status of the last
> > device processed.
> 
> I think "this is the way", at least it follows what cxl/memdev.c is
> doing. However we have ended up with an error scheme per tool when it
> comes to reporting errors for multi-device operations.
> 
> cxl/memdev.c: report the first error
> 
> daxctl/device.c: now fixed to report the first error

With this patch it's actually reporting the last error, not first.

> 
> ndctl/namespace.c: reports last result (same daxctl/device.c issue), unless in the greedy-create case
> 
> cxl/region.c: reports the last error even if that is not the last
> result, immune to the above bug, but why different?

I guess I always preferred last error instead of first, but happy to
change these to last to match cxl/memdev.c. I can see how first error
is probably slightly better (we'd normally want to know about the first
thing that fails).

> 
> The struggle here is that all of these tools continue on error, so it
> has always been the case that the only way to get a reliable error code
> vs action carried out is to not use the "all" or "multi-device" ways to
> specify the devices to operate upon.
> 
> I don't have a good answer besides, be careful when using "all".
> 
> It might make sense to bring ndctl/namespace.c in line to guarantee
> "unless 100% of the attempts are successful the command reports
> failure". However, it might be too late to make that change there if it
> breaks people's scripts. ndctl/namespace.c does not suffer from needing
> to know that namspaceX.0 can not be deleted since the deletion there is
> exclusively done by setting namespace size to zero.

Agreed, I'm onboard with holding off changes to ndctl/namespace.c
unless there's a reported problem, or we hit a test suite failure or
something down the line.

> 
> I think this daxctl change has a low risk of breaking folks because the
> primary failure case is fixed to swallow the error.
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>

Thanks Dan!
Dave Jiang April 15, 2024, 4:19 p.m. UTC | #3
On 4/12/24 2:05 PM, Vishal Verma wrote:
> The loop through the provided list of devices in do_xaction_device()
> returns the status based on whatever the last device did. Since the
> order of processing devices, especially in cases like the 'all' keyword,
> can be effectively random, this can lead to the same command, and same
> effects, exiting with a different error code based on device ordering.
> 
> This was noticed with flakiness in the daxctl-create.sh unit test. Its
> 'destroy-device all' command would either pass or fail based on the
> order it tried to destroy devices in. (Recall that until now, destroying
> a daxX.0 device would result in a failure).
> 
> Make this slightly more consistent by saving a failed status in
> do_xaction_device if any iteration of the loop produces a failure.
> Return this saved status instead of returning the status of the last
> device processed.
> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  daxctl/device.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/daxctl/device.c b/daxctl/device.c
> index 83c61389..14d62148 100644
> --- a/daxctl/device.c
> +++ b/daxctl/device.c
> @@ -1012,7 +1012,7 @@ static int do_xaction_device(const char *device, enum device_action action,
>  	struct json_object *jdevs = NULL;
>  	struct daxctl_region *region;
>  	struct daxctl_dev *dev;
> -	int rc = -ENXIO;
> +	int rc = -ENXIO, saved_rc = 0;
>  
>  	*processed = 0;
>  
> @@ -1059,6 +1059,8 @@ static int do_xaction_device(const char *device, enum device_action action,
>  				rc = -EINVAL;
>  				break;
>  			}
> +			if (rc)
> +				saved_rc = rc;
>  		}
>  	}
>  
> @@ -1070,7 +1072,7 @@ static int do_xaction_device(const char *device, enum device_action action,
>  	if (jdevs)
>  		util_display_json_array(stdout, jdevs, flags);
>  
> -	return rc;
> +	return saved_rc;
>  }
>  
>  int cmd_create_device(int argc, const char **argv, struct daxctl_ctx *ctx)
>
diff mbox series

Patch

diff --git a/daxctl/device.c b/daxctl/device.c
index 83c61389..14d62148 100644
--- a/daxctl/device.c
+++ b/daxctl/device.c
@@ -1012,7 +1012,7 @@  static int do_xaction_device(const char *device, enum device_action action,
 	struct json_object *jdevs = NULL;
 	struct daxctl_region *region;
 	struct daxctl_dev *dev;
-	int rc = -ENXIO;
+	int rc = -ENXIO, saved_rc = 0;
 
 	*processed = 0;
 
@@ -1059,6 +1059,8 @@  static int do_xaction_device(const char *device, enum device_action action,
 				rc = -EINVAL;
 				break;
 			}
+			if (rc)
+				saved_rc = rc;
 		}
 	}
 
@@ -1070,7 +1072,7 @@  static int do_xaction_device(const char *device, enum device_action action,
 	if (jdevs)
 		util_display_json_array(stdout, jdevs, flags);
 
-	return rc;
+	return saved_rc;
 }
 
 int cmd_create_device(int argc, const char **argv, struct daxctl_ctx *ctx)