diff mbox series

[3/3] cxl/region: Fix state transitions after reset failure

Message ID 168696507968.3590522.14484000711718573626.stgit@dwillia2-xfh.jf.intel.com
State Accepted
Commit adfe19738b71a893da62cb2e30bd6bdb4299ea67
Headers show
Series cxl/region: Cache management and region decode reset fixes | expand

Commit Message

Dan Williams June 17, 2023, 1:24 a.m. UTC
Jonathan reports that failed attempts to reset a region (teardown its
HDM decoder configuration) mistakenly advance the state of the region
to "not committed". Revert to the previous state of the region on reset
failure so that the reset can be re-attempted.

Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Closes: http://lore.kernel.org/r/20230316171441.0000205b@Huawei.com
Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/region.c |   26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

Comments

Dave Jiang June 21, 2023, 12:06 a.m. UTC | #1
On 6/16/23 18:24, Dan Williams wrote:
> Jonathan reports that failed attempts to reset a region (teardown its
> HDM decoder configuration) mistakenly advance the state of the region
> to "not committed". Revert to the previous state of the region on reset
> failure so that the reset can be re-attempted.
> 
> Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Closes: http://lore.kernel.org/r/20230316171441.0000205b@Huawei.com
> Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   drivers/cxl/core/region.c |   26 +++++++++++++++-----------
>   1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 31f498f0fb3a..38db377e13f1 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -296,9 +296,11 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr,
>   	if (rc)
>   		return rc;
>   
> -	if (commit)
> +	if (commit) {
>   		rc = cxl_region_decode_commit(cxlr);
> -	else {
> +		if (rc == 0)
> +			p->state = CXL_CONFIG_COMMIT;
> +	} else {
>   		p->state = CXL_CONFIG_RESET_PENDING;
>   		up_write(&cxl_region_rwsem);
>   		device_release_driver(&cxlr->dev);
> @@ -308,18 +310,20 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr,
>   		 * The lock was dropped, so need to revalidate that the reset is
>   		 * still pending.
>   		 */
> -		if (p->state == CXL_CONFIG_RESET_PENDING)
> +		if (p->state == CXL_CONFIG_RESET_PENDING) {
>   			rc = cxl_region_decode_reset(cxlr, p->interleave_ways);
> +			/*
> +			 * Revert to committed since there may still be active
> +			 * decoders associated with this region, or move forward
> +			 * to active to mark the reset successful
> +			 */
> +			if (rc)
> +				p->state = CXL_CONFIG_COMMIT;
> +			else
> +				p->state = CXL_CONFIG_ACTIVE;
> +		}
>   	}
>   
> -	if (rc)
> -		goto out;
> -
> -	if (commit)
> -		p->state = CXL_CONFIG_COMMIT;
> -	else if (p->state == CXL_CONFIG_RESET_PENDING)
> -		p->state = CXL_CONFIG_ACTIVE;
> -
>   out:
>   	up_write(&cxl_region_rwsem);
>   
>
Jonathan Cameron June 22, 2023, 9:18 a.m. UTC | #2
On Fri, 16 Jun 2023 18:24:39 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan reports that failed attempts to reset a region (teardown its
> HDM decoder configuration) mistakenly advance the state of the region
> to "not committed". Revert to the previous state of the region on reset
> failure so that the reset can be re-attempted.
> 
> Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Closes: http://lore.kernel.org/r/20230316171441.0000205b@Huawei.com
> Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
LGTM - though maybe even nicer if we can be pretty sure this will succeed
before trying it.. (same comment as previous patch)

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  drivers/cxl/core/region.c |   26 +++++++++++++++-----------
>  1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 31f498f0fb3a..38db377e13f1 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -296,9 +296,11 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr,
>  	if (rc)
>  		return rc;
>  
> -	if (commit)
> +	if (commit) {
>  		rc = cxl_region_decode_commit(cxlr);
> -	else {
> +		if (rc == 0)
> +			p->state = CXL_CONFIG_COMMIT;
> +	} else {
>  		p->state = CXL_CONFIG_RESET_PENDING;
>  		up_write(&cxl_region_rwsem);
>  		device_release_driver(&cxlr->dev);
> @@ -308,18 +310,20 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr,
>  		 * The lock was dropped, so need to revalidate that the reset is
>  		 * still pending.
>  		 */
> -		if (p->state == CXL_CONFIG_RESET_PENDING)
> +		if (p->state == CXL_CONFIG_RESET_PENDING) {
>  			rc = cxl_region_decode_reset(cxlr, p->interleave_ways);
> +			/*
> +			 * Revert to committed since there may still be active
> +			 * decoders associated with this region, or move forward
> +			 * to active to mark the reset successful
> +			 */
> +			if (rc)
> +				p->state = CXL_CONFIG_COMMIT;
> +			else
> +				p->state = CXL_CONFIG_ACTIVE;
> +		}
>  	}
>  
> -	if (rc)
> -		goto out;
> -
> -	if (commit)
> -		p->state = CXL_CONFIG_COMMIT;
> -	else if (p->state == CXL_CONFIG_RESET_PENDING)
> -		p->state = CXL_CONFIG_ACTIVE;
> -
>  out:
>  	up_write(&cxl_region_rwsem);
>  
>
Dan Williams June 25, 2023, 8:42 p.m. UTC | #3
Jonathan Cameron wrote:
> On Fri, 16 Jun 2023 18:24:39 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > Jonathan reports that failed attempts to reset a region (teardown its
> > HDM decoder configuration) mistakenly advance the state of the region
> > to "not committed". Revert to the previous state of the region on reset
> > failure so that the reset can be re-attempted.
> > 
> > Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> > Closes: http://lore.kernel.org/r/20230316171441.0000205b@Huawei.com
> > Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> LGTM - though maybe even nicer if we can be pretty sure this will succeed
> before trying it.. (same comment as previous patch)

I had the same reaction, but satisfied myself that this is something
that userspace can manage. I.e. tooling can effectively predict when the
kernel will complain about this ordering situation and prevent it. In
other words, the only way this happens in practice is if userspace makes
a mistake.

It is already the case that partially committed decoders need to be
tolerated by the platform since setup and teardown are not atomic. So I
think 'cxl destroy-region' is where this follow-on smarts belongs.

> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Thanks for the collaboration as always.
diff mbox series

Patch

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 31f498f0fb3a..38db377e13f1 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -296,9 +296,11 @@  static ssize_t commit_store(struct device *dev, struct device_attribute *attr,
 	if (rc)
 		return rc;
 
-	if (commit)
+	if (commit) {
 		rc = cxl_region_decode_commit(cxlr);
-	else {
+		if (rc == 0)
+			p->state = CXL_CONFIG_COMMIT;
+	} else {
 		p->state = CXL_CONFIG_RESET_PENDING;
 		up_write(&cxl_region_rwsem);
 		device_release_driver(&cxlr->dev);
@@ -308,18 +310,20 @@  static ssize_t commit_store(struct device *dev, struct device_attribute *attr,
 		 * The lock was dropped, so need to revalidate that the reset is
 		 * still pending.
 		 */
-		if (p->state == CXL_CONFIG_RESET_PENDING)
+		if (p->state == CXL_CONFIG_RESET_PENDING) {
 			rc = cxl_region_decode_reset(cxlr, p->interleave_ways);
+			/*
+			 * Revert to committed since there may still be active
+			 * decoders associated with this region, or move forward
+			 * to active to mark the reset successful
+			 */
+			if (rc)
+				p->state = CXL_CONFIG_COMMIT;
+			else
+				p->state = CXL_CONFIG_ACTIVE;
+		}
 	}
 
-	if (rc)
-		goto out;
-
-	if (commit)
-		p->state = CXL_CONFIG_COMMIT;
-	else if (p->state == CXL_CONFIG_RESET_PENDING)
-		p->state = CXL_CONFIG_ACTIVE;
-
 out:
 	up_write(&cxl_region_rwsem);