diff mbox series

[2/3] cxl/region: Flag partially torn down regions as unusable

Message ID 168696507423.3590522.16254212607926684429.stgit@dwillia2-xfh.jf.intel.com
State Accepted
Commit 2ab47045ac96a605e3037d479a7d5854570ee5bf
Headers show
Series cxl/region: Cache management and region decode reset fixes | expand

Commit Message

Dan Williams June 17, 2023, 1:24 a.m. UTC
cxl_region_decode_reset() walks all the decoders associated with a given
region and disables them. Due to decoder ordering rules it is possible
that a switch in the topology notices that a given decoder can not be
shutdown before another region with a higher HPA is shutdown first. That
can leave the region in a partially committed state.

Capture that state in a new CXL_REGION_F_NEEDS_RESET flag and require
that a successful cxl_region_decode_reset() attempt must be completed
before cxl_region_probe() accepts the region.

This is a corollary for the bug that Jonathan identified in "CXL/region
:  commit reset of out of order region appears to succeed." [1].

Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Link: http://lore.kernel.org/r/20230316171441.0000205b@Huawei.com [1]
Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/cxl/core/region.c |   12 ++++++++++++
 drivers/cxl/cxl.h         |    8 ++++++++
 2 files changed, 20 insertions(+)

Comments

Dave Jiang June 21, 2023, 12:04 a.m. UTC | #1
On 6/16/23 18:24, Dan Williams wrote:
> cxl_region_decode_reset() walks all the decoders associated with a given
> region and disables them. Due to decoder ordering rules it is possible
> that a switch in the topology notices that a given decoder can not be
> shutdown before another region with a higher HPA is shutdown first. That
> can leave the region in a partially committed state.
> 
> Capture that state in a new CXL_REGION_F_NEEDS_RESET flag and require
> that a successful cxl_region_decode_reset() attempt must be completed
> before cxl_region_probe() accepts the region.
> 
> This is a corollary for the bug that Jonathan identified in "CXL/region
> :  commit reset of out of order region appears to succeed." [1].
> 
> Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Link: http://lore.kernel.org/r/20230316171441.0000205b@Huawei.com [1]
> Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>   drivers/cxl/core/region.c |   12 ++++++++++++
>   drivers/cxl/cxl.h         |    8 ++++++++
>   2 files changed, 20 insertions(+)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index d48c5916f2e9..31f498f0fb3a 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -182,14 +182,19 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
>   				rc = cxld->reset(cxld);
>   			if (rc)
>   				return rc;
> +			set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
>   		}
>   
>   endpoint_reset:
>   		rc = cxled->cxld.reset(&cxled->cxld);
>   		if (rc)
>   			return rc;
> +		set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
>   	}
>   
> +	/* all decoders associated with this region have been torn down */
> +	clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
> +
>   	return 0;
>   }
>   
> @@ -2876,6 +2881,13 @@ static int cxl_region_probe(struct device *dev)
>   		goto out;
>   	}
>   
> +	if (test_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags)) {
> +		dev_err(&cxlr->dev,
> +			"failed to activate, re-commit region and retry\n");
> +		rc = -ENXIO;
> +		goto out;
> +	}
> +
>   	/*
>   	 * From this point on any path that changes the region's state away from
>   	 * CXL_CONFIG_COMMIT is also responsible for releasing the driver.
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 187abd8ba673..cd7bf6697a51 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -469,6 +469,14 @@ struct cxl_region_params {
>    */
>   #define CXL_REGION_F_AUTO 0
>   
> +/*
> + * Require that a committed region successfully complete a teardown once
> + * any of its associated decoders have been torn down. This maintains
> + * the commit state for the region since there are committed decoders,
> + * but blocks cxl_region_probe().
> + */
> +#define CXL_REGION_F_NEEDS_RESET 1
> +
>   /**
>    * struct cxl_region - CXL region
>    * @dev: This region's device
>
Jonathan Cameron June 22, 2023, 9:14 a.m. UTC | #2
On Fri, 16 Jun 2023 18:24:34 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> cxl_region_decode_reset() walks all the decoders associated with a given
> region and disables them. Due to decoder ordering rules it is possible
> that a switch in the topology notices that a given decoder can not be
> shutdown before another region with a higher HPA is shutdown first. That
> can leave the region in a partially committed state.
> 
> Capture that state in a new CXL_REGION_F_NEEDS_RESET flag and require
> that a successful cxl_region_decode_reset() attempt must be completed
> before cxl_region_probe() accepts the region.
> 
> This is a corollary for the bug that Jonathan identified in "CXL/region
> :  commit reset of out of order region appears to succeed." [1].
> 
> Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Link: http://lore.kernel.org/r/20230316171441.0000205b@Huawei.com [1]
> Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

I'm sure I understood this flow better a while back....
I think this is right. The subtlety that had me for a while was a fialure
of the first cxld->reset() but if that fails we can assume that all decoders
are still in place so I think we are fine.

I wonder if a better approach (though heavy for a fix?) is a prepass
that checks it is possible to tear the region down, followed by actually doing it
(all under appropriate locking so state doesn't change on us)
That way no intermediate half torn down states to worry about in the state machine.

From a spec side of things I've never been convinced the ordering rules make sense.
Some hardware / firmware simplification at cost of nasty software limitations.

Jonathan


> ---
>  drivers/cxl/core/region.c |   12 ++++++++++++
>  drivers/cxl/cxl.h         |    8 ++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index d48c5916f2e9..31f498f0fb3a 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -182,14 +182,19 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
>  				rc = cxld->reset(cxld);
>  			if (rc)
>  				return rc;
> +			set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
>  		}
>  
>  endpoint_reset:
>  		rc = cxled->cxld.reset(&cxled->cxld);
>  		if (rc)
>  			return rc;
> +		set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
>  	}
>  
> +	/* all decoders associated with this region have been torn down */
> +	clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
> +
>  	return 0;
>  }
>  
> @@ -2876,6 +2881,13 @@ static int cxl_region_probe(struct device *dev)
>  		goto out;
>  	}
>  
> +	if (test_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags)) {
> +		dev_err(&cxlr->dev,
> +			"failed to activate, re-commit region and retry\n");
> +		rc = -ENXIO;
> +		goto out;
> +	}
> +
>  	/*
>  	 * From this point on any path that changes the region's state away from
>  	 * CXL_CONFIG_COMMIT is also responsible for releasing the driver.
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 187abd8ba673..cd7bf6697a51 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -469,6 +469,14 @@ struct cxl_region_params {
>   */
>  #define CXL_REGION_F_AUTO 0
>  
> +/*
> + * Require that a committed region successfully complete a teardown once
> + * any of its associated decoders have been torn down. This maintains
> + * the commit state for the region since there are committed decoders,
> + * but blocks cxl_region_probe().
> + */
> +#define CXL_REGION_F_NEEDS_RESET 1
> +
>  /**
>   * struct cxl_region - CXL region
>   * @dev: This region's device
>
diff mbox series

Patch

diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index d48c5916f2e9..31f498f0fb3a 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -182,14 +182,19 @@  static int cxl_region_decode_reset(struct cxl_region *cxlr, int count)
 				rc = cxld->reset(cxld);
 			if (rc)
 				return rc;
+			set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
 		}
 
 endpoint_reset:
 		rc = cxled->cxld.reset(&cxled->cxld);
 		if (rc)
 			return rc;
+		set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
 	}
 
+	/* all decoders associated with this region have been torn down */
+	clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
+
 	return 0;
 }
 
@@ -2876,6 +2881,13 @@  static int cxl_region_probe(struct device *dev)
 		goto out;
 	}
 
+	if (test_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags)) {
+		dev_err(&cxlr->dev,
+			"failed to activate, re-commit region and retry\n");
+		rc = -ENXIO;
+		goto out;
+	}
+
 	/*
 	 * From this point on any path that changes the region's state away from
 	 * CXL_CONFIG_COMMIT is also responsible for releasing the driver.
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 187abd8ba673..cd7bf6697a51 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -469,6 +469,14 @@  struct cxl_region_params {
  */
 #define CXL_REGION_F_AUTO 0
 
+/*
+ * Require that a committed region successfully complete a teardown once
+ * any of its associated decoders have been torn down. This maintains
+ * the commit state for the region since there are committed decoders,
+ * but blocks cxl_region_probe().
+ */
+#define CXL_REGION_F_NEEDS_RESET 1
+
 /**
  * struct cxl_region - CXL region
  * @dev: This region's device