Message ID | 168696507423.3590522.16254212607926684429.stgit@dwillia2-xfh.jf.intel.com |
---|---|
State | Accepted |
Commit | 2ab47045ac96a605e3037d479a7d5854570ee5bf |
Headers | show |
Series | cxl/region: Cache management and region decode reset fixes | expand |
On 6/16/23 18:24, Dan Williams wrote: > cxl_region_decode_reset() walks all the decoders associated with a given > region and disables them. Due to decoder ordering rules it is possible > that a switch in the topology notices that a given decoder can not be > shutdown before another region with a higher HPA is shutdown first. That > can leave the region in a partially committed state. > > Capture that state in a new CXL_REGION_F_NEEDS_RESET flag and require > that a successful cxl_region_decode_reset() attempt must be completed > before cxl_region_probe() accepts the region. > > This is a corollary for the bug that Jonathan identified in "CXL/region > : commit reset of out of order region appears to succeed." [1]. > > Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com> > Link: http://lore.kernel.org/r/20230316171441.0000205b@Huawei.com [1] > Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") > Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> > --- > drivers/cxl/core/region.c | 12 ++++++++++++ > drivers/cxl/cxl.h | 8 ++++++++ > 2 files changed, 20 insertions(+) > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index d48c5916f2e9..31f498f0fb3a 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -182,14 +182,19 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) > rc = cxld->reset(cxld); > if (rc) > return rc; > + set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); > } > > endpoint_reset: > rc = cxled->cxld.reset(&cxled->cxld); > if (rc) > return rc; > + set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); > } > > + /* all decoders associated with this region have been torn down */ > + clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); > + > return 0; > } > > @@ -2876,6 +2881,13 @@ static int cxl_region_probe(struct device *dev) > goto out; > } > > + if (test_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags)) { > + dev_err(&cxlr->dev, > + "failed to activate, re-commit region and retry\n"); > + rc = -ENXIO; > + goto out; > + } > + > /* > * From this point on any path that changes the region's state away from > * CXL_CONFIG_COMMIT is also responsible for releasing the driver. > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index 187abd8ba673..cd7bf6697a51 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -469,6 +469,14 @@ struct cxl_region_params { > */ > #define CXL_REGION_F_AUTO 0 > > +/* > + * Require that a committed region successfully complete a teardown once > + * any of its associated decoders have been torn down. This maintains > + * the commit state for the region since there are committed decoders, > + * but blocks cxl_region_probe(). > + */ > +#define CXL_REGION_F_NEEDS_RESET 1 > + > /** > * struct cxl_region - CXL region > * @dev: This region's device >
On Fri, 16 Jun 2023 18:24:34 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > cxl_region_decode_reset() walks all the decoders associated with a given > region and disables them. Due to decoder ordering rules it is possible > that a switch in the topology notices that a given decoder can not be > shutdown before another region with a higher HPA is shutdown first. That > can leave the region in a partially committed state. > > Capture that state in a new CXL_REGION_F_NEEDS_RESET flag and require > that a successful cxl_region_decode_reset() attempt must be completed > before cxl_region_probe() accepts the region. > > This is a corollary for the bug that Jonathan identified in "CXL/region > : commit reset of out of order region appears to succeed." [1]. > > Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com> > Link: http://lore.kernel.org/r/20230316171441.0000205b@Huawei.com [1] > Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") > Signed-off-by: Dan Williams <dan.j.williams@intel.com> I'm sure I understood this flow better a while back.... I think this is right. The subtlety that had me for a while was a fialure of the first cxld->reset() but if that fails we can assume that all decoders are still in place so I think we are fine. I wonder if a better approach (though heavy for a fix?) is a prepass that checks it is possible to tear the region down, followed by actually doing it (all under appropriate locking so state doesn't change on us) That way no intermediate half torn down states to worry about in the state machine. From a spec side of things I've never been convinced the ordering rules make sense. Some hardware / firmware simplification at cost of nasty software limitations. Jonathan > --- > drivers/cxl/core/region.c | 12 ++++++++++++ > drivers/cxl/cxl.h | 8 ++++++++ > 2 files changed, 20 insertions(+) > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index d48c5916f2e9..31f498f0fb3a 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -182,14 +182,19 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) > rc = cxld->reset(cxld); > if (rc) > return rc; > + set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); > } > > endpoint_reset: > rc = cxled->cxld.reset(&cxled->cxld); > if (rc) > return rc; > + set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); > } > > + /* all decoders associated with this region have been torn down */ > + clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); > + > return 0; > } > > @@ -2876,6 +2881,13 @@ static int cxl_region_probe(struct device *dev) > goto out; > } > > + if (test_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags)) { > + dev_err(&cxlr->dev, > + "failed to activate, re-commit region and retry\n"); > + rc = -ENXIO; > + goto out; > + } > + > /* > * From this point on any path that changes the region's state away from > * CXL_CONFIG_COMMIT is also responsible for releasing the driver. > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index 187abd8ba673..cd7bf6697a51 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -469,6 +469,14 @@ struct cxl_region_params { > */ > #define CXL_REGION_F_AUTO 0 > > +/* > + * Require that a committed region successfully complete a teardown once > + * any of its associated decoders have been torn down. This maintains > + * the commit state for the region since there are committed decoders, > + * but blocks cxl_region_probe(). > + */ > +#define CXL_REGION_F_NEEDS_RESET 1 > + > /** > * struct cxl_region - CXL region > * @dev: This region's device >
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index d48c5916f2e9..31f498f0fb3a 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -182,14 +182,19 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) rc = cxld->reset(cxld); if (rc) return rc; + set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); } endpoint_reset: rc = cxled->cxld.reset(&cxled->cxld); if (rc) return rc; + set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); } + /* all decoders associated with this region have been torn down */ + clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); + return 0; } @@ -2876,6 +2881,13 @@ static int cxl_region_probe(struct device *dev) goto out; } + if (test_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags)) { + dev_err(&cxlr->dev, + "failed to activate, re-commit region and retry\n"); + rc = -ENXIO; + goto out; + } + /* * From this point on any path that changes the region's state away from * CXL_CONFIG_COMMIT is also responsible for releasing the driver. diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 187abd8ba673..cd7bf6697a51 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -469,6 +469,14 @@ struct cxl_region_params { */ #define CXL_REGION_F_AUTO 0 +/* + * Require that a committed region successfully complete a teardown once + * any of its associated decoders have been torn down. This maintains + * the commit state for the region since there are committed decoders, + * but blocks cxl_region_probe(). + */ +#define CXL_REGION_F_NEEDS_RESET 1 + /** * struct cxl_region - CXL region * @dev: This region's device
cxl_region_decode_reset() walks all the decoders associated with a given region and disables them. Due to decoder ordering rules it is possible that a switch in the topology notices that a given decoder can not be shutdown before another region with a higher HPA is shutdown first. That can leave the region in a partially committed state. Capture that state in a new CXL_REGION_F_NEEDS_RESET flag and require that a successful cxl_region_decode_reset() attempt must be completed before cxl_region_probe() accepts the region. This is a corollary for the bug that Jonathan identified in "CXL/region : commit reset of out of order region appears to succeed." [1]. Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com> Link: http://lore.kernel.org/r/20230316171441.0000205b@Huawei.com [1] Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/cxl/core/region.c | 12 ++++++++++++ drivers/cxl/cxl.h | 8 ++++++++ 2 files changed, 20 insertions(+)