Message ID | 172862486548.2150669.3548553804904171839.stgit@dwillia2-xfh.jf.intel.com |
---|---|
State | Superseded |
Headers | show |
Series | cxl: Initialization and shutdown fixes | expand |
On 2024/10/11 13:34, Dan Williams wrote: > In support of investigating an initialization failure report [1], > cxl_test was updated to register mock memory-devices after the mock > root-port/bus device had been registered. That led to cxl_test crashing > with a use-after-free bug with the following signature: > > cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 > cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 > cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 > 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 > [..] > cxld_unregister: cxl decoder14.0: > cxl_region_decode_reset: cxl_region region3: > mock_decoder_reset: cxl_port port3: decoder3.0 reset > 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 > cxl_endpoint_decoder_release: cxl decoder14.0: > [..] > cxld_unregister: cxl decoder7.0: > 3) cxl_region_decode_reset: cxl_region region3: > Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI > [..] > RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] > [..] > Call Trace: > <TASK> > cxl_region_decode_reset+0x69/0x190 [cxl_core] > cxl_region_detach+0xe8/0x210 [cxl_core] > cxl_decoder_kill_region+0x27/0x40 [cxl_core] > cxld_unregister+0x5d/0x60 [cxl_core] > > At 1) a region has been established with 2 endpoint decoders (7.0 and > 14.0). Those endpoints share a common switch-decoder in the topology > (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits > the "out of order reset case" in the switch decoder. The effect though > is that region3 cleanup is aborted leaving it in-tact and > referencing decoder14.0. At 3) the second attempt to teardown region3 > trips over the stale decoder14.0 object which has long since been > deleted. > > The fix here is to recognize that the CXL specification places no > mandate on in-order shutdown of switch-decoders, the driver enforces > in-order allocation, and hardware enforces in-order commit. So, rather > than fail and leave objects dangling, always remove them. > > In support of making cxl_region_decode_reset() always succeed, > cxl_region_invalidate_memregion() failures are turned into warnings. > Crashing the kernel is ok there since system integrity is at risk if > caches cannot be managed around physical address mutation events like > CXL region destruction. > > A new device_for_each_child_reverse_from() is added to cleanup > port->commit_end after all dependent decoders have been disabled. In > other words if decoders are allocated 0->1->2 and disabled 1->2->0 then > port->commit_end only decrements from 2 after 2 has been disabled, and > it decrements all the way to zero since 1 was disabled previously. > > Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] > Cc: <stable@vger.kernel.org> > Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: Davidlohr Bueso <dave@stgolabs.net> > Cc: Jonathan Cameron <jonathan.cameron@huawei.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Alison Schofield <alison.schofield@intel.com> > Cc: Ira Weiny <ira.weiny@intel.com> > Cc: Zijun Hu <zijun_hu@icloud.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > drivers/base/core.c | 35 +++++++++++++++++++++++++++++ > drivers/cxl/core/hdm.c | 50 +++++++++++++++++++++++++++++++++++------- > drivers/cxl/core/region.c | 48 +++++++++++----------------------------- > drivers/cxl/cxl.h | 3 ++- > include/linux/device.h | 3 +++ > tools/testing/cxl/test/cxl.c | 14 ++++-------- > 6 files changed, 100 insertions(+), 53 deletions(-) > > diff --git a/drivers/base/core.c b/drivers/base/core.c > index a4c853411a6b..e42f1ad73078 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data, > } > EXPORT_SYMBOL_GPL(device_for_each_child_reverse); > > +/** > + * device_for_each_child_reverse_from - device child iterator in reversed order. > + * @parent: parent struct device. > + * @from: optional starting point in child list > + * @fn: function to be called for each device. > + * @data: data for the callback. > + * > + * Iterate over @parent's child devices, starting at @from, and call @fn > + * for each, passing it @data. This helper is identical to > + * device_for_each_child_reverse() when @from is NULL. > + * > + * @fn is checked each iteration. If it returns anything other than 0, > + * iteration stop and that value is returned to the caller of > + * device_for_each_child_reverse_from(); > + */ > +int device_for_each_child_reverse_from(struct device *parent, > + struct device *from, const void *data, > + int (*fn)(struct device *, const void *)) > +{ > + struct klist_iter i; > + struct device *child; > + int error = 0; > + > + if (!parent->p) > + return 0; > + > + klist_iter_init_node(&parent->p->klist_children, &i, > + (from ? &from->p->knode_parent : NULL)); > + while ((child = prev_device(&i)) && !error) > + error = fn(child, data); > + klist_iter_exit(&i); > + return error; > +} > +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from); > + it does NOT deserve, also does NOT need to introduce a new core driver API device_for_each_child_reverse_from(). existing device_for_each_child_reverse() can do what the _from() wants to do. we can use similar approach as below link shown: https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/ > /** > * device_find_child - device iterator for locating a particular device. > * @parent: parent struct device > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c > index 3df10517a327..223c273c0cd1 100644 > --- a/drivers/cxl/core/hdm.c > +++ b/drivers/cxl/core/hdm.c > @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld) > return 0; > } > > -static int cxl_decoder_reset(struct cxl_decoder *cxld) > +static int commit_reap(struct device *dev, const void *data) > +{ > + struct cxl_port *port = to_cxl_port(dev->parent); > + struct cxl_decoder *cxld; > + > + if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev)) > + return 0; > + > + cxld = to_cxl_decoder(dev); > + if (port->commit_end == cxld->id && > + ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) { > + port->commit_end--; > + dev_dbg(&port->dev, "reap: %s commit_end: %d\n", > + dev_name(&cxld->dev), port->commit_end); > + } > + > + return 0; > +} > + > +void cxl_port_commit_reap(struct cxl_decoder *cxld) > +{ > + struct cxl_port *port = to_cxl_port(cxld->dev.parent); > + > + lockdep_assert_held_write(&cxl_region_rwsem); > + > + /* > + * Once the highest committed decoder is disabled, free any other > + * decoders that were pinned allocated by out-of-order release. > + */ > + port->commit_end--; > + dev_dbg(&port->dev, "reap: %s commit_end: %d\n", dev_name(&cxld->dev), > + port->commit_end); > + device_for_each_child_reverse_from(&port->dev, &cxld->dev, NULL, > + commit_reap); > +} > +EXPORT_SYMBOL_NS_GPL(cxl_port_commit_reap, CXL); > + > +static void cxl_decoder_reset(struct cxl_decoder *cxld) > { > struct cxl_port *port = to_cxl_port(cxld->dev.parent); > struct cxl_hdm *cxlhdm = dev_get_drvdata(&port->dev); > @@ -721,14 +758,14 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) > u32 ctrl; > > if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0) > - return 0; > + return; > > - if (port->commit_end != id) { > + if (port->commit_end == id) > + cxl_port_commit_reap(cxld); > + else > dev_dbg(&port->dev, > "%s: out of order reset, expected decoder%d.%d\n", > dev_name(&cxld->dev), port->id, port->commit_end); > - return -EBUSY; > - } > > down_read(&cxl_dpa_rwsem); > ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id)); > @@ -741,7 +778,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) > writel(0, hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id)); > up_read(&cxl_dpa_rwsem); > > - port->commit_end--; > cxld->flags &= ~CXL_DECODER_F_ENABLE; > > /* Userspace is now responsible for reconfiguring this decoder */ > @@ -751,8 +787,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) > cxled = to_cxl_endpoint_decoder(&cxld->dev); > cxled->state = CXL_DECODER_STATE_MANUAL; > } > - > - return 0; > } > > static int cxl_setup_hdm_decoder_from_dvsec( > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index e701e4b04032..3478d2058303 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -232,8 +232,8 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr) > "Bypassing cpu_cache_invalidate_memregion() for testing!\n"); > return 0; > } else { > - dev_err(&cxlr->dev, > - "Failed to synchronize CPU cache state\n"); > + dev_WARN(&cxlr->dev, > + "Failed to synchronize CPU cache state\n"); > return -ENXIO; > } > } > @@ -242,19 +242,17 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr) > return 0; > } > > -static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) > +static void cxl_region_decode_reset(struct cxl_region *cxlr, int count) > { > struct cxl_region_params *p = &cxlr->params; > - int i, rc = 0; > + int i; > > /* > - * Before region teardown attempt to flush, and if the flush > - * fails cancel the region teardown for data consistency > - * concerns > + * Before region teardown attempt to flush, evict any data cached for > + * this region, or scream loudly about missing arch / platform support > + * for CXL teardown. > */ > - rc = cxl_region_invalidate_memregion(cxlr); > - if (rc) > - return rc; > + cxl_region_invalidate_memregion(cxlr); > > for (i = count - 1; i >= 0; i--) { > struct cxl_endpoint_decoder *cxled = p->targets[i]; > @@ -277,23 +275,17 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) > cxl_rr = cxl_rr_load(iter, cxlr); > cxld = cxl_rr->decoder; > if (cxld->reset) > - rc = cxld->reset(cxld); > - if (rc) > - return rc; > + cxld->reset(cxld); > set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); > } > > endpoint_reset: > - rc = cxled->cxld.reset(&cxled->cxld); > - if (rc) > - return rc; > + cxled->cxld.reset(&cxled->cxld); > set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); > } > > /* all decoders associated with this region have been torn down */ > clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); > - > - return 0; > } > > static int commit_decoder(struct cxl_decoder *cxld) > @@ -409,16 +401,8 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr, > * still pending. > */ > if (p->state == CXL_CONFIG_RESET_PENDING) { > - rc = cxl_region_decode_reset(cxlr, p->interleave_ways); > - /* > - * Revert to committed since there may still be active > - * decoders associated with this region, or move forward > - * to active to mark the reset successful > - */ > - if (rc) > - p->state = CXL_CONFIG_COMMIT; > - else > - p->state = CXL_CONFIG_ACTIVE; > + cxl_region_decode_reset(cxlr, p->interleave_ways); > + p->state = CXL_CONFIG_ACTIVE; > } > } > > @@ -2054,13 +2038,7 @@ static int cxl_region_detach(struct cxl_endpoint_decoder *cxled) > get_device(&cxlr->dev); > > if (p->state > CXL_CONFIG_ACTIVE) { > - /* > - * TODO: tear down all impacted regions if a device is > - * removed out of order > - */ > - rc = cxl_region_decode_reset(cxlr, p->interleave_ways); > - if (rc) > - goto out; > + cxl_region_decode_reset(cxlr, p->interleave_ways); > p->state = CXL_CONFIG_ACTIVE; > } > > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index 0d8b810a51f0..5406e3ab3d4a 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -359,7 +359,7 @@ struct cxl_decoder { > struct cxl_region *region; > unsigned long flags; > int (*commit)(struct cxl_decoder *cxld); > - int (*reset)(struct cxl_decoder *cxld); > + void (*reset)(struct cxl_decoder *cxld); > }; > > /* > @@ -730,6 +730,7 @@ static inline bool is_cxl_root(struct cxl_port *port) > int cxl_num_decoders_committed(struct cxl_port *port); > bool is_cxl_port(const struct device *dev); > struct cxl_port *to_cxl_port(const struct device *dev); > +void cxl_port_commit_reap(struct cxl_decoder *cxld); > struct pci_bus; > int devm_cxl_register_pci_bus(struct device *host, struct device *uport_dev, > struct pci_bus *bus); > diff --git a/include/linux/device.h b/include/linux/device.h > index b4bde8d22697..667cb6db9019 100644 > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -1078,6 +1078,9 @@ int device_for_each_child(struct device *dev, void *data, > int (*fn)(struct device *dev, void *data)); > int device_for_each_child_reverse(struct device *dev, void *data, > int (*fn)(struct device *dev, void *data)); > +int device_for_each_child_reverse_from(struct device *parent, > + struct device *from, const void *data, > + int (*fn)(struct device *, const void *)); > struct device *device_find_child(struct device *dev, void *data, > int (*match)(struct device *dev, void *data)); > struct device *device_find_child_by_name(struct device *parent, > diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c > index 90d5afd52dd0..c5bbd89b3192 100644 > --- a/tools/testing/cxl/test/cxl.c > +++ b/tools/testing/cxl/test/cxl.c > @@ -693,26 +693,22 @@ static int mock_decoder_commit(struct cxl_decoder *cxld) > return 0; > } > > -static int mock_decoder_reset(struct cxl_decoder *cxld) > +static void mock_decoder_reset(struct cxl_decoder *cxld) > { > struct cxl_port *port = to_cxl_port(cxld->dev.parent); > int id = cxld->id; > > if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0) > - return 0; > + return; > > dev_dbg(&port->dev, "%s reset\n", dev_name(&cxld->dev)); > - if (port->commit_end != id) { > + if (port->commit_end == id) > + cxl_port_commit_reap(cxld); > + else > dev_dbg(&port->dev, > "%s: out of order reset, expected decoder%d.%d\n", > dev_name(&cxld->dev), port->id, port->commit_end); > - return -EBUSY; > - } > - > - port->commit_end--; > cxld->flags &= ~CXL_DECODER_F_ENABLE; > - > - return 0; > } > > static void default_mock_decoder(struct cxl_decoder *cxld) >
Zijun Hu wrote: > On 2024/10/11 13:34, Dan Williams wrote: > > In support of investigating an initialization failure report [1], > > cxl_test was updated to register mock memory-devices after the mock > > root-port/bus device had been registered. That led to cxl_test crashing > > with a use-after-free bug with the following signature: > > > > cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 > > cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 > > cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 > > 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 > > [..] > > cxld_unregister: cxl decoder14.0: > > cxl_region_decode_reset: cxl_region region3: > > mock_decoder_reset: cxl_port port3: decoder3.0 reset > > 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 > > cxl_endpoint_decoder_release: cxl decoder14.0: > > [..] > > cxld_unregister: cxl decoder7.0: > > 3) cxl_region_decode_reset: cxl_region region3: > > Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI > > [..] > > RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] > > [..] > > Call Trace: > > <TASK> > > cxl_region_decode_reset+0x69/0x190 [cxl_core] > > cxl_region_detach+0xe8/0x210 [cxl_core] > > cxl_decoder_kill_region+0x27/0x40 [cxl_core] > > cxld_unregister+0x5d/0x60 [cxl_core] > > > > At 1) a region has been established with 2 endpoint decoders (7.0 and > > 14.0). Those endpoints share a common switch-decoder in the topology > > (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits > > the "out of order reset case" in the switch decoder. The effect though > > is that region3 cleanup is aborted leaving it in-tact and > > referencing decoder14.0. At 3) the second attempt to teardown region3 > > trips over the stale decoder14.0 object which has long since been > > deleted. > > > > The fix here is to recognize that the CXL specification places no > > mandate on in-order shutdown of switch-decoders, the driver enforces > > in-order allocation, and hardware enforces in-order commit. So, rather > > than fail and leave objects dangling, always remove them. > > > > In support of making cxl_region_decode_reset() always succeed, > > cxl_region_invalidate_memregion() failures are turned into warnings. > > Crashing the kernel is ok there since system integrity is at risk if > > caches cannot be managed around physical address mutation events like > > CXL region destruction. > > > > A new device_for_each_child_reverse_from() is added to cleanup > > port->commit_end after all dependent decoders have been disabled. In > > other words if decoders are allocated 0->1->2 and disabled 1->2->0 then > > port->commit_end only decrements from 2 after 2 has been disabled, and > > it decrements all the way to zero since 1 was disabled previously. > > > > Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] > > Cc: <stable@vger.kernel.org> > > Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > Cc: Davidlohr Bueso <dave@stgolabs.net> > > Cc: Jonathan Cameron <jonathan.cameron@huawei.com> > > Cc: Dave Jiang <dave.jiang@intel.com> > > Cc: Alison Schofield <alison.schofield@intel.com> > > Cc: Ira Weiny <ira.weiny@intel.com> > > Cc: Zijun Hu <zijun_hu@icloud.com> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > > --- > > drivers/base/core.c | 35 +++++++++++++++++++++++++++++ > > drivers/cxl/core/hdm.c | 50 +++++++++++++++++++++++++++++++++++------- > > drivers/cxl/core/region.c | 48 +++++++++++----------------------------- > > drivers/cxl/cxl.h | 3 ++- > > include/linux/device.h | 3 +++ > > tools/testing/cxl/test/cxl.c | 14 ++++-------- > > 6 files changed, 100 insertions(+), 53 deletions(-) > > > > diff --git a/drivers/base/core.c b/drivers/base/core.c > > index a4c853411a6b..e42f1ad73078 100644 > > --- a/drivers/base/core.c > > +++ b/drivers/base/core.c > > @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data, > > } > > EXPORT_SYMBOL_GPL(device_for_each_child_reverse); > > > > +/** > > + * device_for_each_child_reverse_from - device child iterator in reversed order. > > + * @parent: parent struct device. > > + * @from: optional starting point in child list > > + * @fn: function to be called for each device. > > + * @data: data for the callback. > > + * > > + * Iterate over @parent's child devices, starting at @from, and call @fn > > + * for each, passing it @data. This helper is identical to > > + * device_for_each_child_reverse() when @from is NULL. > > + * > > + * @fn is checked each iteration. If it returns anything other than 0, > > + * iteration stop and that value is returned to the caller of > > + * device_for_each_child_reverse_from(); > > + */ > > +int device_for_each_child_reverse_from(struct device *parent, > > + struct device *from, const void *data, > > + int (*fn)(struct device *, const void *)) > > +{ > > + struct klist_iter i; > > + struct device *child; > > + int error = 0; > > + > > + if (!parent->p) > > + return 0; > > + > > + klist_iter_init_node(&parent->p->klist_children, &i, > > + (from ? &from->p->knode_parent : NULL)); > > + while ((child = prev_device(&i)) && !error) > > + error = fn(child, data); > > + klist_iter_exit(&i); > > + return error; > > +} > > +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from); > > + > > it does NOT deserve, also does NOT need to introduce a new core driver > API device_for_each_child_reverse_from(). existing > device_for_each_child_reverse() can do what the _from() wants to do. > > we can use similar approach as below link shown: > https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/ No, just have a simple starting point parameter. I understand that more logic can be placed around device_for_each_child_reverse() to achieve the same effect, but the core helpers should be removing logic from consumers, not forcing them to add more. If bloat is a concern, then after your const cleanups go through device_for_each_child_reverse() can be rewritten in terms of device_for_each_child_reverse_from() as (untested): diff --git a/drivers/base/core.c b/drivers/base/core.c index e42f1ad73078..2571c910da46 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -4007,36 +4007,6 @@ int device_for_each_child(struct device *parent, void *data, } EXPORT_SYMBOL_GPL(device_for_each_child); -/** - * device_for_each_child_reverse - device child iterator in reversed order. - * @parent: parent struct device. - * @fn: function to be called for each device. - * @data: data for the callback. - * - * Iterate over @parent's child devices, and call @fn for each, - * passing it @data. - * - * We check the return of @fn each time. If it returns anything - * other than 0, we break out and return that value. - */ -int device_for_each_child_reverse(struct device *parent, void *data, - int (*fn)(struct device *dev, void *data)) -{ - struct klist_iter i; - struct device *child; - int error = 0; - - if (!parent || !parent->p) - return 0; - - klist_iter_init(&parent->p->klist_children, &i); - while ((child = prev_device(&i)) && !error) - error = fn(child, data); - klist_iter_exit(&i); - return error; -} -EXPORT_SYMBOL_GPL(device_for_each_child_reverse); - /** * device_for_each_child_reverse_from - device child iterator in reversed order. * @parent: parent struct device. diff --git a/include/linux/device.h b/include/linux/device.h index 667cb6db9019..96a2c072bf5b 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1076,11 +1076,14 @@ DEFINE_FREE(device_del, struct device *, if (_T) device_del(_T)) int device_for_each_child(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); -int device_for_each_child_reverse(struct device *dev, void *data, - int (*fn)(struct device *dev, void *data)); int device_for_each_child_reverse_from(struct device *parent, struct device *from, const void *data, int (*fn)(struct device *, const void *)); +static inline int device_for_each_child_reverse(struct device *dev, const void *data, + int (*fn)(struct device *, const void *)) +{ + return device_for_each_child_reverse_from(dev, NULL, data, fn); +} struct device *device_find_child(struct device *dev, void *data, int (*match)(struct device *dev, void *data)); struct device *device_find_child_by_name(struct device *parent,
On 2024/10/12 01:46, Dan Williams wrote: > Zijun Hu wrote: >> On 2024/10/11 13:34, Dan Williams wrote: >>> In support of investigating an initialization failure report [1], >>> cxl_test was updated to register mock memory-devices after the mock >>> root-port/bus device had been registered. That led to cxl_test crashing >>> with a use-after-free bug with the following signature: >>> >>> cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 >>> cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 >>> cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 >>> 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 >>> [..] >>> cxld_unregister: cxl decoder14.0: >>> cxl_region_decode_reset: cxl_region region3: >>> mock_decoder_reset: cxl_port port3: decoder3.0 reset >>> 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 >>> cxl_endpoint_decoder_release: cxl decoder14.0: >>> [..] >>> cxld_unregister: cxl decoder7.0: >>> 3) cxl_region_decode_reset: cxl_region region3: >>> Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI >>> [..] >>> RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] >>> [..] >>> Call Trace: >>> <TASK> >>> cxl_region_decode_reset+0x69/0x190 [cxl_core] >>> cxl_region_detach+0xe8/0x210 [cxl_core] >>> cxl_decoder_kill_region+0x27/0x40 [cxl_core] >>> cxld_unregister+0x5d/0x60 [cxl_core] >>> >>> At 1) a region has been established with 2 endpoint decoders (7.0 and >>> 14.0). Those endpoints share a common switch-decoder in the topology >>> (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits >>> the "out of order reset case" in the switch decoder. The effect though >>> is that region3 cleanup is aborted leaving it in-tact and >>> referencing decoder14.0. At 3) the second attempt to teardown region3 >>> trips over the stale decoder14.0 object which has long since been >>> deleted. >>> >>> The fix here is to recognize that the CXL specification places no >>> mandate on in-order shutdown of switch-decoders, the driver enforces >>> in-order allocation, and hardware enforces in-order commit. So, rather >>> than fail and leave objects dangling, always remove them. >>> >>> In support of making cxl_region_decode_reset() always succeed, >>> cxl_region_invalidate_memregion() failures are turned into warnings. >>> Crashing the kernel is ok there since system integrity is at risk if >>> caches cannot be managed around physical address mutation events like >>> CXL region destruction. >>> >>> A new device_for_each_child_reverse_from() is added to cleanup >>> port->commit_end after all dependent decoders have been disabled. In >>> other words if decoders are allocated 0->1->2 and disabled 1->2->0 then >>> port->commit_end only decrements from 2 after 2 has been disabled, and >>> it decrements all the way to zero since 1 was disabled previously. >>> >>> Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] >>> Cc: <stable@vger.kernel.org> >>> Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") >>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> >>> Cc: Davidlohr Bueso <dave@stgolabs.net> >>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> >>> Cc: Dave Jiang <dave.jiang@intel.com> >>> Cc: Alison Schofield <alison.schofield@intel.com> >>> Cc: Ira Weiny <ira.weiny@intel.com> >>> Cc: Zijun Hu <zijun_hu@icloud.com> >>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >>> --- >>> drivers/base/core.c | 35 +++++++++++++++++++++++++++++ >>> drivers/cxl/core/hdm.c | 50 +++++++++++++++++++++++++++++++++++------- >>> drivers/cxl/core/region.c | 48 +++++++++++----------------------------- >>> drivers/cxl/cxl.h | 3 ++- >>> include/linux/device.h | 3 +++ >>> tools/testing/cxl/test/cxl.c | 14 ++++-------- >>> 6 files changed, 100 insertions(+), 53 deletions(-) >>> >>> diff --git a/drivers/base/core.c b/drivers/base/core.c >>> index a4c853411a6b..e42f1ad73078 100644 >>> --- a/drivers/base/core.c >>> +++ b/drivers/base/core.c >>> @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data, >>> } >>> EXPORT_SYMBOL_GPL(device_for_each_child_reverse); >>> >>> +/** >>> + * device_for_each_child_reverse_from - device child iterator in reversed order. >>> + * @parent: parent struct device. >>> + * @from: optional starting point in child list >>> + * @fn: function to be called for each device. >>> + * @data: data for the callback. >>> + * >>> + * Iterate over @parent's child devices, starting at @from, and call @fn >>> + * for each, passing it @data. This helper is identical to >>> + * device_for_each_child_reverse() when @from is NULL. >>> + * >>> + * @fn is checked each iteration. If it returns anything other than 0, >>> + * iteration stop and that value is returned to the caller of >>> + * device_for_each_child_reverse_from(); >>> + */ >>> +int device_for_each_child_reverse_from(struct device *parent, >>> + struct device *from, const void *data, >>> + int (*fn)(struct device *, const void *)) >>> +{ >>> + struct klist_iter i; >>> + struct device *child; >>> + int error = 0; >>> + >>> + if (!parent->p) >>> + return 0; >>> + >>> + klist_iter_init_node(&parent->p->klist_children, &i, >>> + (from ? &from->p->knode_parent : NULL)); >>> + while ((child = prev_device(&i)) && !error) >>> + error = fn(child, data); >>> + klist_iter_exit(&i); >>> + return error; >>> +} >>> +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from); >>> + >> >> it does NOT deserve, also does NOT need to introduce a new core driver >> API device_for_each_child_reverse_from(). existing >> device_for_each_child_reverse() can do what the _from() wants to do. >> >> we can use similar approach as below link shown: >> https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/ > > No, just have a simple starting point parameter. I understand that more > logic can be placed around device_for_each_child_reverse() to achieve > the same effect, but the core helpers should be removing logic from > consumers, not forcing them to add more. > > If bloat is a concern, then after your const cleanups go through > device_for_each_child_reverse() can be rewritten in terms of > device_for_each_child_reverse_from() as (untested): > bloat is one aspect, the other aspect is that there are redundant between both driver core APIs, namely, there are a question: why to still need device_for_each_child_reverse() if it is same as _from(..., NULL, ...) ? So i suggest use existing API now. if there are more users who have such starting point requirement, then add the parameter into device_for_each_child_reverse() which is consistent with other existing *_for_each_*() core APIs such as (class|driver|bus)_for_each_device() and bus_for_each_drv(), that may need much efforts. could you please contains your proposal "fixing this allocation order validation" of below link into this patch series with below reason? and Cc me (^^) https://lore.kernel.org/all/670835f5a2887_964f229474@dwillia2-xfh.jf.intel.com.notmuch/ A) the proposal depends on this patch series. B) one of the issues the proposal fix is match_free_decoder() error logic which is also relevant issues this patch series fix, the proposal also can fix the other device_find_child()'s match() issue i care about. C) Actually, it is a bit difficult for me to understand the proposal since i don't have any basic knowledge about CXL. (^^) > diff --git a/drivers/base/core.c b/drivers/base/core.c > index e42f1ad73078..2571c910da46 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -4007,36 +4007,6 @@ int device_for_each_child(struct device *parent, void *data, > } > EXPORT_SYMBOL_GPL(device_for_each_child); > > -/** > - * device_for_each_child_reverse - device child iterator in reversed order. > - * @parent: parent struct device. > - * @fn: function to be called for each device. > - * @data: data for the callback. > - * > - * Iterate over @parent's child devices, and call @fn for each, > - * passing it @data. > - * > - * We check the return of @fn each time. If it returns anything > - * other than 0, we break out and return that value. > - */ > -int device_for_each_child_reverse(struct device *parent, void *data, > - int (*fn)(struct device *dev, void *data)) > -{ > - struct klist_iter i; > - struct device *child; > - int error = 0; > - > - if (!parent || !parent->p) > - return 0; > - > - klist_iter_init(&parent->p->klist_children, &i); > - while ((child = prev_device(&i)) && !error) > - error = fn(child, data); > - klist_iter_exit(&i); > - return error; > -} > -EXPORT_SYMBOL_GPL(device_for_each_child_reverse); > - > /** > * device_for_each_child_reverse_from - device child iterator in reversed order. > * @parent: parent struct device. > diff --git a/include/linux/device.h b/include/linux/device.h > index 667cb6db9019..96a2c072bf5b 100644 > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -1076,11 +1076,14 @@ DEFINE_FREE(device_del, struct device *, if (_T) device_del(_T)) > > int device_for_each_child(struct device *dev, void *data, > int (*fn)(struct device *dev, void *data)); > -int device_for_each_child_reverse(struct device *dev, void *data, > - int (*fn)(struct device *dev, void *data)); > int device_for_each_child_reverse_from(struct device *parent, > struct device *from, const void *data, > int (*fn)(struct device *, const void *)); > +static inline int device_for_each_child_reverse(struct device *dev, const void *data, > + int (*fn)(struct device *, const void *)) > +{ > + return device_for_each_child_reverse_from(dev, NULL, data, fn); > +} > struct device *device_find_child(struct device *dev, void *data, > int (*match)(struct device *dev, void *data)); > struct device *device_find_child_by_name(struct device *parent,
On Sat, Oct 12, 2024 at 07:40:13AM +0800, Zijun Hu wrote: > On 2024/10/12 01:46, Dan Williams wrote: > > Zijun Hu wrote: > >> On 2024/10/11 13:34, Dan Williams wrote: > >>> In support of investigating an initialization failure report [1], > >>> cxl_test was updated to register mock memory-devices after the mock > >>> root-port/bus device had been registered. That led to cxl_test crashing > >>> with a use-after-free bug with the following signature: > >>> > >>> cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 > >>> cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 > >>> cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 > >>> 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 > >>> [..] > >>> cxld_unregister: cxl decoder14.0: > >>> cxl_region_decode_reset: cxl_region region3: > >>> mock_decoder_reset: cxl_port port3: decoder3.0 reset > >>> 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 > >>> cxl_endpoint_decoder_release: cxl decoder14.0: > >>> [..] > >>> cxld_unregister: cxl decoder7.0: > >>> 3) cxl_region_decode_reset: cxl_region region3: > >>> Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI > >>> [..] > >>> RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] > >>> [..] > >>> Call Trace: > >>> <TASK> > >>> cxl_region_decode_reset+0x69/0x190 [cxl_core] > >>> cxl_region_detach+0xe8/0x210 [cxl_core] > >>> cxl_decoder_kill_region+0x27/0x40 [cxl_core] > >>> cxld_unregister+0x5d/0x60 [cxl_core] > >>> > >>> At 1) a region has been established with 2 endpoint decoders (7.0 and > >>> 14.0). Those endpoints share a common switch-decoder in the topology > >>> (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits > >>> the "out of order reset case" in the switch decoder. The effect though > >>> is that region3 cleanup is aborted leaving it in-tact and > >>> referencing decoder14.0. At 3) the second attempt to teardown region3 > >>> trips over the stale decoder14.0 object which has long since been > >>> deleted. > >>> > >>> The fix here is to recognize that the CXL specification places no > >>> mandate on in-order shutdown of switch-decoders, the driver enforces > >>> in-order allocation, and hardware enforces in-order commit. So, rather > >>> than fail and leave objects dangling, always remove them. > >>> > >>> In support of making cxl_region_decode_reset() always succeed, > >>> cxl_region_invalidate_memregion() failures are turned into warnings. > >>> Crashing the kernel is ok there since system integrity is at risk if > >>> caches cannot be managed around physical address mutation events like > >>> CXL region destruction. > >>> > >>> A new device_for_each_child_reverse_from() is added to cleanup > >>> port->commit_end after all dependent decoders have been disabled. In > >>> other words if decoders are allocated 0->1->2 and disabled 1->2->0 then > >>> port->commit_end only decrements from 2 after 2 has been disabled, and > >>> it decrements all the way to zero since 1 was disabled previously. > >>> > >>> Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] > >>> Cc: <stable@vger.kernel.org> > >>> Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") > >>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > >>> Cc: Davidlohr Bueso <dave@stgolabs.net> > >>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> > >>> Cc: Dave Jiang <dave.jiang@intel.com> > >>> Cc: Alison Schofield <alison.schofield@intel.com> > >>> Cc: Ira Weiny <ira.weiny@intel.com> > >>> Cc: Zijun Hu <zijun_hu@icloud.com> > >>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> > >>> --- > >>> drivers/base/core.c | 35 +++++++++++++++++++++++++++++ > >>> drivers/cxl/core/hdm.c | 50 +++++++++++++++++++++++++++++++++++------- > >>> drivers/cxl/core/region.c | 48 +++++++++++----------------------------- > >>> drivers/cxl/cxl.h | 3 ++- > >>> include/linux/device.h | 3 +++ > >>> tools/testing/cxl/test/cxl.c | 14 ++++-------- > >>> 6 files changed, 100 insertions(+), 53 deletions(-) > >>> > >>> diff --git a/drivers/base/core.c b/drivers/base/core.c > >>> index a4c853411a6b..e42f1ad73078 100644 > >>> --- a/drivers/base/core.c > >>> +++ b/drivers/base/core.c > >>> @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data, > >>> } > >>> EXPORT_SYMBOL_GPL(device_for_each_child_reverse); > >>> > >>> +/** > >>> + * device_for_each_child_reverse_from - device child iterator in reversed order. > >>> + * @parent: parent struct device. > >>> + * @from: optional starting point in child list > >>> + * @fn: function to be called for each device. > >>> + * @data: data for the callback. > >>> + * > >>> + * Iterate over @parent's child devices, starting at @from, and call @fn > >>> + * for each, passing it @data. This helper is identical to > >>> + * device_for_each_child_reverse() when @from is NULL. > >>> + * > >>> + * @fn is checked each iteration. If it returns anything other than 0, > >>> + * iteration stop and that value is returned to the caller of > >>> + * device_for_each_child_reverse_from(); > >>> + */ > >>> +int device_for_each_child_reverse_from(struct device *parent, > >>> + struct device *from, const void *data, > >>> + int (*fn)(struct device *, const void *)) > >>> +{ > >>> + struct klist_iter i; > >>> + struct device *child; > >>> + int error = 0; > >>> + > >>> + if (!parent->p) > >>> + return 0; > >>> + > >>> + klist_iter_init_node(&parent->p->klist_children, &i, > >>> + (from ? &from->p->knode_parent : NULL)); > >>> + while ((child = prev_device(&i)) && !error) > >>> + error = fn(child, data); > >>> + klist_iter_exit(&i); > >>> + return error; > >>> +} > >>> +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from); > >>> + > >> > >> it does NOT deserve, also does NOT need to introduce a new core driver > >> API device_for_each_child_reverse_from(). existing > >> device_for_each_child_reverse() can do what the _from() wants to do. > >> > >> we can use similar approach as below link shown: > >> https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/ > > > > No, just have a simple starting point parameter. I understand that more > > logic can be placed around device_for_each_child_reverse() to achieve > > the same effect, but the core helpers should be removing logic from > > consumers, not forcing them to add more. > > > > If bloat is a concern, then after your const cleanups go through > > device_for_each_child_reverse() can be rewritten in terms of > > device_for_each_child_reverse_from() as (untested): > > > > bloat is one aspect, the other aspect is that there are redundant > between both driver core APIs, namely, there are a question: > > why to still need device_for_each_child_reverse() if it is same as > _from(..., NULL, ...) ? > This same pattern (_reverse and _from_reverse) is present in list.h and other iterators. Why would it be contentious here? Reducing _reverse() to be a wrapper of _from_reverse is a nice way of reducing the bloat/redundancy without having to update every current user - this is a very common refactor pattern. Refactoring without disrupting in-flight work is intrinsically valuable. > > - * > > - * Iterate over @parent's child devices, and call @fn for each, > > - * passing it @data. > > - * > > - * We check the return of @fn each time. If it returns anything > > - * other than 0, we break out and return that value. > > - */ > > -int device_for_each_child_reverse(struct device *parent, void *data, > > - int (*fn)(struct device *dev, void *data)) > > -{ > > - struct klist_iter i; > > - struct device *child; > > - int error = 0; > > - > > - if (!parent || !parent->p) > > - return 0; > > - > > - klist_iter_init(&parent->p->klist_children, &i); > > - while ((child = prev_device(&i)) && !error) > > - error = fn(child, data); > > - klist_iter_exit(&i); > > - return error; > > -} > > -EXPORT_SYMBOL_GPL(device_for_each_child_reverse); > > - > > /** > > * device_for_each_child_reverse_from - device child iterator in reversed order. > > * @parent: parent struct device. > > diff --git a/include/linux/device.h b/include/linux/device.h > > index 667cb6db9019..96a2c072bf5b 100644 > > --- a/include/linux/device.h > > +++ b/include/linux/device.h > > @@ -1076,11 +1076,14 @@ DEFINE_FREE(device_del, struct device *, if (_T) device_del(_T)) > > > > int device_for_each_child(struct device *dev, void *data, > > int (*fn)(struct device *dev, void *data)); > > -int device_for_each_child_reverse(struct device *dev, void *data, > > - int (*fn)(struct device *dev, void *data)); > > int device_for_each_child_reverse_from(struct device *parent, > > struct device *from, const void *data, > > int (*fn)(struct device *, const void *)); > > +static inline int device_for_each_child_reverse(struct device *dev, const void *data, > > + int (*fn)(struct device *, const void *)) > > +{ > > + return device_for_each_child_reverse_from(dev, NULL, data, fn); > > +} > > struct device *device_find_child(struct device *dev, void *data, > > int (*match)(struct device *dev, void *data)); > > struct device *device_find_child_by_name(struct device *parent, >
Zijun Hu wrote: [..] > >> it does NOT deserve, also does NOT need to introduce a new core driver > >> API device_for_each_child_reverse_from(). existing > >> device_for_each_child_reverse() can do what the _from() wants to do. > >> > >> we can use similar approach as below link shown: > >> https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/ > > > > No, just have a simple starting point parameter. I understand that more > > logic can be placed around device_for_each_child_reverse() to achieve > > the same effect, but the core helpers should be removing logic from > > consumers, not forcing them to add more. > > > > If bloat is a concern, then after your const cleanups go through > > device_for_each_child_reverse() can be rewritten in terms of > > device_for_each_child_reverse_from() as (untested): > > > > bloat is one aspect, the other aspect is that there are redundant > between both driver core APIs, namely, there are a question: > > why to still need device_for_each_child_reverse() if it is same as > _from(..., NULL, ...) ? To allow access to the new functionality without burdening existing callers. With device_for_each_child_reverse() re-written to internally use device_for_each_child_reverse_from() Linux gets more functionality with no disruption to existing users and no bloat. This is typical API evolution. > So i suggest use existing API now. No, I am failing to understand your concern. > if there are more users who have such starting point requirement, then > add the parameter into device_for_each_child_reverse() which is > consistent with other existing *_for_each_*() core APIs such as > (class|driver|bus)_for_each_device() and bus_for_each_drv(), that may > need much efforts. There are ~370 existing device_for_each* callers. Let's not thrash them. Introduce new superset calls with the additional parameter and then rewrite the old routines to just have a simple wrapper that passes a NULL @start parameter. Now, if Greg has the appetite to go touch all ~370 existing callers, so be it, but introducing a superset-iterator-helper and a compat-wrapper for legacy is the path I would take. > could you please contains your proposal "fixing this allocation > order validation" of below link into this patch series with below > reason? and Cc me (^^) > > https://lore.kernel.org/all/670835f5a2887_964f229474@dwillia2-xfh.jf.intel.com.notmuch/ > > A) > the proposal depends on this patch series. > B) > one of the issues the proposal fix is match_free_decoder() error > logic which is also relevant issues this patch series fix, the proposal > also can fix the other device_find_child()'s match() issue i care about. > > C) > Actually, it is a bit difficult for me to understand the proposal since > i don't have any basic knowledge about CXL. (^^) If I understand your question correctly you are asking how does device_for_each_child_reverse_from() get used in cxl_region_find_decoder() to enforce in-order allocation? The recommendation is the following: -- 8< -- diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index 3478d2058303..32cde18ff31b 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -778,26 +778,50 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos) return rc; } +static int check_commit_order(struct device *dev, const void *data) +{ + struct cxl_decoder *cxld = to_cxl_decoder(dev); + + /* + * if port->commit_end is not the only free decoder, then out of + * order shutdown has occurred, block further allocations until + * that is resolved + */ + if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) + return -EBUSY; + return 0; +} + static int match_free_decoder(struct device *dev, void *data) { + struct cxl_port *port = to_cxl_port(dev->parent); struct cxl_decoder *cxld; - int *id = data; + int rc; if (!is_switch_decoder(dev)) return 0; cxld = to_cxl_decoder(dev); - /* enforce ordered allocation */ - if (cxld->id != *id) + if (cxld->id != port->commit_end + 1) return 0; - if (!cxld->region) - return 1; - - (*id)++; + if (cxld->region) { + dev_dbg(dev->parent, + "next decoder to commit is already reserved\n", + dev_name(dev)); + return 0; + } - return 0; + rc = device_for_each_child_reverse_from(dev->parent, dev, NULL, + check_commit_order); + if (rc) { + dev_dbg(dev->parent, + "unable to allocate %s due to out of order shutdown\n", + dev_name(dev)); + return 0; + } + return 1; } static int match_auto_decoder(struct device *dev, void *data) @@ -824,7 +848,6 @@ cxl_region_find_decoder(struct cxl_port *port, struct cxl_region *cxlr) { struct device *dev; - int id = 0; if (port == cxled_to_port(cxled)) return &cxled->cxld; @@ -833,7 +856,7 @@ cxl_region_find_decoder(struct cxl_port *port, dev = device_find_child(&port->dev, &cxlr->params, match_auto_decoder); else - dev = device_find_child(&port->dev, &id, match_free_decoder); + dev = device_find_child(&port->dev, NULL, match_free_decoder); if (!dev) return NULL; /*
On 2024/10/13 06:16, Dan Williams wrote: > Zijun Hu wrote: > [..] >>>> it does NOT deserve, also does NOT need to introduce a new core driver >>>> API device_for_each_child_reverse_from(). existing >>>> device_for_each_child_reverse() can do what the _from() wants to do. >>>> >>>> we can use similar approach as below link shown: >>>> https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/ >>> >>> No, just have a simple starting point parameter. I understand that more >>> logic can be placed around device_for_each_child_reverse() to achieve >>> the same effect, but the core helpers should be removing logic from >>> consumers, not forcing them to add more. >>> >>> If bloat is a concern, then after your const cleanups go through >>> device_for_each_child_reverse() can be rewritten in terms of >>> device_for_each_child_reverse_from() as (untested): >>> >> >> bloat is one aspect, the other aspect is that there are redundant >> between both driver core APIs, namely, there are a question: >> >> why to still need device_for_each_child_reverse() if it is same as >> _from(..., NULL, ...) ? > > To allow access to the new functionality without burdening existing > callers. With device_for_each_child_reverse() re-written to internally > use device_for_each_child_reverse_from() Linux gets more functionality > with no disruption to existing users and no bloat. This is typical API > evolution. > >> So i suggest use existing API now. > > No, I am failing to understand your concern. > >> if there are more users who have such starting point requirement, then >> add the parameter into device_for_each_child_reverse() which is >> consistent with other existing *_for_each_*() core APIs such as >> (class|driver|bus)_for_each_device() and bus_for_each_drv(), that may >> need much efforts. > > There are ~370 existing device_for_each* callers. Let's not thrash them. > > Introduce new superset calls with the additional parameter and then > rewrite the old routines to just have a simple wrapper that passes a > NULL @start parameter. > > Now, if Greg has the appetite to go touch all ~370 existing callers, so > be it, but introducing a superset-iterator-helper and a compat-wrapper > for legacy is the path I would take. > current kernel tree ONLY has 15 usages of device_for_each_child_reverse(), i would like to add an extra parameter @start as existing (class|driver)_for_each_device() and bus_for_each_(dev|drv)() do if it is required. >> could you please contains your proposal "fixing this allocation >> order validation" of below link into this patch series with below >> reason? and Cc me (^^) >> >> https://lore.kernel.org/all/670835f5a2887_964f229474@dwillia2-xfh.jf.intel.com.notmuch/ >> >> A) >> the proposal depends on this patch series. >> B) >> one of the issues the proposal fix is match_free_decoder() error >> logic which is also relevant issues this patch series fix, the proposal >> also can fix the other device_find_child()'s match() issue i care about. >> >> C) >> Actually, it is a bit difficult for me to understand the proposal since >> i don't have any basic knowledge about CXL. (^^) > > If I understand your question correctly you are asking how does > device_for_each_child_reverse_from() get used in > cxl_region_find_decoder() to enforce in-order allocation? > yes. your recommendation may help me understand it. > The recommendation is the following: > > -- 8< -- > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index 3478d2058303..32cde18ff31b 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -778,26 +778,50 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos) > return rc; > } > > +static int check_commit_order(struct device *dev, const void *data) > +{ > + struct cxl_decoder *cxld = to_cxl_decoder(dev); > + > + /* > + * if port->commit_end is not the only free decoder, then out of > + * order shutdown has occurred, block further allocations until > + * that is resolved > + */ > + if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) > + return -EBUSY; > + return 0; > +} > + > static int match_free_decoder(struct device *dev, void *data) > { > + struct cxl_port *port = to_cxl_port(dev->parent); > struct cxl_decoder *cxld; > - int *id = data; > + int rc; > > if (!is_switch_decoder(dev)) > return 0; > > cxld = to_cxl_decoder(dev); > > - /* enforce ordered allocation */ > - if (cxld->id != *id) > + if (cxld->id != port->commit_end + 1) > return 0; > for a port, is it possible that there are multiple CXLDs with same IDs? > - if (!cxld->region) > - return 1; > - > - (*id)++; > + if (cxld->region) { > + dev_dbg(dev->parent, > + "next decoder to commit is already reserved\n", > + dev_name(dev)); > + return 0; > + } > > - return 0; > + rc = device_for_each_child_reverse_from(dev->parent, dev, NULL, > + check_commit_order); > + if (rc) { > + dev_dbg(dev->parent, > + "unable to allocate %s due to out of order shutdown\n", > + dev_name(dev)); > + return 0; > + } > + return 1; > } > > static int match_auto_decoder(struct device *dev, void *data) > @@ -824,7 +848,6 @@ cxl_region_find_decoder(struct cxl_port *port, > struct cxl_region *cxlr) > { > struct device *dev; > - int id = 0; > > if (port == cxled_to_port(cxled)) > return &cxled->cxld; > @@ -833,7 +856,7 @@ cxl_region_find_decoder(struct cxl_port *port, > dev = device_find_child(&port->dev, &cxlr->params, > match_auto_decoder); > else > - dev = device_find_child(&port->dev, &id, match_free_decoder); > + dev = device_find_child(&port->dev, NULL, match_free_decoder); > if (!dev) > return NULL; > /*
Zijun Hu wrote: > On 2024/10/13 06:16, Dan Williams wrote: > > Zijun Hu wrote: > > [..] > >>>> it does NOT deserve, also does NOT need to introduce a new core driver > >>>> API device_for_each_child_reverse_from(). existing > >>>> device_for_each_child_reverse() can do what the _from() wants to do. > >>>> > >>>> we can use similar approach as below link shown: > >>>> https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@quicinc.com/ > >>> > >>> No, just have a simple starting point parameter. I understand that more > >>> logic can be placed around device_for_each_child_reverse() to achieve > >>> the same effect, but the core helpers should be removing logic from > >>> consumers, not forcing them to add more. > >>> > >>> If bloat is a concern, then after your const cleanups go through > >>> device_for_each_child_reverse() can be rewritten in terms of > >>> device_for_each_child_reverse_from() as (untested): > >>> > >> > >> bloat is one aspect, the other aspect is that there are redundant > >> between both driver core APIs, namely, there are a question: > >> > >> why to still need device_for_each_child_reverse() if it is same as > >> _from(..., NULL, ...) ? > > > > To allow access to the new functionality without burdening existing > > callers. With device_for_each_child_reverse() re-written to internally > > use device_for_each_child_reverse_from() Linux gets more functionality > > with no disruption to existing users and no bloat. This is typical API > > evolution. > > > >> So i suggest use existing API now. > > > > No, I am failing to understand your concern. > > > >> if there are more users who have such starting point requirement, then > >> add the parameter into device_for_each_child_reverse() which is > >> consistent with other existing *_for_each_*() core APIs such as > >> (class|driver|bus)_for_each_device() and bus_for_each_drv(), that may > >> need much efforts. > > > > There are ~370 existing device_for_each* callers. Let's not thrash them. > > > > Introduce new superset calls with the additional parameter and then > > rewrite the old routines to just have a simple wrapper that passes a > > NULL @start parameter. > > > > Now, if Greg has the appetite to go touch all ~370 existing callers, so > > be it, but introducing a superset-iterator-helper and a compat-wrapper > > for legacy is the path I would take. > > > > current kernel tree ONLY has 15 usages of > device_for_each_child_reverse(), i would like to > add an extra parameter @start as existing > (class|driver)_for_each_device() and bus_for_each_(dev|drv)() do > if it is required. A new parameter to a new wrapper symbol sounds fine to me. Otherwise, please do not go thrash all the call sites to pass an unused NULL @start parameter. Just accept that device_for_each_* did not follow the {class,driver,bus}_for_each_* example and instead introduce a new symbol to wrap the new functionality that so far only has the single CXL user. [..] > > If I understand your question correctly you are asking how does > > device_for_each_child_reverse_from() get used in > > cxl_region_find_decoder() to enforce in-order allocation? > > > > yes. your recommendation may help me understand it. > > > The recommendation is the following: > > > > -- 8< -- > > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > > index 3478d2058303..32cde18ff31b 100644 > > --- a/drivers/cxl/core/region.c > > +++ b/drivers/cxl/core/region.c > > @@ -778,26 +778,50 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos) > > return rc; > > } > > > > +static int check_commit_order(struct device *dev, const void *data) > > +{ > > + struct cxl_decoder *cxld = to_cxl_decoder(dev); > > + > > + /* > > + * if port->commit_end is not the only free decoder, then out of > > + * order shutdown has occurred, block further allocations until > > + * that is resolved > > + */ > > + if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) > > + return -EBUSY; > > + return 0; > > +} > > + > > static int match_free_decoder(struct device *dev, void *data) > > { > > + struct cxl_port *port = to_cxl_port(dev->parent); > > struct cxl_decoder *cxld; > > - int *id = data; > > + int rc; > > > > if (!is_switch_decoder(dev)) > > return 0; > > > > cxld = to_cxl_decoder(dev); > > > > - /* enforce ordered allocation */ > > - if (cxld->id != *id) > > + if (cxld->id != port->commit_end + 1) > > return 0; > > > > for a port, is it possible that there are multiple CXLDs with same IDs? Not possible, that is also enforced by the driver core, all kernel object names must be unique (at least if they have the same parent), and the local subsystem convention is that all decoders are registered in id-order.
On 2024/10/15 03:32, Dan Williams wrote: > Zijun Hu wrote: >> On 2024/10/13 06:16, Dan Williams wrote: >>> Zijun Hu wrote: >>> [..] >>>>>> it does NOT deserve, also does NOT need to introduce a new core driver >>>>>> API device_for_each_child_reverse_from(). existing >>>>>> device_for_each_child_reverse() can do what the _from() wants to do. >>>>>> [snip] >>> Introduce new superset calls with the additional parameter and then >>> rewrite the old routines to just have a simple wrapper that passes a >>> NULL @start parameter. >>> >>> Now, if Greg has the appetite to go touch all ~370 existing callers, so >>> be it, but introducing a superset-iterator-helper and a compat-wrapper >>> for legacy is the path I would take. >>> >> >> current kernel tree ONLY has 15 usages of >> device_for_each_child_reverse(), i would like to >> add an extra parameter @start as existing >> (class|driver)_for_each_device() and bus_for_each_(dev|drv)() do >> if it is required. > > A new parameter to a new wrapper symbol sounds fine to me. Otherwise, > please do not go thrash all the call sites to pass an unused NULL @start > parameter. Just accept that device_for_each_* did not follow the > {class,driver,bus}_for_each_* example and instead introduce a new symbol > to wrap the new functionality that so far only has the single CXL user. > you maybe regard my idea as a alternative proposal if Greg dislike introducing a new core driver API. (^^) > [..] >>> If I understand your question correctly you are asking how does >>> device_for_each_child_reverse_from() get used in >>> cxl_region_find_decoder() to enforce in-order allocation? >>> >> >> yes. your recommendation may help me understand it. >> below simple solution should have same effect as your recommendation. also have below optimizations: 1) it don't need new core API. 2) it is more efficient since it has minimal iterating. i will submit it if you like it. (^^) index e701e4b04032..37da42329ff3 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -796,8 +796,9 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos) static int match_free_decoder(struct device *dev, void *data) { + struct cxl_port *port = to_cxl_port(dev->parent); struct cxl_decoder *cxld; - int *id = data; + struct device **target_dev = data; if (!is_switch_decoder(dev)) return 0; @@ -805,15 +806,19 @@ static int match_free_decoder(struct device *dev, void *data) cxld = to_cxl_decoder(dev); /* enforce ordered allocation */ - if (cxld->id != *id) - return 0; - - if (!cxld->region) - return 1; - - (*id)++; - - return 0; + if (cxld->id == port->commit_end + 1) { + if (!cxld->region) { + *target_dev = dev; + return 1; + } else { + dev_dbg(dev->parent, + "next decoder to commit is already reserved\n", + dev_name(dev)); + return -ENODEV; + } + } else { + return cxld->flags & CXL_DECODER_F_ENABLE ? 0 : -EBUSY; + } } static int match_auto_decoder(struct device *dev, void *data) @@ -839,7 +844,7 @@ cxl_region_find_decoder(struct cxl_port *port, struct cxl_endpoint_decoder *cxled, struct cxl_region *cxlr) { - struct device *dev; + struct device *dev = NULL; int id = 0; if (port == cxled_to_port(cxled)) @@ -848,8 +853,8 @@ cxl_region_find_decoder(struct cxl_port *port, if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) dev = device_find_child(&port->dev, &cxlr->params, match_auto_decoder); - else - dev = device_find_child(&port->dev, &id, match_free_decoder); + else if (device_for_each_child(&port->dev, &dev, match_free_decoder) > 0) + get_device(dev); if (!dev) return NULL; /* >>> The recommendation is the following: >>> >>> -- 8< -- >>> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c >>> index 3478d2058303..32cde18ff31b 100644 >>> --- a/drivers/cxl/core/region.c >>> +++ b/drivers/cxl/core/region.c >>> @@ -778,26 +778,50 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos) >>> return rc; >>> } >>> >>> +static int check_commit_order(struct device *dev, const void *data) >>> +{ >>> + struct cxl_decoder *cxld = to_cxl_decoder(dev); >>> + >>> + /* >>> + * if port->commit_end is not the only free decoder, then out of >>> + * order shutdown has occurred, block further allocations until >>> + * that is resolved >>> + */ >>> + if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) >>> + return -EBUSY; >>> + return 0; >>> +} >>> + >>> static int match_free_decoder(struct device *dev, void *data) >>> { >>> + struct cxl_port *port = to_cxl_port(dev->parent); >>> struct cxl_decoder *cxld; >>> - int *id = data; >>> + int rc; >>> >>> if (!is_switch_decoder(dev)) >>> return 0; >>> >>> cxld = to_cxl_decoder(dev); >>> >>> - /* enforce ordered allocation */ >>> - if (cxld->id != *id) >>> + if (cxld->id != port->commit_end + 1) >>> return 0; >>> >> >> for a port, is it possible that there are multiple CXLDs with same IDs? > > Not possible, that is also enforced by the driver core, all kernel > object names must be unique (at least if they have the same parent), and > the local subsystem convention is that all decoders are registered in > id-order.
Zijun Hu wrote: > On 2024/10/15 03:32, Dan Williams wrote: > > Zijun Hu wrote: > >> On 2024/10/13 06:16, Dan Williams wrote: > >>> Zijun Hu wrote: > >>> [..] > >>>>>> it does NOT deserve, also does NOT need to introduce a new core driver > >>>>>> API device_for_each_child_reverse_from(). existing > >>>>>> device_for_each_child_reverse() can do what the _from() wants to do. > >>>>>> > > [snip] > > >>> Introduce new superset calls with the additional parameter and then > >>> rewrite the old routines to just have a simple wrapper that passes a > >>> NULL @start parameter. > >>> > >>> Now, if Greg has the appetite to go touch all ~370 existing callers, so > >>> be it, but introducing a superset-iterator-helper and a compat-wrapper > >>> for legacy is the path I would take. > >>> > >> > >> current kernel tree ONLY has 15 usages of > >> device_for_each_child_reverse(), i would like to > >> add an extra parameter @start as existing > >> (class|driver)_for_each_device() and bus_for_each_(dev|drv)() do > >> if it is required. > > > > A new parameter to a new wrapper symbol sounds fine to me. Otherwise, > > please do not go thrash all the call sites to pass an unused NULL @start > > parameter. Just accept that device_for_each_* did not follow the > > {class,driver,bus}_for_each_* example and instead introduce a new symbol > > to wrap the new functionality that so far only has the single CXL user. > > > > you maybe regard my idea as a alternative proposal if Greg dislike > introducing a new core driver API. (^^) If the proposal is to add an unwanted parameter to existing call sites then yes, I would NAK that. > > [..] > >>> If I understand your question correctly you are asking how does > >>> device_for_each_child_reverse_from() get used in > >>> cxl_region_find_decoder() to enforce in-order allocation? > >>> > >> > >> yes. your recommendation may help me understand it. > >> > > below simple solution should have same effect as your recommendation. > also have below optimizations: > > 1) it don't need new core API. > 2) it is more efficient since it has minimal iterating. > > i will submit it if you like it. (^^) See the patch I just submitted, it does not handle the case of competing allocations. The cxld->region check is not sufficient for determining that the decoder is committed.
On Thu, 10 Oct 2024 22:34:26 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > In support of investigating an initialization failure report [1], > cxl_test was updated to register mock memory-devices after the mock > root-port/bus device had been registered. That led to cxl_test crashing > with a use-after-free bug with the following signature: > > cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 > cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 > cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 > 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 > [..] > cxld_unregister: cxl decoder14.0: > cxl_region_decode_reset: cxl_region region3: > mock_decoder_reset: cxl_port port3: decoder3.0 reset > 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 > cxl_endpoint_decoder_release: cxl decoder14.0: > [..] > cxld_unregister: cxl decoder7.0: > 3) cxl_region_decode_reset: cxl_region region3: > Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI > [..] > RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] > [..] > Call Trace: > <TASK> > cxl_region_decode_reset+0x69/0x190 [cxl_core] > cxl_region_detach+0xe8/0x210 [cxl_core] > cxl_decoder_kill_region+0x27/0x40 [cxl_core] > cxld_unregister+0x5d/0x60 [cxl_core] > > At 1) a region has been established with 2 endpoint decoders (7.0 and > 14.0). Those endpoints share a common switch-decoder in the topology > (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits > the "out of order reset case" in the switch decoder. The effect though > is that region3 cleanup is aborted leaving it in-tact and > referencing decoder14.0. At 3) the second attempt to teardown region3 > trips over the stale decoder14.0 object which has long since been > deleted. > > The fix here is to recognize that the CXL specification places no > mandate on in-order shutdown of switch-decoders, the driver enforces > in-order allocation, and hardware enforces in-order commit. So, rather > than fail and leave objects dangling, always remove them. > > In support of making cxl_region_decode_reset() always succeed, > cxl_region_invalidate_memregion() failures are turned into warnings. > Crashing the kernel is ok there since system integrity is at risk if > caches cannot be managed around physical address mutation events like > CXL region destruction. I'm fine with this, but seems like it is worth breaking out as a precursor where we can discuss merits of that change separate from the complexity of the rest. I don't mind that strongly though so if you keep this intact, Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Trivial passing comment inline. > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c > index 3df10517a327..223c273c0cd1 100644 > --- a/drivers/cxl/core/hdm.c > +++ b/drivers/cxl/core/hdm.c > @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld) > return 0; > } > > -static int cxl_decoder_reset(struct cxl_decoder *cxld) > +static int commit_reap(struct device *dev, const void *data) > +{ > + struct cxl_port *port = to_cxl_port(dev->parent); > + struct cxl_decoder *cxld; > + > + if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev)) > + return 0; > + > + cxld = to_cxl_decoder(dev); > + if (port->commit_end == cxld->id && > + ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) { I'd have gone with !(cxld->flags & CXL_DECODER_F_ENABLE) but this is consistent with exiting form, so fine as is. > + port->commit_end--; > + dev_dbg(&port->dev, "reap: %s commit_end: %d\n", > + dev_name(&cxld->dev), port->commit_end); > + } > + > + return 0; > +} > diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c > index 90d5afd52dd0..c5bbd89b3192 100644 > --- a/tools/testing/cxl/test/cxl.c > +++ b/tools/testing/cxl/test/cxl.c > @@ -693,26 +693,22 @@ static int mock_decoder_commit(struct cxl_decoder *cxld) > return 0; > } > > -static int mock_decoder_reset(struct cxl_decoder *cxld) > +static void mock_decoder_reset(struct cxl_decoder *cxld) > { > struct cxl_port *port = to_cxl_port(cxld->dev.parent); > int id = cxld->id; > > if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0) > - return 0; > + return; > > dev_dbg(&port->dev, "%s reset\n", dev_name(&cxld->dev)); > - if (port->commit_end != id) { > + if (port->commit_end == id) > + cxl_port_commit_reap(cxld); > + else > dev_dbg(&port->dev, > "%s: out of order reset, expected decoder%d.%d\n", > dev_name(&cxld->dev), port->id, port->commit_end); > - return -EBUSY; > - } > - > - port->commit_end--; > cxld->flags &= ~CXL_DECODER_F_ENABLE; > - > - return 0; > } > > static void default_mock_decoder(struct cxl_decoder *cxld) > >
Jonathan Cameron wrote: > On Thu, 10 Oct 2024 22:34:26 -0700 > Dan Williams <dan.j.williams@intel.com> wrote: > > > In support of investigating an initialization failure report [1], > > cxl_test was updated to register mock memory-devices after the mock > > root-port/bus device had been registered. That led to cxl_test crashing > > with a use-after-free bug with the following signature: > > > > cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 > > cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 > > cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 > > 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 > > [..] > > cxld_unregister: cxl decoder14.0: > > cxl_region_decode_reset: cxl_region region3: > > mock_decoder_reset: cxl_port port3: decoder3.0 reset > > 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 > > cxl_endpoint_decoder_release: cxl decoder14.0: > > [..] > > cxld_unregister: cxl decoder7.0: > > 3) cxl_region_decode_reset: cxl_region region3: > > Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI > > [..] > > RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] > > [..] > > Call Trace: > > <TASK> > > cxl_region_decode_reset+0x69/0x190 [cxl_core] > > cxl_region_detach+0xe8/0x210 [cxl_core] > > cxl_decoder_kill_region+0x27/0x40 [cxl_core] > > cxld_unregister+0x5d/0x60 [cxl_core] > > > > At 1) a region has been established with 2 endpoint decoders (7.0 and > > 14.0). Those endpoints share a common switch-decoder in the topology > > (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits > > the "out of order reset case" in the switch decoder. The effect though > > is that region3 cleanup is aborted leaving it in-tact and > > referencing decoder14.0. At 3) the second attempt to teardown region3 > > trips over the stale decoder14.0 object which has long since been > > deleted. > > > > The fix here is to recognize that the CXL specification places no > > mandate on in-order shutdown of switch-decoders, the driver enforces > > in-order allocation, and hardware enforces in-order commit. So, rather > > than fail and leave objects dangling, always remove them. > > > > In support of making cxl_region_decode_reset() always succeed, > > cxl_region_invalidate_memregion() failures are turned into warnings. > > Crashing the kernel is ok there since system integrity is at risk if > > caches cannot be managed around physical address mutation events like > > CXL region destruction. > > I'm fine with this, but seems like it is worth breaking out as a precursor > where we can discuss merits of that change separate from the complexity > of the rest. > > I don't mind that strongly though so if you keep this intact, If there are merits to discuss, let's discuss them in this patch (in v2) because if cxl_region_invalidate_memregion() failures are not suitable to be warnings then that invalidates this patch. > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Trivial passing comment inline. > > > > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c > > index 3df10517a327..223c273c0cd1 100644 > > --- a/drivers/cxl/core/hdm.c > > +++ b/drivers/cxl/core/hdm.c > > @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld) > > return 0; > > } > > > > -static int cxl_decoder_reset(struct cxl_decoder *cxld) > > +static int commit_reap(struct device *dev, const void *data) > > +{ > > + struct cxl_port *port = to_cxl_port(dev->parent); > > + struct cxl_decoder *cxld; > > + > > + if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev)) > > + return 0; > > + > > + cxld = to_cxl_decoder(dev); > > + if (port->commit_end == cxld->id && > > + ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) { > I'd have gone with !(cxld->flags & CXL_DECODER_F_ENABLE) but > this is consistent with exiting form, so fine as is. I have long had an aversion to negation operators for the small speedbump to left-to-right readability as evidenced by all the other "(cxld->flags & CXL_DECODER_F_ENABLE) == 0" in drivers/cxl/.
diff --git a/drivers/base/core.c b/drivers/base/core.c index a4c853411a6b..e42f1ad73078 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data, } EXPORT_SYMBOL_GPL(device_for_each_child_reverse); +/** + * device_for_each_child_reverse_from - device child iterator in reversed order. + * @parent: parent struct device. + * @from: optional starting point in child list + * @fn: function to be called for each device. + * @data: data for the callback. + * + * Iterate over @parent's child devices, starting at @from, and call @fn + * for each, passing it @data. This helper is identical to + * device_for_each_child_reverse() when @from is NULL. + * + * @fn is checked each iteration. If it returns anything other than 0, + * iteration stop and that value is returned to the caller of + * device_for_each_child_reverse_from(); + */ +int device_for_each_child_reverse_from(struct device *parent, + struct device *from, const void *data, + int (*fn)(struct device *, const void *)) +{ + struct klist_iter i; + struct device *child; + int error = 0; + + if (!parent->p) + return 0; + + klist_iter_init_node(&parent->p->klist_children, &i, + (from ? &from->p->knode_parent : NULL)); + while ((child = prev_device(&i)) && !error) + error = fn(child, data); + klist_iter_exit(&i); + return error; +} +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from); + /** * device_find_child - device iterator for locating a particular device. * @parent: parent struct device diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c index 3df10517a327..223c273c0cd1 100644 --- a/drivers/cxl/core/hdm.c +++ b/drivers/cxl/core/hdm.c @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld) return 0; } -static int cxl_decoder_reset(struct cxl_decoder *cxld) +static int commit_reap(struct device *dev, const void *data) +{ + struct cxl_port *port = to_cxl_port(dev->parent); + struct cxl_decoder *cxld; + + if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev)) + return 0; + + cxld = to_cxl_decoder(dev); + if (port->commit_end == cxld->id && + ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) { + port->commit_end--; + dev_dbg(&port->dev, "reap: %s commit_end: %d\n", + dev_name(&cxld->dev), port->commit_end); + } + + return 0; +} + +void cxl_port_commit_reap(struct cxl_decoder *cxld) +{ + struct cxl_port *port = to_cxl_port(cxld->dev.parent); + + lockdep_assert_held_write(&cxl_region_rwsem); + + /* + * Once the highest committed decoder is disabled, free any other + * decoders that were pinned allocated by out-of-order release. + */ + port->commit_end--; + dev_dbg(&port->dev, "reap: %s commit_end: %d\n", dev_name(&cxld->dev), + port->commit_end); + device_for_each_child_reverse_from(&port->dev, &cxld->dev, NULL, + commit_reap); +} +EXPORT_SYMBOL_NS_GPL(cxl_port_commit_reap, CXL); + +static void cxl_decoder_reset(struct cxl_decoder *cxld) { struct cxl_port *port = to_cxl_port(cxld->dev.parent); struct cxl_hdm *cxlhdm = dev_get_drvdata(&port->dev); @@ -721,14 +758,14 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) u32 ctrl; if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0) - return 0; + return; - if (port->commit_end != id) { + if (port->commit_end == id) + cxl_port_commit_reap(cxld); + else dev_dbg(&port->dev, "%s: out of order reset, expected decoder%d.%d\n", dev_name(&cxld->dev), port->id, port->commit_end); - return -EBUSY; - } down_read(&cxl_dpa_rwsem); ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id)); @@ -741,7 +778,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) writel(0, hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id)); up_read(&cxl_dpa_rwsem); - port->commit_end--; cxld->flags &= ~CXL_DECODER_F_ENABLE; /* Userspace is now responsible for reconfiguring this decoder */ @@ -751,8 +787,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) cxled = to_cxl_endpoint_decoder(&cxld->dev); cxled->state = CXL_DECODER_STATE_MANUAL; } - - return 0; } static int cxl_setup_hdm_decoder_from_dvsec( diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index e701e4b04032..3478d2058303 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -232,8 +232,8 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr) "Bypassing cpu_cache_invalidate_memregion() for testing!\n"); return 0; } else { - dev_err(&cxlr->dev, - "Failed to synchronize CPU cache state\n"); + dev_WARN(&cxlr->dev, + "Failed to synchronize CPU cache state\n"); return -ENXIO; } } @@ -242,19 +242,17 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr) return 0; } -static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) +static void cxl_region_decode_reset(struct cxl_region *cxlr, int count) { struct cxl_region_params *p = &cxlr->params; - int i, rc = 0; + int i; /* - * Before region teardown attempt to flush, and if the flush - * fails cancel the region teardown for data consistency - * concerns + * Before region teardown attempt to flush, evict any data cached for + * this region, or scream loudly about missing arch / platform support + * for CXL teardown. */ - rc = cxl_region_invalidate_memregion(cxlr); - if (rc) - return rc; + cxl_region_invalidate_memregion(cxlr); for (i = count - 1; i >= 0; i--) { struct cxl_endpoint_decoder *cxled = p->targets[i]; @@ -277,23 +275,17 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) cxl_rr = cxl_rr_load(iter, cxlr); cxld = cxl_rr->decoder; if (cxld->reset) - rc = cxld->reset(cxld); - if (rc) - return rc; + cxld->reset(cxld); set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); } endpoint_reset: - rc = cxled->cxld.reset(&cxled->cxld); - if (rc) - return rc; + cxled->cxld.reset(&cxled->cxld); set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); } /* all decoders associated with this region have been torn down */ clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); - - return 0; } static int commit_decoder(struct cxl_decoder *cxld) @@ -409,16 +401,8 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr, * still pending. */ if (p->state == CXL_CONFIG_RESET_PENDING) { - rc = cxl_region_decode_reset(cxlr, p->interleave_ways); - /* - * Revert to committed since there may still be active - * decoders associated with this region, or move forward - * to active to mark the reset successful - */ - if (rc) - p->state = CXL_CONFIG_COMMIT; - else - p->state = CXL_CONFIG_ACTIVE; + cxl_region_decode_reset(cxlr, p->interleave_ways); + p->state = CXL_CONFIG_ACTIVE; } } @@ -2054,13 +2038,7 @@ static int cxl_region_detach(struct cxl_endpoint_decoder *cxled) get_device(&cxlr->dev); if (p->state > CXL_CONFIG_ACTIVE) { - /* - * TODO: tear down all impacted regions if a device is - * removed out of order - */ - rc = cxl_region_decode_reset(cxlr, p->interleave_ways); - if (rc) - goto out; + cxl_region_decode_reset(cxlr, p->interleave_ways); p->state = CXL_CONFIG_ACTIVE; } diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 0d8b810a51f0..5406e3ab3d4a 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -359,7 +359,7 @@ struct cxl_decoder { struct cxl_region *region; unsigned long flags; int (*commit)(struct cxl_decoder *cxld); - int (*reset)(struct cxl_decoder *cxld); + void (*reset)(struct cxl_decoder *cxld); }; /* @@ -730,6 +730,7 @@ static inline bool is_cxl_root(struct cxl_port *port) int cxl_num_decoders_committed(struct cxl_port *port); bool is_cxl_port(const struct device *dev); struct cxl_port *to_cxl_port(const struct device *dev); +void cxl_port_commit_reap(struct cxl_decoder *cxld); struct pci_bus; int devm_cxl_register_pci_bus(struct device *host, struct device *uport_dev, struct pci_bus *bus); diff --git a/include/linux/device.h b/include/linux/device.h index b4bde8d22697..667cb6db9019 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1078,6 +1078,9 @@ int device_for_each_child(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); int device_for_each_child_reverse(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); +int device_for_each_child_reverse_from(struct device *parent, + struct device *from, const void *data, + int (*fn)(struct device *, const void *)); struct device *device_find_child(struct device *dev, void *data, int (*match)(struct device *dev, void *data)); struct device *device_find_child_by_name(struct device *parent, diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c index 90d5afd52dd0..c5bbd89b3192 100644 --- a/tools/testing/cxl/test/cxl.c +++ b/tools/testing/cxl/test/cxl.c @@ -693,26 +693,22 @@ static int mock_decoder_commit(struct cxl_decoder *cxld) return 0; } -static int mock_decoder_reset(struct cxl_decoder *cxld) +static void mock_decoder_reset(struct cxl_decoder *cxld) { struct cxl_port *port = to_cxl_port(cxld->dev.parent); int id = cxld->id; if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0) - return 0; + return; dev_dbg(&port->dev, "%s reset\n", dev_name(&cxld->dev)); - if (port->commit_end != id) { + if (port->commit_end == id) + cxl_port_commit_reap(cxld); + else dev_dbg(&port->dev, "%s: out of order reset, expected decoder%d.%d\n", dev_name(&cxld->dev), port->id, port->commit_end); - return -EBUSY; - } - - port->commit_end--; cxld->flags &= ~CXL_DECODER_F_ENABLE; - - return 0; } static void default_mock_decoder(struct cxl_decoder *cxld)
In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature: cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0: 3) cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] [..] Call Trace: <TASK> cxl_region_decode_reset+0x69/0x190 [cxl_core] cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x5d/0x60 [cxl_core] At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted. The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them. In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction. A new device_for_each_child_reverse_from() is added to cleanup port->commit_end after all dependent decoders have been disabled. In other words if decoders are allocated 0->1->2 and disabled 1->2->0 then port->commit_end only decrements from 2 after 2 has been disabled, and it decrements all the way to zero since 1 was disabled previously. Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Cc: <stable@vger.kernel.org> Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Zijun Hu <zijun_hu@icloud.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/base/core.c | 35 +++++++++++++++++++++++++++++ drivers/cxl/core/hdm.c | 50 +++++++++++++++++++++++++++++++++++------- drivers/cxl/core/region.c | 48 +++++++++++----------------------------- drivers/cxl/cxl.h | 3 ++- include/linux/device.h | 3 +++ tools/testing/cxl/test/cxl.c | 14 ++++-------- 6 files changed, 100 insertions(+), 53 deletions(-)