Message ID | 169596542307.790108.11339208844199665348.stgit@dwillia2-xfh.jf.intel.com |
---|---|
State | Superseded |
Headers | show |
Series | cxl/mem: Fix shutdown order | expand |
On Thu, 28 Sep 2023 22:30:23 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > Ira reports that removing cxl_mock_mem causes a crash with the following > trace: > > BUG: kernel NULL pointer dereference, address: 0000000000000044 > [..] > RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core] > [..] > Call Trace: > <TASK> > cxl_region_detach+0xe8/0x210 [cxl_core] > cxl_decoder_kill_region+0x27/0x40 [cxl_core] > cxld_unregister+0x29/0x40 [cxl_core] > devres_release_all+0xb8/0x110 > device_unbind_cleanup+0xe/0x70 > device_release_driver_internal+0x1d2/0x210 > bus_remove_device+0xd7/0x150 > device_del+0x155/0x3e0 > device_unregister+0x13/0x60 > devm_release_action+0x4d/0x90 > ? __pfx_unregister_port+0x10/0x10 [cxl_core] > delete_endpoint+0x121/0x130 [cxl_core] > devres_release_all+0xb8/0x110 > device_unbind_cleanup+0xe/0x70 > device_release_driver_internal+0x1d2/0x210 > bus_remove_device+0xd7/0x150 > device_del+0x155/0x3e0 > ? lock_release+0x142/0x290 > cdev_device_del+0x15/0x50 > cxl_memdev_unregister+0x54/0x70 [cxl_core] > > This crash is due to the clearing out the cxl_memdev's driver context > (@cxlds) before the driver is done with it. Fix it by keeping the driver > context valid until device unregistration completes. Ideally good to have an explicit statement of which bit of context is used in the cdev_device_del() call. > > Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations") > Reported-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> I'm not that keen on the fix, because it makes some incidental ordering changes and the ordering in here is messy. What you have feels like it's fixing on symptom whereas there are other issues here. Firstly, as a side note given I'm looking at this code cxl_memdev_security_shutdown() should perhaps take a struct cxl_memdev rather than the struct device that it then uses just to get hold of the cxl_memdev available in cxl_memdev_shutdown(). The equivalent setup code cxl_memdev_security_init() already does take a struct cxl_memdev so this would be desirable for symmetry if nothing else. However, cxl_memdev_security_shutdown() doesn't actually undo the stuff in cxl_memdev_security_init() which previously sent me on a wild goose chase. Offloading that put_santize callback here to devm looks dubious to me: https://elixir.bootlin.com/linux/v6.6-rc3/source/drivers/cxl/core/memdev.c#L1031 .. for a function mid way through a bunch of stuff that is otherwise cleaned up by cxl_memdev_unregister() via devm here https://elixir.bootlin.com/linux/v6.6-rc3/source/drivers/cxl/core/memdev.c#L1066 seems like a recipe for future pain. Could we just move the put_santize() call directly into cxl_memdev_unregister() and the error path in cxl_memdev_add_dev() and make sure the order is as expected. I had a long argument written up on why cxl_memdev_security_shutdown() was in the wrong place in the new sequence before seeing that it didn't balance with what I assumed it did. So probably still in an illogical place given I'd expect all the 'security stuff' to be initialized at same point in setup and tear down sequences (reversed obviously.) Thanks, Jonathan > --- > drivers/cxl/core/memdev.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c > index 14b547c07f54..92d40c5e7efb 100644 > --- a/drivers/cxl/core/memdev.c > +++ b/drivers/cxl/core/memdev.c > @@ -580,8 +580,8 @@ static void cxl_memdev_unregister(void *_cxlmd) > struct cxl_memdev *cxlmd = _cxlmd; > struct device *dev = &cxlmd->dev; > > - cxl_memdev_shutdown(dev); > cdev_device_del(&cxlmd->cdev, dev); > + cxl_memdev_shutdown(dev); > put_device(dev); > } > >
Jonathan Cameron wrote: > On Thu, 28 Sep 2023 22:30:23 -0700 > Dan Williams <dan.j.williams@intel.com> wrote: > > > Ira reports that removing cxl_mock_mem causes a crash with the following > > trace: > > > > BUG: kernel NULL pointer dereference, address: 0000000000000044 > > [..] > > RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core] > > [..] > > Call Trace: > > <TASK> > > cxl_region_detach+0xe8/0x210 [cxl_core] > > cxl_decoder_kill_region+0x27/0x40 [cxl_core] > > cxld_unregister+0x29/0x40 [cxl_core] > > devres_release_all+0xb8/0x110 > > device_unbind_cleanup+0xe/0x70 > > device_release_driver_internal+0x1d2/0x210 > > bus_remove_device+0xd7/0x150 > > device_del+0x155/0x3e0 > > device_unregister+0x13/0x60 > > devm_release_action+0x4d/0x90 > > ? __pfx_unregister_port+0x10/0x10 [cxl_core] > > delete_endpoint+0x121/0x130 [cxl_core] > > devres_release_all+0xb8/0x110 > > device_unbind_cleanup+0xe/0x70 > > device_release_driver_internal+0x1d2/0x210 > > bus_remove_device+0xd7/0x150 > > device_del+0x155/0x3e0 > > ? lock_release+0x142/0x290 > > cdev_device_del+0x15/0x50 > > cxl_memdev_unregister+0x54/0x70 [cxl_core] > > > > This crash is due to the clearing out the cxl_memdev's driver context > > (@cxlds) before the driver is done with it. Fix it by keeping the driver > > context valid until device unregistration completes. > Ideally good to have an explicit statement of which bit of context > is used in the cdev_device_del() call. > > > > Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations") > > Reported-by: Ira Weiny <ira.weiny@intel.com> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > > I'm not that keen on the fix, because it makes some incidental ordering > changes and the ordering in here is messy. What you have feels like > it's fixing on symptom whereas there are other issues here. > > Firstly, as a side note given I'm looking at this > code cxl_memdev_security_shutdown() should perhaps take a > struct cxl_memdev rather than the struct device that it then cxl_memdev_security_shutdown() is in the wrong place, it should just be shutdown like any other optional capability. > uses just to get hold of the cxl_memdev available in > cxl_memdev_shutdown(). The equivalent setup code > cxl_memdev_security_init() already does take a struct > cxl_memdev so this would be desirable for symmetry if nothing else. > > However, cxl_memdev_security_shutdown() doesn't actually undo > the stuff in cxl_memdev_security_init() which previously sent me > on a wild goose chase. Yes, that needs to be cleaned up. > > Offloading that put_santize callback here to devm looks dubious to me: > https://elixir.bootlin.com/linux/v6.6-rc3/source/drivers/cxl/core/memdev.c#L1031 > > .. for a function mid way through a bunch of stuff that is otherwise cleaned up > by cxl_memdev_unregister() via devm here > https://elixir.bootlin.com/linux/v6.6-rc3/source/drivers/cxl/core/memdev.c#L1066 > seems like a recipe for future pain. Could we just move the put_santize() > call directly into cxl_memdev_unregister() and the error path in > cxl_memdev_add_dev() and make sure the order is as expected. > > I had a long argument written up on why cxl_memdev_security_shutdown() > was in the wrong place in the new sequence before seeing that it > didn't balance with what I assumed it did. So probably still in an illogical > place given I'd expect all the 'security stuff' to be initialized at same > point in setup and tear down sequences (reversed obviously.) Yes, I tripped over that and should have mentioned the grumbling as I held my nose and fixed the crash.
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c index 14b547c07f54..92d40c5e7efb 100644 --- a/drivers/cxl/core/memdev.c +++ b/drivers/cxl/core/memdev.c @@ -580,8 +580,8 @@ static void cxl_memdev_unregister(void *_cxlmd) struct cxl_memdev *cxlmd = _cxlmd; struct device *dev = &cxlmd->dev; - cxl_memdev_shutdown(dev); cdev_device_del(&cxlmd->cdev, dev); + cxl_memdev_shutdown(dev); put_device(dev); }
Ira reports that removing cxl_mock_mem causes a crash with the following trace: BUG: kernel NULL pointer dereference, address: 0000000000000044 [..] RIP: 0010:cxl_region_decode_reset+0x7f/0x180 [cxl_core] [..] Call Trace: <TASK> cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x29/0x40 [cxl_core] devres_release_all+0xb8/0x110 device_unbind_cleanup+0xe/0x70 device_release_driver_internal+0x1d2/0x210 bus_remove_device+0xd7/0x150 device_del+0x155/0x3e0 device_unregister+0x13/0x60 devm_release_action+0x4d/0x90 ? __pfx_unregister_port+0x10/0x10 [cxl_core] delete_endpoint+0x121/0x130 [cxl_core] devres_release_all+0xb8/0x110 device_unbind_cleanup+0xe/0x70 device_release_driver_internal+0x1d2/0x210 bus_remove_device+0xd7/0x150 device_del+0x155/0x3e0 ? lock_release+0x142/0x290 cdev_device_del+0x15/0x50 cxl_memdev_unregister+0x54/0x70 [cxl_core] This crash is due to the clearing out the cxl_memdev's driver context (@cxlds) before the driver is done with it. Fix it by keeping the driver context valid until device unregistration completes. Fixes: 9cc238c7a526 ("cxl/pci: Introduce cdevm_file_operations") Reported-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/cxl/core/memdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)