Message ID | 156029554317.419799.1324389595953183385.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | libnvdimm: Fix async operations and locking | expand |
On 6/11/2019 4:25 PM, Dan Williams wrote: > The libnvdimm subsystem uses async operations to parallelize device > probing operations and to allow sysfs to trigger device_unregister() on > deleted namepsaces. A multithreaded stress test of the libnvdimm sysfs > interface uncovered a case where device_unregister() is triggered > multiple times, and the subsequent investigation uncovered a broken > locking scenario. > > The lack of lockdep coverage for device_lock() stymied the debug. That > is, until patch6 "driver-core, libnvdimm: Let device subsystems add > local lockdep coverage" solved that with a shadow lock, with lockdep > coverage, to mirror device_lock() operations. Given the time saved with > shadow-lock debug-hack, patch6 attempts to generalize device_lock() > debug facility that might be able to be carried upstream. Patch6 is > staged at the end of this fix series in case it is contentious and needs > to be dropped. > > Patch1 "drivers/base: Introduce kill_device()" could be achieved with > local libnvdimm infrastructure. However, the existing 'dead' flag in > 'struct device_private' aims to solve similar async register/unregister > races so the fix in patch2 "libnvdimm/bus: Prevent duplicate > device_unregister() calls" can be implemented with existing driver-core > infrastructure. > > Patch3 is a rare lockdep warning that is intermittent based on > namespaces racing ahead of the completion of probe of their parent > region. It is not related to the other fixes, it just happened to > trigger as a result of the async stress test. > > Patch4 and patch5 address an ABBA deadlock tripped by the stress test. > > These patches pass the failing stress test and the existing libnvdimm > unit tests with CONFIG_PROVE_LOCKING=y and the new "dev->lockdep_mutex" > shadow lock with no lockdep warnings. > > --- > > Dan Williams (6): > drivers/base: Introduce kill_device() > libnvdimm/bus: Prevent duplicate device_unregister() calls > libnvdimm/region: Register badblocks before namespaces > libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl() > libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock > driver-core, libnvdimm: Let device subsystems add local lockdep coverage > > > drivers/acpi/nfit/core.c | 28 ++++--- > drivers/acpi/nfit/nfit.h | 24 ++++++ > drivers/base/core.c | 30 ++++++-- > drivers/nvdimm/btt_devs.c | 16 ++-- > drivers/nvdimm/bus.c | 154 +++++++++++++++++++++++++++------------ > drivers/nvdimm/core.c | 10 +-- > drivers/nvdimm/dimm_devs.c | 4 + > drivers/nvdimm/namespace_devs.c | 36 +++++---- > drivers/nvdimm/nd-core.h | 71 ++++++++++++++++++ > drivers/nvdimm/pfn_devs.c | 24 +++--- > drivers/nvdimm/pmem.c | 4 + > drivers/nvdimm/region.c | 24 +++--- > drivers/nvdimm/region_devs.c | 12 ++- > include/linux/device.h | 6 ++ > 14 files changed, 308 insertions(+), 135 deletions(-) > Tested-by: Jane Chu <jane.chu@oracle.com> Specifically, running parallel ndctls creating/destroying namespaces in multiple processes concurrently led to system panic, that has been verified fixed by this patch series. Thanks! -jane