diff mbox

[RFC] nvme: avoid race-conditions when enabling devices

Message ID 744877924.5841545.1521630049567.JavaMail.zimbra@kalray.eu (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Marta Rybczynska March 21, 2018, 11 a.m. UTC
NVMe driver uses threads for the work at device reset, including enabling
the PCIe device. When multiple NVMe devices are initialized, their reset
works may be scheduled in parallel. Then pci_enable_device_mem can be
called in parallel on multiple cores.

This causes a loop of enabling of all upstream bridges in
pci_enable_bridge(). pci_enable_bridge() causes multiple operations
including __pci_set_master and architecture-specific functions that
call ones like and pci_enable_resources(). Both __pci_set_master()
and pci_enable_resources() read PCI_COMMAND field in the PCIe space
and change it. This is done as read/modify/write.

Imagine that the PCIe tree looks like:
A - B - switch -  C - D
               \- E - F

D and F are two NVMe disks and all devices from B are not enabled and bus
mastering is not set. If their reset work are scheduled in parallel the two
modifications of PCI_COMMAND may happen in parallel without locking and the
system may end up with the part of PCIe tree not enabled.

The problem may also happen if other device is initialized in parallel to
a nvme disk.

This fix moves pci_enable_device_mem to the probe part of the driver that
is run sequentially to avoid the issue.

Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
Signed-off-by: Pierre-Yves Kerbrat <pkerbrat@kalray.eu>
---
 drivers/nvme/host/pci.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Ming Lei March 21, 2018, 11:50 a.m. UTC | #1
On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
> NVMe driver uses threads for the work at device reset, including enabling
> the PCIe device. When multiple NVMe devices are initialized, their reset
> works may be scheduled in parallel. Then pci_enable_device_mem can be
> called in parallel on multiple cores.
> 
> This causes a loop of enabling of all upstream bridges in
> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
> including __pci_set_master and architecture-specific functions that
> call ones like and pci_enable_resources(). Both __pci_set_master()
> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
> and change it. This is done as read/modify/write.
> 
> Imagine that the PCIe tree looks like:
> A - B - switch -  C - D
>                \- E - F
> 
> D and F are two NVMe disks and all devices from B are not enabled and bus
> mastering is not set. If their reset work are scheduled in parallel the two
> modifications of PCI_COMMAND may happen in parallel without locking and the
> system may end up with the part of PCIe tree not enabled.

Then looks serialized reset should be used, and I did see the commit
79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
to mark controller state' in reset stress test.

But that commit only covers case of PCI reset from sysfs attribute, and
maybe other cases need to be dealt with in similar way too.

Thanks,
Ming
Marta Rybczynska March 21, 2018, 12:10 p.m. UTC | #2
> On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
>> NVMe driver uses threads for the work at device reset, including enabling
>> the PCIe device. When multiple NVMe devices are initialized, their reset
>> works may be scheduled in parallel. Then pci_enable_device_mem can be
>> called in parallel on multiple cores.
>> 
>> This causes a loop of enabling of all upstream bridges in
>> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
>> including __pci_set_master and architecture-specific functions that
>> call ones like and pci_enable_resources(). Both __pci_set_master()
>> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
>> and change it. This is done as read/modify/write.
>> 
>> Imagine that the PCIe tree looks like:
>> A - B - switch -  C - D
>>                \- E - F
>> 
>> D and F are two NVMe disks and all devices from B are not enabled and bus
>> mastering is not set. If their reset work are scheduled in parallel the two
>> modifications of PCI_COMMAND may happen in parallel without locking and the
>> system may end up with the part of PCIe tree not enabled.
> 
> Then looks serialized reset should be used, and I did see the commit
> 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
> to mark controller state' in reset stress test.
> 
> But that commit only covers case of PCI reset from sysfs attribute, and
> maybe other cases need to be dealt with in similar way too.
> 

It seems to me that the serialized reset works for multiple resets of the
same device, doesn't it? Our problem is linked to resets of different devices
that share the same PCIe tree.

You're right that the problem we face might also come with manual resets
under certain conditions (I think that all devices in a subtree would need
to be disabled).

Thanks,
Marta
Ming Lei March 21, 2018, 3:48 p.m. UTC | #3
On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
> > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
> >> NVMe driver uses threads for the work at device reset, including enabling
> >> the PCIe device. When multiple NVMe devices are initialized, their reset
> >> works may be scheduled in parallel. Then pci_enable_device_mem can be
> >> called in parallel on multiple cores.
> >> 
> >> This causes a loop of enabling of all upstream bridges in
> >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
> >> including __pci_set_master and architecture-specific functions that
> >> call ones like and pci_enable_resources(). Both __pci_set_master()
> >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
> >> and change it. This is done as read/modify/write.
> >> 
> >> Imagine that the PCIe tree looks like:
> >> A - B - switch -  C - D
> >>                \- E - F
> >> 
> >> D and F are two NVMe disks and all devices from B are not enabled and bus
> >> mastering is not set. If their reset work are scheduled in parallel the two
> >> modifications of PCI_COMMAND may happen in parallel without locking and the
> >> system may end up with the part of PCIe tree not enabled.
> > 
> > Then looks serialized reset should be used, and I did see the commit
> > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
> > to mark controller state' in reset stress test.
> > 
> > But that commit only covers case of PCI reset from sysfs attribute, and
> > maybe other cases need to be dealt with in similar way too.
> > 
> 
> It seems to me that the serialized reset works for multiple resets of the
> same device, doesn't it? Our problem is linked to resets of different devices
> that share the same PCIe tree.

Given reset shouldn't be a frequent action, it might be fine to serialize all
reset from different devices.

Thanks,
Ming
diff mbox

Patch

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b6f43b7..af53854 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2515,6 +2515,14 @@  static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
+	/*
+	 * Enable the device now to make sure that all accesses to bridges above
+	 * are done without races
+	 */
+	result = pci_enable_device_mem(pdev);
+	if (result)
+		goto release_pools;
+
 	nvme_reset_ctrl(&dev->ctrl);
 
 	return 0;