Message ID | 1501917313-9812-3-git-send-email-dingtianhong@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Sat, Aug 05, 2017 at 03:15:11PM +0800, Ding Tianhong wrote: > When bit4 is set in the PCIe Device Control register, it indicates > whether the device is permitted to use relaxed ordering. > On some platforms using relaxed ordering can have performance issues or > due to erratum can cause data-corruption. In such cases devices must avoid > using relaxed ordering. > > This patch checks if there is any node in the hierarchy that indicates that > using relaxed ordering is not safe. I think you only check the devices between the root port and the target device. For example, you don't check siblings or cousins of the target device. > In such cases the patch turns off the > relaxed ordering by clearing the eapability for this device. s/eapability/capability/ > And if the > device is probably running in a guest machine, we should do nothing. I don't know what this sentence means. "Probably running in a guest machine" doesn't really make sense, and there's nothing in your patch that explicitly checks for being in a guest machine. > Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> > Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> > Acked-by: Ashok Raj <ashok.raj@intel.com> > --- > drivers/pci/pci.c | 29 +++++++++++++++++++++++++++++ > drivers/pci/probe.c | 37 +++++++++++++++++++++++++++++++++++++ > include/linux/pci.h | 2 ++ > 3 files changed, 68 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index af0cc34..4f9d7c1 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -4854,6 +4854,35 @@ int pcie_set_mps(struct pci_dev *dev, int mps) > EXPORT_SYMBOL(pcie_set_mps); > > /** > + * pcie_clear_relaxed_ordering - clear PCI Express relaxed ordering bit > + * @dev: PCI device to query > + * > + * If possible clear relaxed ordering Why "If possible"? The bit is required to be RW or hardwired to zero, so PCI_EXP_DEVCTL_RELAX_EN should *always* be zero when this returns. > + */ > +int pcie_clear_relaxed_ordering(struct pci_dev *dev) > +{ > + return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, > + PCI_EXP_DEVCTL_RELAX_EN); > +} > +EXPORT_SYMBOL(pcie_clear_relaxed_ordering); The current series doesn't add any callers of this except pci_configure_relaxed_ordering(), so it doesn't need to be exported to modules. I think I would put both of these functions in drivers/pci/probe.c. Then this one could be static and you'd only have to add pcie_relaxed_ordering_supported() to include/linux/pci.h. > + > +/** > + * pcie_relaxed_ordering_supported - Probe for PCIe relexed ordering support s/relexed/relaxed/ > + * @dev: PCI device to query > + * > + * Returns true if the device support relaxed ordering attribute. > + */ > +bool pcie_relaxed_ordering_supported(struct pci_dev *dev) > +{ > + u16 v; > + > + pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &v); > + > + return !!(v & PCI_EXP_DEVCTL_RELAX_EN); > +} > +EXPORT_SYMBOL(pcie_relaxed_ordering_supported); This is misnamed. This doesn't tell us anything about whether the device *supports* relaxed ordering. It only tells us whether the device is *permitted* to use it. When a device initiates a transaction, the hardware should set the RO bit in the TLP with logic something like this: RO = <this Function supports relaxed ordering> && <this transaction doesn't require strong write ordering> && <PCI_EXP_DEVCTL_RELAX_EN is set> The issue you're fixing is that some Completers don't handle RO correctly. The determining factor is not the Requester, but the Completer (for this series, a Root Port). So I think this should be something like: int pcie_relaxed_ordering_broken(struct pci_dev *completer) { if (!completer) return 0; return completer->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING; } and the caller should do something like this: if (pcie_relaxed_ordering_broken(pci_find_pcie_root_port(pdev))) adapter->flags |= ROOT_NO_RELAXED_ORDERING; That way it's obvious where the issue is, and it's obvious that the answer might be different for peer-to-peer transactions than it is for transactions to the root port, i.e., to coherent memory. > + > +/** > * pcie_get_minimum_link - determine minimum link settings of a PCI device > * @dev: PCI device to query > * @speed: storage for minimum speed > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > index c31310d..48df012 100644 > --- a/drivers/pci/probe.c > +++ b/drivers/pci/probe.c > @@ -1762,6 +1762,42 @@ static void pci_configure_extended_tags(struct pci_dev *dev) > PCI_EXP_DEVCTL_EXT_TAG); > } > > +/** > + * pci_dev_should_disable_relaxed_ordering - check if the PCI device > + * should disable the relaxed ordering attribute. > + * @dev: PCI device > + * > + * Return true if any of the PCI devices above us do not support > + * relaxed ordering. > + */ > +static bool pci_dev_should_disable_relaxed_ordering(struct pci_dev *dev) > +{ > + while (dev) { > + if (dev->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING) > + return true; > + > + dev = dev->bus->self; > + } > + > + return false; > +} > + > +static void pci_configure_relaxed_ordering(struct pci_dev *dev) > +{ > + /* We should not alter the relaxed ordering bit for the VF */ > + if (dev->is_virtfn) > + return; > + > + /* If the releaxed ordering enable bit is not set, do nothing. */ s/releaxed/relaxed/ > + if (!pcie_relaxed_ordering_supported(dev)) > + return; > + > + if (pci_dev_should_disable_relaxed_ordering(dev)) { > + pcie_clear_relaxed_ordering(dev); > + dev_info(&dev->dev, "Disable Relaxed Ordering\n"); This associates the message with the Requester that may potentially use relaxed ordering. But there's nothing wrong or unusual about the Requester; the issue is with the *Completer*, so I think the message should be in the quirk where we set PCI_DEV_FLAGS_NO_RELAXED_ORDERING. Maybe it should be both places; I dunno. This implementation assumes the device only initiates transactions to coherent memory, i.e., it assumes the device never does peer-to-peer DMA. I guess we'll have to wait and see if we trip over any peer-to-peer issues, then figure out how to handle them. > + } > +} > + > static void pci_configure_device(struct pci_dev *dev) > { > struct hotplug_params hpp; > @@ -1769,6 +1805,7 @@ static void pci_configure_device(struct pci_dev *dev) > > pci_configure_mps(dev); > pci_configure_extended_tags(dev); > + pci_configure_relaxed_ordering(dev); > > memset(&hpp, 0, sizeof(hpp)); > ret = pci_get_hp_params(dev, &hpp); > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 412ec1c..3aa23a2 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1127,6 +1127,8 @@ int pci_add_ext_cap_save_buffer(struct pci_dev *dev, > void pci_pme_wakeup_bus(struct pci_bus *bus); > void pci_d3cold_enable(struct pci_dev *dev); > void pci_d3cold_disable(struct pci_dev *dev); > +int pcie_clear_relaxed_ordering(struct pci_dev *dev); > +bool pcie_relaxed_ordering_supported(struct pci_dev *dev); > > /* PCI Virtual Channel */ > int pci_save_vc_state(struct pci_dev *dev); > -- > 1.8.3.1 > > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Tue, Aug 08, 2017 at 09:22:39PM -0500, Bjorn Helgaas wrote: > On Sat, Aug 05, 2017 at 03:15:11PM +0800, Ding Tianhong wrote: > > When bit4 is set in the PCIe Device Control register, it indicates > > whether the device is permitted to use relaxed ordering. > > On some platforms using relaxed ordering can have performance issues or > > due to erratum can cause data-corruption. In such cases devices must avoid > > using relaxed ordering. > > > > This patch checks if there is any node in the hierarchy that indicates that > > using relaxed ordering is not safe. ... > > +EXPORT_SYMBOL(pcie_relaxed_ordering_supported); > > This is misnamed. This doesn't tell us anything about whether the > device *supports* relaxed ordering. It only tells us whether the > device is *permitted* to use it. > > When a device initiates a transaction, the hardware should set the RO > bit in the TLP with logic something like this: > > RO = <this Function supports relaxed ordering> && > <this transaction doesn't require strong write ordering> && > <PCI_EXP_DEVCTL_RELAX_EN is set> > > The issue you're fixing is that some Completers don't handle RO > correctly. The determining factor is not the Requester, but the > Completer (for this series, a Root Port). So I think this should be > something like: > > int pcie_relaxed_ordering_broken(struct pci_dev *completer) > { > if (!completer) > return 0; > > return completer->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING; > } > > and the caller should do something like this: > > if (pcie_relaxed_ordering_broken(pci_find_pcie_root_port(pdev))) > adapter->flags |= ROOT_NO_RELAXED_ORDERING; > > That way it's obvious where the issue is, and it's obvious that the > answer might be different for peer-to-peer transactions than it is for > transactions to the root port, i.e., to coherent memory. After looking at the driver, I wonder if it would be simpler like this: int pcie_relaxed_ordering_enabled(struct pci_dev *dev) { u16 ctl; pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &ctl); return ctl & PCI_EXP_DEVCTL_RELAX_EN; } EXPORT_SYMBOL(pcie_relaxed_ordering_enabled); static void pci_configure_relaxed_ordering(struct pci_dev *dev) { struct pci_dev *root; if (dev->is_virtfn) return; /* PCI_EXP_DEVCTL_RELAX_EN is RsvdP in VFs */ if (!pcie_relaxed_ordering_enabled(dev)) return; /* * For now, we only deal with Relaxed Ordering issues with Root * Ports. Peer-to-peer DMA is another can of worms. */ root = pci_find_pcie_root_port(dev); if (!root) return; if (root->relaxed_ordering_broken) pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_RELAX_EN); } This doesn't check every intervening switch, but I don't think we know about any issues except with root ports. And the driver could do: if (!pcie_relaxed_ordering_enabled(pdev)) adapter->flags |= ROOT_NO_RELAXED_ORDERING; The driver code wouldn't show anything about coherent memory vs. peer-to-peer, but we really don't have a clue about how to handle that yet anyway. I guess this is back to exactly what you proposed, except that I changed the name of pcie_relaxed_ordering_supported() to pcie_relaxed_ordering_enabled(), which I think is slightly more specific from the device's point of view. Bjorn
| From: Bjorn Helgaas <helgaas@kernel.org> | Sent: Tuesday, August 8, 2017 7:22 PM | ... | and the caller should do something like this: | | if (pcie_relaxed_ordering_broken(pci_find_pcie_root_port(pdev))) | adapter->flags |= ROOT_NO_RELAXED_ORDERING; | | That way it's obvious where the issue is, and it's obvious that the | answer might be different for peer-to-peer transactions than it is for | transactions to the root port, i.e., to coherent memory. | ... Which is back to something very close to what I initially suggested in my first patch submission. Because you're right, this isn't about broken Source Devices, it's about broken Completing Devices. Unfortunately, as Alexander Duyck noted, in a Virtual Machine we won't be able to see our Root Port in order to make this determination. (And in some Hypervisor implementations I've seen, there's not even a synthetic Root Port available to the VM at all, let alone read-only access to the real one.) So the current scheme was developed of having the Hypervisor kernel traverse down the PCIe Fabric when it finds a broken Root Port implementation (the issue that we've mostly been primarily focused upon), and turning off the PCIe Capability Device Control[Relaxed Ordering Enable]. This was to serve two purposes: 1. Turn that off in order to prevent sending any Transaction Layer Packets with the Relaxed Ordering Attribute to any Completer. Which unfortunately would also prevent Peer-to-Peer use of the Relaxed Ordering Attribute. 2. Act as a message to Device Drivers for those downstream devices that the they were dealing with a broken Root Port implementation. And this would work even for a driver in a VM with an attached device since it would be able to see the PCIe Configuration Space for the attached device. I haven't been excited about any of this because: A. While so far all of the examples we've talked about are broken Root Port Completers, it's perfectly possible that other devices could be broken -- say an NVMe Device which is not "Coherent Memory". How would this fit into the framework APIs being described? B. I have yet to see a n example of how the currently proposed API framework would be used in a hybrid environment where TLPs to the Root Port would not use Relaxed Ordering, but TLPs to a Peer would use Relaxed Ordering. So far its all been about using a "big hammer" to completely disable the use of Relaxed Ordering. But the VM problem keeps cropping up over and over. A driver in a VM doesn't have access to the Root Port to determine if its on a "Black List" and our only way of communicating with the driver in the VM is to leave the device in a particular state (i.e. initialize the PCIe Capability Device Control[Relaxed Ordering Enable] to "off"). Oh, and also, on the current patch submission's focus on broken Root Port implementations: one could suggest that even if we're stuck with the "Device attached to a VM Conundrum", that what we should really be thinking about is if ANY device within a PCIe Fabric has broken Relaxed Ordering completion problems, and, if so, "poisoning" the entire containing PCIe Fabric by turning off Relaxed Ordering Enable for every device, up, down sideways -- including the Root Port itself. | ... | This associates the message with the Requester that may potentially | use relaxed ordering. But there's nothing wrong or unusual about the | Requester; the issue is with the *Completer*, so I think the message | should be in the quirk where we set PCI_DEV_FLAGS_NO_RELAXED_ORDERING. | Maybe it should be both places; I dunno. | | This implementation assumes the device only initiates transactions to | coherent memory, i.e., it assumes the device never does peer-to-peer | DMA. I guess we'll have to wait and see if we trip over any | peer-to-peer issues, then figure out how to handle them. | ... Yes, as soon as we want to implement the hybrid use of Relaxed Ordering I mentioned above. And that the Intel document mentions itself. Casey
Hi Bjorn: On 2017/8/9 10:22, Bjorn Helgaas wrote: > On Sat, Aug 05, 2017 at 03:15:11PM +0800, Ding Tianhong wrote: >> When bit4 is set in the PCIe Device Control register, it indicates >> whether the device is permitted to use relaxed ordering. >> On some platforms using relaxed ordering can have performance issues or >> due to erratum can cause data-corruption. In such cases devices must avoid >> using relaxed ordering. >> >> This patch checks if there is any node in the hierarchy that indicates that >> using relaxed ordering is not safe. > > I think you only check the devices between the root port and the > target device. For example, you don't check siblings or cousins of > the target device. > OK, update the description. >> In such cases the patch turns off the >> relaxed ordering by clearing the eapability for this device. > > s/eapability/capability/ > >> And if the >> device is probably running in a guest machine, we should do nothing. > > I don't know what this sentence means. "Probably running in a guest > machine" doesn't really make sense, and there's nothing in your patch > that explicitly checks for being in a guest machine. > Alex noticed that we should do nothing if in the virtual machine because the Root Complex is NULL at that time, so I think this word should be more clearly here. >> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> >> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> >> Acked-by: Ashok Raj <ashok.raj@intel.com> >> --- >> drivers/pci/pci.c | 29 +++++++++++++++++++++++++++++ >> drivers/pci/probe.c | 37 +++++++++++++++++++++++++++++++++++++ >> include/linux/pci.h | 2 ++ >> 3 files changed, 68 insertions(+) >> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index af0cc34..4f9d7c1 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -4854,6 +4854,35 @@ int pcie_set_mps(struct pci_dev *dev, int mps) >> EXPORT_SYMBOL(pcie_set_mps); >> >> /** >> + * pcie_clear_relaxed_ordering - clear PCI Express relaxed ordering bit >> + * @dev: PCI device to query >> + * >> + * If possible clear relaxed ordering > > Why "If possible"? The bit is required to be RW or hardwired to zero, > so PCI_EXP_DEVCTL_RELAX_EN should *always* be zero when this returns. > OK >> + */ >> +int pcie_clear_relaxed_ordering(struct pci_dev *dev) >> +{ ... >> >> _______________________________________________ >> linux-arm-kernel mailing list >> linux-arm-kernel@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > > . >
On 2017/8/9 11:25, Bjorn Helgaas wrote: > On Tue, Aug 08, 2017 at 09:22:39PM -0500, Bjorn Helgaas wrote: >> On Sat, Aug 05, 2017 at 03:15:11PM +0800, Ding Tianhong wrote: >>> When bit4 is set in the PCIe Device Control register, it indicates > After looking at the driver, I wonder if it would be simpler like > this: > > int pcie_relaxed_ordering_enabled(struct pci_dev *dev) > { > u16 ctl; > > pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &ctl); > return ctl & PCI_EXP_DEVCTL_RELAX_EN; > } > EXPORT_SYMBOL(pcie_relaxed_ordering_enabled); > > static void pci_configure_relaxed_ordering(struct pci_dev *dev) > { > struct pci_dev *root; > > if (dev->is_virtfn) > return; /* PCI_EXP_DEVCTL_RELAX_EN is RsvdP in VFs */ > > if (!pcie_relaxed_ordering_enabled(dev)) > return; > > /* > * For now, we only deal with Relaxed Ordering issues with Root > * Ports. Peer-to-peer DMA is another can of worms. > */ > root = pci_find_pcie_root_port(dev); > if (!root) > return; > > if (root->relaxed_ordering_broken) > pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, > PCI_EXP_DEVCTL_RELAX_EN); > } > > This doesn't check every intervening switch, but I don't think we know > about any issues except with root ports. > Yes > And the driver could do: > > if (!pcie_relaxed_ordering_enabled(pdev)) > adapter->flags |= ROOT_NO_RELAXED_ORDERING; > > The driver code wouldn't show anything about coherent memory vs. > peer-to-peer, but we really don't have a clue about how to handle that > yet anyway. > > I guess this is back to exactly what you proposed, except that I > changed the name of pcie_relaxed_ordering_supported() to > pcie_relaxed_ordering_enabled(), which I think is slightly more > specific from the device's point of view. > OK, looks like we reach a consensus finally, I will follow your new opinion and resend, thanks. Ding > Bjorn > > . >
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index af0cc34..4f9d7c1 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4854,6 +4854,35 @@ int pcie_set_mps(struct pci_dev *dev, int mps) EXPORT_SYMBOL(pcie_set_mps); /** + * pcie_clear_relaxed_ordering - clear PCI Express relaxed ordering bit + * @dev: PCI device to query + * + * If possible clear relaxed ordering + */ +int pcie_clear_relaxed_ordering(struct pci_dev *dev) +{ + return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, + PCI_EXP_DEVCTL_RELAX_EN); +} +EXPORT_SYMBOL(pcie_clear_relaxed_ordering); + +/** + * pcie_relaxed_ordering_supported - Probe for PCIe relexed ordering support + * @dev: PCI device to query + * + * Returns true if the device support relaxed ordering attribute. + */ +bool pcie_relaxed_ordering_supported(struct pci_dev *dev) +{ + u16 v; + + pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &v); + + return !!(v & PCI_EXP_DEVCTL_RELAX_EN); +} +EXPORT_SYMBOL(pcie_relaxed_ordering_supported); + +/** * pcie_get_minimum_link - determine minimum link settings of a PCI device * @dev: PCI device to query * @speed: storage for minimum speed diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index c31310d..48df012 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1762,6 +1762,42 @@ static void pci_configure_extended_tags(struct pci_dev *dev) PCI_EXP_DEVCTL_EXT_TAG); } +/** + * pci_dev_should_disable_relaxed_ordering - check if the PCI device + * should disable the relaxed ordering attribute. + * @dev: PCI device + * + * Return true if any of the PCI devices above us do not support + * relaxed ordering. + */ +static bool pci_dev_should_disable_relaxed_ordering(struct pci_dev *dev) +{ + while (dev) { + if (dev->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING) + return true; + + dev = dev->bus->self; + } + + return false; +} + +static void pci_configure_relaxed_ordering(struct pci_dev *dev) +{ + /* We should not alter the relaxed ordering bit for the VF */ + if (dev->is_virtfn) + return; + + /* If the releaxed ordering enable bit is not set, do nothing. */ + if (!pcie_relaxed_ordering_supported(dev)) + return; + + if (pci_dev_should_disable_relaxed_ordering(dev)) { + pcie_clear_relaxed_ordering(dev); + dev_info(&dev->dev, "Disable Relaxed Ordering\n"); + } +} + static void pci_configure_device(struct pci_dev *dev) { struct hotplug_params hpp; @@ -1769,6 +1805,7 @@ static void pci_configure_device(struct pci_dev *dev) pci_configure_mps(dev); pci_configure_extended_tags(dev); + pci_configure_relaxed_ordering(dev); memset(&hpp, 0, sizeof(hpp)); ret = pci_get_hp_params(dev, &hpp); diff --git a/include/linux/pci.h b/include/linux/pci.h index 412ec1c..3aa23a2 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1127,6 +1127,8 @@ int pci_add_ext_cap_save_buffer(struct pci_dev *dev, void pci_pme_wakeup_bus(struct pci_bus *bus); void pci_d3cold_enable(struct pci_dev *dev); void pci_d3cold_disable(struct pci_dev *dev); +int pcie_clear_relaxed_ordering(struct pci_dev *dev); +bool pcie_relaxed_ordering_supported(struct pci_dev *dev); /* PCI Virtual Channel */ int pci_save_vc_state(struct pci_dev *dev);