Message ID | 758d0e431c732fe133e7b0e660bde5fc1beccdba.1493678834.git.leedom@chelsio.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
hi, Casey: On 2017/5/2 7:13, Casey Leedom wrote: > The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed > Ordering Attribute should not be used on Transaction Layer Packets destined > for the PCIe End Node so flagged. Initially flagged this way are Intel > E5-26xx Root Complex Ports which suffer from a Flow Control Credit > Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which > don't obey PCIe 3.0 ordering rules which can lead to Data Corruption. > --- > drivers/pci/quirks.c | 38 ++++++++++++++++++++++++++++++++++++++ > include/linux/pci.h | 2 ++ > 2 files changed, 40 insertions(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index f754453..4ae78b3 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -3979,6 +3979,44 @@ static void quirk_tw686x_class(struct pci_dev *pdev) > quirk_tw686x_class); > > /* > + * Some devices have problems with Transaction Layer Packets with the Relaxed > + * Ordering Attribute set. Such devices should mark themselves and other > + * Device Drivers should check before sending TLPs with RO set. > + */ > +static void quirk_relaxedordering_disable(struct pci_dev *dev) > +{ > + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING; > +} > + > +/* > + * Intel E5-26xx Root Complex has a Flow Control Credit issue which can > + * cause performance problems with Upstream Transaction Layer Packets with > + * Relaxed Ordering set. > + */ > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f02, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f04, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f08, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > + > +/* > + * The AMD ARM A1100 (AKA "SEATTLE") SoC has a bug in its PCIe Root Complex > + * where Upstream Transaction Layer Packets with the Relaxed Ordering > + * Attribute clear are allowed to bypass earlier TLPs with Relaxed Ordering > + * set. This is a violation of the PCIe 3.0 Transaction Ordering Rules > + * outlined in Section 2.4.1 (PCI Express(r) Base Specification Revision 3.0 > + * November 10, 2010). As a result, on this platform we can't use Relaxed > + * Ordering for Upstream TLPs. > + */ > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a00, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a01, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a02, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > + > +/* > * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same > * values for the Attribute as were supplied in the header of the > * corresponding Request, except as explicitly allowed when IDO is used." > diff --git a/include/linux/pci.h b/include/linux/pci.h > index eb3da1a..6764f66 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -178,6 +178,8 @@ enum pci_dev_flags { > PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1 << 7), > /* Get VPD from function 0 VPD */ > PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1 << 8), > + /* Don't use Relaxed Ordering for TLPs directed at this device */ > + PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 9), > }; What about add a new general func to check the RO for several drivers to use them ? just like: #define pci_dev_support_relaxed_ordering(struct pci_dev *root) \ (!(root->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING)) Thanks Ding > > enum pci_irq_reroute_variant { >
On Mon, May 1, 2017 at 4:13 PM, Casey Leedom <leedom@chelsio.com> wrote: > The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed > Ordering Attribute should not be used on Transaction Layer Packets destined > for the PCIe End Node so flagged. Initially flagged this way are Intel > E5-26xx Root Complex Ports which suffer from a Flow Control Credit > Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which > don't obey PCIe 3.0 ordering rules which can lead to Data Corruption. So this is a good first step though might I suggest one other change. We may want to add logic to the core PCIe code that clears the "Enable Relaxed Ordering" bit in the device control register for all devices hanging off of this root complex. Assuming the devices conform to the PCIe spec doing that should disable relaxed ordering in a device agnostic way that then enables us at a driver level to just enable the feature always without having to perform any checks for your flag. We could probably do that as a part of probing the PCIe interfaces hanging off of these devices. > --- > drivers/pci/quirks.c | 38 ++++++++++++++++++++++++++++++++++++++ > include/linux/pci.h | 2 ++ > 2 files changed, 40 insertions(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index f754453..4ae78b3 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -3979,6 +3979,44 @@ static void quirk_tw686x_class(struct pci_dev *pdev) > quirk_tw686x_class); > > /* > + * Some devices have problems with Transaction Layer Packets with the Relaxed > + * Ordering Attribute set. Such devices should mark themselves and other > + * Device Drivers should check before sending TLPs with RO set. > + */ > +static void quirk_relaxedordering_disable(struct pci_dev *dev) > +{ > + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING; > +} > + > +/* > + * Intel E5-26xx Root Complex has a Flow Control Credit issue which can > + * cause performance problems with Upstream Transaction Layer Packets with > + * Relaxed Ordering set. > + */ > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f02, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f04, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f08, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > + > +/* > + * The AMD ARM A1100 (AKA "SEATTLE") SoC has a bug in its PCIe Root Complex > + * where Upstream Transaction Layer Packets with the Relaxed Ordering > + * Attribute clear are allowed to bypass earlier TLPs with Relaxed Ordering > + * set. This is a violation of the PCIe 3.0 Transaction Ordering Rules > + * outlined in Section 2.4.1 (PCI Express(r) Base Specification Revision 3.0 > + * November 10, 2010). As a result, on this platform we can't use Relaxed > + * Ordering for Upstream TLPs. > + */ > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a00, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a01, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a02, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > + > +/* > * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same > * values for the Attribute as were supplied in the header of the > * corresponding Request, except as explicitly allowed when IDO is used." > diff --git a/include/linux/pci.h b/include/linux/pci.h > index eb3da1a..6764f66 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -178,6 +178,8 @@ enum pci_dev_flags { > PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1 << 7), > /* Get VPD from function 0 VPD */ > PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1 << 8), > + /* Don't use Relaxed Ordering for TLPs directed at this device */ > + PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 9), > }; > > enum pci_irq_reroute_variant { > -- > 1.9.1 >
Hi Casey On Mon, May 01, 2017 at 04:13:50PM -0700, Casey Leedom wrote: > The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed > Ordering Attribute should not be used on Transaction Layer Packets destined > for the PCIe End Node so flagged. Initially flagged this way are Intel > E5-26xx Root Complex Ports which suffer from a Flow Control Credit > Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which > don't obey PCIe 3.0 ordering rules which can lead to Data Corruption. > --- > drivers/pci/quirks.c | 38 ++++++++++++++++++++++++++++++++++++++ > include/linux/pci.h | 2 ++ > 2 files changed, 40 insertions(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index f754453..4ae78b3 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -3979,6 +3979,44 @@ static void quirk_tw686x_class(struct pci_dev *pdev) > quirk_tw686x_class); > > /* > + * Some devices have problems with Transaction Layer Packets with the Relaxed > + * Ordering Attribute set. Such devices should mark themselves and other > + * Device Drivers should check before sending TLPs with RO set. > + */ > +static void quirk_relaxedordering_disable(struct pci_dev *dev) > +{ > + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING; > +} > + > +/* > + * Intel E5-26xx Root Complex has a Flow Control Credit issue which can > + * cause performance problems with Upstream Transaction Layer Packets with > + * Relaxed Ordering set. > + */ > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f02, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f04, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f08, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > + You might want to add the RP ID's for both HSX/BDX. Tne entire range is 2F01H-2F0EH & 6F01H-6F0EH. Cheers, Ashok
On Tue, May 02, 2017 at 09:39:34AM -0700, Alexander Duyck wrote: > On Mon, May 1, 2017 at 4:13 PM, Casey Leedom <leedom@chelsio.com> wrote: > > The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed > > Ordering Attribute should not be used on Transaction Layer Packets destined > > for the PCIe End Node so flagged. Initially flagged this way are Intel > > E5-26xx Root Complex Ports which suffer from a Flow Control Credit > > Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which > > don't obey PCIe 3.0 ordering rules which can lead to Data Corruption. > > So this is a good first step though might I suggest one other change. > > We may want to add logic to the core PCIe code that clears the "Enable > Relaxed Ordering" bit in the device control register for all devices > hanging off of this root complex. Assuming the devices conform to the > PCIe spec doing that should disable relaxed ordering in a device > agnostic way that then enables us at a driver level to just enable the > feature always without having to perform any checks for your flag. We > could probably do that as a part of probing the PCIe interfaces > hanging off of these devices. I suppose you don't want to turn off RO completely on the device. When traffic is targetted to mmio for peer to peer reasons RO has performance upside. The specific issue with these root ports indicate limitation using RO for traffic targetting coherent memory. > > > --- > > drivers/pci/quirks.c | 38 ++++++++++++++++++++++++++++++++++++++ > > include/linux/pci.h | 2 ++ > > 2 files changed, 40 insertions(+) > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > index f754453..4ae78b3 100644 > > --- a/drivers/pci/quirks.c > > +++ b/drivers/pci/quirks.c > > @@ -3979,6 +3979,44 @@ static void quirk_tw686x_class(struct pci_dev *pdev) > > quirk_tw686x_class); > > > > /* > > + * Some devices have problems with Transaction Layer Packets with the Relaxed > > + * Ordering Attribute set. Such devices should mark themselves and other > > + * Device Drivers should check before sending TLPs with RO set. > > + */ > > +static void quirk_relaxedordering_disable(struct pci_dev *dev) > > +{ > > + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING; > > +} > > + > > +/* > > + * Intel E5-26xx Root Complex has a Flow Control Credit issue which can > > + * cause performance problems with Upstream Transaction Layer Packets with > > + * Relaxed Ordering set. > > + */ > > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f02, PCI_CLASS_NOT_DEFINED, 8, > > + quirk_relaxedordering_disable); > > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f04, PCI_CLASS_NOT_DEFINED, 8, > > + quirk_relaxedordering_disable); > > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f08, PCI_CLASS_NOT_DEFINED, 8, > > + quirk_relaxedordering_disable); > > + > > +/* > > + * The AMD ARM A1100 (AKA "SEATTLE") SoC has a bug in its PCIe Root Complex > > + * where Upstream Transaction Layer Packets with the Relaxed Ordering > > + * Attribute clear are allowed to bypass earlier TLPs with Relaxed Ordering > > + * set. This is a violation of the PCIe 3.0 Transaction Ordering Rules > > + * outlined in Section 2.4.1 (PCI Express(r) Base Specification Revision 3.0 > > + * November 10, 2010). As a result, on this platform we can't use Relaxed > > + * Ordering for Upstream TLPs. > > + */ > > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a00, PCI_CLASS_NOT_DEFINED, 8, > > + quirk_relaxedordering_disable); > > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a01, PCI_CLASS_NOT_DEFINED, 8, > > + quirk_relaxedordering_disable); > > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a02, PCI_CLASS_NOT_DEFINED, 8, > > + quirk_relaxedordering_disable); > > + > > +/* > > * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same > > * values for the Attribute as were supplied in the header of the > > * corresponding Request, except as explicitly allowed when IDO is used." > > diff --git a/include/linux/pci.h b/include/linux/pci.h > > index eb3da1a..6764f66 100644 > > --- a/include/linux/pci.h > > +++ b/include/linux/pci.h > > @@ -178,6 +178,8 @@ enum pci_dev_flags { > > PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1 << 7), > > /* Get VPD from function 0 VPD */ > > PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1 << 8), > > + /* Don't use Relaxed Ordering for TLPs directed at this device */ > > + PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 9), > > }; > > > > enum pci_irq_reroute_variant { > > -- > > 1.9.1 > >
On Tue, May 2, 2017 at 9:53 AM, Raj, Ashok <ashok.raj@intel.com> wrote: > On Tue, May 02, 2017 at 09:39:34AM -0700, Alexander Duyck wrote: >> On Mon, May 1, 2017 at 4:13 PM, Casey Leedom <leedom@chelsio.com> wrote: >> > The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed >> > Ordering Attribute should not be used on Transaction Layer Packets destined >> > for the PCIe End Node so flagged. Initially flagged this way are Intel >> > E5-26xx Root Complex Ports which suffer from a Flow Control Credit >> > Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which >> > don't obey PCIe 3.0 ordering rules which can lead to Data Corruption. >> >> So this is a good first step though might I suggest one other change. >> >> We may want to add logic to the core PCIe code that clears the "Enable >> Relaxed Ordering" bit in the device control register for all devices >> hanging off of this root complex. Assuming the devices conform to the >> PCIe spec doing that should disable relaxed ordering in a device >> agnostic way that then enables us at a driver level to just enable the >> feature always without having to perform any checks for your flag. We >> could probably do that as a part of probing the PCIe interfaces >> hanging off of these devices. > > I suppose you don't want to turn off RO completely on the device. When > traffic is targetted to mmio for peer to peer reasons RO has performance > upside. The specific issue with these root ports indicate limitation using > RO for traffic targetting coherent memory. Actually my main concern here is virtualization. If I take the PCIe function and direct assign it I have no way of seeing the root complex flag as it is now virtualized away. In the meantime the guest now has the ability to enable the function and sees nothing that says you can't enable relaxed ordering which in turn ends up potentially causing data corruption on the system. I want relaxed ordering disabled before I even consider assigning it to the guest on the systems where this would be an issue. I prefer to err on the side of caution with this. Enabling Relaxed Ordering is technically a performance enhancement, so we function but not as well as we would like, while having it enabled when there are issues can lead to data corruption. I would weigh the risk of data corruption the thing to be avoided and of much higher priority than enabling improved performance. As such I say we should default the relaxed ordering attribute to off in general and look at "white-listing" it in for various architectures and/or chipsets that support/need it rather than having it enabled by default and trying to switch it off after the fact when we find some new issue. So for example, in the case of x86 it seems like there are multiple root complexes that have issues, and the gains for enabling it with standard DMA to host memory are small. As such we may want to default it to off via the architecture specific PCIe code and then look at having "white-list" cases where we enable it for things like peer-to-peer accesses. In the case of SPARC we could look at defaulting it to on, and only "black-list" any cases where there might be issues since SPARC relies on this in a significant way for performance. In the case of ARM and other architectures it is a bit of a toss-up. I would say we could just default it to on for now and "black-list" anything that doesn't work, or we could go the other way like I suggested for x86. It all depends on what the ARM community would want to agree on for this. I would say unless it makes a significant difference like it does in the case of SPARC we are probably better off just defaulting it to off. - Alex
| From: Alexander Duyck <alexander.duyck@gmail.com> | Date: Tuesday, May 2, 2017 11:10 AM | ... | So for example, in the case of x86 it seems like there are multiple | root complexes that have issues, and the gains for enabling it with | standard DMA to host memory are small. As such we may want to default | it to off via the architecture specific PCIe code and then look at | having "white-list" cases where we enable it for things like | peer-to-peer accesses. In the case of SPARC we could look at | defaulting it to on, and only "black-list" any cases where there might | be issues since SPARC relies on this in a significant way for | performance. In the case of ARM and other architectures it is a bit of | a toss-up. I would say we could just default it to on for now and | "black-list" anything that doesn't work, or we could go the other way | like I suggested for x86. It all depends on what the ARM community | would want to agree on for this. I would say unless it makes a | significant difference like it does in the case of SPARC we are | probably better off just defaulting it to off. Sorry, I forgot to respond to this earlier when someone was rushing me out to a customer dinner. I'm unaware of any other x86 Root Complex Port that has a problem with Relaxed Ordering other than the performance issue with the current Intel implementation. Ashok tells me that Intel is in the final stages of putting together a technical notice on this issue but I don't know when that will come out. Hopefully that will shed much more light on the cause and actual use of Relaxed Ordering when directed to Coherent Memory on current and past Intel platforms. (Note that the performance bug seems to limit us to ~75-85Gb/s DMA Write speed to Coherent Host Memory.) The only other Device that I currently know of which has issues with Relaxed Ordering TLPs directed at it, is also a Root Complex Port on the new AMD A1100 ARM SoC ("SEATTLE"). There we have an actual bug which could lead to Data Corruption. But I don't know anything about other x86 Root Complex Ports having issues with this -- we've been using it ever since commit ef306b50b from December 2010. Also, I'm not aware of any performance data which has been gathered on the use of Relaxed Ordering when directed at Host Memory. From your note, it sounds like it's important on SPARC architectures. But it could conceivably be important on any architecture or Root Complex/Memory Controller implementation. We use it to direct Ingress Packet Data to Free List Buffers, followed by a TLP without Relaxed Ordering directed at a Host Memory Message Queue. The idea here is to give the Root Complex options on which DMA Memory Write TLPs to process in order to optimize data placement in memory. And with the next Ethernet speed bump to 200Gb/s this may be even more important. Basically, I'd hate to come up with a solution where we write off the use of Relaxed Ordering for DMA Writes to Host Memory. I don't think you're suggesting that, but there are a number of x86 Root Complex implementations out there -- and some like the new AMD Ryzen have yet to be tested -- as well as other architectures. And, as Ashok and I have both nothed, any solution we come up with needs to cope with a heterogeneous situation where, on the same PCIe Fabric, it may be necessesary/desireable to support Relaxed Ordering TLPs directed at some End Nodes but not others. Casey
On Tue, May 2, 2017 at 9:30 PM, Casey Leedom <leedom@chelsio.com> wrote: > | From: Alexander Duyck <alexander.duyck@gmail.com> > | Date: Tuesday, May 2, 2017 11:10 AM > | ... > | So for example, in the case of x86 it seems like there are multiple > | root complexes that have issues, and the gains for enabling it with > | standard DMA to host memory are small. As such we may want to default > | it to off via the architecture specific PCIe code and then look at > | having "white-list" cases where we enable it for things like > | peer-to-peer accesses. In the case of SPARC we could look at > | defaulting it to on, and only "black-list" any cases where there might > | be issues since SPARC relies on this in a significant way for > | performance. In the case of ARM and other architectures it is a bit of > | a toss-up. I would say we could just default it to on for now and > | "black-list" anything that doesn't work, or we could go the other way > | like I suggested for x86. It all depends on what the ARM community > | would want to agree on for this. I would say unless it makes a > | significant difference like it does in the case of SPARC we are > | probably better off just defaulting it to off. > > Sorry, I forgot to respond to this earlier when someone was rushing me out > to a customer dinner. > > I'm unaware of any other x86 Root Complex Port that has a problem with > Relaxed Ordering other than the performance issue with the current Intel > implementation. Ashok tells me that Intel is in the final stages of putting > together a technical notice on this issue but I don't know when that will > come out. Hopefully that will shed much more light on the cause and actual > use of Relaxed Ordering when directed to Coherent Memory on current and past > Intel platforms. (Note that the performance bug seems to limit us to > ~75-85Gb/s DMA Write speed to Coherent Host Memory.) So my concern isn't so much about existing issues as it is about where is the advantage in enabling it. We have had support in the Intel hardware for enabling relaxed ordering for about 10 years. In all that time I have yet to see an x86 platform that sees any real benefit from enabling it for standard DMA. That is why my preference would be to leave it disabled by default on x86 and we white list it in at some point when hardware shows that there is a benefit to be had for enabling it. > The only other Device that I currently know of which has issues with > Relaxed Ordering TLPs directed at it, is also a Root Complex Port on the new > AMD A1100 ARM SoC ("SEATTLE"). There we have an actual bug which could lead > to Data Corruption. > > But I don't know anything about other x86 Root Complex Ports having issues > with this -- we've been using it ever since commit ef306b50b from December > 2010. So the question I would have for you then, is what benefits have you seen from enabling it on x86? In our case we haven't seen any for transactions that go through the root complex. If you are seeing benefits would I be correct in assuming it is for your peer-to-peer case or were there some x86 platforms that showed gains? > Also, I'm not aware of any performance data which has been gathered on the > use of Relaxed Ordering when directed at Host Memory. From your note, it > sounds like it's important on SPARC architectures. But it could conceivably > be important on any architecture or Root Complex/Memory Controller > implementation. We use it to direct Ingress Packet Data to Free List > Buffers, followed by a TLP without Relaxed Ordering directed at a Host > Memory Message Queue. The idea here is to give the Root Complex options on > which DMA Memory Write TLPs to process in order to optimize data placement > in memory. And with the next Ethernet speed bump to 200Gb/s this may be > even more important. The operative term here is "may be". That is my concern. We are leaving this enabled by default and there are known hardware that have issues that can be pretty serious. I'm not saying we have to disable it and keep it disabled, but I would like to see us intelligently enable this feature so that it is enabled on the platforms that show benefit and disabled on the ones that don't. My biggest concern with all this is introducing regressions as drivers like igb and ixgbe are used on a wide range of platforms beyond even what is covered by x86 and I would prefer not to suddenly have a deluge of bugs to sort out triggered by us enabling relaxed ordering on platforms that historically have not had it enabled on them. > Basically, I'd hate to come up with a solution where we write off the use > of Relaxed Ordering for DMA Writes to Host Memory. I don't think you're > suggesting that, but there are a number of x86 Root Complex implementations > out there -- and some like the new AMD Ryzen have yet to be tested -- as > well as other architectures. I'm not saying that we have to write off the use of Relaxed Ordering, what I am saying is that we should probably be more judicious about how we go about enabling it. If a platform and/or architecture has no benefit to enabling it what is the point in adding the possible risk? Hopefully the AMD Ryzen platform has already been tested and doesn't need a quirk to disable relaxed ordering. Really it shouldn't fall on the likes of us to be testing for those kind of things. > And, as Ashok and I have both nothed, any solution we come up with needs > to cope with a heterogeneous situation where, on the same PCIe Fabric, it > may be necessesary/desireable to support Relaxed Ordering TLPs directed at > some End Nodes but not others. > > Casey It sounds like we are more or less in agreement. My only concern is really what we default this to. On x86 I would say we could probably default this to disabled for existing platforms since my understanding is that relaxed ordering doesn't provide much benefit on what is out there right now when performing DMA through the root complex. As far as peer-to-peer I would say we should probably look at enabling the ability to have Relaxed Ordering enabled for some channels but not others. In those cases the hardware needs to be smart enough to allow for you to indicate you want it disabled by default for most of your DMA channels, and then enabled for the select channels that are handling the peer-to-peer traffic. - Alex
| From: Alexander Duyck <alexander.duyck@gmail.com> | Sent: Wednesday, May 3, 2017 9:02 AM | ... | It sounds like we are more or less in agreement. My only concern is | really what we default this to. On x86 I would say we could probably | default this to disabled for existing platforms since my understanding | is that relaxed ordering doesn't provide much benefit on what is out | there right now when performing DMA through the root complex. As far | as peer-to-peer I would say we should probably look at enabling the | ability to have Relaxed Ordering enabled for some channels but not | others. In those cases the hardware needs to be smart enough to allow | for you to indicate you want it disabled by default for most of your | DMA channels, and then enabled for the select channels that are | handling the peer-to-peer traffic. Yes, I think that we are mostly in agreement. I had just wanted to make sure that whatever scheme was developed would allow for simultaneously supporting non-Relaxed Ordering for some PCIe End Points and Relaxed Ordering for others within the same system. I.e. not simply enabling/disabling/etc. based solely on System Platform Architecture. By the way, I've started our QA folks off looking at what things look like in Linux Virtual Machines under different Hypervisors to see what information they may provide to the VM in the way of what Root Complex Port is being used, etc. So far they've got Windows HyperV done and there there's no PCIe Fabric exposed in any way: just the attached device. I'll have to see what pci_find_pcie_root_port() returns in that environment. Maybe NULL? With your reservations (which I also share), I think that it probably makes sense to have a per-architecture definition of the "Can I Use Relaxed Ordering With TLPs Directed At This End Point" predicate, with the default being "No" for any architecture which doesn't implement the predicate. And if the specified (struct pci_dev *) End Node is NULL, it ought to return False for that as well. I can't see any reason to pass in the Source End Node but I may be missing something. At this point, this is pretty far outside my level of expertise. I'm happy to give it a go, but I'd be even happier if someone with a lot more experience in the PCIe Infrastructure were to want to carry the ball forward. I'm not super familiar with the Linux Kernel "Rules Of Engagement", so let me know what my next step should be. Thanks. Casey
On Thu, May 4, 2017 at 2:01 PM, Casey Leedom <leedom@chelsio.com> wrote: > | From: Alexander Duyck <alexander.duyck@gmail.com> > | Sent: Wednesday, May 3, 2017 9:02 AM > | ... > | It sounds like we are more or less in agreement. My only concern is > | really what we default this to. On x86 I would say we could probably > | default this to disabled for existing platforms since my understanding > | is that relaxed ordering doesn't provide much benefit on what is out > | there right now when performing DMA through the root complex. As far > | as peer-to-peer I would say we should probably look at enabling the > | ability to have Relaxed Ordering enabled for some channels but not > | others. In those cases the hardware needs to be smart enough to allow > | for you to indicate you want it disabled by default for most of your > | DMA channels, and then enabled for the select channels that are > | handling the peer-to-peer traffic. > > Yes, I think that we are mostly in agreement. I had just wanted to make > sure that whatever scheme was developed would allow for simultaneously > supporting non-Relaxed Ordering for some PCIe End Points and Relaxed > Ordering for others within the same system. I.e. not simply > enabling/disabling/etc. based solely on System Platform Architecture. > > By the way, I've started our QA folks off looking at what things look like > in Linux Virtual Machines under different Hypervisors to see what > information they may provide to the VM in the way of what Root Complex Port > is being used, etc. So far they've got Windows HyperV done and there > there's no PCIe Fabric exposed in any way: just the attached device. I'll > have to see what pci_find_pcie_root_port() returns in that environment. > Maybe NULL? I believe NULL is one of the options. It all depends on what qemu is emulating. Most likely you won't find a pcie root port on KVM because the default is to emulate an older system that only supports PCI. > With your reservations (which I also share), I think that it probably > makes sense to have a per-architecture definition of the "Can I Use Relaxed > Ordering With TLPs Directed At This End Point" predicate, with the default > being "No" for any architecture which doesn't implement the predicate. And > if the specified (struct pci_dev *) End Node is NULL, it ought to return > False for that as well. I can't see any reason to pass in the Source End > Node but I may be missing something. > > At this point, this is pretty far outside my level of expertise. I'm > happy to give it a go, but I'd be even happier if someone with a lot more > experience in the PCIe Infrastructure were to want to carry the ball > forward. I'm not super familiar with the Linux Kernel "Rules Of > Engagement", so let me know what my next step should be. Thanks. > > Casey For now we can probably keep this on the linux-pci mailing list. Going that route is the most straight forward for now since step one is probably just making sure we are setting the relaxed ordering bit in the setups that make sense. I would say we could probably keep it simple. We just need to enable relaxed ordering by default for SPARC architectures, on most others we can probably default it to off. I believe this all had started as Ding Tianhong was hoping to enable this for the ARM architecture. That is the only one I can think of where it might be difficult to figure out which way to default as we were attempting to follow the same code that was enabled for SPARC and that is what started this tug-of-war about how this should be done. What we might do is take care of this in two phases. The first one enables the infrastructure generically but leaves it defaulted to off for everyone but SPARC. Then we can go through and start enabling it for other platforms such as some of those on ARM in the platforms that Ding Tianhong was working with. - Alex
On 2017/5/5 22:04, Alexander Duyck wrote: > On Thu, May 4, 2017 at 2:01 PM, Casey Leedom <leedom@chelsio.com> wrote: >> | From: Alexander Duyck <alexander.duyck@gmail.com> >> | Sent: Wednesday, May 3, 2017 9:02 AM >> | ... >> | It sounds like we are more or less in agreement. My only concern is >> | really what we default this to. On x86 I would say we could probably >> | default this to disabled for existing platforms since my understanding >> | is that relaxed ordering doesn't provide much benefit on what is out >> | there right now when performing DMA through the root complex. As far >> | as peer-to-peer I would say we should probably look at enabling the >> | ability to have Relaxed Ordering enabled for some channels but not >> | others. In those cases the hardware needs to be smart enough to allow >> | for you to indicate you want it disabled by default for most of your >> | DMA channels, and then enabled for the select channels that are >> | handling the peer-to-peer traffic. >> >> Yes, I think that we are mostly in agreement. I had just wanted to make >> sure that whatever scheme was developed would allow for simultaneously >> supporting non-Relaxed Ordering for some PCIe End Points and Relaxed >> Ordering for others within the same system. I.e. not simply >> enabling/disabling/etc. based solely on System Platform Architecture. >> >> By the way, I've started our QA folks off looking at what things look like >> in Linux Virtual Machines under different Hypervisors to see what >> information they may provide to the VM in the way of what Root Complex Port >> is being used, etc. So far they've got Windows HyperV done and there >> there's no PCIe Fabric exposed in any way: just the attached device. I'll >> have to see what pci_find_pcie_root_port() returns in that environment. >> Maybe NULL? > > I believe NULL is one of the options. It all depends on what qemu is > emulating. Most likely you won't find a pcie root port on KVM because > the default is to emulate an older system that only supports PCI. > >> With your reservations (which I also share), I think that it probably >> makes sense to have a per-architecture definition of the "Can I Use Relaxed >> Ordering With TLPs Directed At This End Point" predicate, with the default >> being "No" for any architecture which doesn't implement the predicate. And >> if the specified (struct pci_dev *) End Node is NULL, it ought to return >> False for that as well. I can't see any reason to pass in the Source End >> Node but I may be missing something. >> >> At this point, this is pretty far outside my level of expertise. I'm >> happy to give it a go, but I'd be even happier if someone with a lot more >> experience in the PCIe Infrastructure were to want to carry the ball >> forward. I'm not super familiar with the Linux Kernel "Rules Of >> Engagement", so let me know what my next step should be. Thanks. >> >> Casey > > For now we can probably keep this on the linux-pci mailing list. Going > that route is the most straight forward for now since step one is > probably just making sure we are setting the relaxed ordering bit in > the setups that make sense. I would say we could probably keep it > simple. We just need to enable relaxed ordering by default for SPARC > architectures, on most others we can probably default it to off. > Casey, Alexander: Thanks for the wonderful discussion, it is more clearly that what to do next, I agree that enable relaxed ordering by default only for SPARC and ARM64 is more safe for all the other platform, as no one want to break anything. > I believe this all had started as Ding Tianhong was hoping to enable > this for the ARM architecture. That is the only one I can think of > where it might be difficult to figure out which way to default as we > were attempting to follow the same code that was enabled for SPARC and > that is what started this tug-of-war about how this should be done. > What we might do is take care of this in two phases. The first one > enables the infrastructure generically but leaves it defaulted to off > for everyone but SPARC. Then we can go through and start enabling it > for other platforms such as some of those on ARM in the platforms that > Ding Tianhong was working with. > According the suggestion, I could only think of this code: @@ -3979,6 +3979,15 @@ static void quirk_tw686x_class(struct pci_dev *pdev) DECLARE_PCI_FIXUP_CLASS_EARLY(0x1797, 0x6869, PCI_CLASS_NOT_DEFINED, 8, quirk_tw686x_class); +static void quirk_relaxedordering_disable(struct pci_dev *dev) +{ + if (dev->vendor != PCI_VENDOR_ID_HUAWEI && + dev->vendor != PCI_VENDOR_ID_SUN) + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING; +} +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_INTEL_ID, PCI_ANY_ID, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); + /* * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same * values for the Attribute as were supplied in the header of the What do you think of it? Thanks Ding > - Alex > > . >
On Fri, May 5, 2017 at 8:08 PM, Ding Tianhong <dingtianhong@huawei.com> wrote: > > > On 2017/5/5 22:04, Alexander Duyck wrote: >> On Thu, May 4, 2017 at 2:01 PM, Casey Leedom <leedom@chelsio.com> wrote: >>> | From: Alexander Duyck <alexander.duyck@gmail.com> >>> | Sent: Wednesday, May 3, 2017 9:02 AM >>> | ... >>> | It sounds like we are more or less in agreement. My only concern is >>> | really what we default this to. On x86 I would say we could probably >>> | default this to disabled for existing platforms since my understanding >>> | is that relaxed ordering doesn't provide much benefit on what is out >>> | there right now when performing DMA through the root complex. As far >>> | as peer-to-peer I would say we should probably look at enabling the >>> | ability to have Relaxed Ordering enabled for some channels but not >>> | others. In those cases the hardware needs to be smart enough to allow >>> | for you to indicate you want it disabled by default for most of your >>> | DMA channels, and then enabled for the select channels that are >>> | handling the peer-to-peer traffic. >>> >>> Yes, I think that we are mostly in agreement. I had just wanted to make >>> sure that whatever scheme was developed would allow for simultaneously >>> supporting non-Relaxed Ordering for some PCIe End Points and Relaxed >>> Ordering for others within the same system. I.e. not simply >>> enabling/disabling/etc. based solely on System Platform Architecture. >>> >>> By the way, I've started our QA folks off looking at what things look like >>> in Linux Virtual Machines under different Hypervisors to see what >>> information they may provide to the VM in the way of what Root Complex Port >>> is being used, etc. So far they've got Windows HyperV done and there >>> there's no PCIe Fabric exposed in any way: just the attached device. I'll >>> have to see what pci_find_pcie_root_port() returns in that environment. >>> Maybe NULL? >> >> I believe NULL is one of the options. It all depends on what qemu is >> emulating. Most likely you won't find a pcie root port on KVM because >> the default is to emulate an older system that only supports PCI. >> >>> With your reservations (which I also share), I think that it probably >>> makes sense to have a per-architecture definition of the "Can I Use Relaxed >>> Ordering With TLPs Directed At This End Point" predicate, with the default >>> being "No" for any architecture which doesn't implement the predicate. And >>> if the specified (struct pci_dev *) End Node is NULL, it ought to return >>> False for that as well. I can't see any reason to pass in the Source End >>> Node but I may be missing something. >>> >>> At this point, this is pretty far outside my level of expertise. I'm >>> happy to give it a go, but I'd be even happier if someone with a lot more >>> experience in the PCIe Infrastructure were to want to carry the ball >>> forward. I'm not super familiar with the Linux Kernel "Rules Of >>> Engagement", so let me know what my next step should be. Thanks. >>> >>> Casey >> >> For now we can probably keep this on the linux-pci mailing list. Going >> that route is the most straight forward for now since step one is >> probably just making sure we are setting the relaxed ordering bit in >> the setups that make sense. I would say we could probably keep it >> simple. We just need to enable relaxed ordering by default for SPARC >> architectures, on most others we can probably default it to off. >> > > Casey, Alexander: > > Thanks for the wonderful discussion, it is more clearly that what to do next, > I agree that enable relaxed ordering by default only for SPARC and ARM64 > is more safe for all the other platform, as no one want to break anything. > >> I believe this all had started as Ding Tianhong was hoping to enable >> this for the ARM architecture. That is the only one I can think of >> where it might be difficult to figure out which way to default as we >> were attempting to follow the same code that was enabled for SPARC and >> that is what started this tug-of-war about how this should be done. >> What we might do is take care of this in two phases. The first one >> enables the infrastructure generically but leaves it defaulted to off >> for everyone but SPARC. Then we can go through and start enabling it >> for other platforms such as some of those on ARM in the platforms that >> Ding Tianhong was working with. >> > > According the suggestion, I could only think of this code: > > @@ -3979,6 +3979,15 @@ static void quirk_tw686x_class(struct pci_dev *pdev) > DECLARE_PCI_FIXUP_CLASS_EARLY(0x1797, 0x6869, PCI_CLASS_NOT_DEFINED, 8, > quirk_tw686x_class); > > +static void quirk_relaxedordering_disable(struct pci_dev *dev) > +{ > + if (dev->vendor != PCI_VENDOR_ID_HUAWEI && > + dev->vendor != PCI_VENDOR_ID_SUN) > + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING; > +} > +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_INTEL_ID, PCI_ANY_ID, PCI_CLASS_NOT_DEFINED, 8, > + quirk_relaxedordering_disable); > + > /* > * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same > * values for the Attribute as were supplied in the header of the > > > What do you think of it? > > Thanks > Ding > This is a bit simplistic but it is a start. The other bit I was getting at is that we need to update the core PCIe code so that when we configure devices and the root complex reports no support for relaxed ordering it should be clearing the relaxed ordering bits in the PCIe configuration registers on the upstream facing devices. The last bit we need in all this is a way to allow for setups where peer-to-peer wants to perform relaxed ordering but for writes to the host we have to not use relaxed ordering. For that we need to enable a special case and that isn't handled right now in any of the solutions we have coded up so far. - Alex
On 2017/5/7 2:07, Alexander Duyck wrote: > On Fri, May 5, 2017 at 8:08 PM, Ding Tianhong <dingtianhong@huawei.com> wrote: >> >> >> On 2017/5/5 22:04, Alexander Duyck wrote: >>> On Thu, May 4, 2017 at 2:01 PM, Casey Leedom <leedom@chelsio.com> wrote: >>>> | From: Alexander Duyck <alexander.duyck@gmail.com> >>>> | Sent: Wednesday, May 3, 2017 9:02 AM >>>> | ... >>>> | It sounds like we are more or less in agreement. My only concern is >>>> | really what we default this to. On x86 I would say we could probably >>>> | default this to disabled for existing platforms since my understanding >>>> | is that relaxed ordering doesn't provide much benefit on what is out >>>> | there right now when performing DMA through the root complex. As far >>>> | as peer-to-peer I would say we should probably look at enabling the >>>> | ability to have Relaxed Ordering enabled for some channels but not >>>> | others. In those cases the hardware needs to be smart enough to allow >>>> | for you to indicate you want it disabled by default for most of your >>>> | DMA channels, and then enabled for the select channels that are >>>> | handling the peer-to-peer traffic. >>>> >>>> Yes, I think that we are mostly in agreement. I had just wanted to make >>>> sure that whatever scheme was developed would allow for simultaneously >>>> supporting non-Relaxed Ordering for some PCIe End Points and Relaxed >>>> Ordering for others within the same system. I.e. not simply >>>> enabling/disabling/etc. based solely on System Platform Architecture. >>>> >>>> By the way, I've started our QA folks off looking at what things look like >>>> in Linux Virtual Machines under different Hypervisors to see what >>>> information they may provide to the VM in the way of what Root Complex Port >>>> is being used, etc. So far they've got Windows HyperV done and there >>>> there's no PCIe Fabric exposed in any way: just the attached device. I'll >>>> have to see what pci_find_pcie_root_port() returns in that environment. >>>> Maybe NULL? >>> >>> I believe NULL is one of the options. It all depends on what qemu is >>> emulating. Most likely you won't find a pcie root port on KVM because >>> the default is to emulate an older system that only supports PCI. >>> >>>> With your reservations (which I also share), I think that it probably >>>> makes sense to have a per-architecture definition of the "Can I Use Relaxed >>>> Ordering With TLPs Directed At This End Point" predicate, with the default >>>> being "No" for any architecture which doesn't implement the predicate. And >>>> if the specified (struct pci_dev *) End Node is NULL, it ought to return >>>> False for that as well. I can't see any reason to pass in the Source End >>>> Node but I may be missing something. >>>> >>>> At this point, this is pretty far outside my level of expertise. I'm >>>> happy to give it a go, but I'd be even happier if someone with a lot more >>>> experience in the PCIe Infrastructure were to want to carry the ball >>>> forward. I'm not super familiar with the Linux Kernel "Rules Of >>>> Engagement", so let me know what my next step should be. Thanks. >>>> >>>> Casey >>> >>> For now we can probably keep this on the linux-pci mailing list. Going >>> that route is the most straight forward for now since step one is >>> probably just making sure we are setting the relaxed ordering bit in >>> the setups that make sense. I would say we could probably keep it >>> simple. We just need to enable relaxed ordering by default for SPARC >>> architectures, on most others we can probably default it to off. >>> >> >> Casey, Alexander: >> >> Thanks for the wonderful discussion, it is more clearly that what to do next, >> I agree that enable relaxed ordering by default only for SPARC and ARM64 >> is more safe for all the other platform, as no one want to break anything. >> >>> I believe this all had started as Ding Tianhong was hoping to enable >>> this for the ARM architecture. That is the only one I can think of >>> where it might be difficult to figure out which way to default as we >>> were attempting to follow the same code that was enabled for SPARC and >>> that is what started this tug-of-war about how this should be done. >>> What we might do is take care of this in two phases. The first one >>> enables the infrastructure generically but leaves it defaulted to off >>> for everyone but SPARC. Then we can go through and start enabling it >>> for other platforms such as some of those on ARM in the platforms that >>> Ding Tianhong was working with. >>> >> >> According the suggestion, I could only think of this code: >> >> @@ -3979,6 +3979,15 @@ static void quirk_tw686x_class(struct pci_dev *pdev) >> DECLARE_PCI_FIXUP_CLASS_EARLY(0x1797, 0x6869, PCI_CLASS_NOT_DEFINED, 8, >> quirk_tw686x_class); >> >> +static void quirk_relaxedordering_disable(struct pci_dev *dev) >> +{ >> + if (dev->vendor != PCI_VENDOR_ID_HUAWEI && >> + dev->vendor != PCI_VENDOR_ID_SUN) >> + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING; >> +} >> +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_INTEL_ID, PCI_ANY_ID, PCI_CLASS_NOT_DEFINED, 8, >> + quirk_relaxedordering_disable); >> + >> /* >> * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same >> * values for the Attribute as were supplied in the header of the >> >> >> What do you think of it? >> >> Thanks >> Ding >> > > This is a bit simplistic but it is a start. > > The other bit I was getting at is that we need to update the core PCIe > code so that when we configure devices and the root complex reports no > support for relaxed ordering it should be clearing the relaxed > ordering bits in the PCIe configuration registers on the upstream > facing devices. How about this: rename the PCI_DEV_FLAGS_NO_RELAXED_ORDERIN to PCI_DEV_FLAGS_RELAXED_ORDERIN, only enable it when pcie root configure if it support the RO mode, otherwise we will not set it to indicate that the pcie dev did not support RO mode. > > The last bit we need in all this is a way to allow for setups where > peer-to-peer wants to perform relaxed ordering but for writes to the > host we have to not use relaxed ordering. For that we need to enable a > special case and that isn't handled right now in any of the solutions > we have coded up so far. > Sorry I am not clear of this way, can you explain more about this or give me a special case, thanks a lot. Ding > - Alex > > . >
On Mon, May 8, 2017 at 7:33 AM, Ding Tianhong <dingtianhong@huawei.com> wrote: > > > On 2017/5/7 2:07, Alexander Duyck wrote: >> On Fri, May 5, 2017 at 8:08 PM, Ding Tianhong <dingtianhong@huawei.com> wrote: >>> >>> >>> On 2017/5/5 22:04, Alexander Duyck wrote: >>>> On Thu, May 4, 2017 at 2:01 PM, Casey Leedom <leedom@chelsio.com> wrote: >>>>> | From: Alexander Duyck <alexander.duyck@gmail.com> >>>>> | Sent: Wednesday, May 3, 2017 9:02 AM >>>>> | ... >>>>> | It sounds like we are more or less in agreement. My only concern is >>>>> | really what we default this to. On x86 I would say we could probably >>>>> | default this to disabled for existing platforms since my understanding >>>>> | is that relaxed ordering doesn't provide much benefit on what is out >>>>> | there right now when performing DMA through the root complex. As far >>>>> | as peer-to-peer I would say we should probably look at enabling the >>>>> | ability to have Relaxed Ordering enabled for some channels but not >>>>> | others. In those cases the hardware needs to be smart enough to allow >>>>> | for you to indicate you want it disabled by default for most of your >>>>> | DMA channels, and then enabled for the select channels that are >>>>> | handling the peer-to-peer traffic. >>>>> >>>>> Yes, I think that we are mostly in agreement. I had just wanted to make >>>>> sure that whatever scheme was developed would allow for simultaneously >>>>> supporting non-Relaxed Ordering for some PCIe End Points and Relaxed >>>>> Ordering for others within the same system. I.e. not simply >>>>> enabling/disabling/etc. based solely on System Platform Architecture. >>>>> >>>>> By the way, I've started our QA folks off looking at what things look like >>>>> in Linux Virtual Machines under different Hypervisors to see what >>>>> information they may provide to the VM in the way of what Root Complex Port >>>>> is being used, etc. So far they've got Windows HyperV done and there >>>>> there's no PCIe Fabric exposed in any way: just the attached device. I'll >>>>> have to see what pci_find_pcie_root_port() returns in that environment. >>>>> Maybe NULL? >>>> >>>> I believe NULL is one of the options. It all depends on what qemu is >>>> emulating. Most likely you won't find a pcie root port on KVM because >>>> the default is to emulate an older system that only supports PCI. >>>> >>>>> With your reservations (which I also share), I think that it probably >>>>> makes sense to have a per-architecture definition of the "Can I Use Relaxed >>>>> Ordering With TLPs Directed At This End Point" predicate, with the default >>>>> being "No" for any architecture which doesn't implement the predicate. And >>>>> if the specified (struct pci_dev *) End Node is NULL, it ought to return >>>>> False for that as well. I can't see any reason to pass in the Source End >>>>> Node but I may be missing something. >>>>> >>>>> At this point, this is pretty far outside my level of expertise. I'm >>>>> happy to give it a go, but I'd be even happier if someone with a lot more >>>>> experience in the PCIe Infrastructure were to want to carry the ball >>>>> forward. I'm not super familiar with the Linux Kernel "Rules Of >>>>> Engagement", so let me know what my next step should be. Thanks. >>>>> >>>>> Casey >>>> >>>> For now we can probably keep this on the linux-pci mailing list. Going >>>> that route is the most straight forward for now since step one is >>>> probably just making sure we are setting the relaxed ordering bit in >>>> the setups that make sense. I would say we could probably keep it >>>> simple. We just need to enable relaxed ordering by default for SPARC >>>> architectures, on most others we can probably default it to off. >>>> >>> >>> Casey, Alexander: >>> >>> Thanks for the wonderful discussion, it is more clearly that what to do next, >>> I agree that enable relaxed ordering by default only for SPARC and ARM64 >>> is more safe for all the other platform, as no one want to break anything. >>> >>>> I believe this all had started as Ding Tianhong was hoping to enable >>>> this for the ARM architecture. That is the only one I can think of >>>> where it might be difficult to figure out which way to default as we >>>> were attempting to follow the same code that was enabled for SPARC and >>>> that is what started this tug-of-war about how this should be done. >>>> What we might do is take care of this in two phases. The first one >>>> enables the infrastructure generically but leaves it defaulted to off >>>> for everyone but SPARC. Then we can go through and start enabling it >>>> for other platforms such as some of those on ARM in the platforms that >>>> Ding Tianhong was working with. >>>> >>> >>> According the suggestion, I could only think of this code: >>> >>> @@ -3979,6 +3979,15 @@ static void quirk_tw686x_class(struct pci_dev *pdev) >>> DECLARE_PCI_FIXUP_CLASS_EARLY(0x1797, 0x6869, PCI_CLASS_NOT_DEFINED, 8, >>> quirk_tw686x_class); >>> >>> +static void quirk_relaxedordering_disable(struct pci_dev *dev) >>> +{ >>> + if (dev->vendor != PCI_VENDOR_ID_HUAWEI && >>> + dev->vendor != PCI_VENDOR_ID_SUN) >>> + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING; >>> +} >>> +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_INTEL_ID, PCI_ANY_ID, PCI_CLASS_NOT_DEFINED, 8, >>> + quirk_relaxedordering_disable); >>> + >>> /* >>> * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same >>> * values for the Attribute as were supplied in the header of the >>> >>> >>> What do you think of it? >>> >>> Thanks >>> Ding >>> >> >> This is a bit simplistic but it is a start. >> >> The other bit I was getting at is that we need to update the core PCIe >> code so that when we configure devices and the root complex reports no >> support for relaxed ordering it should be clearing the relaxed >> ordering bits in the PCIe configuration registers on the upstream >> facing devices. > > How about this: > rename the PCI_DEV_FLAGS_NO_RELAXED_ORDERIN to PCI_DEV_FLAGS_RELAXED_ORDERIN, only enable it > when pcie root configure if it support the RO mode, otherwise we will not set it to indicate > that the pcie dev did not support RO mode. The problem is we need to have something that can be communicated through a VM. Your change doesn't work in that regard. That was why I suggested just updating the code so that we when we initialized PCIe devices what we do is either set or clear the relaxed ordering bit in the PCIe device control register. That way when we direct assign an interface it could know just based on the bits int the PCIe configuration if it could use relaxed ordering or not. At that point the driver code itself becomes very simple since you could just enable the relaxed ordering by default in the igb/ixgbe driver and if the bit is set or cleared in the PCIe configuration then we are either sending with relaxed ordering requests or not and don't have to try and locate the root complex. >> >> The last bit we need in all this is a way to allow for setups where >> peer-to-peer wants to perform relaxed ordering but for writes to the >> host we have to not use relaxed ordering. For that we need to enable a >> special case and that isn't handled right now in any of the solutions >> we have coded up so far. >> > > Sorry I am not clear of this way, can you explain more about this or give me > a special case, thanks a lot. So from the sound of it Casey has a special use case where he doesn't want to send relaxed ordering frames to the root complex, but instead would like to send them to another PCIe device. To do that he needs to have a way to enable the relaxed ordering bit in the PCIe configuration but then not send any to the root complex. Odds are that is something he might be able to just implement in the driver, but is something that may become a more general case in the future. I don't see our change here impacting it as long as we keep the solution generic and mostly confined to when we instantiate the devices as the driver could likely make the decision to change the behavior later. - Alex
| From: Alexander Duyck <alexander.duyck@gmail.com> | Date: Saturday, May 6, 2017 11:07 AM | | | From: Ding Tianhong <dingtianhong@huawei.com> | | Date: Fri, May 5, 2017 at 8:08 PM | | | | According the suggestion, I could only think of this code: | | .. | | This is a bit simplistic but it is a start. Yes, something tells me that this is going to be more complicated than any of us like ... | The other bit I was getting at is that we need to update the core PCIe | code so that when we configure devices and the root complex reports no | support for relaxed ordering it should be clearing the relaxed | ordering bits in the PCIe configuration registers on the upstream | facing devices. Of course, this can be written to by the Driver at any time ... and is in the case of the cxgb4 Driver ... After a lot of rummaging around, it does look like KVM prohibits writes to the PCIe Capability Device Control register in drivers/xen/xen-pciback/ conf_space_capability.c and conf_space.c simply because writes aren't allowed unless "permissive" is set. So it ~looks~ like a driver running in a Virtual Machine can't turn Enable Relaxed Ordering back on ... | The last bit we need in all this is a way to allow for setups where | peer-to-peer wants to perform relaxed ordering but for writes to the | host we have to not use relaxed ordering. For that we need to enable a | special case and that isn't handled right now in any of the solutions | we have coded up so far. Yes, we do need this. | From: Alexander Duyck <alexander.duyck@gmail.com> | Date: Saturday, May 8, 2017 08:22 AM | | The problem is we need to have something that can be communicated | through a VM. Your change doesn't work in that regard. That was why I | suggested just updating the code so that we when we initialized PCIe | devices what we do is either set or clear the relaxed ordering bit in | the PCIe device control register. That way when we direct assign an | interface it could know just based on the bits int the PCIe | configuration if it could use relaxed ordering or not. | | At that point the driver code itself becomes very simple since you | could just enable the relaxed ordering by default in the igb/ixgbe | driver and if the bit is set or cleared in the PCIe configuration then | we are either sending with relaxed ordering requests or not and don't | have to try and locate the root complex. | | So from the sound of it Casey has a special use case where he doesn't | want to send relaxed ordering frames to the root complex, but instead | would like to send them to another PCIe device. To do that he needs to | have a way to enable the relaxed ordering bit in the PCIe | configuration but then not send any to the root complex. Odds are that | is something he might be able to just implement in the driver, but is | something that may become a more general case in the future. I don't | see our change here impacting it as long as we keep the solution | generic and mostly confined to when we instantiate the devices as the | driver could likely make the decision to change the behavior later. It's not just me. Intel has said that while RO directed at the Root Complex Host Coherent Memory has a performance bug (not Data Corruption), it's a performance win for Peer-to-Peer writes to MMIO Space. (I'll be very interested in hearing what the bug is if we get that much detail. The very same TLPs directed to the Root Complex Port without Relaxed Ordering set get good performance. So this is essentially a bug in the hardware that was ~trying~ to implement a performance win.) Meanwhile, I currently only know of a single PCIe End Point which causes catastrophic results: the AMD A1100 ARM SoC ("SEATTLE"). And it's not even clear that product is even alive anymore since I haven't been able to get any responses from them for several months. What I'm saying is: let's try to architect a solution which doesn't throw the baby out with the bath water ... I think that if a Device's Root Complex Port has problems with Relaxed Ordering, it ~probably~ makes sense to turn off the PCIe Capability Device Control[Enable Relaxed Ordering] when we assign a device to a Virtual Machine since the Device Driver can no longer query the Relaxed Ordering Support of the Root Complex Port. The only down side of this would be if we assigned two Peers to a VM in an application which wanted to do Peer-to-Peer transfers. But that seems like a hard application to support in any case since the PCI Bus:Slot.Function IDs for assigned Devices within a VM don't match the actual values. For Devices running in the base OS/Hypervisor, their Drivers can query the Relaxed Ordering Support for the Root Complex Port or a Peer Device. So a simple flag within the (struct pci_dev *)->dev_flags would serve for that along with a per-Architecture/Platform mechanism for setting it ... Casey
On 2017/5/9 8:48, Casey Leedom wrote: > > | From: Alexander Duyck <alexander.duyck@gmail.com> > | Date: Saturday, May 6, 2017 11:07 AM > | > | | From: Ding Tianhong <dingtianhong@huawei.com> > | | Date: Fri, May 5, 2017 at 8:08 PM > | | > | | According the suggestion, I could only think of this code: > | | .. > | > | This is a bit simplistic but it is a start. > > Yes, something tells me that this is going to be more complicated than any > of us like ... > > | The other bit I was getting at is that we need to update the core PCIe > | code so that when we configure devices and the root complex reports no > | support for relaxed ordering it should be clearing the relaxed > | ordering bits in the PCIe configuration registers on the upstream > | facing devices. > > Of course, this can be written to by the Driver at any time ... and is in > the case of the cxgb4 Driver ... > > After a lot of rummaging around, it does look like KVM prohibits writes to > the PCIe Capability Device Control register in drivers/xen/xen-pciback/ > conf_space_capability.c and conf_space.c simply because writes aren't > allowed unless "permissive" is set. So it ~looks~ like a driver running in > a Virtual Machine can't turn Enable Relaxed Ordering back on ... > > | The last bit we need in all this is a way to allow for setups where > | peer-to-peer wants to perform relaxed ordering but for writes to the > | host we have to not use relaxed ordering. For that we need to enable a > | special case and that isn't handled right now in any of the solutions > | we have coded up so far. > > Yes, we do need this. > > > | From: Alexander Duyck <alexander.duyck@gmail.com> > | Date: Saturday, May 8, 2017 08:22 AM > | > | The problem is we need to have something that can be communicated > | through a VM. Your change doesn't work in that regard. That was why I > | suggested just updating the code so that we when we initialized PCIe > | devices what we do is either set or clear the relaxed ordering bit in > | the PCIe device control register. That way when we direct assign an > | interface it could know just based on the bits int the PCIe > | configuration if it could use relaxed ordering or not. > | > | At that point the driver code itself becomes very simple since you > | could just enable the relaxed ordering by default in the igb/ixgbe > | driver and if the bit is set or cleared in the PCIe configuration then > | we are either sending with relaxed ordering requests or not and don't > | have to try and locate the root complex. > | > | So from the sound of it Casey has a special use case where he doesn't > | want to send relaxed ordering frames to the root complex, but instead > | would like to send them to another PCIe device. To do that he needs to > | have a way to enable the relaxed ordering bit in the PCIe > | configuration but then not send any to the root complex. Odds are that > | is something he might be able to just implement in the driver, but is > | something that may become a more general case in the future. I don't > | see our change here impacting it as long as we keep the solution > | generic and mostly confined to when we instantiate the devices as the > | driver could likely make the decision to change the behavior later. > > It's not just me. Intel has said that while RO directed at the Root > Complex Host Coherent Memory has a performance bug (not Data Corruption), > it's a performance win for Peer-to-Peer writes to MMIO Space. (I'll be very > interested in hearing what the bug is if we get that much detail. The very > same TLPs directed to the Root Complex Port without Relaxed Ordering set get > good performance. So this is essentially a bug in the hardware that was > ~trying~ to implement a performance win.) > > Meanwhile, I currently only know of a single PCIe End Point which causes > catastrophic results: the AMD A1100 ARM SoC ("SEATTLE"). And it's not even > clear that product is even alive anymore since I haven't been able to get > any responses from them for several months. > > What I'm saying is: let's try to architect a solution which doesn't throw > the baby out with the bath water ... > > I think that if a Device's Root Complex Port has problems with Relaxed > Ordering, it ~probably~ makes sense to turn off the PCIe Capability Device > Control[Enable Relaxed Ordering] when we assign a device to a Virtual > Machine since the Device Driver can no longer query the Relaxed Ordering > Support of the Root Complex Port. The only down side of this would be if we > assigned two Peers to a VM in an application which wanted to do Peer-to-Peer > transfers. But that seems like a hard application to support in any case > since the PCI Bus:Slot.Function IDs for assigned Devices within a VM don't > match the actual values. > > For Devices running in the base OS/Hypervisor, their Drivers can query the > Relaxed Ordering Support for the Root Complex Port or a Peer Device. So a > simple flag within the (struct pci_dev *)->dev_flags would serve for that > along with a per-Architecture/Platform mechanism for setting it ... > Hi Casey: Will you continue to work on this solution or send a new version patches? Thanks Ding > Casey > > . >
| From: Ding Tianhong <dingtianhong@huawei.com> | Sent: Wednesday, May 10, 2017 6:15 PM | | Hi Casey: | | Will you continue to work on this solution or send a new version patches? I won't be able to work on this any time soon given several other urgent issues. If you've got a desire to pick this up, I'd be happy to help code review your efforts. Casey
On Tue, May 16, 2017 at 11:38 AM, Casey Leedom <leedom@chelsio.com> wrote: > | From: Ding Tianhong <dingtianhong@huawei.com> > | Sent: Wednesday, May 10, 2017 6:15 PM > | > | Hi Casey: > | > | Will you continue to work on this solution or send a new version patches? > > I won't be able to work on this any time soon given several other urgent > issues. If you've got a desire to pick this up, I'd be happy to help code > review your efforts. > > Casey Thanks for the heads up. I'll see if I can find somebody in my team that might be able to take on the task though I can't make any promises either as we have quite a bit going on. - Alex
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index f754453..4ae78b3 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3979,6 +3979,44 @@ static void quirk_tw686x_class(struct pci_dev *pdev) quirk_tw686x_class); /* + * Some devices have problems with Transaction Layer Packets with the Relaxed + * Ordering Attribute set. Such devices should mark themselves and other + * Device Drivers should check before sending TLPs with RO set. + */ +static void quirk_relaxedordering_disable(struct pci_dev *dev) +{ + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING; +} + +/* + * Intel E5-26xx Root Complex has a Flow Control Credit issue which can + * cause performance problems with Upstream Transaction Layer Packets with + * Relaxed Ordering set. + */ +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f02, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f04, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f08, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); + +/* + * The AMD ARM A1100 (AKA "SEATTLE") SoC has a bug in its PCIe Root Complex + * where Upstream Transaction Layer Packets with the Relaxed Ordering + * Attribute clear are allowed to bypass earlier TLPs with Relaxed Ordering + * set. This is a violation of the PCIe 3.0 Transaction Ordering Rules + * outlined in Section 2.4.1 (PCI Express(r) Base Specification Revision 3.0 + * November 10, 2010). As a result, on this platform we can't use Relaxed + * Ordering for Upstream TLPs. + */ +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a00, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a01, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a02, PCI_CLASS_NOT_DEFINED, 8, + quirk_relaxedordering_disable); + +/* * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same * values for the Attribute as were supplied in the header of the * corresponding Request, except as explicitly allowed when IDO is used." diff --git a/include/linux/pci.h b/include/linux/pci.h index eb3da1a..6764f66 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -178,6 +178,8 @@ enum pci_dev_flags { PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1 << 7), /* Get VPD from function 0 VPD */ PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1 << 8), + /* Don't use Relaxed Ordering for TLPs directed at this device */ + PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 9), }; enum pci_irq_reroute_variant {