Message ID | MWHPR12MB1600A6986E4AD13D302F2DC9C8B00@MWHPR12MB1600.namprd12.prod.outlook.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Hi Casey On Wed, Aug 02, 2017 at 05:53:52PM +0000, Casey Leedom wrote: > Okay, here you go. As you can tell, it's almost a trivial copy of the > cxgb4 patch. > > By the way, I realized that we have yet another hole which is likely not > to be fixable. If we're dealing with a problematic Root Complex, and we > instantiate Virtual Functions and attach them to a Virtual Machine along > with an NVMe device which can deal with Relaxed Ordering TLPs, the VF driver > in the VM will be able to tell that it shouldn't attempt to send RO TLPs to > the RC because it will see the state of its own PCIe Capability Device > Control[Relaxed Ordering Enable] (a copy of the setting in the VF's > corresponding PF), but it won't be able to change that and send non-RO TLPs > to the RC, and RO TLPs to the NVMe device. Oh well. I don't understand this completely.. So your driver would know not to send RO TLP's to root complex. But you want to send RO to the NVMe device? This is the peer-2-peer case correct? The issue in the current patchset is that we device to turn off the device capability for all devices in the hierarchy so one would expect that the NVMe also would have RO turned off i suppose. The other approach is to not turn off the device capabilty, but let the driver do the right thing. i.e for transactions towards system memory vs. peer-2-peer? But since we wanted to take a big hammer approach because some platforms there can be data-corruption and we can't let trust guest drivers to do the right thing. This isn't something we can fix in this current version. One possible approach is to provide a strict flag, where we use this heavy hammer approach only on platforms that have a serious implication, and the other is we let the driver do the right thing depending on the platform. Worst case if the driver doesn't do the right thing, you would see perf issues but nothing bad would happen. It would allow you to select when to turn on RO and when to turn it off. Cheers, Ashok
| From: Raj, Ashok <ashok.raj@intel.com> | Sent: Thursday, August 3, 2017 1:31 AM | | I don't understand this completely.. So your driver would know not to send | RO TLP's to root complex. But you want to send RO to the NVMe device? This | is the peer-2-peer case correct? Yes, this is the "heavy hammer" issue which you alluded to later. There are applications where a device will want to send TLPs to a Root Complex without Relaxed Ordering set, but will want to use it when sending TLPs to a Peer device (say, an NVMe storage device). The current approach doesn't make that easy ... and in fact, I still don't kow how to code a solution for this with the proposed APIs. This means that we may be trading off one performance problem for another and that Relaxed Ordering may be doomed for use under Linux for the foreseeable future. As I've noted a number of times, it would be great if the Intel Hardware Engineers who attempted to implement the Relaxed Ordering semantics in the current generation of Root Complexes had left the ability to turn off the logic which is obviously not working. If there was a way to disable the logic via an undocumented register, then we could have the Linux PCI Quirk do that. Since Relaxed Ordering is just a hint, it's completely legitimate to completely ignore it. Casey
On Fri, Aug 04, 2017 at 08:20:37PM +0000, Casey Leedom wrote: > | From: Raj, Ashok <ashok.raj@intel.com> > | Sent: Thursday, August 3, 2017 1:31 AM > | > | I don't understand this completely.. So your driver would know not to send > | RO TLP's to root complex. But you want to send RO to the NVMe device? This > | is the peer-2-peer case correct? > > Yes, this is the "heavy hammer" issue which you alluded to later. There are > applications where a device will want to send TLPs to a Root Complex without > Relaxed Ordering set, but will want to use it when sending TLPs to a Peer > device (say, an NVMe storage device). The current approach doesn't make > that easy ... and in fact, I still don't kow how to code a solution for this > with the proposed APIs. This means that we may be trading off one > performance problem for another and that Relaxed Ordering may be doomed for > use under Linux for the foreseeable future. > > As I've noted a number of times, it would be great if the Intel Hardware > Engineers who attempted to implement the Relaxed Ordering semantics in the > current generation of Root Complexes had left the ability to turn off the > logic which is obviously not working. If there was a way to disable the > logic via an undocumented register, then we could have the Linux PCI Quirk > do that. Since Relaxed Ordering is just a hint, it's completely legitimate > to completely ignore it. Suppose you are looking for the existence of a chicken bit to instruct the port to ignore RO traffic. So all we would do is turn the chicken bit on but would permit p2p traffic to be allowed since we won't turn off the capability as currently proposed. Let me look into that keep you posted. Cheers, Ashok > > Casey
| From: Raj, Ashok <ashok.raj@intel.com> | Sent: Friday, August 4, 2017 1:21 PM | | On Fri, Aug 04, 2017 at 08:20:37PM +0000, Casey Leedom wrote: | > ... | > As I've noted a number of times, it would be great if the Intel Hardware | > Engineers who attempted to implement the Relaxed Ordering semantics in the | > current generation of Root Complexes had left the ability to turn off the | > logic which is obviously not working. If there was a way to disable the | > logic via an undocumented register, then we could have the Linux PCI Quirk | > do that. Since Relaxed Ordering is just a hint, it's completely legitimate | > to completely ignore it. | | Suppose you are looking for the existence of a chicken bit to instruct the | port to ignore RO traffic. So all we would do is turn the chicken bit on | but would permit p2p traffic to be allowed since we won't turn off the | capability as currently proposed. | | Let me look into that keep you posted. Huh, I'd never heard it called a "chicken bit" before, but yes, that's what I'm talking about. Whenever our Hardware Designers implement new functionality in our hardware, they almost always put in A. several "knobs" which can control fundamental parameters of the new Hardware Feature, and B. a mechanism of completely disabling it if necessary. This stems from the incredibly long Design -> Deployment cyle for Hardware (as opposed to the edit->compile->run cycle for s! It's obvious that handling Relaxed Ordering is a new Hardware Feature for Intel's Root Complexes since previous versions simply ignored it (because that's legal[1]). If I was a Hardware Engineer tasked with implementing Relaxed Ordering semantics for a device, I would certainly have also implemented a switch to turn it off in case there were unintended consequences (performance in this case). And if there is such a mechanism to simply disable processing of Relaxed Ordering semantics in the Root Complex, that would be a far easier "fix" for this problem ... and leave the code in place to continue sending Relaxed Ordering TLPs for a future Root Complex implementation which got it right ... Casey [1] One can't ~quite~ just ignore the Relaxed Ordering Attribute on an incoming Transaction Layer Packet Request: PCIe completion rules (see section 2.2.9 of the PCIe 3.0 specificatin) require that the Relaxed Ordering and No Snoop Attributes in a Request TLP be reflected back verbatim in any corresponding Response TLP. (The rules for ID-Based Ordering are more complex.)
From: Casey Leedom > Sent: 04 August 2017 21:49 ... > Whenever our Hardware Designers implement new functionality in our hardware, > they almost always put in A. several "knobs" which can control fundamental > parameters of the new Hardware Feature, and B. a mechanism of completely > disabling it if necessary. This stems from the incredibly long Design -> > Deployment cyle for Hardware (as opposed to the edit->compile->run cycle for s! Indeed, I'd also expect there to be an undocumented flag to turn it on (broken) in earlier parts to allow testing. David
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h b/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h index 109bc63..08c6ddb 100644 --- a/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h +++ b/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h @@ -408,6 +408,7 @@ enum { /* adapter flags */ USING_MSI = (1UL << 1), USING_MSIX = (1UL << 2), QUEUES_BOUND = (1UL << 3), + ROOT_NO_RELAXED_ORDERING = (1UL << 4), }; /* diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c index ac7a150..59e7639 100644 --- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c @@ -2888,6 +2888,24 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev, */ adapter->name = pci_name(pdev); adapter->msg_enable = DFLT_MSG_ENABLE; + + /* If possible, we use PCIe Relaxed Ordering Attribute to deliver + * Ingress Packet Data to Free List Buffers in order to allow for + * chipset performance optimizations between the Root Complex and + * Memory Controllers. (Messages to the associated Ingress Queue + * notifying new Packet Placement in the Free Lists Buffers will be + * send without the Relaxed Ordering Attribute thus guaranteeing that + * all preceding PCIe Transaction Layer Packets will be processed + * first.) But some Root Complexes have various issues with Upstream + * Transaction Layer Packets with the Relaxed Ordering Attribute set. + * The PCIe devices which under the Root Complexes will be cleared the + * Relaxed Ordering bit in the configuration space, So we check our + * PCIe configuration space to see if it's flagged with advice against + * using Relaxed Ordering. + */ + if (!pcie_relaxed_ordering_supported(pdev)) + adapter->flags |= ROOT_NO_RELAXED_ORDERING; + err = adap_init0(adapter); if (err) goto err_unmap_bar; diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c index e37dde2..05498e7 100644 --- a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c @@ -2205,6 +2205,7 @@ int t4vf_sge_alloc_rxq(struct adapter *adapter, struct sge_rspq *rspq, struct port_info *pi = netdev_priv(dev); struct fw_iq_cmd cmd, rpl; int ret, iqandst, flsz = 0; + int relaxed = !(adapter->flags & ROOT_NO_RELAXED_ORDERING); /* * If we're using MSI interrupts and we're not initializing the @@ -2300,6 +2301,8 @@ int t4vf_sge_alloc_rxq(struct adapter *adapter, struct sge_rspq *rspq, cpu_to_be32( FW_IQ_CMD_FL0HOSTFCMODE_V(SGE_HOSTFCMODE_NONE) | FW_IQ_CMD_FL0PACKEN_F | + FW_IQ_CMD_FL0FETCHRO_V(relaxed) | + FW_IQ_CMD_FL0DATARO_V(relaxed) | FW_IQ_CMD_FL0PADEN_F); /* In T6, for egress queue type FL there is internal overhead