Message ID | 20210405052404.213889-1-leon@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Enable relaxed ordering for ULPs | expand |
On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: > From: Leon Romanovsky <leonro@nvidia.com> > > >From Avihai, > > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering > imposed on PCI transactions, and thus, can improve performance. > > Until now, relaxed ordering could be set only by user space applications > for user MRs. The following patch series enables relaxed ordering for the > kernel ULPs as well. Relaxed ordering is an optional capability, and as > such, it is ignored by vendors that don't support it. > > The following test results show the performance improvement achieved > with relaxed ordering. The test was performed on a NVIDIA A100 in order > to check performance of storage infrastructure over xprtrdma: Isn't the Nvidia A100 a GPU not actually supported by Linux at all? What does that have to do with storage protocols? Also if you enable this for basically all kernel ULPs, why not have an opt-out into strict ordering for the cases that need it (if there are any).
On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote: > On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: > > From: Leon Romanovsky <leonro@nvidia.com> > > > > >From Avihai, > > > > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering > > imposed on PCI transactions, and thus, can improve performance. > > > > Until now, relaxed ordering could be set only by user space applications > > for user MRs. The following patch series enables relaxed ordering for the > > kernel ULPs as well. Relaxed ordering is an optional capability, and as > > such, it is ignored by vendors that don't support it. > > > > The following test results show the performance improvement achieved > > with relaxed ordering. The test was performed on a NVIDIA A100 in order > > to check performance of storage infrastructure over xprtrdma: > > Isn't the Nvidia A100 a GPU not actually supported by Linux at all? > What does that have to do with storage protocols? This system is in use by our storage oriented customer who performed the test. He runs drivers/infiniband/* stack from the upstream, simply backported to specific kernel version. The performance boost is seen in other systems too. > > Also if you enable this for basically all kernel ULPs, why not have > an opt-out into strict ordering for the cases that need it (if there are > any). The RO property is optional, it can only improve. In addition, all in-kernel ULPs don't need strict ordering. I can be mistaken here and Jason will correct me, it is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it. Thanks
> On Apr 5, 2021, at 7:08 AM, Leon Romanovsky <leon@kernel.org> wrote: > > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote: >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: >>> From: Leon Romanovsky <leonro@nvidia.com> >>> >>>> From Avihai, >>> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering >>> imposed on PCI transactions, and thus, can improve performance. >>> >>> Until now, relaxed ordering could be set only by user space applications >>> for user MRs. The following patch series enables relaxed ordering for the >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as >>> such, it is ignored by vendors that don't support it. >>> >>> The following test results show the performance improvement achieved >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order >>> to check performance of storage infrastructure over xprtrdma: >> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all? >> What does that have to do with storage protocols? > > This system is in use by our storage oriented customer who performed the > test. He runs drivers/infiniband/* stack from the upstream, simply backported > to specific kernel version. > > The performance boost is seen in other systems too. > >> >> Also if you enable this for basically all kernel ULPs, why not have >> an opt-out into strict ordering for the cases that need it (if there are >> any). > > The RO property is optional, it can only improve. In addition, all in-kernel ULPs > don't need strict ordering. I can be mistaken here and Jason will correct me, it > is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it. > Sticking to in-kernel ULP’s don’t need strict ordering, why don’t you enable this for HCA’s which supports it by default instead of patching very ULPs. Some ULPs in future may forget to specify such flag. Regards, Santosh
On 4/5/2021 10:08 AM, Leon Romanovsky wrote: > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote: >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: >>> From: Leon Romanovsky <leonro@nvidia.com> >>> >>> >From Avihai, >>> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering >>> imposed on PCI transactions, and thus, can improve performance. >>> >>> Until now, relaxed ordering could be set only by user space applications >>> for user MRs. The following patch series enables relaxed ordering for the >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as >>> such, it is ignored by vendors that don't support it. >>> >>> The following test results show the performance improvement achieved >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order >>> to check performance of storage infrastructure over xprtrdma: >> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all? >> What does that have to do with storage protocols? > > This system is in use by our storage oriented customer who performed the > test. He runs drivers/infiniband/* stack from the upstream, simply backported > to specific kernel version. > > The performance boost is seen in other systems too. We need to see more information about this test, and platform. What correctness testing was done, and how was it verified? What PCI bus type(s) were tested, and with what adapters? What storage workload was generated, and were all possible RDMA exchanges by each ULP exercised? >> Also if you enable this for basically all kernel ULPs, why not have >> an opt-out into strict ordering for the cases that need it (if there are >> any). > > The RO property is optional, it can only improve. In addition, all in-kernel ULPs > don't need strict ordering. I can be mistaken here and Jason will correct me, it > is because of two things: ULP doesn't touch data before CQE and DMA API prohibits it. +1 on Christoph's comment. I woud hope most well-architected ULPs will support relaxed ordering, but storage workloads, in my experience, can find ways to cause failure in adapters. I would not suggest making this the default behavior without extensive testing. Tom.
On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote: > On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: > > From: Leon Romanovsky <leonro@nvidia.com> > > > > >From Avihai, > > > > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering > > imposed on PCI transactions, and thus, can improve performance. > > > > Until now, relaxed ordering could be set only by user space applications > > for user MRs. The following patch series enables relaxed ordering for the > > kernel ULPs as well. Relaxed ordering is an optional capability, and as > > such, it is ignored by vendors that don't support it. > > > > The following test results show the performance improvement achieved > > with relaxed ordering. The test was performed on a NVIDIA A100 in order > > to check performance of storage infrastructure over xprtrdma: > > Isn't the Nvidia A100 a GPU not actually supported by Linux at all? > What does that have to do with storage protocols? I think it is a typo (or at least mit makes no sense to be talking about NFS with a GPU chip) Probably it should be a DGX A100 which is a dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA workload. AMD dual socket systems are well known to benefit from relaxed ordering, people have been doing this in userspace for a while now with the opt in. What surprises me is the performance difference, I hadn't heard it is 4x! Jason
> On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote: >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: >>> From: Leon Romanovsky <leonro@nvidia.com> >>> >>>> From Avihai, >>> >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering >>> imposed on PCI transactions, and thus, can improve performance. >>> >>> Until now, relaxed ordering could be set only by user space applications >>> for user MRs. The following patch series enables relaxed ordering for the >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as >>> such, it is ignored by vendors that don't support it. >>> >>> The following test results show the performance improvement achieved >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order >>> to check performance of storage infrastructure over xprtrdma: >> >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all? >> What does that have to do with storage protocols? > > I think it is a typo (or at least mit makes no sense to be talking > about NFS with a GPU chip) Probably it should be a DGX A100 which is a > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA > workload. We need to get a better idea what correctness testing has been done, and whether positive correctness testing results can be replicated on a variety of platforms. I have an old Haswell dual-socket system in my lab, but otherwise I'm not sure I have a platform that would be interesting for such a test. > AMD dual socket systems are well known to benefit from relaxed > ordering, people have been doing this in userspace for a while now > with the opt in. -- Chuck Lever
On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote: > > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote: > >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: > >>> From: Leon Romanovsky <leonro@nvidia.com> > >>> > >>>> From Avihai, > >>> > >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering > >>> imposed on PCI transactions, and thus, can improve performance. > >>> > >>> Until now, relaxed ordering could be set only by user space applications > >>> for user MRs. The following patch series enables relaxed ordering for the > >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as > >>> such, it is ignored by vendors that don't support it. > >>> > >>> The following test results show the performance improvement achieved > >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order > >>> to check performance of storage infrastructure over xprtrdma: > >> > >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all? > >> What does that have to do with storage protocols? > > > > I think it is a typo (or at least mit makes no sense to be talking > > about NFS with a GPU chip) Probably it should be a DGX A100 which is a > > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA > > workload. > > We need to get a better idea what correctness testing has been done, > and whether positive correctness testing results can be replicated > on a variety of platforms. > > I have an old Haswell dual-socket system in my lab, but otherwise > I'm not sure I have a platform that would be interesting for such a > test. Not sure if Haswell will be useful for such testing. It looks like many of those subscribe to 'quirk_relaxedordering_disable'.
On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: > From: Leon Romanovsky <leonro@nvidia.com> > > From Avihai, > > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering > imposed on PCI transactions, and thus, can improve performance. > > Until now, relaxed ordering could be set only by user space applications > for user MRs. The following patch series enables relaxed ordering for the > kernel ULPs as well. Relaxed ordering is an optional capability, and as > such, it is ignored by vendors that don't support it. > > The following test results show the performance improvement achieved Did you test this patchset with CPU does not support relaxed ordering? We observed significantly performance degradation when run perftest with relaxed ordering enabled over old CPU. https://github.com/linux-rdma/perftest/issues/116 thanks
On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote: > On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: > > From: Leon Romanovsky <leonro@nvidia.com> > > > > From Avihai, > > > > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering > > imposed on PCI transactions, and thus, can improve performance. > > > > Until now, relaxed ordering could be set only by user space applications > > for user MRs. The following patch series enables relaxed ordering for the > > kernel ULPs as well. Relaxed ordering is an optional capability, and as > > such, it is ignored by vendors that don't support it. > > > > The following test results show the performance improvement achieved > > Did you test this patchset with CPU does not support relaxed ordering? I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation and they are not interesting from performance point of view. > > We observed significantly performance degradation when run perftest with > relaxed ordering enabled over old CPU. > > https://github.com/linux-rdma/perftest/issues/116 The perftest is slightly different, but you pointed to the valid point. We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit and arguably this was needed to be done in perftest too. Thanks > > thanks >
On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote: > > > > On Apr 5, 2021, at 4:07 PM, Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > On Mon, Apr 05, 2021 at 03:41:15PM +0200, Christoph Hellwig wrote: > >> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: > >>> From: Leon Romanovsky <leonro@nvidia.com> > >>> > >>>> From Avihai, > >>> > >>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering > >>> imposed on PCI transactions, and thus, can improve performance. > >>> > >>> Until now, relaxed ordering could be set only by user space applications > >>> for user MRs. The following patch series enables relaxed ordering for the > >>> kernel ULPs as well. Relaxed ordering is an optional capability, and as > >>> such, it is ignored by vendors that don't support it. > >>> > >>> The following test results show the performance improvement achieved > >>> with relaxed ordering. The test was performed on a NVIDIA A100 in order > >>> to check performance of storage infrastructure over xprtrdma: > >> > >> Isn't the Nvidia A100 a GPU not actually supported by Linux at all? > >> What does that have to do with storage protocols? > > > > I think it is a typo (or at least mit makes no sense to be talking > > about NFS with a GPU chip) Probably it should be a DGX A100 which is a > > dual socket AMD server with alot of PCIe, and xptrtrdma is a NFS-RDMA > > workload. > > We need to get a better idea what correctness testing has been done, > and whether positive correctness testing results can be replicated > on a variety of platforms. I will ask to provide more details. > > I have an old Haswell dual-socket system in my lab, but otherwise > I'm not sure I have a platform that would be interesting for such a > test. We don't have such old systems too. > > > > AMD dual socket systems are well known to benefit from relaxed > > ordering, people have been doing this in userspace for a while now > > with the opt in. > > > -- > Chuck Lever > > >
On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote: > We need to get a better idea what correctness testing has been done, > and whether positive correctness testing results can be replicated > on a variety of platforms. RO has been rolling out slowly on mlx5 over a few years and storage ULPs are the last to change. eg the mlx5 ethernet driver has had RO turned on for a long time, userspace HPC applications have been using it for a while now too. We know there are platforms with broken RO implementations (like Haswell) but the kernel is supposed to globally turn off RO on all those cases. I'd be a bit surprised if we discover any more from this series. On the other hand there are platforms that get huge speed ups from turning this on, AMD is one example, there are a bunch in the ARM world too. Still, obviously people should test on the platforms they have. Jason
On Tue, Apr 06, 2021 at 08:09:43AM +0300, Leon Romanovsky wrote: > On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote: > > On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: > > > From: Leon Romanovsky <leonro@nvidia.com> > > > > > > From Avihai, > > > > > > Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering > > > imposed on PCI transactions, and thus, can improve performance. > > > > > > Until now, relaxed ordering could be set only by user space applications > > > for user MRs. The following patch series enables relaxed ordering for the > > > kernel ULPs as well. Relaxed ordering is an optional capability, and as > > > such, it is ignored by vendors that don't support it. > > > > > > The following test results show the performance improvement achieved > > > > Did you test this patchset with CPU does not support relaxed ordering? > > I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation > and they are not interesting from performance point of view. > > > > > We observed significantly performance degradation when run perftest with > > relaxed ordering enabled over old CPU. > > > > https://github.com/linux-rdma/perftest/issues/116 > > The perftest is slightly different, but you pointed to the valid point. > We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit > and arguably this was needed to be done in perftest too. No, the PCI device should not have the RO bit set in this situation. It is something mlx5_core needs to do. We can't push this into applications. There should be no performance difference from asking for IBV_ACCESS_RELAXED_ORDERING when RO is disabled at the PCI config and not asking for it at all. Either the platform has working relaxed ordering that gives a performance gain and the RO config spec bit should be set, or it doesn't and the bit should be clear. This is not something to decide in userspace, or in RDMA. At worst it becomes another platform specific PCI tunable people have to set. I thought the old haswell systems were quirked to disable RO globally anyhow? Jason
On 4/6/2021 7:49 AM, Jason Gunthorpe wrote: > On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote: > >> We need to get a better idea what correctness testing has been done, >> and whether positive correctness testing results can be replicated >> on a variety of platforms. > > RO has been rolling out slowly on mlx5 over a few years and storage > ULPs are the last to change. eg the mlx5 ethernet driver has had RO > turned on for a long time, userspace HPC applications have been using > it for a while now too. I'd love to see RO be used more, it was always something the RDMA specs supported and carefully architected for. My only concern is that it's difficult to get right, especially when the platforms have been running strictly-ordered for so long. The ULPs need testing, and a lot of it. > We know there are platforms with broken RO implementations (like > Haswell) but the kernel is supposed to globally turn off RO on all > those cases. I'd be a bit surprised if we discover any more from this > series. > > On the other hand there are platforms that get huge speed ups from > turning this on, AMD is one example, there are a bunch in the ARM > world too. My belief is that the biggest risk is from situations where completions are batched, and therefore polling is used to detect them without interrupts (which explicitly). The RO pipeline will completely reorder DMA writes, and consumers which infer ordering from memory contents may break. This can even apply within the provider code, which may attempt to poll WR and CQ structures, and be tripped up. The Mellanox adapter, itself, historically has strict in-order DMA semantics, and while it's great to relax that, changing it by default for all consumers is something to consider very cautiously. > Still, obviously people should test on the platforms they have. Yes, and "test" be taken seriously with focus on ULP data integrity. Speedups will mean nothing if the data is ever damaged. Tom.
> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote: > > On 4/6/2021 7:49 AM, Jason Gunthorpe wrote: >> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote: >> >>> We need to get a better idea what correctness testing has been done, >>> and whether positive correctness testing results can be replicated >>> on a variety of platforms. >> RO has been rolling out slowly on mlx5 over a few years and storage >> ULPs are the last to change. eg the mlx5 ethernet driver has had RO >> turned on for a long time, userspace HPC applications have been using >> it for a while now too. > > I'd love to see RO be used more, it was always something the RDMA > specs supported and carefully architected for. My only concern is > that it's difficult to get right, especially when the platforms > have been running strictly-ordered for so long. The ULPs need > testing, and a lot of it. > >> We know there are platforms with broken RO implementations (like >> Haswell) but the kernel is supposed to globally turn off RO on all >> those cases. I'd be a bit surprised if we discover any more from this >> series. >> On the other hand there are platforms that get huge speed ups from >> turning this on, AMD is one example, there are a bunch in the ARM >> world too. > > My belief is that the biggest risk is from situations where completions > are batched, and therefore polling is used to detect them without > interrupts (which explicitly). The RO pipeline will completely reorder > DMA writes, and consumers which infer ordering from memory contents may > break. This can even apply within the provider code, which may attempt > to poll WR and CQ structures, and be tripped up. You are referring specifically to RPC/RDMA depending on Receive completions to guarantee that previous RDMA Writes have been retired? Or is there a particular implementation practice in the Linux RPC/RDMA code that worries you? > The Mellanox adapter, itself, historically has strict in-order DMA > semantics, and while it's great to relax that, changing it by default > for all consumers is something to consider very cautiously. > >> Still, obviously people should test on the platforms they have. > > Yes, and "test" be taken seriously with focus on ULP data integrity. > Speedups will mean nothing if the data is ever damaged. I agree that data integrity comes first. Since I currently don't have facilities to test RO in my lab, the community will have to agree on a set of tests and expected results that specifically exercise the corner cases you are concerned about. -- Chuck Lever
On 4/9/2021 10:45 AM, Chuck Lever III wrote: > > >> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote: >> >> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote: >>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote: >>> >>>> We need to get a better idea what correctness testing has been done, >>>> and whether positive correctness testing results can be replicated >>>> on a variety of platforms. >>> RO has been rolling out slowly on mlx5 over a few years and storage >>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO >>> turned on for a long time, userspace HPC applications have been using >>> it for a while now too. >> >> I'd love to see RO be used more, it was always something the RDMA >> specs supported and carefully architected for. My only concern is >> that it's difficult to get right, especially when the platforms >> have been running strictly-ordered for so long. The ULPs need >> testing, and a lot of it. >> >>> We know there are platforms with broken RO implementations (like >>> Haswell) but the kernel is supposed to globally turn off RO on all >>> those cases. I'd be a bit surprised if we discover any more from this >>> series. >>> On the other hand there are platforms that get huge speed ups from >>> turning this on, AMD is one example, there are a bunch in the ARM >>> world too. >> >> My belief is that the biggest risk is from situations where completions >> are batched, and therefore polling is used to detect them without >> interrupts (which explicitly). The RO pipeline will completely reorder >> DMA writes, and consumers which infer ordering from memory contents may >> break. This can even apply within the provider code, which may attempt >> to poll WR and CQ structures, and be tripped up. > > You are referring specifically to RPC/RDMA depending on Receive > completions to guarantee that previous RDMA Writes have been > retired? Or is there a particular implementation practice in > the Linux RPC/RDMA code that worries you? Nothing in the RPC/RDMA code, which is IMO correct. The worry, which is hopefully unfounded, is that the RO pipeline might not have flushed when a completion is posted *after* posting an interrupt. Something like this... RDMA Write arrives PCIe RO Write for data PCIe RO Write for data ... RDMA Write arrives PCIe RO Write for data ... RDMA Send arrives PCIe RO Write for receive data PCIe RO Write for receive descriptor PCIe interrupt (flushes RO pipeline for all three ops above) RPC/RDMA polls CQ Reaps receive completion RDMA Send arrives PCIe RO Write for receive data PCIe RO write for receive descriptor Does *not* interrupt, since CQ not armed RPC/RDMA continues to poll CQ Reaps receive completion PCIe RO writes not yet flushed Processes incomplete in-memory data Bzzzt Hopefully, the adapter performs a PCIe flushing read, or something to avoid this when an interrupt is not generated. Alternatively, I'm overly paranoid. Tom. >> The Mellanox adapter, itself, historically has strict in-order DMA >> semantics, and while it's great to relax that, changing it by default >> for all consumers is something to consider very cautiously. >> >>> Still, obviously people should test on the platforms they have. >> >> Yes, and "test" be taken seriously with focus on ULP data integrity. >> Speedups will mean nothing if the data is ever damaged. > > I agree that data integrity comes first. > > Since I currently don't have facilities to test RO in my lab, the > community will have to agree on a set of tests and expected results > that specifically exercise the corner cases you are concerned about. > > > -- > Chuck Lever > > > >
> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote: > > On 4/9/2021 10:45 AM, Chuck Lever III wrote: >>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote: >>> >>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote: >>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote: >>>> >>>>> We need to get a better idea what correctness testing has been done, >>>>> and whether positive correctness testing results can be replicated >>>>> on a variety of platforms. >>>> RO has been rolling out slowly on mlx5 over a few years and storage >>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO >>>> turned on for a long time, userspace HPC applications have been using >>>> it for a while now too. >>> >>> I'd love to see RO be used more, it was always something the RDMA >>> specs supported and carefully architected for. My only concern is >>> that it's difficult to get right, especially when the platforms >>> have been running strictly-ordered for so long. The ULPs need >>> testing, and a lot of it. >>> >>>> We know there are platforms with broken RO implementations (like >>>> Haswell) but the kernel is supposed to globally turn off RO on all >>>> those cases. I'd be a bit surprised if we discover any more from this >>>> series. >>>> On the other hand there are platforms that get huge speed ups from >>>> turning this on, AMD is one example, there are a bunch in the ARM >>>> world too. >>> >>> My belief is that the biggest risk is from situations where completions >>> are batched, and therefore polling is used to detect them without >>> interrupts (which explicitly). The RO pipeline will completely reorder >>> DMA writes, and consumers which infer ordering from memory contents may >>> break. This can even apply within the provider code, which may attempt >>> to poll WR and CQ structures, and be tripped up. >> You are referring specifically to RPC/RDMA depending on Receive >> completions to guarantee that previous RDMA Writes have been >> retired? Or is there a particular implementation practice in >> the Linux RPC/RDMA code that worries you? > > Nothing in the RPC/RDMA code, which is IMO correct. The worry, which > is hopefully unfounded, is that the RO pipeline might not have flushed > when a completion is posted *after* posting an interrupt. > > Something like this... > > RDMA Write arrives > PCIe RO Write for data > PCIe RO Write for data > ... > RDMA Write arrives > PCIe RO Write for data > ... > RDMA Send arrives > PCIe RO Write for receive data > PCIe RO Write for receive descriptor Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then it will shure prior written RO date has global visibility when the CQE can be observed. > PCIe interrupt (flushes RO pipeline for all three ops above) Before the interrupt, the HCA will write the EQE (Event Queue Entry). This has to be a Strongly Ordered write to "push" prior written CQEs so that when the EQE is observed, the prior writes of CQEs have global visibility. And the MSI-X write likewise, to avoid spurious interrupts. Thxs, Håkon > > RPC/RDMA polls CQ > Reaps receive completion > > RDMA Send arrives > PCIe RO Write for receive data > PCIe RO write for receive descriptor > Does *not* interrupt, since CQ not armed > > RPC/RDMA continues to poll CQ > Reaps receive completion > PCIe RO writes not yet flushed > Processes incomplete in-memory data > Bzzzt > > Hopefully, the adapter performs a PCIe flushing read, or something > to avoid this when an interrupt is not generated. Alternatively, I'm > overly paranoid. > > Tom. > >>> The Mellanox adapter, itself, historically has strict in-order DMA >>> semantics, and while it's great to relax that, changing it by default >>> for all consumers is something to consider very cautiously. >>> >>>> Still, obviously people should test on the platforms they have. >>> >>> Yes, and "test" be taken seriously with focus on ULP data integrity. >>> Speedups will mean nothing if the data is ever damaged. >> I agree that data integrity comes first. >> Since I currently don't have facilities to test RO in my lab, the >> community will have to agree on a set of tests and expected results >> that specifically exercise the corner cases you are concerned about. >> -- >> Chuck Lever
On Fri, Apr 09, 2021 at 10:26:21AM -0400, Tom Talpey wrote: > My belief is that the biggest risk is from situations where completions > are batched, and therefore polling is used to detect them without > interrupts (which explicitly). We don't do this in the kernel. All kernel ULPs only read data after they observe the CQE. We do not have "last data polling" and our interrupt model does not support some hacky "interrupt means go and use the data" approach. ULPs have to be designed this way to use the DMA API properly. Fencing a DMA before it is completed by the HW will cause IOMMU errors. Userspace is a different story, but that will remain as-is with optional relaxed ordering. Jason
On 4/9/2021 12:40 PM, Jason Gunthorpe wrote: > On Fri, Apr 09, 2021 at 10:26:21AM -0400, Tom Talpey wrote: > >> My belief is that the biggest risk is from situations where completions >> are batched, and therefore polling is used to detect them without >> interrupts (which explicitly). > > We don't do this in the kernel. > > All kernel ULPs only read data after they observe the CQE. We do not > have "last data polling" and our interrupt model does not support some > hacky "interrupt means go and use the data" approach. > > ULPs have to be designed this way to use the DMA API properly. Yep. Totally agree. My concern was about the data being written as relaxed, and the CQE racing it. I'll reply in the other fork. > Fencing a DMA before it is completed by the HW will cause IOMMU > errors. > > Userspace is a different story, but that will remain as-is with > optional relaxed ordering. > > Jason >
On 4/9/2021 12:27 PM, Haakon Bugge wrote: > > >> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote: >> >> On 4/9/2021 10:45 AM, Chuck Lever III wrote: >>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote: >>>> >>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote: >>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote: >>>>> >>>>>> We need to get a better idea what correctness testing has been done, >>>>>> and whether positive correctness testing results can be replicated >>>>>> on a variety of platforms. >>>>> RO has been rolling out slowly on mlx5 over a few years and storage >>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO >>>>> turned on for a long time, userspace HPC applications have been using >>>>> it for a while now too. >>>> >>>> I'd love to see RO be used more, it was always something the RDMA >>>> specs supported and carefully architected for. My only concern is >>>> that it's difficult to get right, especially when the platforms >>>> have been running strictly-ordered for so long. The ULPs need >>>> testing, and a lot of it. >>>> >>>>> We know there are platforms with broken RO implementations (like >>>>> Haswell) but the kernel is supposed to globally turn off RO on all >>>>> those cases. I'd be a bit surprised if we discover any more from this >>>>> series. >>>>> On the other hand there are platforms that get huge speed ups from >>>>> turning this on, AMD is one example, there are a bunch in the ARM >>>>> world too. >>>> >>>> My belief is that the biggest risk is from situations where completions >>>> are batched, and therefore polling is used to detect them without >>>> interrupts (which explicitly). The RO pipeline will completely reorder >>>> DMA writes, and consumers which infer ordering from memory contents may >>>> break. This can even apply within the provider code, which may attempt >>>> to poll WR and CQ structures, and be tripped up. >>> You are referring specifically to RPC/RDMA depending on Receive >>> completions to guarantee that previous RDMA Writes have been >>> retired? Or is there a particular implementation practice in >>> the Linux RPC/RDMA code that worries you? >> >> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which >> is hopefully unfounded, is that the RO pipeline might not have flushed >> when a completion is posted *after* posting an interrupt. >> >> Something like this... >> >> RDMA Write arrives >> PCIe RO Write for data >> PCIe RO Write for data >> ... >> RDMA Write arrives >> PCIe RO Write for data >> ... >> RDMA Send arrives >> PCIe RO Write for receive data >> PCIe RO Write for receive descriptor > > Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then it will shure prior written RO date has global visibility when the CQE can be observed. I wasn't aware that a strongly-ordered PCIe Write will ensure that prior relaxed-ordered writes went first. If that's the case, I'm fine with it - as long as the providers are correctly coded!! >> PCIe interrupt (flushes RO pipeline for all three ops above) > > Before the interrupt, the HCA will write the EQE (Event Queue Entry). This has to be a Strongly Ordered write to "push" prior written CQEs so that when the EQE is observed, the prior writes of CQEs have global visibility. > > And the MSI-X write likewise, to avoid spurious interrupts. Ok, and yes agreed the same principle would apply. Is there any implication if a PCIe switch were present on the motherboard? The switch is allowed to do some creative routing if the operation is relaxed, correct? Tom. > Thxs, Håkon > >> >> RPC/RDMA polls CQ >> Reaps receive completion >> >> RDMA Send arrives >> PCIe RO Write for receive data >> PCIe RO write for receive descriptor >> Does *not* interrupt, since CQ not armed >> >> RPC/RDMA continues to poll CQ >> Reaps receive completion >> PCIe RO writes not yet flushed >> Processes incomplete in-memory data >> Bzzzt >> >> Hopefully, the adapter performs a PCIe flushing read, or something >> to avoid this when an interrupt is not generated. Alternatively, I'm >> overly paranoid. >> >> Tom. >> >>>> The Mellanox adapter, itself, historically has strict in-order DMA >>>> semantics, and while it's great to relax that, changing it by default >>>> for all consumers is something to consider very cautiously. >>>> >>>>> Still, obviously people should test on the platforms they have. >>>> >>>> Yes, and "test" be taken seriously with focus on ULP data integrity. >>>> Speedups will mean nothing if the data is ever damaged. >>> I agree that data integrity comes first. >>> Since I currently don't have facilities to test RO in my lab, the >>> community will have to agree on a set of tests and expected results >>> that specifically exercise the corner cases you are concerned about. >>> -- >>> Chuck Lever >
From: Tom Talpey > Sent: 09 April 2021 18:49 > On 4/9/2021 12:27 PM, Haakon Bugge wrote: > > > > > >> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote: > >> > >> On 4/9/2021 10:45 AM, Chuck Lever III wrote: > >>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote: > >>>> > >>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote: > >>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote: > >>>>> > >>>>>> We need to get a better idea what correctness testing has been done, > >>>>>> and whether positive correctness testing results can be replicated > >>>>>> on a variety of platforms. > >>>>> RO has been rolling out slowly on mlx5 over a few years and storage > >>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO > >>>>> turned on for a long time, userspace HPC applications have been using > >>>>> it for a while now too. > >>>> > >>>> I'd love to see RO be used more, it was always something the RDMA > >>>> specs supported and carefully architected for. My only concern is > >>>> that it's difficult to get right, especially when the platforms > >>>> have been running strictly-ordered for so long. The ULPs need > >>>> testing, and a lot of it. > >>>> > >>>>> We know there are platforms with broken RO implementations (like > >>>>> Haswell) but the kernel is supposed to globally turn off RO on all > >>>>> those cases. I'd be a bit surprised if we discover any more from this > >>>>> series. > >>>>> On the other hand there are platforms that get huge speed ups from > >>>>> turning this on, AMD is one example, there are a bunch in the ARM > >>>>> world too. > >>>> > >>>> My belief is that the biggest risk is from situations where completions > >>>> are batched, and therefore polling is used to detect them without > >>>> interrupts (which explicitly). The RO pipeline will completely reorder > >>>> DMA writes, and consumers which infer ordering from memory contents may > >>>> break. This can even apply within the provider code, which may attempt > >>>> to poll WR and CQ structures, and be tripped up. > >>> You are referring specifically to RPC/RDMA depending on Receive > >>> completions to guarantee that previous RDMA Writes have been > >>> retired? Or is there a particular implementation practice in > >>> the Linux RPC/RDMA code that worries you? > >> > >> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which > >> is hopefully unfounded, is that the RO pipeline might not have flushed > >> when a completion is posted *after* posting an interrupt. > >> > >> Something like this... > >> > >> RDMA Write arrives > >> PCIe RO Write for data > >> PCIe RO Write for data > >> ... > >> RDMA Write arrives > >> PCIe RO Write for data > >> ... > >> RDMA Send arrives > >> PCIe RO Write for receive data > >> PCIe RO Write for receive descriptor > > > > Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then > it will shure prior written RO date has global visibility when the CQE can be observed. > > I wasn't aware that a strongly-ordered PCIe Write will ensure that > prior relaxed-ordered writes went first. If that's the case, I'm > fine with it - as long as the providers are correctly coded!! I remember trying to read the relevant section of the PCIe spec. (Possibly in a book that was trying to make it easier to understand!) It is about as clear as mud. I presume this is all about allowing PCIe targets (eg ethernet cards) to use relaxed ordering on write requests to host memory. And that such writes can be completed out of order? It isn't entirely clear that you aren't talking of letting the cpu do 'relaxed order' writes to PCIe targets! For a typical ethernet driver the receive interrupt just means 'go and look at the receive descriptor ring'. So there is an absolute requirement that the writes for data buffer complete before the write to the receive descriptor. There is no requirement for the interrupt (requested after the descriptor write) to have been seen by the cpu. Quite often the driver will find the 'receive complete' descriptor when processing frames from an earlier interrupt (and nothing to do in response to the interrupt itself). So the write to the receive descriptor would have to have RO clear to ensure that all the buffer writes complete first. (The furthest I've got into PCIe internals was fixing the bug in some vendor-supplied FPGA logic that failed to correctly handle multiple data TLP responses to a single read TLP. Fortunately it wasn't in the hard-IP bit.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 4/6/2021 2:53 PM, Jason Gunthorpe wrote: > On Tue, Apr 06, 2021 at 08:09:43AM +0300, Leon Romanovsky wrote: >> On Tue, Apr 06, 2021 at 10:37:38AM +0800, Honggang LI wrote: >>> On Mon, Apr 05, 2021 at 08:23:54AM +0300, Leon Romanovsky wrote: >>>> From: Leon Romanovsky <leonro@nvidia.com> >>>> >>>> From Avihai, >>>> >>>> Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering >>>> imposed on PCI transactions, and thus, can improve performance. >>>> >>>> Until now, relaxed ordering could be set only by user space applications >>>> for user MRs. The following patch series enables relaxed ordering for the >>>> kernel ULPs as well. Relaxed ordering is an optional capability, and as >>>> such, it is ignored by vendors that don't support it. >>>> >>>> The following test results show the performance improvement achieved >>> Did you test this patchset with CPU does not support relaxed ordering? >> I don't think so, the CPUs that don't support RO are Intel's fourth/fifth-generation >> and they are not interesting from performance point of view. >> >>> We observed significantly performance degradation when run perftest with >>> relaxed ordering enabled over old CPU. >>> >>> https://github.com/linux-rdma/perftest/issues/116 >> The perftest is slightly different, but you pointed to the valid point. >> We forgot to call pcie_relaxed_ordering_enabled() before setting RO bit >> and arguably this was needed to be done in perftest too. > No, the PCI device should not have the RO bit set in this situation. > It is something mlx5_core needs to do. We can't push this into > applications. pcie_relaxed_ordering_enabled is called in drivers/net/ethernet/mellanox/mlx5/core/en_common.c so probably need to move it to mlx5_core in this series. > > There should be no performance difference from asking for > IBV_ACCESS_RELAXED_ORDERING when RO is disabled at the PCI config and > not asking for it at all. > > Either the platform has working relaxed ordering that gives a > performance gain and the RO config spec bit should be set, or it > doesn't and the bit should be clear. is this the case today ? > > This is not something to decide in userspace, or in RDMA. At worst it > becomes another platform specific PCI tunable people have to set. > > I thought the old haswell systems were quirked to disable RO globally > anyhow? > > Jason
> On 10 Apr 2021, at 15:30, David Laight <David.Laight@aculab.com> wrote: > > From: Tom Talpey >> Sent: 09 April 2021 18:49 >> On 4/9/2021 12:27 PM, Haakon Bugge wrote: >>> >>> >>>> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote: >>>> >>>> On 4/9/2021 10:45 AM, Chuck Lever III wrote: >>>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote: >>>>>> >>>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote: >>>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote: >>>>>>> >>>>>>>> We need to get a better idea what correctness testing has been done, >>>>>>>> and whether positive correctness testing results can be replicated >>>>>>>> on a variety of platforms. >>>>>>> RO has been rolling out slowly on mlx5 over a few years and storage >>>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO >>>>>>> turned on for a long time, userspace HPC applications have been using >>>>>>> it for a while now too. >>>>>> >>>>>> I'd love to see RO be used more, it was always something the RDMA >>>>>> specs supported and carefully architected for. My only concern is >>>>>> that it's difficult to get right, especially when the platforms >>>>>> have been running strictly-ordered for so long. The ULPs need >>>>>> testing, and a lot of it. >>>>>> >>>>>>> We know there are platforms with broken RO implementations (like >>>>>>> Haswell) but the kernel is supposed to globally turn off RO on all >>>>>>> those cases. I'd be a bit surprised if we discover any more from this >>>>>>> series. >>>>>>> On the other hand there are platforms that get huge speed ups from >>>>>>> turning this on, AMD is one example, there are a bunch in the ARM >>>>>>> world too. >>>>>> >>>>>> My belief is that the biggest risk is from situations where completions >>>>>> are batched, and therefore polling is used to detect them without >>>>>> interrupts (which explicitly). The RO pipeline will completely reorder >>>>>> DMA writes, and consumers which infer ordering from memory contents may >>>>>> break. This can even apply within the provider code, which may attempt >>>>>> to poll WR and CQ structures, and be tripped up. >>>>> You are referring specifically to RPC/RDMA depending on Receive >>>>> completions to guarantee that previous RDMA Writes have been >>>>> retired? Or is there a particular implementation practice in >>>>> the Linux RPC/RDMA code that worries you? >>>> >>>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which >>>> is hopefully unfounded, is that the RO pipeline might not have flushed >>>> when a completion is posted *after* posting an interrupt. >>>> >>>> Something like this... >>>> >>>> RDMA Write arrives >>>> PCIe RO Write for data >>>> PCIe RO Write for data >>>> ... >>>> RDMA Write arrives >>>> PCIe RO Write for data >>>> ... >>>> RDMA Send arrives >>>> PCIe RO Write for receive data >>>> PCIe RO Write for receive descriptor >>> >>> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then >> it will shure prior written RO date has global visibility when the CQE can be observed. >> >> I wasn't aware that a strongly-ordered PCIe Write will ensure that >> prior relaxed-ordered writes went first. If that's the case, I'm >> fine with it - as long as the providers are correctly coded!! The PCIe spec (Table Ordering Rules Summary) is quite clear here (A Posted request is Memory Write Request in this context): A Posted Request must not pass another Posted Request unless A2b applies. A2b: A Posted Request with RO Set is permitted to pass another Posted Request. Thxs, Håkon > > I remember trying to read the relevant section of the PCIe spec. > (Possibly in a book that was trying to make it easier to understand!) > It is about as clear as mud. > > I presume this is all about allowing PCIe targets (eg ethernet cards) > to use relaxed ordering on write requests to host memory. > And that such writes can be completed out of order? > > It isn't entirely clear that you aren't talking of letting the > cpu do 'relaxed order' writes to PCIe targets! > > For a typical ethernet driver the receive interrupt just means > 'go and look at the receive descriptor ring'. > So there is an absolute requirement that the writes for data > buffer complete before the write to the receive descriptor. > There is no requirement for the interrupt (requested after the > descriptor write) to have been seen by the cpu. > > Quite often the driver will find the 'receive complete' > descriptor when processing frames from an earlier interrupt > (and nothing to do in response to the interrupt itself). > > So the write to the receive descriptor would have to have RO clear > to ensure that all the buffer writes complete first. > > (The furthest I've got into PCIe internals was fixing the bug > in some vendor-supplied FPGA logic that failed to correctly > handle multiple data TLP responses to a single read TLP. > Fortunately it wasn't in the hard-IP bit.) > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
On 4/12/2021 2:32 PM, Haakon Bugge wrote: > > >> On 10 Apr 2021, at 15:30, David Laight <David.Laight@aculab.com> wrote: >> >> From: Tom Talpey >>> Sent: 09 April 2021 18:49 >>> On 4/9/2021 12:27 PM, Haakon Bugge wrote: >>>> >>>> >>>>> On 9 Apr 2021, at 17:32, Tom Talpey <tom@talpey.com> wrote: >>>>> >>>>> On 4/9/2021 10:45 AM, Chuck Lever III wrote: >>>>>>> On Apr 9, 2021, at 10:26 AM, Tom Talpey <tom@talpey.com> wrote: >>>>>>> >>>>>>> On 4/6/2021 7:49 AM, Jason Gunthorpe wrote: >>>>>>>> On Mon, Apr 05, 2021 at 11:42:31PM +0000, Chuck Lever III wrote: >>>>>>>> >>>>>>>>> We need to get a better idea what correctness testing has been done, >>>>>>>>> and whether positive correctness testing results can be replicated >>>>>>>>> on a variety of platforms. >>>>>>>> RO has been rolling out slowly on mlx5 over a few years and storage >>>>>>>> ULPs are the last to change. eg the mlx5 ethernet driver has had RO >>>>>>>> turned on for a long time, userspace HPC applications have been using >>>>>>>> it for a while now too. >>>>>>> >>>>>>> I'd love to see RO be used more, it was always something the RDMA >>>>>>> specs supported and carefully architected for. My only concern is >>>>>>> that it's difficult to get right, especially when the platforms >>>>>>> have been running strictly-ordered for so long. The ULPs need >>>>>>> testing, and a lot of it. >>>>>>> >>>>>>>> We know there are platforms with broken RO implementations (like >>>>>>>> Haswell) but the kernel is supposed to globally turn off RO on all >>>>>>>> those cases. I'd be a bit surprised if we discover any more from this >>>>>>>> series. >>>>>>>> On the other hand there are platforms that get huge speed ups from >>>>>>>> turning this on, AMD is one example, there are a bunch in the ARM >>>>>>>> world too. >>>>>>> >>>>>>> My belief is that the biggest risk is from situations where completions >>>>>>> are batched, and therefore polling is used to detect them without >>>>>>> interrupts (which explicitly). The RO pipeline will completely reorder >>>>>>> DMA writes, and consumers which infer ordering from memory contents may >>>>>>> break. This can even apply within the provider code, which may attempt >>>>>>> to poll WR and CQ structures, and be tripped up. >>>>>> You are referring specifically to RPC/RDMA depending on Receive >>>>>> completions to guarantee that previous RDMA Writes have been >>>>>> retired? Or is there a particular implementation practice in >>>>>> the Linux RPC/RDMA code that worries you? >>>>> >>>>> Nothing in the RPC/RDMA code, which is IMO correct. The worry, which >>>>> is hopefully unfounded, is that the RO pipeline might not have flushed >>>>> when a completion is posted *after* posting an interrupt. >>>>> >>>>> Something like this... >>>>> >>>>> RDMA Write arrives >>>>> PCIe RO Write for data >>>>> PCIe RO Write for data >>>>> ... >>>>> RDMA Write arrives >>>>> PCIe RO Write for data >>>>> ... >>>>> RDMA Send arrives >>>>> PCIe RO Write for receive data >>>>> PCIe RO Write for receive descriptor >>>> >>>> Do you mean the Write of the CQE? It has to be Strongly Ordered for a correct implementation. Then >>> it will shure prior written RO date has global visibility when the CQE can be observed. >>> >>> I wasn't aware that a strongly-ordered PCIe Write will ensure that >>> prior relaxed-ordered writes went first. If that's the case, I'm >>> fine with it - as long as the providers are correctly coded!! > > The PCIe spec (Table Ordering Rules Summary) is quite clear here (A Posted request is Memory Write Request in this context): > > A Posted Request must not pass another Posted Request unless A2b applies. > > A2b: A Posted Request with RO Set is permitted to pass another Posted Request. > > > Thxs, Håkon Ok, good - a non-RO write (for example, to a CQE), or an interrupt (which would be similarly non-RO), will "get behind" all prior writes. So the issue is only in testing all the providers and platforms, to be sure this new behavior isn't tickling anything that went unnoticed all along, because no RDMA provider ever issued RO. Honestly, the Haswell sounds like a great first candidate, because if it has a known-broken RO behavior, verifying that it works with this change is highly important. I'd have greater confidence in newer platforms, in other words. They *all* have to work, proveably. Tom. >> I remember trying to read the relevant section of the PCIe spec. >> (Possibly in a book that was trying to make it easier to understand!) >> It is about as clear as mud. >> >> I presume this is all about allowing PCIe targets (eg ethernet cards) >> to use relaxed ordering on write requests to host memory. >> And that such writes can be completed out of order? >> >> It isn't entirely clear that you aren't talking of letting the >> cpu do 'relaxed order' writes to PCIe targets! >> >> For a typical ethernet driver the receive interrupt just means >> 'go and look at the receive descriptor ring'. >> So there is an absolute requirement that the writes for data >> buffer complete before the write to the receive descriptor. >> There is no requirement for the interrupt (requested after the >> descriptor write) to have been seen by the cpu. >> >> Quite often the driver will find the 'receive complete' >> descriptor when processing frames from an earlier interrupt >> (and nothing to do in response to the interrupt itself). >> >> So the write to the receive descriptor would have to have RO clear >> to ensure that all the buffer writes complete first. >> >> (The furthest I've got into PCIe internals was fixing the bug >> in some vendor-supplied FPGA logic that failed to correctly >> handle multiple data TLP responses to a single read TLP. >> Fortunately it wasn't in the hard-IP bit.) >> >> David >> >> - >> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK >> Registration No: 1397386 (Wales) >
On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote: > So the issue is only in testing all the providers and platforms, > to be sure this new behavior isn't tickling anything that went > unnoticed all along, because no RDMA provider ever issued RO. The mlx5 ethernet driver has run in RO mode for a long time, and it operates in basically the same way as RDMA. The issues with Haswell have been worked out there already. The only open question is if the ULPs have errors in their implementation, which I don't think we can find out until we apply this series and people start running their tests aggressively. Jason
On 4/12/2021 6:48 PM, Jason Gunthorpe wrote: > On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote: > >> So the issue is only in testing all the providers and platforms, >> to be sure this new behavior isn't tickling anything that went >> unnoticed all along, because no RDMA provider ever issued RO. > > The mlx5 ethernet driver has run in RO mode for a long time, and it > operates in basically the same way as RDMA. The issues with Haswell > have been worked out there already. > > The only open question is if the ULPs have errors in their > implementation, which I don't think we can find out until we apply > this series and people start running their tests aggressively. I agree that the core RO support should go in. But turning it on by default for a ULP should be the decision of each ULP maintainer. It's a huge risk to shift all the storage drivers overnight. How do you propose to ensure the aggressive testing happens? One thing that worries me is the patch02 on-by-default for the dma_lkey. There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING from being set in __ib_alloc_pd(). Tom.
From: Tom Talpey > Sent: 14 April 2021 15:16 > > On 4/12/2021 6:48 PM, Jason Gunthorpe wrote: > > On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote: > > > >> So the issue is only in testing all the providers and platforms, > >> to be sure this new behavior isn't tickling anything that went > >> unnoticed all along, because no RDMA provider ever issued RO. > > > > The mlx5 ethernet driver has run in RO mode for a long time, and it > > operates in basically the same way as RDMA. The issues with Haswell > > have been worked out there already. > > > > The only open question is if the ULPs have errors in their > > implementation, which I don't think we can find out until we apply > > this series and people start running their tests aggressively. > > I agree that the core RO support should go in. But turning it on > by default for a ULP should be the decision of each ULP maintainer. > It's a huge risk to shift all the storage drivers overnight. How > do you propose to ensure the aggressive testing happens? > > One thing that worries me is the patch02 on-by-default for the dma_lkey. > There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING > from being set in __ib_alloc_pd(). What is a ULP in this context? I've presumed that this is all about getting PCIe targets (ie cards) to set the RO (relaxed ordering) bit in some of the write TLP they generate for writing to host memory? So whatever driver initialises the target needs to configure whatever target-specific register enables the RO transfers themselves. After that there could be flags in the PCIe config space of the target and any bridges that clear the RO flag. There could also be flags in the bridges and root complex to ignore the RO flag even if it is set. Then the Linux kernel can have option(s) to tell the driver not to enable RO - even though the driver believes it should all work. This could be a single global flag, or fin-grained in some way. So what exactly is this patch series doing? David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, Apr 14, 2021 at 10:16:28AM -0400, Tom Talpey wrote: > On 4/12/2021 6:48 PM, Jason Gunthorpe wrote: > > On Mon, Apr 12, 2021 at 04:20:47PM -0400, Tom Talpey wrote: > > > > > So the issue is only in testing all the providers and platforms, > > > to be sure this new behavior isn't tickling anything that went > > > unnoticed all along, because no RDMA provider ever issued RO. > > > > The mlx5 ethernet driver has run in RO mode for a long time, and it > > operates in basically the same way as RDMA. The issues with Haswell > > have been worked out there already. > > > > The only open question is if the ULPs have errors in their > > implementation, which I don't think we can find out until we apply > > this series and people start running their tests aggressively. > > I agree that the core RO support should go in. But turning it on > by default for a ULP should be the decision of each ULP maintainer. > It's a huge risk to shift all the storage drivers overnight. How > do you propose to ensure the aggressive testing happens? Realistically we do test most of the RDMA storage ULPs at NVIDIA over mlx5 which is the only HW that will enable this for now. I disagree it is a "huge risk". Additional wider testing is welcomed and can happen over the 16 week release cycle for a kernel. I would aim to get the relaxed ordering changed merged to linux-next a week or so after the merge window. Further testing happens before these changes would get picked up in a distro on something like MLNX_OFED. I don't think we need to make the patch design worse or over think the submission process for something that, so far, hasn't discovered any issues and alread has a proven track record in other ULPs. Any storage ULP that has a problem here is mis-using verbs and the DMA API and thus has an existing data-corruption bug that they are simply lucky to have not yet discovered. > One thing that worries me is the patch02 on-by-default for the dma_lkey. > There's no way for a ULP to prevent IB_ACCESS_RELAXED_ORDERING > from being set in __ib_alloc_pd(). The ULPs are being forced into relaxed_ordering. They don't get to turn it off one by one. The v2 will be more explicit about this as there will be no ULP patches, just the verbs core code being updated. Jason
On Wed, Apr 14, 2021 at 02:41:52PM +0000, David Laight wrote: > So whatever driver initialises the target needs to configure whatever > target-specific register enables the RO transfers themselves. RDMA in general, and mlx5 in particular, is a layered design: mlx5_core <- owns the PCI function, should turn on RO at the PCI function level mlx5_en <- Commands the chip to use RO for queues used in ethernet ib_core ib_uverbs mlx5_ib <- Commands the chip to use RO for some queues used in userspace ib_srp* <- A ULP driver built on RDMA - this patch commands the chip to use RO on SRP queues nvme-rdma <- Ditto ib_iser <- Ditto rds <- Ditto So this series is about expanding the set of queues running on mlx5 that have RO turned when the PCI function is already running with RO enabled. We want as many queues as possible RO enabled because it brings big performance wins Jason
From: Leon Romanovsky <leonro@nvidia.com> From Avihai, Relaxed Ordering is a PCIe mechanism that relaxes the strict ordering imposed on PCI transactions, and thus, can improve performance. Until now, relaxed ordering could be set only by user space applications for user MRs. The following patch series enables relaxed ordering for the kernel ULPs as well. Relaxed ordering is an optional capability, and as such, it is ignored by vendors that don't support it. The following test results show the performance improvement achieved with relaxed ordering. The test was performed on a NVIDIA A100 in order to check performance of storage infrastructure over xprtrdma: Without Relaxed Ordering: READ: bw=16.5GiB/s (17.7GB/s), 16.5GiB/s-16.5GiB/s (17.7GB/s-17.7GB/s), io=1987GiB (2133GB), run=120422-120422msec With relaxed ordering: READ: bw=72.9GiB/s (78.2GB/s), 72.9GiB/s-72.9GiB/s (78.2GB/s-78.2GB/s), io=2367GiB (2542GB), run=32492-32492msec Thanks Avihai Horon (10): RDMA: Add access flags to ib_alloc_mr() and ib_mr_pool_init() RDMA/core: Enable Relaxed Ordering in __ib_alloc_pd() RDMA/iser: Enable Relaxed Ordering RDMA/rtrs: Enable Relaxed Ordering RDMA/srp: Enable Relaxed Ordering nvme-rdma: Enable Relaxed Ordering cifs: smbd: Enable Relaxed Ordering net/rds: Enable Relaxed Ordering net/smc: Enable Relaxed Ordering xprtrdma: Enable Relaxed Ordering drivers/infiniband/core/mr_pool.c | 7 +- drivers/infiniband/core/rw.c | 12 ++-- drivers/infiniband/core/verbs.c | 26 +++++-- drivers/infiniband/hw/bnxt_re/ib_verbs.c | 2 +- drivers/infiniband/hw/bnxt_re/ib_verbs.h | 2 +- drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 2 +- drivers/infiniband/hw/cxgb4/mem.c | 2 +- drivers/infiniband/hw/hns/hns_roce_device.h | 2 +- drivers/infiniband/hw/hns/hns_roce_mr.c | 2 +- drivers/infiniband/hw/i40iw/i40iw_verbs.c | 3 +- drivers/infiniband/hw/mlx4/mlx4_ib.h | 2 +- drivers/infiniband/hw/mlx4/mr.c | 2 +- drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 ++-- drivers/infiniband/hw/mlx5/mr.c | 61 ++++++++-------- drivers/infiniband/hw/mlx5/wr.c | 69 ++++++++++++++----- drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2 +- drivers/infiniband/hw/ocrdma/ocrdma_verbs.h | 2 +- drivers/infiniband/hw/qedr/verbs.c | 2 +- drivers/infiniband/hw/qedr/verbs.h | 2 +- drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c | 4 +- .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h | 2 +- drivers/infiniband/sw/rdmavt/mr.c | 3 +- drivers/infiniband/sw/rdmavt/mr.h | 2 +- drivers/infiniband/sw/rxe/rxe_verbs.c | 2 +- drivers/infiniband/sw/siw/siw_verbs.c | 2 +- drivers/infiniband/sw/siw/siw_verbs.h | 2 +- drivers/infiniband/ulp/iser/iser_memory.c | 10 ++- drivers/infiniband/ulp/iser/iser_verbs.c | 6 +- drivers/infiniband/ulp/rtrs/rtrs-clt.c | 6 +- drivers/infiniband/ulp/rtrs/rtrs-srv.c | 15 ++-- drivers/infiniband/ulp/srp/ib_srp.c | 8 +-- drivers/nvme/host/rdma.c | 19 +++-- fs/cifs/smbdirect.c | 17 +++-- include/rdma/ib_verbs.h | 11 ++- include/rdma/mr_pool.h | 3 +- net/rds/ib_frmr.c | 7 +- net/smc/smc_ib.c | 3 +- net/smc/smc_wr.c | 3 +- net/sunrpc/xprtrdma/frwr_ops.c | 10 +-- 39 files changed, 209 insertions(+), 140 deletions(-)