Message ID | 20210805161134.1234-1-tatyana.e.nikolova@intel.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | [rdma-core] irdma: Restore full memory barrier for doorbell optimization | expand |
On Thu, Aug 05, 2021 at 11:11:34AM -0500, Tatyana Nikolova wrote: > During the irdma library upstream submission we agreed to > replace atomic_thread_fence(memory_order_seq_cst) in the irdma > doorbell optimization algorithm with udma_to_device_barrier(). > However, further regression testing uncovered cases where in > absence of a full memory barrier, the algorithm incorrectly > skips ringing the doorbell. > > There has been a discussion about the necessity of a full > memory barrier for the doorbell optimization in the past: > https://lore.kernel.org/linux-rdma/20170301172920.GA11340@ssaleem-MOBL4.amr.corp.intel.com/ > > The algorithm decides whether to ring the doorbell based on input > from the shadow memory (hw_tail). If the hw_tail is behind the sq_head, > then the algorithm doesn't ring the doorbell, because it assumes that > the HW is still actively processing WQEs. > > The shadow area indicates the last WQE processed by the HW and it is > okay if the shadow area value isn't the most current. However there > can't be a window of time between reading the shadow area and setting > the valid bit for the last WQE posted, because the HW may go idle and > the algorithm won't detect this. > > The following two cases illustrate this issue and are identical, > except for ops ordering. The first case is an example of how > the wrong order results in not ringing the doorbell when the > HW has gone idle. I can't really understand this explanation. since this seemes to be about a concurrency problem can you please explain it using a normal ladder diagram? (eg 1a3402d93c73 ("posix-cpu-timers: Fix rearm racing against process tick") to pick an example at random) > diff --git a/providers/irdma/uk.c b/providers/irdma/uk.c > index c7053c52..d63996db 100644 > +++ b/providers/irdma/uk.c > @@ -118,7 +118,7 @@ void irdma_uk_qp_post_wr(struct irdma_qp_uk *qp) > __u32 sw_sq_head; > > /* valid bit is written and loads completed before reading shadow */ > - udma_to_device_barrier(); > + atomic_thread_fence(memory_order_seq_cst); Because it certainly looks wrong to replace a DMA barrier with something that is not a DMA barrier. I'm guessing this problem is that the shadow memory is not locked and also not using using atomics to control concurrent access it? If so then the fix is to use atomics for the shadow memory and place the proper order requirement on the atomic itself. Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, August 5, 2021 8:29 PM > To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com> > Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for > doorbell optimization > > On Thu, Aug 05, 2021 at 11:11:34AM -0500, Tatyana Nikolova wrote: > > During the irdma library upstream submission we agreed to replace > > atomic_thread_fence(memory_order_seq_cst) in the irdma doorbell > > optimization algorithm with udma_to_device_barrier(). > > However, further regression testing uncovered cases where in absence > > of a full memory barrier, the algorithm incorrectly skips ringing the > > doorbell. > > > > There has been a discussion about the necessity of a full memory > > barrier for the doorbell optimization in the past: > > https://lore.kernel.org/linux-rdma/20170301172920.GA11340@ssaleem- > MOBL > > 4.amr.corp.intel.com/ > > > > The algorithm decides whether to ring the doorbell based on input from > > the shadow memory (hw_tail). If the hw_tail is behind the sq_head, > > then the algorithm doesn't ring the doorbell, because it assumes that > > the HW is still actively processing WQEs. > > > > The shadow area indicates the last WQE processed by the HW and it is > > okay if the shadow area value isn't the most current. However there > > can't be a window of time between reading the shadow area and setting > > the valid bit for the last WQE posted, because the HW may go idle and > > the algorithm won't detect this. > > > > The following two cases illustrate this issue and are identical, > > except for ops ordering. The first case is an example of how the wrong > > order results in not ringing the doorbell when the HW has gone idle. > > I can't really understand this explanation. since this seemes to be about a > concurrency problem can you please explain it using a normal ladder > diagram? (eg 1a3402d93c73 ("posix-cpu-timers: Fix rearm racing against > process tick") to pick an example at random) > Hi Jason, The doorbell optimization algorithm involves the following operations: 1. Software writing the valid bit in the WQE. 2. Software reading shadow memory (hw_tail) value. 3. Software deciding whether or not to ring the doorbell. 4. Hardware reading that WQE. Without the MFENCE after step 1, the processor is free to move instructions around - it can move the write of the valid bit later (after software reads the hw_tail value). MFENCE ensures the order of 1 and 2. MFENCE (Vol. 1 11-12) "The MFENCE instruction establishes a memory fence for both loads and stores. The processor ensures that no load or store after MFENCE will become globally visible until all loads and stores before MFENCE are globally visible" If the order of 1 and 2 is not maintained, the algorithm can fail because the hardware may not see the valid bit and the doorbell optimization algorithm does not ring the doorbell. This causes hardware to go idle even though there is work to do. The failure scenario without the MFENCE is shown in a ladder diagram: SOFTWARE CPU DMA DEVICE Writes the valid bit SFENCE Reads the hw_tail value <Reorders instructions> Reads hw_tail value Decides not to ring the doorbell. Reads WQE - <valid bit is not set and hardware goes idle> Writes valid bit SFENCE The order of 1 and 2 is not maintained which causes hardware to go idle even though there is work to do. Our doorbell optimization algorithm has always required a full memory barrier because it prevents reordering of stores and loads. There used to be an MFENCE in the i40iw doorbell algorithm. In a previous discussion, https://lore.kernel.org/linux-rdma/20170306194052.GB31672@obsidianresearch.com/, you suggested the use of atomic_thread_fence(memory_order_seq_cst), which is a C11 barrier, and this did work for us. Thank you, Tatyana > > diff --git a/providers/irdma/uk.c b/providers/irdma/uk.c index > > c7053c52..d63996db 100644 > > +++ b/providers/irdma/uk.c > > @@ -118,7 +118,7 @@ void irdma_uk_qp_post_wr(struct irdma_qp_uk > *qp) > > __u32 sw_sq_head; > > > > /* valid bit is written and loads completed before reading shadow */ > > - udma_to_device_barrier(); > > + atomic_thread_fence(memory_order_seq_cst); > > Because it certainly looks wrong to replace a DMA barrier with something > that is not a DMA barrier. > > I'm guessing this problem is that the shadow memory is not locked and also > not using using atomics to control concurrent access it? If so then the fix is to > use atomics for the shadow memory and place the proper order > requirement on the atomic itself. > > Jason
On Mon, Aug 09, 2021 at 08:07:32PM +0000, Nikolova, Tatyana E wrote: > 1. Software writing the valid bit in the WQE. > 2. Software reading shadow memory (hw_tail) value. You are missing an ordered atomic on this read it looks like Jason
>> 1. Software writing the valid bit in the WQE. >> 2. Software reading shadow memory (hw_tail) value. > You are missing an ordered atomic on this read it looks like Hi Jason, Why do you think we need atomic ops in this case? We aren't trying to protect from multiple threads but CPU re-ordering of a write and a read. Thank you, Tatyana
On Fri, Aug 13, 2021 at 05:25:49PM -0500, Tatyana Nikolova wrote: > >> 1. Software writing the valid bit in the WQE. > >> 2. Software reading shadow memory (hw_tail) value. > > > You are missing an ordered atomic on this read it looks like > > Hi Jason, > > Why do you think we need atomic ops in this case? We aren't trying > to protect from multiple threads but CPU re-ordering of a write and > a read. Which is what the atomics will do. Barriers are only appropriate when you can't add atomic markers to the actual data that needs ordering. Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, August 18, 2021 11:50 AM > To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com> > Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for > doorbell optimization > > On Fri, Aug 13, 2021 at 05:25:49PM -0500, Tatyana Nikolova wrote: > > >> 1. Software writing the valid bit in the WQE. > > >> 2. Software reading shadow memory (hw_tail) value. > > > > > You are missing an ordered atomic on this read it looks like > > > > Hi Jason, > > > > Why do you think we need atomic ops in this case? We aren't trying to > > protect from multiple threads but CPU re-ordering of a write and a > > read. > > Which is what the atomics will do. > > Barriers are only appropriate when you can't add atomic markers to the > actual data that needs ordering. Hi Jason, We aren't sure what you mean by atomic markers. We ran a few experiments with atomics, but none of the barriers we tried smp_mb__{before,after}_atomic(), smp_load_acquire() and smp_store_release() translates to a full memory barrier on X86. Could you give us an example? Thank you, Tatyana
On Thu, Aug 19, 2021 at 10:01:50PM +0000, Nikolova, Tatyana E wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, August 18, 2021 11:50 AM > > To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com> > > Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org > > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for > > doorbell optimization > > > > On Fri, Aug 13, 2021 at 05:25:49PM -0500, Tatyana Nikolova wrote: > > > >> 1. Software writing the valid bit in the WQE. > > > >> 2. Software reading shadow memory (hw_tail) value. > > > > > > > You are missing an ordered atomic on this read it looks like > > > > > > Hi Jason, > > > > > > Why do you think we need atomic ops in this case? We aren't trying to > > > protect from multiple threads but CPU re-ordering of a write and a > > > read. > > > > Which is what the atomics will do. > > > > Barriers are only appropriate when you can't add atomic markers to the > > actual data that needs ordering. > > Hi Jason, > > We aren't sure what you mean by atomic markers. We ran a few > experiments with atomics, but none of the barriers we tried > smp_mb__{before,after}_atomic(), smp_load_acquire() and > smp_store_release() translates to a full memory barrier on X86. Huh? Those are kernel primitives, this is a userspace patch. Userspace follows the C11 atomics memory model. So I'd expect atomic_store_explicit(tail, memory_order_release) atomic_load_explicit(tail, memory_order_acquire) To be the atomics you need. This will ensure that the read/writes to valid before the atomics are sequenced correctly, eg no CPU thread can observe an updated tail without also observing the set valid. Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, August 19, 2021 5:44 PM > To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com> > Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for > doorbell optimization > > On Thu, Aug 19, 2021 at 10:01:50PM +0000, Nikolova, Tatyana E wrote: > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Wednesday, August 18, 2021 11:50 AM > > > To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com> > > > Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org > > > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier > > > for doorbell optimization > > > > > > On Fri, Aug 13, 2021 at 05:25:49PM -0500, Tatyana Nikolova wrote: > > > > >> 1. Software writing the valid bit in the WQE. > > > > >> 2. Software reading shadow memory (hw_tail) value. > > > > > > > > > You are missing an ordered atomic on this read it looks like > > > > > > > > Hi Jason, > > > > > > > > Why do you think we need atomic ops in this case? We aren't trying > > > > to protect from multiple threads but CPU re-ordering of a write > > > > and a read. > > > > > > Which is what the atomics will do. > > > > > > Barriers are only appropriate when you can't add atomic markers to > > > the actual data that needs ordering. > > > > Hi Jason, > > > > We aren't sure what you mean by atomic markers. We ran a few > > experiments with atomics, but none of the barriers we tried > > smp_mb__{before,after}_atomic(), smp_load_acquire() and > > smp_store_release() translates to a full memory barrier on X86. > > Huh? Those are kernel primitives, this is a userspace patch. > > Userspace follows the C11 atomics memory model. > > So I'd expect > > atomic_store_explicit(tail, memory_order_release) > atomic_load_explicit(tail, memory_order_acquire) > > To be the atomics you need. This will ensure that the read/writes to valid > before the atomics are sequenced correctly, eg no CPU thread can observe > an updated tail without also observing the set valid. > Hi Jason, We tried these atomic ops as shown bellow, but they don't fix the issue. atomic_store_explicit(hdr, memory_order_release) atomic_load_explicit(tail, memory_order_acquire) In assembly they look like this: //set_64bit_val(wqe, 24, hdr); atomic_store_explicit((_Atomic(uint64_t) *)(wqe + (24 >> 3)), hdr, memory_order_release); 2130: 49 89 5f 18 mov %rbx,0x18(%r15) /root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:747 /root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:123 temp = atomic_load_explicit((_Atomic(__u64) *)qp->shadow_area, memory_order_acquire); 1c32: 15 00 00 28 84 adc $0x84280000,%eax However, the following works: atomic_store_explicit(hdr, memory_order_seq_cst) //set_64bit_val(wqe, 24, hdr); atomic_store_explicit((_Atomic(uint64_t) *)(wqe + (24 >> 3)), hdr, memory_order_seq_cst); 2130: 49 89 5f 18 mov %rbx,0x18(%r15) 2134: 0f ae f0 mfence /root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:748 atomic_load_explicit(tail, memory_order_seq_cst) - same assembly as with memory_order_acquire Thank you, Tatyana
On Thu, Sep 02, 2021 at 04:27:37PM +0000, Nikolova, Tatyana E wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Thursday, August 19, 2021 5:44 PM > > To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com> > > Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org > > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for > > doorbell optimization > > > > On Thu, Aug 19, 2021 at 10:01:50PM +0000, Nikolova, Tatyana E wrote: > > > > > > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Wednesday, August 18, 2021 11:50 AM > > > > To: Nikolova, Tatyana E <tatyana.e.nikolova@intel.com> > > > > Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org > > > > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier > > > > for doorbell optimization > > > > > > > > On Fri, Aug 13, 2021 at 05:25:49PM -0500, Tatyana Nikolova wrote: > > > > > >> 1. Software writing the valid bit in the WQE. > > > > > >> 2. Software reading shadow memory (hw_tail) value. > > > > > > > > > > > You are missing an ordered atomic on this read it looks like > > > > > > > > > > Hi Jason, > > > > > > > > > > Why do you think we need atomic ops in this case? We aren't trying > > > > > to protect from multiple threads but CPU re-ordering of a write > > > > > and a read. > > > > > > > > Which is what the atomics will do. > > > > > > > > Barriers are only appropriate when you can't add atomic markers to > > > > the actual data that needs ordering. > > > > > > Hi Jason, > > > > > > We aren't sure what you mean by atomic markers. We ran a few > > > experiments with atomics, but none of the barriers we tried > > > smp_mb__{before,after}_atomic(), smp_load_acquire() and > > > smp_store_release() translates to a full memory barrier on X86. > > > > Huh? Those are kernel primitives, this is a userspace patch. > > > > Userspace follows the C11 atomics memory model. > > > > So I'd expect > > > > atomic_store_explicit(tail, memory_order_release) > > atomic_load_explicit(tail, memory_order_acquire) > > > > To be the atomics you need. This will ensure that the read/writes to valid > > before the atomics are sequenced correctly, eg no CPU thread can observe > > an updated tail without also observing the set valid. > > > > Hi Jason, > > We tried these atomic ops as shown bellow, but they don't fix the issue. > > atomic_store_explicit(hdr, memory_order_release) atomic_load_explicit(tail, memory_order_acquire) > > In assembly they look like this: > > //set_64bit_val(wqe, 24, hdr); > atomic_store_explicit((_Atomic(uint64_t) *)(wqe + (24 >> 3)), hdr, memory_order_release); > 2130: 49 89 5f 18 mov %rbx,0x18(%r15) > /root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:747 > > > /root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:123 > temp = atomic_load_explicit((_Atomic(__u64) *)qp->shadow_area, memory_order_acquire); > 1c32: 15 00 00 28 84 adc $0x84280000,%eax > > > However, the following works: > atomic_store_explicit(hdr, memory_order_seq_cst) > > //set_64bit_val(wqe, 24, hdr); > atomic_store_explicit((_Atomic(uint64_t) *)(wqe + (24 >> 3)), hdr, memory_order_seq_cst); > 2130: 49 89 5f 18 mov %rbx,0x18(%r15) > 2134: 0f ae f0 mfence > /root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:748 > > > atomic_load_explicit(tail, memory_order_seq_cst) - same assembly as with memory_order_acquire Again, you shouldn't need the mfence unless you are doing something very special, and so far nothing you've explained reaches to what that special thing is. According to the x86 memory model when you do a store release then all writes made by the current CPU will be observed by the CPU that does the load acquire. Which matches what you asked for. The main purpose of the atomic notations on x86 is to force the compiler to put all load/stores in the correct order when going to assembly. The x86 architecture takes care of the rest without MFENCE, and from your prior description it seems the issue was compiler re-ordering of load/stores around a serializing load/sore. MFENCE is about preventing future stores from becoming visible until prior stores are, which you shouldn't need with the simple acquire/release semantics that you described.. Jason
diff --git a/providers/irdma/uk.c b/providers/irdma/uk.c index c7053c52..d63996db 100644 --- a/providers/irdma/uk.c +++ b/providers/irdma/uk.c @@ -118,7 +118,7 @@ void irdma_uk_qp_post_wr(struct irdma_qp_uk *qp) __u32 sw_sq_head; /* valid bit is written and loads completed before reading shadow */ - udma_to_device_barrier(); + atomic_thread_fence(memory_order_seq_cst); /* read the doorbell shadow area */ get_64bit_val(qp->shadow_area, 0, &temp);