mbox series

[0/4] Introducing RDMA shared CQ pool

Message ID 1589122557-88996-1-git-send-email-yaminf@mellanox.com (mailing list archive)
Headers show
Series Introducing RDMA shared CQ pool | expand

Message

Yamin Friedman May 10, 2020, 2:55 p.m. UTC
This is the fourth re-incarnation of the CQ pool patches proposed
by Sagi and Christoph. I have started with the patches that Sagi last
submitted and built the CQ pool as a new API for acquiring shared CQs.

Our ULPs often want to make smart decisions on completion vector
affinitization when using multiple completion queues spread on
multiple cpu cores. We can see examples for this in iser, srp, nvme-rdma.

This patch set attempts to move this smartness to the rdma core by
introducing per-device CQ pools that by definition spread
across cpu cores. In addition, we completely make the completion
queue allocation transparent to the ULP by adding affinity hints
to create_qp which tells the rdma core to select (or allocate)
a completion queue that has the needed affinity for it.

This API gives us a similar approach to whats used in the networking
stack where the device completion queues are hidden from the application.
With the affinitization hints, we also do not compromise performance
as the completion queue will be affinitized correctly.

One thing that should be noticed is that now different ULPs using this
API may share completion queues (given that they use the same polling context).
However, even without this API they share interrupt vectors (and CPUs
that are assigned to them). Thus aggregating consumers on less completion
queues will result in better overall completion processing efficiency per
completion event (or interrupt).

An advantage of this method of using the CQ pool is that changes in the ULP
driver are minimal (around 14 altered lines of code).

The patch set convert nvme-rdma and nvmet-rdma to use the new API.

Test setup:
Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz servers.
Running NVMeoF 4KB read IOs over ConnectX-5EX across Spectrum switch.
TX-depth = 32. Number of cores refers to the initiator side. Four disks are
accessed from each core. In the current case we have four CQs per core and
in the shared case we have a single CQ per core. Until 14 cores there is no
significant change in performance and the number of interrupts per second
is less than a million in the current case.
==================================================
|Cores|Current KIOPs  |Shared KIOPs  |improvement|
|-----|---------------|--------------|-----------|
|14   |2188           |2620          |19.7%      |
|-----|---------------|--------------|-----------|
|20   |2063           |2308          |11.8%      |
|-----|---------------|--------------|-----------|
|28   |1933           |2235          |15.6%      |
|=================================================
|Cores|Current avg lat|Shared avg lat|improvement|
|-----|---------------|--------------|-----------|
|14   |817us          |683us         |16.4%      |
|-----|---------------|--------------|-----------|
|20   |1239us         |1108us        |10.6%      |
|-----|---------------|--------------|-----------|
|28   |1852us         |1601us        |13.5%      |
========================================================
|Cores|Current interrupts|Shared interrupts|improvement|
|-----|------------------|-----------------|-----------|
|14   |2131K/sec         |425K/sec         |80%        |
|-----|------------------|-----------------|-----------|
|20   |2267K/sec         |594K/sec         |73.8%      |
|-----|------------------|-----------------|-----------|
|28   |2370K/sec         |1057K/sec        |55.3%      |
====================================================================
|Cores|Current 99.99th PCTL lat|Shared 99.99th PCTL lat|improvement|
|-----|------------------------|-----------------------|-----------|
|14   |85Kus                   |9Kus                   |88%        |
|-----|------------------------|-----------------------|-----------|
|20   |6Kus                    |5.3Kus                 |14.6%      |
|-----|------------------------|-----------------------|-----------|
|28   |11.6Kus                 |9.5Kus                 |18%        |
|===================================================================

Performance improvement with 16 disks (16 CQs per core) is comparable.
What we can see is that once we reach a couple million interrupts per
second the Intel CPU can no longer sustain line rate and this feature
mitigates that effect.

Comments and feedback are welcome.

Yamin Friedman (4):
  infiniband/core: Add protection for shared CQs used by ULPs
  RDMA/core: Introduce shared CQ pool API
  nvme-rdma: use new shared CQ mechanism
  nvmet-rdma: use new shared CQ mechanism

 drivers/infiniband/core/core_priv.h |   8 ++
 drivers/infiniband/core/cq.c        | 170 ++++++++++++++++++++++++++++++++++--
 drivers/infiniband/core/device.c    |   3 +-
 drivers/infiniband/core/verbs.c     |  37 +++++++-
 drivers/nvme/host/rdma.c            |  65 +++++++++-----
 drivers/nvme/target/rdma.c          |  14 +--
 include/rdma/ib_verbs.h             |  52 ++++++++++-
 7 files changed, 308 insertions(+), 41 deletions(-)

Comments

Gal Pressman May 10, 2020, 3:04 p.m. UTC | #1
On 10/05/2020 17:55, Yamin Friedman wrote:
> This is the fourth re-incarnation of the CQ pool patches proposed
> by Sagi and Christoph. I have started with the patches that Sagi last
> submitted and built the CQ pool as a new API for acquiring shared CQs.

Can you please share a link to the previous revisions?
Yamin Friedman May 10, 2020, 3:17 p.m. UTC | #2
Sure, here is the last version:

https://marc.info/?l=linux-rdma&m=151013509002219&w=2

On 5/10/2020 6:04 PM, Gal Pressman wrote:
> On 10/05/2020 17:55, Yamin Friedman wrote:
>> This is the fourth re-incarnation of the CQ pool patches proposed
>> by Sagi and Christoph. I have started with the patches that Sagi last
>> submitted and built the CQ pool as a new API for acquiring shared CQs.
> Can you please share a link to the previous revisions?
Sagi Grimberg May 11, 2020, 8:34 a.m. UTC | #3
Hey Yamin, thanks for picking it up!

> This is the fourth re-incarnation of the CQ pool patches proposed
> by Sagi and Christoph. I have started with the patches that Sagi last
> submitted and built the CQ pool as a new API for acquiring shared CQs.

Can you also enumerate what changed in this version?

> Our ULPs often want to make smart decisions on completion vector
> affinitization when using multiple completion queues spread on
> multiple cpu cores. We can see examples for this in iser, srp, nvme-rdma.

I'd really like to see some love for the rest of our ulps. I'm not as
involved as I used to be, but it still annoys me a little. You guys
sometimes have a bad habit of centering improvements around nvme(t)/rdma
instead of improving the rest of the kernel consumers. Just one's
opinion though, so its up to you if you want to take it or not...

> This patch set attempts to move this smartness to the rdma core by
> introducing per-device CQ pools that by definition spread
> across cpu cores. In addition, we completely make the completion
> queue allocation transparent to the ULP by adding affinity hints
> to create_qp which tells the rdma core to select (or allocate)
> a completion queue that has the needed affinity for it.
> 
> This API gives us a similar approach to whats used in the networking
> stack where the device completion queues are hidden from the application.
> With the affinitization hints, we also do not compromise performance
> as the completion queue will be affinitized correctly.
> 
> One thing that should be noticed is that now different ULPs using this
> API may share completion queues (given that they use the same polling context).
> However, even without this API they share interrupt vectors (and CPUs
> that are assigned to them). Thus aggregating consumers on less completion
> queues will result in better overall completion processing efficiency per
> completion event (or interrupt).

Have you experimented this? IIRC that was the main feedback from the
last version way back. Have you measured the affect of this? It makes
sense to me that it is beneficial, but probably would be worth to
measure it.

> An advantage of this method of using the CQ pool is that changes in the ULP
> driver are minimal (around 14 altered lines of code).
> 
> The patch set convert nvme-rdma and nvmet-rdma to use the new API.
> 
> Test setup:
> Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz servers.
> Running NVMeoF 4KB read IOs over ConnectX-5EX across Spectrum switch.
> TX-depth = 32. Number of cores refers to the initiator side. Four disks are
> accessed from each core.

What does this mean? 4 namepsaces? How does that matter?

> In the current case we have four CQs per core and
> in the shared case we have a single CQ per core.

This makes me think that the host connected to 4 controllers...

> Until 14 cores there is no
> significant change in performance and the number of interrupts per second
> is less than a million in the current case.

This is interesting, what changes between different cores? I'm assuming
that a single core with 4 CQs vs. 1 CQ would have shown the difference
no?

> ==================================================
> |Cores|Current KIOPs  |Shared KIOPs  |improvement|
> |-----|---------------|--------------|-----------|
> |14   |2188           |2620          |19.7%      |
> |-----|---------------|--------------|-----------|
> |20   |2063           |2308          |11.8%      |
> |-----|---------------|--------------|-----------|
> |28   |1933           |2235          |15.6%      |
> |=================================================
> |Cores|Current avg lat|Shared avg lat|improvement|
> |-----|---------------|--------------|-----------|
> |14   |817us          |683us         |16.4%      |
> |-----|---------------|--------------|-----------|
> |20   |1239us         |1108us        |10.6%      |
> |-----|---------------|--------------|-----------|
> |28   |1852us         |1601us        |13.5%      |
> ========================================================
> |Cores|Current interrupts|Shared interrupts|improvement|
> |-----|------------------|-----------------|-----------|
> |14   |2131K/sec         |425K/sec         |80%        |
> |-----|------------------|-----------------|-----------|
> |20   |2267K/sec         |594K/sec         |73.8%      |
> |-----|------------------|-----------------|-----------|
> |28   |2370K/sec         |1057K/sec        |55.3%      |
> ====================================================================
> |Cores|Current 99.99th PCTL lat|Shared 99.99th PCTL lat|improvement|
> |-----|------------------------|-----------------------|-----------|
> |14   |85Kus                   |9Kus                   |88%        |
> |-----|------------------------|-----------------------|-----------|
> |20   |6Kus                    |5.3Kus                 |14.6%      |
> |-----|------------------------|-----------------------|-----------|
> |28   |11.6Kus                 |9.5Kus                 |18%        |
> |===================================================================
> 
> Performance improvement with 16 disks (16 CQs per core) is comparable.
> What we can see is that once we reach a couple million interrupts per
> second the Intel CPU can no longer sustain line rate and this feature
> mitigates that effect.

I'm not sure I understand what is the global effect here.
Yamin Friedman May 11, 2020, 12:24 p.m. UTC | #4
On 5/11/2020 11:34 AM, Sagi Grimberg wrote:
> Hey Yamin, thanks for picking it up!
No problem :)
>
>> This is the fourth re-incarnation of the CQ pool patches proposed
>> by Sagi and Christoph. I have started with the patches that Sagi last
>> submitted and built the CQ pool as a new API for acquiring shared CQs.
>
> Can you also enumerate what changed in this version?
Good idea. I will add.
>
>> Our ULPs often want to make smart decisions on completion vector
>> affinitization when using multiple completion queues spread on
>> multiple cpu cores. We can see examples for this in iser, srp, 
>> nvme-rdma.
>
> I'd really like to see some love for the rest of our ulps. I'm not as
> involved as I used to be, but it still annoys me a little. You guys
> sometimes have a bad habit of centering improvements around nvme(t)/rdma
> instead of improving the rest of the kernel consumers. Just one's
> opinion though, so its up to you if you want to take it or not...
You are definitely right I have patches for Iser ready and if this goes 
through I will add any other relevant ULPs.
>
>> This patch set attempts to move this smartness to the rdma core by
>> introducing per-device CQ pools that by definition spread
>> across cpu cores. In addition, we completely make the completion
>> queue allocation transparent to the ULP by adding affinity hints
>> to create_qp which tells the rdma core to select (or allocate)
>> a completion queue that has the needed affinity for it.
>>
>> This API gives us a similar approach to whats used in the networking
>> stack where the device completion queues are hidden from the 
>> application.
>> With the affinitization hints, we also do not compromise performance
>> as the completion queue will be affinitized correctly.
>>
>> One thing that should be noticed is that now different ULPs using this
>> API may share completion queues (given that they use the same polling 
>> context).
>> However, even without this API they share interrupt vectors (and CPUs
>> that are assigned to them). Thus aggregating consumers on less 
>> completion
>> queues will result in better overall completion processing efficiency 
>> per
>> completion event (or interrupt).
>
> Have you experimented this? IIRC that was the main feedback from the
> last version way back. Have you measured the affect of this? It makes
> sense to me that it is beneficial, but probably would be worth to
> measure it.

The main conclusion I have from the tests I have run so far is that this 
is so. This significantly reduces the amount of interrupts in the system 
without hurting the performance of each queue.

>
>> An advantage of this method of using the CQ pool is that changes in 
>> the ULP
>> driver are minimal (around 14 altered lines of code).
>>
>> The patch set convert nvme-rdma and nvmet-rdma to use the new API.
>>
>> Test setup:
>> Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz servers.
>> Running NVMeoF 4KB read IOs over ConnectX-5EX across Spectrum switch.
>> TX-depth = 32. Number of cores refers to the initiator side. Four 
>> disks are
>> accessed from each core.
>
> What does this mean? 4 namepsaces? How does that matter?
Yes, I meant four controllers. I will make that clearer.
>
>> In the current case we have four CQs per core and
>> in the shared case we have a single CQ per core.
>
> This makes me think that the host connected to 4 controllers...
>
>> Until 14 cores there is no
>> significant change in performance and the number of interrupts per 
>> second
>> is less than a million in the current case.
>
> This is interesting, what changes between different cores? I'm assuming
> that a single core with 4 CQs vs. 1 CQ would have shown the difference
> no?
What I saw is that once you reach enough cores working (including a 
higher IOP count) there is a very high number of interrupts in system in 
general. Once the system reaches  about 2M interrupts per second the 
performance goes down significantly which is where we see the value of 
this feature. On a single core we still reduce the number of interrupts 
but the are not a bottleneck at that point.
>
>> ==================================================
>> |Cores|Current KIOPs  |Shared KIOPs  |improvement|
>> |-----|---------------|--------------|-----------|
>> |14   |2188           |2620          |19.7%      |
>> |-----|---------------|--------------|-----------|
>> |20   |2063           |2308          |11.8%      |
>> |-----|---------------|--------------|-----------|
>> |28   |1933           |2235          |15.6%      |
>> |=================================================
>> |Cores|Current avg lat|Shared avg lat|improvement|
>> |-----|---------------|--------------|-----------|
>> |14   |817us          |683us         |16.4%      |
>> |-----|---------------|--------------|-----------|
>> |20   |1239us         |1108us        |10.6%      |
>> |-----|---------------|--------------|-----------|
>> |28   |1852us         |1601us        |13.5%      |
>> ========================================================
>> |Cores|Current interrupts|Shared interrupts|improvement|
>> |-----|------------------|-----------------|-----------|
>> |14   |2131K/sec         |425K/sec         |80% |
>> |-----|------------------|-----------------|-----------|
>> |20   |2267K/sec         |594K/sec         |73.8% |
>> |-----|------------------|-----------------|-----------|
>> |28   |2370K/sec         |1057K/sec        |55.3% |
>> ====================================================================
>> |Cores|Current 99.99th PCTL lat|Shared 99.99th PCTL lat|improvement|
>> |-----|------------------------|-----------------------|-----------|
>> |14   |85Kus                   |9Kus |88%        |
>> |-----|------------------------|-----------------------|-----------|
>> |20   |6Kus                    |5.3Kus |14.6%      |
>> |-----|------------------------|-----------------------|-----------|
>> |28   |11.6Kus                 |9.5Kus |18%        |
>> |===================================================================
>>
>> Performance improvement with 16 disks (16 CQs per core) is comparable.
>> What we can see is that once we reach a couple million interrupts per
>> second the Intel CPU can no longer sustain line rate and this feature
>> mitigates that effect.
>
> I'm not sure I understand what is the global effect here.
 From what we have seen in various tests is that if there are too many 
interrupts in the system the CPU starts causing backpressure on the PCIe 
preventing DMA.The main goal here is to reduce the number of interrupts.