mbox series

[v2,0/5] implement nvmf read/write queue maps

Message ID 20181211233519.9350-1-sagi@grimberg.me (mailing list archive)
Headers show
Series implement nvmf read/write queue maps | expand

Message

Sagi Grimberg Dec. 11, 2018, 11:35 p.m. UTC
This set implements read/write queue maps to nvmf (implemented in tcp
and rdma). We basically allow the users to pass in nr_write_queues
argument that will basically maps a separate set of queues to host
write I/O (or more correctly non-read I/O) and a set of queues to
hold read I/O (which is now controlled by the known nr_io_queues).

A patchset that restores nvme-rdma polling is in the pipe.
The polling is less trivial because:
1. we can find non I/O completions in the cq (i.e. memreg)
2. we need to start with non-polling for a sane connect and
   then switch to polling which is not trivial behind the
   cq API we use.

Note that read/write separation for rdma but especially tcp this can be
very clear win as we minimize the risk for head-of-queue blocking for
mixed workloads over a single tcp byte stream.

Changes from v1:
- simplified map_queues in nvme-tcp and nvme-rdma
- improved change logs
- collected review tags
- added nr-write-queues entry in nvme-cli docuementation

Sagi Grimberg (5):
  blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues
  nvme-fabrics: add missing nvmf_ctrl_options documentation
  nvme-fabrics: allow user to set nr_write_queues for separate queue
    maps
  nvme-tcp: support separate queue maps for read and write
  nvme-rdma: support separate queue maps for read and write

 block/blk-mq-rdma.c         |  8 +++----
 drivers/nvme/host/fabrics.c | 15 ++++++++++++-
 drivers/nvme/host/fabrics.h |  6 +++++
 drivers/nvme/host/rdma.c    | 28 ++++++++++++++++++++---
 drivers/nvme/host/tcp.c     | 44 ++++++++++++++++++++++++++++++++-----
 include/linux/blk-mq-rdma.h |  2 +-
 6 files changed, 88 insertions(+), 15 deletions(-)

Comments

James Smart Dec. 12, 2018, 5:23 p.m. UTC | #1
On 12/11/2018 3:35 PM, Sagi Grimberg wrote:
> This set implements read/write queue maps to nvmf (implemented in tcp
> and rdma). We basically allow the users to pass in nr_write_queues
> argument that will basically maps a separate set of queues to host
> write I/O (or more correctly non-read I/O) and a set of queues to
> hold read I/O (which is now controlled by the known nr_io_queues).
>
> A patchset that restores nvme-rdma polling is in the pipe.
> The polling is less trivial because:
> 1. we can find non I/O completions in the cq (i.e. memreg)
> 2. we need to start with non-polling for a sane connect and
>     then switch to polling which is not trivial behind the
>     cq API we use.
>
> Note that read/write separation for rdma but especially tcp this can be
> very clear win as we minimize the risk for head-of-queue blocking for
> mixed workloads over a single tcp byte stream.

Sagi,

What other wins are there for this split ?

I'm considering whether its worthwhile for fc as well, but the hol issue 
doesn't exist with fc. What else is being resolved ?

-- james
Sagi Grimberg Dec. 12, 2018, 5:58 p.m. UTC | #2
> Sagi,
> 
> What other wins are there for this split ?
> 
> I'm considering whether its worthwhile for fc as well, but the hol issue 
> doesn't exist with fc. What else is being resolved ?

I've pondered it myself, which is why I didn't add it to fc as well
(would have been easy enough I think). I guess that with this the
user can limit writes compared to read by assigning less queues as
jens suggested in his patches...
James Smart Dec. 12, 2018, 7:15 p.m. UTC | #3
On 12/12/2018 9:58 AM, Sagi Grimberg wrote:
>
>> Sagi,
>>
>> What other wins are there for this split ?
>>
>> I'm considering whether its worthwhile for fc as well, but the hol 
>> issue doesn't exist with fc. What else is being resolved ?
>
> I've pondered it myself, which is why I didn't add it to fc as well
> (would have been easy enough I think). I guess that with this the
> user can limit writes compared to read by assigning less queues as
> jens suggested in his patches...

not enough to interest me.  I also would prefer to not require more 
queues - it only increases the oversubscription requirements on the 
target, which is already a major issue on real shared subsystems.

Thanks

-- james
Hannes Reinecke Dec. 13, 2018, 12:04 p.m. UTC | #4
On 12/12/18 6:58 PM, Sagi Grimberg wrote:
> 
>> Sagi,
>>
>> What other wins are there for this split ?
>>
>> I'm considering whether its worthwhile for fc as well, but the hol 
>> issue doesn't exist with fc. What else is being resolved ?
> 
> I've pondered it myself, which is why I didn't add it to fc as well
> (would have been easy enough I think). I guess that with this the
> user can limit writes compared to read by assigning less queues as
> jens suggested in his patches...
 >
Weelll ... isn't that what blkcg is for?
I do understand if one would want to implement something like this if 
there's a real benefit from the _hardware_ side, but if there isn't we 
should try to keep it in the upper layers where it belongs.

Cheers,

Hannes
Jens Axboe Dec. 13, 2018, 2:02 p.m. UTC | #5
On 12/13/18 5:04 AM, Hannes Reinecke wrote:
> On 12/12/18 6:58 PM, Sagi Grimberg wrote:
>>
>>> Sagi,
>>>
>>> What other wins are there for this split ?
>>>
>>> I'm considering whether its worthwhile for fc as well, but the hol 
>>> issue doesn't exist with fc. What else is being resolved ?
>>
>> I've pondered it myself, which is why I didn't add it to fc as well
>> (would have been easy enough I think). I guess that with this the
>> user can limit writes compared to read by assigning less queues as
>> jens suggested in his patches...
>  >
> Weelll ... isn't that what blkcg is for?
> I do understand if one would want to implement something like this if 
> there's a real benefit from the _hardware_ side, but if there isn't we 
> should try to keep it in the upper layers where it belongs.

The point is that nvme round robins queues, if you have 1 write queue
for N read queues, then you would get better read servicing as writes
can't flood more than one queue.

Obviously this isn't a replacement for any kind of real throttling,
but it's cheap and easy to do.

The multiple queue maps are more useful for the polled IO, and for
future prioritized IO in general. But it is also handy for not
oversubscribing the write side.
Hannes Reinecke Dec. 13, 2018, 2:19 p.m. UTC | #6
On 12/13/18 3:02 PM, Jens Axboe wrote:
> On 12/13/18 5:04 AM, Hannes Reinecke wrote:
>> On 12/12/18 6:58 PM, Sagi Grimberg wrote:
>>>
>>>> Sagi,
>>>>
>>>> What other wins are there for this split ?
>>>>
>>>> I'm considering whether its worthwhile for fc as well, but the hol
>>>> issue doesn't exist with fc. What else is being resolved ?
>>>
>>> I've pondered it myself, which is why I didn't add it to fc as well
>>> (would have been easy enough I think). I guess that with this the
>>> user can limit writes compared to read by assigning less queues as
>>> jens suggested in his patches...
>>>
>> Weelll ... isn't that what blkcg is for?
>> I do understand if one would want to implement something like this if
>> there's a real benefit from the _hardware_ side, but if there isn't we
>> should try to keep it in the upper layers where it belongs.
> 
> The point is that nvme round robins queues, if you have 1 write queue
> for N read queues, then you would get better read servicing as writes
> can't flood more than one queue.
> 
> Obviously this isn't a replacement for any kind of real throttling,
> but it's cheap and easy to do.
> 
> The multiple queue maps are more useful for the polled IO, and for
> future prioritized IO in general. But it is also handy for not
> oversubscribing the write side.
> 
But how can you tell?
It's not that the hardware tells us "Oh, incidentally, I'm far slower 
writing than reading, please use only so many queues for writing.".
So any number we pick would be pretty heuristic to start with.

Cheers,

Hannes