[PATCHv2,0/2] block,nvme: latency-based I/O scheduler

Message ID	20240403141756.88233-1-hare@kernel.org (mailing list archive)
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 58F44168BD for <linux-block@vger.kernel.org>; Wed, 3 Apr 2024 14:18:07 +0000 (UTC) From: Hannes Reinecke <hare@kernel.org> To: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <kbusch@kernel.org>, Sagi Grimberg <sagi@grimberg.me>, Jens Axboe <axboe@kernel.dk>, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, Hannes Reinecke <hare@kernel.org> Subject: [PATCHv2 0/2] block,nvme: latency-based I/O scheduler Date: Wed, 3 Apr 2024 16:17:54 +0200 Message-Id: <20240403141756.88233-1-hare@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	block,nvme: latency-based I/O scheduler \| expand [PATCHv2,0/2] block,nvme: latency-based I/O scheduler [1/2] block: track per-node I/O latency [2/2] nvme: add 'latency' iopolicy [v3,3/3] nvme: multipath: pr_notice when iopolicy changes

Hannes Reinecke April 3, 2024, 2:17 p.m. UTC

Hi all,

there had been several attempts to implement a latency-based I/O
scheduler for native nvme multipath, all of which had its issues.

So time to start afresh, this time using the QoS framework
already present in the block layer.
It consists of two parts:
- a new 'blk-nlatency' QoS module, which is just a simple per-node
  latency tracker
- a 'latency' nvme I/O policy

Using the 'tiobench' fio script with 512 byte blocksize I'm getting
the following latencies (in usecs) as a baseline:
- seq write: avg 186 stddev 331
- rand write: avg 4598 stddev 7903
- seq read: avg 149 stddev 65
- rand read: avg 150 stddev 68

Enabling the 'latency' iopolicy:
- seq write: avg 178 stddev 113
- rand write: avg 3427 stddev 6703
- seq read: avg 140 stddev 59
- rand read: avg 141 stddev 58

Setting the 'decay' parameter to 10:
- seq write: avg 182 stddev 65
- rand write: avg 2619 stddev 5894
- seq read: avg 142 stddev 57
- rand read: avg 140 stddev 57  

That's on a 32G FC testbed running against a brd target,
fio running with 48 threads. So promises are met: latency
goes down, and we're even able to control the standard
deviation via the 'decay' parameter.

As usual, comments and reviews are welcome.

Changes to the original version:
- split the rqos debugfs entries
- Modify commit message to indicate latency
- rename to blk-nlatency

Hannes Reinecke (2):
  block: track per-node I/O latency
  nvme: add 'latency' iopolicy

 block/Kconfig                 |   6 +
 block/Makefile                |   1 +
 block/blk-mq-debugfs.c        |   2 +
 block/blk-nlatency.c          | 388 ++++++++++++++++++++++++++++++++++
 block/blk-rq-qos.h            |   6 +
 drivers/nvme/host/multipath.c |  57 ++++-
 drivers/nvme/host/nvme.h      |   1 +
 include/linux/blk-mq.h        |  11 +
 8 files changed, 465 insertions(+), 7 deletions(-)
 create mode 100644 block/blk-nlatency.c

Keith Busch April 4, 2024, 9:14 p.m. UTC | #1

On Wed, Apr 03, 2024 at 04:17:54PM +0200, Hannes Reinecke wrote:
> Hi all,
> 
> there had been several attempts to implement a latency-based I/O
> scheduler for native nvme multipath, all of which had its issues.
> 
> So time to start afresh, this time using the QoS framework
> already present in the block layer.
> It consists of two parts:
> - a new 'blk-nlatency' QoS module, which is just a simple per-node
>   latency tracker
> - a 'latency' nvme I/O policy
 
Whatever happened with the io-depth based path selector? That should
naturally align with the lower latency path, and that metric is cheaper
to track.

Hannes Reinecke April 5, 2024, 6:21 a.m. UTC | #2

On 4/4/24 23:14, Keith Busch wrote:
> On Wed, Apr 03, 2024 at 04:17:54PM +0200, Hannes Reinecke wrote:
>> Hi all,
>>
>> there had been several attempts to implement a latency-based I/O
>> scheduler for native nvme multipath, all of which had its issues.
>>
>> So time to start afresh, this time using the QoS framework
>> already present in the block layer.
>> It consists of two parts:
>> - a new 'blk-nlatency' QoS module, which is just a simple per-node
>>    latency tracker
>> - a 'latency' nvme I/O policy
>   
> Whatever happened with the io-depth based path selector? That should
> naturally align with the lower latency path, and that metric is cheaper
> to track.

Turns out that tracking queue depth (on the NVMe level) always requires
an atomic, and with that a performance impact.
The qos/blk-stat framework is already present, and as the numbers show
actually leads to a performance improvement.

So I'm not quite sure what the argument 'cheaper to track' buys us here...

Cheers,

Hannes

Keith Busch April 5, 2024, 3:03 p.m. UTC | #3

On Fri, Apr 05, 2024 at 08:21:14AM +0200, Hannes Reinecke wrote:
> On 4/4/24 23:14, Keith Busch wrote:
> > On Wed, Apr 03, 2024 at 04:17:54PM +0200, Hannes Reinecke wrote:
> > > Hi all,
> > > 
> > > there had been several attempts to implement a latency-based I/O
> > > scheduler for native nvme multipath, all of which had its issues.
> > > 
> > > So time to start afresh, this time using the QoS framework
> > > already present in the block layer.
> > > It consists of two parts:
> > > - a new 'blk-nlatency' QoS module, which is just a simple per-node
> > >    latency tracker
> > > - a 'latency' nvme I/O policy
> > Whatever happened with the io-depth based path selector? That should
> > naturally align with the lower latency path, and that metric is cheaper
> > to track.
> 
> Turns out that tracking queue depth (on the NVMe level) always requires
> an atomic, and with that a performance impact.
> The qos/blk-stat framework is already present, and as the numbers show
> actually leads to a performance improvement.
> 
> So I'm not quite sure what the argument 'cheaper to track' buys us here...

I was considering the blk_stat framework compared to those atomic
operations. I usually don't enable stats because all the extra
ktime_get_ns() and indirect calls are relatively costly. If you're
enabling stats anyway though, then yeah, I guess I don't really have a
point and your idea here seems pretty reasonable.

Hannes Reinecke April 5, 2024, 3:36 p.m. UTC | #4

On 4/5/24 17:03, Keith Busch wrote:
> On Fri, Apr 05, 2024 at 08:21:14AM +0200, Hannes Reinecke wrote:
>> On 4/4/24 23:14, Keith Busch wrote:
>>> On Wed, Apr 03, 2024 at 04:17:54PM +0200, Hannes Reinecke wrote:
>>>> Hi all,
>>>>
>>>> there had been several attempts to implement a latency-based I/O
>>>> scheduler for native nvme multipath, all of which had its issues.
>>>>
>>>> So time to start afresh, this time using the QoS framework
>>>> already present in the block layer.
>>>> It consists of two parts:
>>>> - a new 'blk-nlatency' QoS module, which is just a simple per-node
>>>>     latency tracker
>>>> - a 'latency' nvme I/O policy
>>> Whatever happened with the io-depth based path selector? That should
>>> naturally align with the lower latency path, and that metric is cheaper
>>> to track.
>>
>> Turns out that tracking queue depth (on the NVMe level) always requires
>> an atomic, and with that a performance impact.
>> The qos/blk-stat framework is already present, and as the numbers show
>> actually leads to a performance improvement.
>>
>> So I'm not quite sure what the argument 'cheaper to track' buys us here...
> 
> I was considering the blk_stat framework compared to those atomic
> operations. I usually don't enable stats because all the extra
> ktime_get_ns() and indirect calls are relatively costly. If you're
> enabling stats anyway though, then yeah, I guess I don't really have a
> point and your idea here seems pretty reasonable.

Pretty much. Of course you need stats to be enabled.
And problem with the queue depth is that it's actually quite costly
to compute; the while sbitmap thingie is precisely there to _avoid_
having to track the queue depth.
I can't really see how one could track the queue depth efficiently;
the beauty of the blk_stat framework is that it's running async, and
only calculated after I/O is completed.
We could do a 'mock' queue depth by calculating the difference between
submitted and completed I/O, but even then you'd have to inject a call
in the hot path to track the number of submissions.

In the end, the latency tracker did what I wanted to achieve (namely
balance out uneven paths), _and_ got faster than round-robin, so I 
didn't care about queue depth tracking.

Cheers,

Hannes

Sagi Grimberg April 7, 2024, 7:55 p.m. UTC | #5

On 05/04/2024 18:36, Hannes Reinecke wrote:
> On 4/5/24 17:03, Keith Busch wrote:
>> On Fri, Apr 05, 2024 at 08:21:14AM +0200, Hannes Reinecke wrote:
>>> On 4/4/24 23:14, Keith Busch wrote:
>>>> On Wed, Apr 03, 2024 at 04:17:54PM +0200, Hannes Reinecke wrote:
>>>>> Hi all,
>>>>>
>>>>> there had been several attempts to implement a latency-based I/O
>>>>> scheduler for native nvme multipath, all of which had its issues.
>>>>>
>>>>> So time to start afresh, this time using the QoS framework
>>>>> already present in the block layer.
>>>>> It consists of two parts:
>>>>> - a new 'blk-nlatency' QoS module, which is just a simple per-node
>>>>>     latency tracker
>>>>> - a 'latency' nvme I/O policy
>>>> Whatever happened with the io-depth based path selector? That should
>>>> naturally align with the lower latency path, and that metric is 
>>>> cheaper
>>>> to track.
>>>
>>> Turns out that tracking queue depth (on the NVMe level) always requires
>>> an atomic, and with that a performance impact.
>>> The qos/blk-stat framework is already present, and as the numbers show
>>> actually leads to a performance improvement.
>>>
>>> So I'm not quite sure what the argument 'cheaper to track' buys us 
>>> here...
>>
>> I was considering the blk_stat framework compared to those atomic
>> operations. I usually don't enable stats because all the extra
>> ktime_get_ns() and indirect calls are relatively costly. If you're
>> enabling stats anyway though, then yeah, I guess I don't really have a
>> point and your idea here seems pretty reasonable.
>
> Pretty much. Of course you need stats to be enabled.
> And problem with the queue depth is that it's actually quite costly
> to compute; the while sbitmap thingie is precisely there to _avoid_
> having to track the queue depth.
> I can't really see how one could track the queue depth efficiently;
> the beauty of the blk_stat framework is that it's running async, and
> only calculated after I/O is completed.
> We could do a 'mock' queue depth by calculating the difference between
> submitted and completed I/O, but even then you'd have to inject a call
> in the hot path to track the number of submissions.
>
> In the end, the latency tracker did what I wanted to achieve (namely
> balance out uneven paths), _and_ got faster than round-robin, so I 
> didn't care about queue depth tracking.

Hey Hannes,

I think its a fair claim that a latency tracker is a valid proxy for 
io-depth tracker.

I think that we need Ewan to validate if this solves the original issue 
he was trying
to solve with his io-depth mpath selector. If so, I don't see any major 
issues with this
proposal.

[PATCHv2,0/2] block,nvme: latency-based I/O scheduler

Message

Comments