[3/3] block: Polling completion performance optimization

Message ID	20171221204636.2924-4-keith.busch@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Keith Busch <keith.busch@intel.com> To: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>, Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Subject: [PATCH 3/3] block: Polling completion performance optimization Date: Thu, 21 Dec 2017 13:46:36 -0700 Message-Id: <20171221204636.2924-4-keith.busch@intel.com> In-Reply-To: <20171221204636.2924-1-keith.busch@intel.com> References: <20171221204636.2924-1-keith.busch@intel.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Keith Busch Dec. 21, 2017, 8:46 p.m. UTC

When a request completion is polled, the completion task wakes itself
up. This is unnecessary, as the task can just set itself back to
running.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 fs/block_dev.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Scotty Bauer Dec. 21, 2017, 8:56 p.m. UTC | #1

On 12/21/2017 01:46 PM, Keith Busch wrote:
> When a request completion is polled, the completion task wakes itself
> up. This is unnecessary, as the task can just set itself back to
> running.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  fs/block_dev.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4a181fcb5175..ce67ffe73430 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>  	struct task_struct *waiter = bio->bi_private;
>  
>  	WRITE_ONCE(bio->bi_private, NULL);
> -	wake_up_process(waiter);
> +	if (current != waiter)
> +		wake_up_process(waiter);
> +	else
> +		__set_current_state(TASK_RUNNING);

Do we actually need to set this to TASK_RUNNING? If we get here we're already running, right?

Everywhere I see uses of __set_current_state(TASK_RUNNING) it's after we've done a set_current_state(TASK_INTERRUPTIBLE).

Jens Axboe Dec. 21, 2017, 8:57 p.m. UTC | #2

On 12/21/17 1:46 PM, Keith Busch wrote:
> When a request completion is polled, the completion task wakes itself
> up. This is unnecessary, as the task can just set itself back to
> running.

Looks good to me, I can take it for 4.16 in the block tree.

Jens Axboe Dec. 21, 2017, 9 p.m. UTC | #3

On 12/21/17 1:56 PM, Scott Bauer wrote:
> 
> 
> On 12/21/2017 01:46 PM, Keith Busch wrote:
>> When a request completion is polled, the completion task wakes itself
>> up. This is unnecessary, as the task can just set itself back to
>> running.
>>
>> Signed-off-by: Keith Busch <keith.busch@intel.com>
>> ---
>>  fs/block_dev.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 4a181fcb5175..ce67ffe73430 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>>  	struct task_struct *waiter = bio->bi_private;
>>  
>>  	WRITE_ONCE(bio->bi_private, NULL);
>> -	wake_up_process(waiter);
>> +	if (current != waiter)
>> +		wake_up_process(waiter);
>> +	else
>> +		__set_current_state(TASK_RUNNING);
> 
> Do we actually need to set this to TASK_RUNNING? If we get here we're already running, right?
> 
> Everywhere I see uses of __set_current_state(TASK_RUNNING) it's after we've done a set_current_state(TASK_INTERRUPTIBLE).

We'd only be TASK_RUNNING if the IRQ got to it first. And that's something that
should be removed as well - I suspect that'd be a much bigger win, getting rid
of the IRQ trigger for polled IO, than most of the micro optimizations. For
Keith's testing, looks like he reduced the cost by turning on coalescing, but
it'd be cheaper (and better) to not have to rely on that.

In any case, the __set_current_state() is very cheap. Given that and the
above, I'd rather keep that.

Keith Busch Dec. 21, 2017, 9:34 p.m. UTC | #4

On Thu, Dec 21, 2017 at 02:00:04PM -0700, Jens Axboe wrote:
> On 12/21/17 1:56 PM, Scott Bauer wrote:
> > On 12/21/2017 01:46 PM, Keith Busch wrote:
> >> @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
> >>  	struct task_struct *waiter = bio->bi_private;
> >>  
> >>  	WRITE_ONCE(bio->bi_private, NULL);
> >> -	wake_up_process(waiter);
> >> +	if (current != waiter)
> >> +		wake_up_process(waiter);
> >> +	else
> >> +		__set_current_state(TASK_RUNNING);
> > 
> > Do we actually need to set this to TASK_RUNNING? If we get here we're already running, right?
> > 
> > Everywhere I see uses of __set_current_state(TASK_RUNNING) it's after we've done a set_current_state(TASK_INTERRUPTIBLE).
> 
> We'd only be TASK_RUNNING if the IRQ got to it first. And that's something that
> should be removed as well - I suspect that'd be a much bigger win, getting rid
> of the IRQ trigger for polled IO, than most of the micro optimizations. For
> Keith's testing, looks like he reduced the cost by turning on coalescing, but
> it'd be cheaper (and better) to not have to rely on that.

It would be nice, but the driver doesn't know a request's completion
is going to be a polled. Even if it did, we don't have a spec defined
way to tell the controller not to send an interrupt with this command's
compeletion, which would be negated anyway if any interrupt driven IO
is mixed in the same queue. We could possibly create a special queue
with interrupts disabled for this purpose if we can pass the HIPRI hint
through the request.

Jens Axboe Dec. 21, 2017, 10:17 p.m. UTC | #5

On 12/21/17 2:34 PM, Keith Busch wrote:
> On Thu, Dec 21, 2017 at 02:00:04PM -0700, Jens Axboe wrote:
>> On 12/21/17 1:56 PM, Scott Bauer wrote:
>>> On 12/21/2017 01:46 PM, Keith Busch wrote:
>>>> @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>>>>  	struct task_struct *waiter = bio->bi_private;
>>>>  
>>>>  	WRITE_ONCE(bio->bi_private, NULL);
>>>> -	wake_up_process(waiter);
>>>> +	if (current != waiter)
>>>> +		wake_up_process(waiter);
>>>> +	else
>>>> +		__set_current_state(TASK_RUNNING);
>>>
>>> Do we actually need to set this to TASK_RUNNING? If we get here we're already running, right?
>>>
>>> Everywhere I see uses of __set_current_state(TASK_RUNNING) it's after we've done a set_current_state(TASK_INTERRUPTIBLE).
>>
>> We'd only be TASK_RUNNING if the IRQ got to it first. And that's something that
>> should be removed as well - I suspect that'd be a much bigger win, getting rid
>> of the IRQ trigger for polled IO, than most of the micro optimizations. For
>> Keith's testing, looks like he reduced the cost by turning on coalescing, but
>> it'd be cheaper (and better) to not have to rely on that.
> 
> It would be nice, but the driver doesn't know a request's completion
> is going to be a polled. 

That's trivially solvable though, since the information is available
at submission time.

> Even if it did, we don't have a spec defined
> way to tell the controller not to send an interrupt with this command's
> compeletion, which would be negated anyway if any interrupt driven IO
> is mixed in the same queue. We could possibly create a special queue
> with interrupts disabled for this purpose if we can pass the HIPRI hint
> through the request.

There's on way to do it per IO, right. But you can create a sq/cq pair
without interrupts enabled. This would also allow you to scale better
with multiple users of polling, a case where we currently don't
perform as well spdk, for instance.

Keith Busch Dec. 21, 2017, 11:10 p.m. UTC | #6

On Thu, Dec 21, 2017 at 03:17:41PM -0700, Jens Axboe wrote:
> On 12/21/17 2:34 PM, Keith Busch wrote:
> > It would be nice, but the driver doesn't know a request's completion
> > is going to be a polled. 
> 
> That's trivially solvable though, since the information is available
> at submission time.
> 
> > Even if it did, we don't have a spec defined
> > way to tell the controller not to send an interrupt with this command's
> > compeletion, which would be negated anyway if any interrupt driven IO
> > is mixed in the same queue. We could possibly create a special queue
> > with interrupts disabled for this purpose if we can pass the HIPRI hint
> > through the request.
> 
> There's on way to do it per IO, right. But you can create a sq/cq pair
> without interrupts enabled. This would also allow you to scale better
> with multiple users of polling, a case where we currently don't
> perform as well spdk, for instance.

Would you be open to have blk-mq provide special hi-pri hardware contexts
for all these requests to come through? Maybe one per NUMA node? If not,
I don't think have enough unused bits in the NVMe command id to stash
the hctx id to extract the original request.

Jens Axboe Dec. 22, 2017, 3:40 p.m. UTC | #7

On 12/21/17 4:10 PM, Keith Busch wrote:
> On Thu, Dec 21, 2017 at 03:17:41PM -0700, Jens Axboe wrote:
>> On 12/21/17 2:34 PM, Keith Busch wrote:
>>> It would be nice, but the driver doesn't know a request's completion
>>> is going to be a polled. 
>>
>> That's trivially solvable though, since the information is available
>> at submission time.
>>
>>> Even if it did, we don't have a spec defined
>>> way to tell the controller not to send an interrupt with this command's
>>> compeletion, which would be negated anyway if any interrupt driven IO
>>> is mixed in the same queue. We could possibly create a special queue
>>> with interrupts disabled for this purpose if we can pass the HIPRI hint
>>> through the request.
>>
>> There's on way to do it per IO, right. But you can create a sq/cq pair
>> without interrupts enabled. This would also allow you to scale better
>> with multiple users of polling, a case where we currently don't
>> perform as well spdk, for instance.
> 
> Would you be open to have blk-mq provide special hi-pri hardware contexts
> for all these requests to come through? Maybe one per NUMA node? If not,
> I don't think have enough unused bits in the NVMe command id to stash
> the hctx id to extract the original request.

Yeah, in fact I think we HAVE to do it this way. I've been thinking about
this for a while, and ideally I'd really like blk-mq to support multiple
queue "pools". It's basically just a mapping thing. Right now you hand
blk-mq all your queues, and the mappings are defined for one set of
queues. It'd be nifty to support multiple sets, so we could do things
like "reserve X for polling", for example, and just have the mappings
magically work. blk_mq_map_queue() then just needs to take the bio
or request (or just cmd flags) to be able to decide what set the
request belongs to, making the mapping a function of {cpu,type}.

I originally played with this in the context of isolating writes on
a single queue, to reduce the amount of resources they can grab. And
it'd work nicely on this as well.

Completions could be configurable to where the submitter would do it
(like now, best for sync single thread), or to where you have a/more
kernel threads doing it (spdk'ish, best for high qd / thread count).

Christoph Hellwig Dec. 29, 2017, 9:50 a.m. UTC | #8

On Thu, Dec 21, 2017 at 02:34:00PM -0700, Keith Busch wrote:
> It would be nice, but the driver doesn't know a request's completion
> is going to be a polled.

We can trivially set a REQ_POLL bit.  In fact my initial patch kit had
those on the insitence of Jens, but then I removed it because we had
no users for it.

> Even if it did, we don't have a spec defined
> way to tell the controller not to send an interrupt with this command's
> compeletion, which would be negated anyway if any interrupt driven IO
> is mixed in the same queue.

Time to add such a flag to the spec then..

> We could possibly create a special queue
> with interrupts disabled for this purpose if we can pass the HIPRI hint
> through the request.

Eww.  That means we double our resource usage both in the host and
the device, which isn't going to be pretty.

Christoph Hellwig Dec. 29, 2017, 9:51 a.m. UTC | #9

This looks fine, but we also need the same in blkdev_bio_end_io and
iomap_dio_bio_end_io.  Probably best to add a little helper.

Keith Busch Dec. 29, 2017, 3:51 p.m. UTC | #10

On Fri, Dec 29, 2017 at 10:50:21AM +0100, Christoph Hellwig wrote:
> > We could possibly create a special queue
> > with interrupts disabled for this purpose if we can pass the HIPRI hint
> > through the request.
> 
> Eww.  That means we double our resource usage both in the host and
> the device, which isn't going to be pretty.

Yes, this would create some trouble for our simple policy for divvying
queue resources. We wouldn't be able to use the PCI IRQ affinity for
these queues CPU map, for example.

Maybe if we can figure out a good policy for WRR queues, that might
present a similar infrastructure for supporting polled queues as well.

Sagi Grimberg Dec. 31, 2017, 12:48 p.m. UTC | #11

On 12/29/2017 11:50 AM, Christoph Hellwig wrote:
> On Thu, Dec 21, 2017 at 02:34:00PM -0700, Keith Busch wrote:
>> It would be nice, but the driver doesn't know a request's completion
>> is going to be a polled.
> 
> We can trivially set a REQ_POLL bit.  In fact my initial patch kit had
> those on the insitence of Jens, but then I removed it because we had
> no users for it.
> 
>> Even if it did, we don't have a spec defined
>> way to tell the controller not to send an interrupt with this command's
>> compeletion, which would be negated anyway if any interrupt driven IO
>> is mixed in the same queue.
> 
> Time to add such a flag to the spec then..

That would be very useful, ideally we can also hook it into libaio
to submit without triggering an interrupt and have io_getevents to poll
the underlying bdev (blk_poll) similar to how the net stack implements
low latency sockets [1].

Having the ability to suppress interrupts and poll for completions would
be very useful for file servers or targets living in userspace.

https://lwn.net/Articles/551284/

[3/3] block: Polling completion performance optimization

Commit Message

Comments

Patch