Message ID | 20171221204636.2924-4-keith.busch@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 12/21/2017 01:46 PM, Keith Busch wrote: > When a request completion is polled, the completion task wakes itself > up. This is unnecessary, as the task can just set itself back to > running. > > Signed-off-by: Keith Busch <keith.busch@intel.com> > --- > fs/block_dev.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 4a181fcb5175..ce67ffe73430 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio) > struct task_struct *waiter = bio->bi_private; > > WRITE_ONCE(bio->bi_private, NULL); > - wake_up_process(waiter); > + if (current != waiter) > + wake_up_process(waiter); > + else > + __set_current_state(TASK_RUNNING); Do we actually need to set this to TASK_RUNNING? If we get here we're already running, right? Everywhere I see uses of __set_current_state(TASK_RUNNING) it's after we've done a set_current_state(TASK_INTERRUPTIBLE).
On 12/21/17 1:46 PM, Keith Busch wrote: > When a request completion is polled, the completion task wakes itself > up. This is unnecessary, as the task can just set itself back to > running. Looks good to me, I can take it for 4.16 in the block tree.
On 12/21/17 1:56 PM, Scott Bauer wrote: > > > On 12/21/2017 01:46 PM, Keith Busch wrote: >> When a request completion is polled, the completion task wakes itself >> up. This is unnecessary, as the task can just set itself back to >> running. >> >> Signed-off-by: Keith Busch <keith.busch@intel.com> >> --- >> fs/block_dev.c | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/fs/block_dev.c b/fs/block_dev.c >> index 4a181fcb5175..ce67ffe73430 100644 >> --- a/fs/block_dev.c >> +++ b/fs/block_dev.c >> @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio) >> struct task_struct *waiter = bio->bi_private; >> >> WRITE_ONCE(bio->bi_private, NULL); >> - wake_up_process(waiter); >> + if (current != waiter) >> + wake_up_process(waiter); >> + else >> + __set_current_state(TASK_RUNNING); > > Do we actually need to set this to TASK_RUNNING? If we get here we're already running, right? > > Everywhere I see uses of __set_current_state(TASK_RUNNING) it's after we've done a set_current_state(TASK_INTERRUPTIBLE). We'd only be TASK_RUNNING if the IRQ got to it first. And that's something that should be removed as well - I suspect that'd be a much bigger win, getting rid of the IRQ trigger for polled IO, than most of the micro optimizations. For Keith's testing, looks like he reduced the cost by turning on coalescing, but it'd be cheaper (and better) to not have to rely on that. In any case, the __set_current_state() is very cheap. Given that and the above, I'd rather keep that.
On Thu, Dec 21, 2017 at 02:00:04PM -0700, Jens Axboe wrote: > On 12/21/17 1:56 PM, Scott Bauer wrote: > > On 12/21/2017 01:46 PM, Keith Busch wrote: > >> @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio) > >> struct task_struct *waiter = bio->bi_private; > >> > >> WRITE_ONCE(bio->bi_private, NULL); > >> - wake_up_process(waiter); > >> + if (current != waiter) > >> + wake_up_process(waiter); > >> + else > >> + __set_current_state(TASK_RUNNING); > > > > Do we actually need to set this to TASK_RUNNING? If we get here we're already running, right? > > > > Everywhere I see uses of __set_current_state(TASK_RUNNING) it's after we've done a set_current_state(TASK_INTERRUPTIBLE). > > We'd only be TASK_RUNNING if the IRQ got to it first. And that's something that > should be removed as well - I suspect that'd be a much bigger win, getting rid > of the IRQ trigger for polled IO, than most of the micro optimizations. For > Keith's testing, looks like he reduced the cost by turning on coalescing, but > it'd be cheaper (and better) to not have to rely on that. It would be nice, but the driver doesn't know a request's completion is going to be a polled. Even if it did, we don't have a spec defined way to tell the controller not to send an interrupt with this command's compeletion, which would be negated anyway if any interrupt driven IO is mixed in the same queue. We could possibly create a special queue with interrupts disabled for this purpose if we can pass the HIPRI hint through the request.
On 12/21/17 2:34 PM, Keith Busch wrote: > On Thu, Dec 21, 2017 at 02:00:04PM -0700, Jens Axboe wrote: >> On 12/21/17 1:56 PM, Scott Bauer wrote: >>> On 12/21/2017 01:46 PM, Keith Busch wrote: >>>> @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio) >>>> struct task_struct *waiter = bio->bi_private; >>>> >>>> WRITE_ONCE(bio->bi_private, NULL); >>>> - wake_up_process(waiter); >>>> + if (current != waiter) >>>> + wake_up_process(waiter); >>>> + else >>>> + __set_current_state(TASK_RUNNING); >>> >>> Do we actually need to set this to TASK_RUNNING? If we get here we're already running, right? >>> >>> Everywhere I see uses of __set_current_state(TASK_RUNNING) it's after we've done a set_current_state(TASK_INTERRUPTIBLE). >> >> We'd only be TASK_RUNNING if the IRQ got to it first. And that's something that >> should be removed as well - I suspect that'd be a much bigger win, getting rid >> of the IRQ trigger for polled IO, than most of the micro optimizations. For >> Keith's testing, looks like he reduced the cost by turning on coalescing, but >> it'd be cheaper (and better) to not have to rely on that. > > It would be nice, but the driver doesn't know a request's completion > is going to be a polled. That's trivially solvable though, since the information is available at submission time. > Even if it did, we don't have a spec defined > way to tell the controller not to send an interrupt with this command's > compeletion, which would be negated anyway if any interrupt driven IO > is mixed in the same queue. We could possibly create a special queue > with interrupts disabled for this purpose if we can pass the HIPRI hint > through the request. There's on way to do it per IO, right. But you can create a sq/cq pair without interrupts enabled. This would also allow you to scale better with multiple users of polling, a case where we currently don't perform as well spdk, for instance.
On Thu, Dec 21, 2017 at 03:17:41PM -0700, Jens Axboe wrote: > On 12/21/17 2:34 PM, Keith Busch wrote: > > It would be nice, but the driver doesn't know a request's completion > > is going to be a polled. > > That's trivially solvable though, since the information is available > at submission time. > > > Even if it did, we don't have a spec defined > > way to tell the controller not to send an interrupt with this command's > > compeletion, which would be negated anyway if any interrupt driven IO > > is mixed in the same queue. We could possibly create a special queue > > with interrupts disabled for this purpose if we can pass the HIPRI hint > > through the request. > > There's on way to do it per IO, right. But you can create a sq/cq pair > without interrupts enabled. This would also allow you to scale better > with multiple users of polling, a case where we currently don't > perform as well spdk, for instance. Would you be open to have blk-mq provide special hi-pri hardware contexts for all these requests to come through? Maybe one per NUMA node? If not, I don't think have enough unused bits in the NVMe command id to stash the hctx id to extract the original request.
On 12/21/17 4:10 PM, Keith Busch wrote: > On Thu, Dec 21, 2017 at 03:17:41PM -0700, Jens Axboe wrote: >> On 12/21/17 2:34 PM, Keith Busch wrote: >>> It would be nice, but the driver doesn't know a request's completion >>> is going to be a polled. >> >> That's trivially solvable though, since the information is available >> at submission time. >> >>> Even if it did, we don't have a spec defined >>> way to tell the controller not to send an interrupt with this command's >>> compeletion, which would be negated anyway if any interrupt driven IO >>> is mixed in the same queue. We could possibly create a special queue >>> with interrupts disabled for this purpose if we can pass the HIPRI hint >>> through the request. >> >> There's on way to do it per IO, right. But you can create a sq/cq pair >> without interrupts enabled. This would also allow you to scale better >> with multiple users of polling, a case where we currently don't >> perform as well spdk, for instance. > > Would you be open to have blk-mq provide special hi-pri hardware contexts > for all these requests to come through? Maybe one per NUMA node? If not, > I don't think have enough unused bits in the NVMe command id to stash > the hctx id to extract the original request. Yeah, in fact I think we HAVE to do it this way. I've been thinking about this for a while, and ideally I'd really like blk-mq to support multiple queue "pools". It's basically just a mapping thing. Right now you hand blk-mq all your queues, and the mappings are defined for one set of queues. It'd be nifty to support multiple sets, so we could do things like "reserve X for polling", for example, and just have the mappings magically work. blk_mq_map_queue() then just needs to take the bio or request (or just cmd flags) to be able to decide what set the request belongs to, making the mapping a function of {cpu,type}. I originally played with this in the context of isolating writes on a single queue, to reduce the amount of resources they can grab. And it'd work nicely on this as well. Completions could be configurable to where the submitter would do it (like now, best for sync single thread), or to where you have a/more kernel threads doing it (spdk'ish, best for high qd / thread count).
On Thu, Dec 21, 2017 at 02:34:00PM -0700, Keith Busch wrote: > It would be nice, but the driver doesn't know a request's completion > is going to be a polled. We can trivially set a REQ_POLL bit. In fact my initial patch kit had those on the insitence of Jens, but then I removed it because we had no users for it. > Even if it did, we don't have a spec defined > way to tell the controller not to send an interrupt with this command's > compeletion, which would be negated anyway if any interrupt driven IO > is mixed in the same queue. Time to add such a flag to the spec then.. > We could possibly create a special queue > with interrupts disabled for this purpose if we can pass the HIPRI hint > through the request. Eww. That means we double our resource usage both in the host and the device, which isn't going to be pretty.
This looks fine, but we also need the same in blkdev_bio_end_io and iomap_dio_bio_end_io. Probably best to add a little helper.
On Fri, Dec 29, 2017 at 10:50:21AM +0100, Christoph Hellwig wrote: > > We could possibly create a special queue > > with interrupts disabled for this purpose if we can pass the HIPRI hint > > through the request. > > Eww. That means we double our resource usage both in the host and > the device, which isn't going to be pretty. Yes, this would create some trouble for our simple policy for divvying queue resources. We wouldn't be able to use the PCI IRQ affinity for these queues CPU map, for example. Maybe if we can figure out a good policy for WRR queues, that might present a similar infrastructure for supporting polled queues as well.
On 12/29/2017 11:50 AM, Christoph Hellwig wrote: > On Thu, Dec 21, 2017 at 02:34:00PM -0700, Keith Busch wrote: >> It would be nice, but the driver doesn't know a request's completion >> is going to be a polled. > > We can trivially set a REQ_POLL bit. In fact my initial patch kit had > those on the insitence of Jens, but then I removed it because we had > no users for it. > >> Even if it did, we don't have a spec defined >> way to tell the controller not to send an interrupt with this command's >> compeletion, which would be negated anyway if any interrupt driven IO >> is mixed in the same queue. > > Time to add such a flag to the spec then.. That would be very useful, ideally we can also hook it into libaio to submit without triggering an interrupt and have io_getevents to poll the underlying bdev (blk_poll) similar to how the net stack implements low latency sockets [1]. Having the ability to suppress interrupts and poll for completions would be very useful for file servers or targets living in userspace. https://lwn.net/Articles/551284/
diff --git a/fs/block_dev.c b/fs/block_dev.c index 4a181fcb5175..ce67ffe73430 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -181,7 +181,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio) struct task_struct *waiter = bio->bi_private; WRITE_ONCE(bio->bi_private, NULL); - wake_up_process(waiter); + if (current != waiter) + wake_up_process(waiter); + else + __set_current_state(TASK_RUNNING); } static ssize_t
When a request completion is polled, the completion task wakes itself up. This is unnecessary, as the task can just set itself back to running. Signed-off-by: Keith Busch <keith.busch@intel.com> --- fs/block_dev.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)