Message ID | 1474555980-2787-3-git-send-email-axboe@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Sep 22, 2016 at 08:53:00AM -0600, Jens Axboe wrote: > If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its > ->queue_rq() handler. For that case, blk-mq ensures that we always > calls it from a safe context. First can you provide a more useful defintion of blocking? Lots of current drivers will already block in ->queue_rq, mostly to allocate memory. Second we looked at something similar a few times ago, mosty notably when converting loop and rbd, and came to the conclusion that performance sucks when we only have that single per-hctx work struct for a busy device. I think Ming has done a lot of the benchmarking, so I've added him to Cc. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/22/2016 08:59 AM, Christoph Hellwig wrote: > On Thu, Sep 22, 2016 at 08:53:00AM -0600, Jens Axboe wrote: >> If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its >> ->queue_rq() handler. For that case, blk-mq ensures that we always >> calls it from a safe context. > > First can you provide a more useful defintion of blocking? Lots of current > drivers will already block in ->queue_rq, mostly to allocate memory. Having to grab a mutex, for instance. We invoke ->queue_rq() with preemption disabled, so I'd hope that would not be the case... What drivers block in ->queue_rq? > Second we looked at something similar a few times ago, mosty notably > when converting loop and rbd, and came to the conclusion that > performance sucks when we only have that single per-hctx work struct > for a busy device. I think Ming has done a lot of the benchmarking, > so I've added him to Cc. Loop was another case that was on my radar to get rid of the queue_work it currently has to do. Josef is currently testing the nbd driver using this approach, so we should get some numbers there too. If we have to, we can always bump up the concurrency to mimic more of the behavior of having multiple workers running on the hardware queue. I'd prefer to still do that in blk-mq, instead of having drivers reinventing their own work offload functionality.
On 09/22/2016 10:59 AM, Christoph Hellwig wrote: > On Thu, Sep 22, 2016 at 08:53:00AM -0600, Jens Axboe wrote: >> If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its >> ->queue_rq() handler. For that case, blk-mq ensures that we always >> calls it from a safe context. > > First can you provide a more useful defintion of blocking? Lots of current > drivers will already block in ->queue_rq, mostly to allocate memory. > NBD needs to take a mutex before it sends. This patch was born out of a kbuild error I got because of a schedule while atomic, this was the backtrace [ 17.698207] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620 [ 17.700903] in_atomic(): 1, irqs_disabled(): 0, pid: 1050, name: mount [ 17.702350] 1 lock held by mount/1050: [ 17.703395] #0: (&type->s_umount_key#20/1){+.+.+.}, at: [<ffffffff8122627b>] sget_userns+0x2f7/0x4b0 [ 17.706064] CPU: 0 PID: 1050 Comm: mount Not tainted 4.8.0-rc4-00007-gfd8383fd #1 [ 17.708149] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014 [ 17.710523] 0000000000000000 ffff880034dd39f0 ffffffff8178e23d ffff8800350e4d00 [ 17.712773] 000000000000041a ffff880034dd3a18 ffffffff81108c39 ffffffff839f1476 [ 17.714978] 000000000000026c 0000000000000000 ffff880034dd3a40 ffffffff81108cb5 [ 17.717224] Call Trace: [ 17.718130] [<ffffffff8178e23d>] dump_stack+0x82/0xb8 [ 17.719474] [<ffffffff81108c39>] ___might_sleep+0x1bd/0x1c4 [ 17.720813] [<ffffffff81108cb5>] __might_sleep+0x75/0x7c [ 17.722100] [<ffffffff82ef53e2>] mutex_lock_nested+0x3e/0x396 [ 17.723449] [<ffffffff8114aace>] ? mod_timer+0x10/0x12 [ 17.724705] [<ffffffff81770002>] ? blk_add_timer+0xe0/0xe5 [ 17.726088] [<ffffffff81c2a35c>] nbd_queue_rq+0x7b/0x14b [ 17.727367] [<ffffffff81c2a35c>] ? nbd_queue_rq+0x7b/0x14b [ 17.728653] [<ffffffff81772b66>] __blk_mq_run_hw_queue+0x1c7/0x2c8 [ 17.730118] [<ffffffff81772942>] blk_mq_run_hw_queue+0x5e/0xbb [ 17.731454] [<ffffffff81773d53>] blk_sq_make_request+0x1a1/0x1ba [ 17.732806] [<ffffffff81768e24>] generic_make_request+0xbd/0x160 [ 17.734153] [<ffffffff81768fbd>] submit_bio+0xf6/0xff [ 17.735365] [<ffffffff81252806>] submit_bh_wbc+0x136/0x143 [ 17.736719] [<ffffffff81252c00>] submit_bh+0x10/0x12 [ 17.737888] [<ffffffff81252c52>] __bread_gfp+0x50/0x6f [ 17.739212] [<ffffffff812f290a>] ext4_fill_super+0x1f4/0x27ec [ 17.740492] [<ffffffff81798a59>] ? vsnprintf+0x22d/0x3b7 [ 17.741685] [<ffffffff812265e3>] mount_bdev+0x144/0x197 [ 17.742928] [<ffffffff812f2716>] ? ext4_calculate_overhead+0x2bd/0x2bd [ 17.744275] [<ffffffff812ede93>] ext4_mount+0x15/0x17 [ 17.745406] [<ffffffff81227049>] mount_fs+0x67/0x131 [ 17.746518] [<ffffffff8123ee2f>] vfs_kern_mount+0x6b/0xdb [ 17.747675] [<ffffffff81241759>] do_mount+0x708/0x97d [ 17.748833] [<ffffffff811e87f0>] ? __might_fault+0x7e/0x84 [ 17.750074] [<ffffffff811da511>] ? strndup_user+0x3f/0x53 [ 17.751198] [<ffffffff81268eb6>] compat_SyS_mount+0x185/0x1ae [ 17.752433] [<ffffffff81003b6f>] do_int80_syscall_32+0x5c/0x6b [ 17.753592] [<ffffffff82efa6d8>] entry_INT80_compat+0x38/0x50 [ 17.754952] block nbd11: Attempted send on closed socket > Second we looked at something similar a few times ago, mosty notably > when converting loop and rbd, and came to the conclusion that > performance sucks when we only have that single per-hctx work struct > for a busy device. I think Ming has done a lot of the benchmarking, > so I've added him to Cc. > Yeah this looks to be about a 5% hit on my microbenchmarks for NBD, but nobody accuses NBD of being particularly fast either so I wasn't too worried about it. If we want to say that we can block in ->queue_rq that would make me happy too. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 22, 2016 at 09:03:56AM -0600, Jens Axboe wrote: > Having to grab a mutex, for instance. We invoke ->queue_rq() with > preemption disabled, so I'd hope that would not be the case... What > drivers block in ->queue_rq? I though I had converted a lot of them to GFP_NOIO instead of GFP_ATOMIC allocations, but I can't find any evidence of that. Maybe it was just my imagination, or an unsubmitted patch series. Sorry for the confusion. > Loop was another case that was on my radar to get rid of the queue_work > it currently has to do. Josef is currently testing the nbd driver using > this approach, so we should get some numbers there too. If we have to, > we can always bump up the concurrency to mimic more of the behavior of > having multiple workers running on the hardware queue. I'd prefer to > still do that in blk-mq, instead of having drivers reinventing their own > work offload functionality. There should be a lot of numbers in the list archives from the time that Ming did the loop conversion, as I've been trying to steer him that way, and he actually implemented and benchmarked it. We can't just increase the concurrency as a single work_struct item can't be queued multiple times even on a high concurreny workqueue. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/22/2016 09:12 AM, Christoph Hellwig wrote: > On Thu, Sep 22, 2016 at 09:03:56AM -0600, Jens Axboe wrote: >> Having to grab a mutex, for instance. We invoke ->queue_rq() with >> preemption disabled, so I'd hope that would not be the case... What >> drivers block in ->queue_rq? > > I though I had converted a lot of them to GFP_NOIO instead of GFP_ATOMIC > allocations, but I can't find any evidence of that. Maybe it was just > my imagination, or an unsubmitted patch series. Sorry for the > confusion. OK, that makes more sense. Pretty sure we would have had complaints! >> Loop was another case that was on my radar to get rid of the queue_work >> it currently has to do. Josef is currently testing the nbd driver using >> this approach, so we should get some numbers there too. If we have to, >> we can always bump up the concurrency to mimic more of the behavior of >> having multiple workers running on the hardware queue. I'd prefer to >> still do that in blk-mq, instead of having drivers reinventing their own >> work offload functionality. > > There should be a lot of numbers in the list archives from the time > that Ming did the loop conversion, as I've been trying to steer him > that way, and he actually implemented and benchmarked it. > > We can't just increase the concurrency as a single work_struct item > can't be queued multiple times even on a high concurreny workqueue. But we could have more work items, if we had to. Even if loop isn't a drop-in replacement for this simpler approach, I think it'll work well enough for nbd. The 5% number from Josef is comparing to not having any offload at all, I suspect the number from just converting to queue_work in the driver would be similar.
On Thu, Sep 22, 2016 at 11:12 PM, Christoph Hellwig <hch@infradead.org> wrote: > On Thu, Sep 22, 2016 at 09:03:56AM -0600, Jens Axboe wrote: >> Having to grab a mutex, for instance. We invoke ->queue_rq() with >> preemption disabled, so I'd hope that would not be the case... What >> drivers block in ->queue_rq? > > I though I had converted a lot of them to GFP_NOIO instead of GFP_ATOMIC > allocations, but I can't find any evidence of that. Maybe it was just > my imagination, or an unsubmitted patch series. Sorry for the > confusion. > >> Loop was another case that was on my radar to get rid of the queue_work >> it currently has to do. Josef is currently testing the nbd driver using >> this approach, so we should get some numbers there too. If we have to, >> we can always bump up the concurrency to mimic more of the behavior of >> having multiple workers running on the hardware queue. I'd prefer to >> still do that in blk-mq, instead of having drivers reinventing their own >> work offload functionality. > > There should be a lot of numbers in the list archives from the time > that Ming did the loop conversion, as I've been trying to steer him > that way, and he actually implemented and benchmarked it. > > We can't just increase the concurrency as a single work_struct item > can't be queued multiple times even on a high concurreny workqueue. Yes, that is it, and that is why last time we can't convert to this way. Thanks, -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 22, 2016 at 11:17 PM, Jens Axboe <axboe@fb.com> wrote: > On 09/22/2016 09:12 AM, Christoph Hellwig wrote: >> >> On Thu, Sep 22, 2016 at 09:03:56AM -0600, Jens Axboe wrote: >>> >>> Having to grab a mutex, for instance. We invoke ->queue_rq() with >>> preemption disabled, so I'd hope that would not be the case... What >>> drivers block in ->queue_rq? >> >> >> I though I had converted a lot of them to GFP_NOIO instead of GFP_ATOMIC >> allocations, but I can't find any evidence of that. Maybe it was just >> my imagination, or an unsubmitted patch series. Sorry for the >> confusion. > > > OK, that makes more sense. Pretty sure we would have had complaints! > >>> Loop was another case that was on my radar to get rid of the queue_work >>> it currently has to do. Josef is currently testing the nbd driver using >>> this approach, so we should get some numbers there too. If we have to, >>> we can always bump up the concurrency to mimic more of the behavior of >>> having multiple workers running on the hardware queue. I'd prefer to >>> still do that in blk-mq, instead of having drivers reinventing their own >>> work offload functionality. >> >> >> There should be a lot of numbers in the list archives from the time >> that Ming did the loop conversion, as I've been trying to steer him >> that way, and he actually implemented and benchmarked it. >> >> We can't just increase the concurrency as a single work_struct item >> can't be queued multiple times even on a high concurreny workqueue. > > > But we could have more work items, if we had to. Even if loop isn't a > drop-in replacement for this simpler approach, I think it'll work well > enough for nbd. The 5% number from Josef is comparing to not having any > offload at all, I suspect the number from just converting to queue_work > in the driver would be similar. 5% might be a noise. From nbd's .queue_rq(), kthread_work should match its case because one per-nbd mutex is required and all cmds sent to the nbd are serialized, so using common work items may hurt performance caused by unnecessary context switch. thanks, -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Sep 22, 2016, at 9:44 PM, Ming Lei <ming.lei@canonical.com> wrote: > >> On Thu, Sep 22, 2016 at 11:17 PM, Jens Axboe <axboe@fb.com> wrote: >>> On 09/22/2016 09:12 AM, Christoph Hellwig wrote: >>> >>>> On Thu, Sep 22, 2016 at 09:03:56AM -0600, Jens Axboe wrote: >>>> >>>> Having to grab a mutex, for instance. We invoke ->queue_rq() with >>>> preemption disabled, so I'd hope that would not be the case... What >>>> drivers block in ->queue_rq? >>> >>> >>> I though I had converted a lot of them to GFP_NOIO instead of GFP_ATOMIC >>> allocations, but I can't find any evidence of that. Maybe it was just >>> my imagination, or an unsubmitted patch series. Sorry for the >>> confusion. >> >> >> OK, that makes more sense. Pretty sure we would have had complaints! >> >>>> Loop was another case that was on my radar to get rid of the queue_work >>>> it currently has to do. Josef is currently testing the nbd driver using >>>> this approach, so we should get some numbers there too. If we have to, >>>> we can always bump up the concurrency to mimic more of the behavior of >>>> having multiple workers running on the hardware queue. I'd prefer to >>>> still do that in blk-mq, instead of having drivers reinventing their own >>>> work offload functionality. >>> >>> >>> There should be a lot of numbers in the list archives from the time >>> that Ming did the loop conversion, as I've been trying to steer him >>> that way, and he actually implemented and benchmarked it. >>> >>> We can't just increase the concurrency as a single work_struct item >>> can't be queued multiple times even on a high concurreny workqueue. >> >> >> But we could have more work items, if we had to. Even if loop isn't a >> drop-in replacement for this simpler approach, I think it'll work well >> enough for nbd. The 5% number from Josef is comparing to not having any >> offload at all, I suspect the number from just converting to queue_work >> in the driver would be similar. > > 5% might be a noise. > > From nbd's .queue_rq(), kthread_work should match its case because > one per-nbd mutex is required and all cmds sent to the nbd are serialized, > so using common work items may hurt performance caused by > unnecessary context switch. This is with my multi connection patch with 4 connections per device (so 4 hw queues). The 5% is real but I don't think it'll get better by punting to a worker per command, so this approach is fine for NBD. Thanks, Josef-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/block/blk-mq.c b/block/blk-mq.c index c29700010b5c..eae2f12f011f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -908,7 +908,7 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) !blk_mq_hw_queue_mapped(hctx))) return; - if (!async) { + if (!async && !(hctx->flags & BLK_MQ_F_BLOCKING)) { int cpu = get_cpu(); if (cpumask_test_cpu(cpu, hctx->cpumask)) { __blk_mq_run_hw_queue(hctx); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index fbcfdf323243..5daa0ef756dd 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -155,6 +155,7 @@ enum { BLK_MQ_F_TAG_SHARED = 1 << 1, BLK_MQ_F_SG_MERGE = 1 << 2, BLK_MQ_F_DEFER_ISSUE = 1 << 4, + BLK_MQ_F_BLOCKING = 1 << 5, BLK_MQ_F_ALLOC_POLICY_START_BIT = 8, BLK_MQ_F_ALLOC_POLICY_BITS = 1,
If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its ->queue_rq() handler. For that case, blk-mq ensures that we always calls it from a safe context. Signed-off-by: Jens Axboe <axboe@fb.com> --- block/blk-mq.c | 2 +- include/linux/blk-mq.h | 1 + 2 files changed, 2 insertions(+), 1 deletion(-)