Message ID | dbe42007-8109-2e21-d0f3-0778007cd152@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Jens Axboe <axboe@kernel.dk> writes: >> Can you try this patch? It's not perfect, but I'll be interested if it >> makes a difference for you. > Hi Jens, Sorry for the delay. I just got back to this and have been running your patch on top of 4.8 without a crash for over 1 hour. I wanna give it more time to make sure it's running properly, though. Let me get back to you after a few more rounds of test. > This one should handle the WARN_ON() for running the hw queue on the > wrong CPU as well. On the workaround you added to prevent WARN_ON, we surely need to prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first place, right.. How do you feel about the following RFC? I know it's not a complete fix, but it feels like a good improvement to me. http://www.spinics.net/lists/linux-scsi/msg98608.html
On 08/29/2016 12:06 PM, Gabriel Krisman Bertazi wrote: > Jens Axboe <axboe@kernel.dk> writes: >>> Can you try this patch? It's not perfect, but I'll be interested if it >>> makes a difference for you. >> > > Hi Jens, > > Sorry for the delay. I just got back to this and have been running your > patch on top of 4.8 without a crash for over 1 hour. I wanna give it > more time to make sure it's running properly, though. > > Let me get back to you after a few more rounds of test. Thanks, sounds good. The patches have landed in mainline too. >> This one should handle the WARN_ON() for running the hw queue on the >> wrong CPU as well. > > On the workaround you added to prevent WARN_ON, we surely need to > prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first > place, right.. How do you feel about the following RFC? I know it's > not a complete fix, but it feels like a good improvement to me. > > http://www.spinics.net/lists/linux-scsi/msg98608.html But we can't completely prevent it, and I don't think we have to. I just don't want to trigger a warning for something that's a valid condition. I want the warning to trigger if this happens without the CPU going offline, since then it's indicative of a real bug in the mapping. Your patch isn't going to prevent it either - it'll shrink the window, at the expense of making blk_mq_hctx_next_cpu() more expensive. So I don't think it's worthwhile.
diff --git a/block/blk-mq.c b/block/blk-mq.c index 758a9b5..b21a9b9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -810,11 +810,12 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) struct list_head *dptr; int queued; - WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)); - if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state))) return; + WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) && + cpu_online(hctx->next_cpu)); + hctx->run++;