Message ID | 20160608204347.GA30146@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi, Jens, Jens Axboe <axboe@fb.com> writes: > At Facebook, we have a number of cases where people use ionice to set a > lower priority, then end up having tasks stuck for a long time because > eg meta data updates from an idle priority tasks is blocking out higher > priority processes. It's bad enough that it will trigger the softlockup > warning. I expect a higher prio process could be blocked on a lower prio process reading the same metadata, too. I had a hard time tracking down where REQ_META WRITE I/O was issued outside of the journal or writeback paths (and I hope you're not ionice-ing those!). Eventually, with the help of sandeen, I found some oddball cases that I doubt you're running into. Can you enlighten me as to where this (REQ_META write I/O) is happening? I don't disagree that it's a problem, I just would like to understand your problem case better. Anyway, it seems to me you could just set REQ_PRIO in the code paths you care about instead of modifying CFQ to treat REQ_META and REQ_PRIO as the same thing, which essentially undoes commit 65299a3b788bd ("block: separate priority boosting from REQ_META") from Christoph. Thanks! Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 09, 2016 at 11:55:56AM -0400, Jeff Moyer wrote: > I expect a higher prio process could be blocked on a lower prio process > reading the same metadata, too. I had a hard time tracking down where > REQ_META WRITE I/O was issued outside of the journal or writeback paths > (and I hope you're not ionice-ing those!). Eventually, with the help of > sandeen, I found some oddball cases that I doubt you're running into. > Can you enlighten me as to where this (REQ_META write I/O) is happening? > I don't disagree that it's a problem, I just would like to understand > your problem case better. XFS does lots of REQ_META writes from _xfs_buf_ioapply(). But none of those should be in the critical path as the all metadata is logged first and then written back later. > Anyway, it seems to me you could just set REQ_PRIO in the code paths you > care about instead of modifying CFQ to treat REQ_META and REQ_PRIO as > the same thing, which essentially undoes commit 65299a3b788bd ("block: > separate priority boosting from REQ_META") from Christoph. And I'm still waiting for someone to explain when exactly REQ_PRIO should be used.. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig <hch@infradead.org> writes: > On Thu, Jun 09, 2016 at 11:55:56AM -0400, Jeff Moyer wrote: >> I expect a higher prio process could be blocked on a lower prio process >> reading the same metadata, too. I had a hard time tracking down where >> REQ_META WRITE I/O was issued outside of the journal or writeback paths >> (and I hope you're not ionice-ing those!). Eventually, with the help of >> sandeen, I found some oddball cases that I doubt you're running into. >> Can you enlighten me as to where this (REQ_META write I/O) is happening? >> I don't disagree that it's a problem, I just would like to understand >> your problem case better. > > XFS does lots of REQ_META writes from _xfs_buf_ioapply(). But none > of those should be in the critical path as the all metadata is logged > first and then written back later. > >> Anyway, it seems to me you could just set REQ_PRIO in the code paths you >> care about instead of modifying CFQ to treat REQ_META and REQ_PRIO as >> the same thing, which essentially undoes commit 65299a3b788bd ("block: >> separate priority boosting from REQ_META") from Christoph. > > And I'm still waiting for someone to explain when exactly REQ_PRIO > should be used.. I think Jens' bug report is exactly that explanation, no? To avoid priority inversion? -Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/09/2016 09:55 AM, Jeff Moyer wrote: > Hi, Jens, > > Jens Axboe <axboe@fb.com> writes: > >> At Facebook, we have a number of cases where people use ionice to set a >> lower priority, then end up having tasks stuck for a long time because >> eg meta data updates from an idle priority tasks is blocking out higher >> priority processes. It's bad enough that it will trigger the softlockup >> warning. > > I expect a higher prio process could be blocked on a lower prio process > reading the same metadata, too. I had a hard time tracking down where > REQ_META WRITE I/O was issued outside of the journal or writeback paths > (and I hope you're not ionice-ing those!). Eventually, with the help of > sandeen, I found some oddball cases that I doubt you're running into. > Can you enlighten me as to where this (REQ_META write I/O) is happening? > I don't disagree that it's a problem, I just would like to understand > your problem case better. > > Anyway, it seems to me you could just set REQ_PRIO in the code paths you > care about instead of modifying CFQ to treat REQ_META and REQ_PRIO as > the same thing, which essentially undoes commit 65299a3b788bd ("block: > separate priority boosting from REQ_META") from Christoph. I personally don't care too much about boosting for just PRIO, or for both PRIO and META. For the important paths, we basically have both set anyway. But here's a trace for one such hang: [478381.198925] ------------[ cut here ]------------ [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds. [478381.201324] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1 [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [478381.203462] ionice D ffff8803692736a8 0 1168369 1 0x00000080 [478381.203466] ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698 [478381.204589] ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002 [478381.205752] ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000 [478381.206874] Call Trace: [478381.207253] [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80 [478381.208175] [<ffffffff8177cea7>] schedule+0x37/0x90 [478381.208932] [<ffffffff8177f5fc>] schedule_timeout+0x1dc/0x250 [478381.209805] [<ffffffff81421c17>] ? __blk_run_queue+0x37/0x50 [478381.210706] [<ffffffff810ca1c5>] ? ktime_get+0x45/0xb0 [478381.211489] [<ffffffff8177c407>] io_schedule_timeout+0xa7/0x110 [478381.212402] [<ffffffff810a8c2b>] ? prepare_to_wait+0x5b/0x90 [478381.213280] [<ffffffff8177d616>] bit_wait_io+0x36/0x50 [478381.214063] [<ffffffff8177d325>] __wait_on_bit+0x65/0x90 [478381.214961] [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80 [478381.215872] [<ffffffff8177d47c>] out_of_line_wait_on_bit+0x7c/0x90 [478381.216806] [<ffffffff810a89f0>] ? wake_atomic_t_function+0x40/0x40 [478381.217773] [<ffffffff811f03aa>] __wait_on_buffer+0x2a/0x30 [478381.218641] [<ffffffff8123c557>] ext4_bread+0x57/0x70 [478381.219425] [<ffffffff8124498c>] __ext4_read_dirblock+0x3c/0x380 [478381.220467] [<ffffffff8124665d>] ext4_dx_find_entry+0x7d/0x170 [478381.221357] [<ffffffff8114c49e>] ? find_get_entry+0x1e/0xa0 [478381.222208] [<ffffffff81246bd4>] ext4_find_entry+0x484/0x510 [478381.223090] [<ffffffff812471a2>] ext4_lookup+0x52/0x160 [478381.223882] [<ffffffff811c401d>] lookup_real+0x1d/0x60 [478381.224675] [<ffffffff811c4698>] __lookup_hash+0x38/0x50 [478381.225697] [<ffffffff817745bd>] lookup_slow+0x45/0xab [478381.226941] [<ffffffff811c690e>] link_path_walk+0x7ae/0x820 [478381.227880] [<ffffffff811c6a42>] path_init+0xc2/0x430 [478381.228677] [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20 [478381.229776] [<ffffffff811c8c57>] path_openat+0x77/0x620 [478381.230767] [<ffffffff81185c6e>] ? page_add_file_rmap+0x2e/0x70 [478381.232019] [<ffffffff811cb253>] do_filp_open+0x43/0xa0 [478381.233016] [<ffffffff8108c4a9>] ? creds_are_invalid+0x29/0x70 [478381.234072] [<ffffffff811c0cb0>] do_open_execat+0x70/0x170 [478381.235039] [<ffffffff811c1bf8>] do_execveat_common.isra.36+0x1b8/0x6e0 [478381.236051] [<ffffffff811c214c>] do_execve+0x2c/0x30 [478381.236809] [<ffffffff811ca392>] ? getname+0x12/0x20 [478381.237564] [<ffffffff811c23be>] SyS_execve+0x2e/0x40 [478381.238338] [<ffffffff81780a1d>] stub_execve+0x6d/0xa0 [478381.239126] ------------[ cut here ]------------ [478381.239915] ------------[ cut here ]------------ [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds. [478381.242673] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1 [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [478381.244902] python2.7 D ffff88005cf8fb98 0 1168375 1168248 0x00000080 [478381.244904] ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0 [478381.246023] ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff [478381.247138] ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8 [478381.248252] Call Trace: [478381.248630] [<ffffffff8177cea7>] schedule+0x37/0x90 [478381.249382] [<ffffffff8177d08e>] schedule_preempt_disabled+0xe/0x10 [478381.250465] [<ffffffff8177e892>] __mutex_lock_slowpath+0x92/0x100 [478381.251409] [<ffffffff8177e91b>] mutex_lock+0x1b/0x2f [478381.252199] [<ffffffff817745ae>] lookup_slow+0x36/0xab [478381.253023] [<ffffffff811c690e>] link_path_walk+0x7ae/0x820 [478381.253877] [<ffffffff811aeb41>] ? try_charge+0xc1/0x700 [478381.254690] [<ffffffff811c6a42>] path_init+0xc2/0x430 [478381.255525] [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20 [478381.256450] [<ffffffff811c8c57>] path_openat+0x77/0x620 [478381.257256] [<ffffffff8115b2fb>] ? lru_cache_add_active_or_unevictable+0x2b/0xa0 [478381.258390] [<ffffffff8117b623>] ? handle_mm_fault+0x13f3/0x1720 [478381.259309] [<ffffffff811cb253>] do_filp_open+0x43/0xa0 [478381.260139] [<ffffffff811d7ae2>] ? __alloc_fd+0x42/0x120 [478381.260962] [<ffffffff811b95ac>] do_sys_open+0x13c/0x230 [478381.261779] [<ffffffff81011393>] ? syscall_trace_enter_phase1+0x113/0x170 [478381.262851] [<ffffffff811b96c2>] SyS_open+0x22/0x30 [478381.263598] [<ffffffff81780532>] system_call_fastpath+0x12/0x17 [478381.264551] ------------[ cut here ]------------ [478381.265377] ------------[ cut here ]------------ These are reads. It's a classic case of starting some operation that ends up holding a file system resource, then making very slow progress (due to task being scheduled as idle IO), and hence blocking out potentially much important tasks. The IO is marked META|PRIO, so if folks don't are for boosting on meta, we can just kill that. "Normal" meta data could be scheduled as idle, but anything scheduled under a fs lock or similar should be marked PRIO and get boosted.
Jens Axboe <axboe@kernel.dk> writes: > On 06/09/2016 09:55 AM, Jeff Moyer wrote: >> Hi, Jens, >> >> Jens Axboe <axboe@fb.com> writes: >> >>> At Facebook, we have a number of cases where people use ionice to set a >>> lower priority, then end up having tasks stuck for a long time because >>> eg meta data updates from an idle priority tasks is blocking out higher >>> priority processes. It's bad enough that it will trigger the softlockup >>> warning. >> >> I expect a higher prio process could be blocked on a lower prio process >> reading the same metadata, too. I had a hard time tracking down where >> REQ_META WRITE I/O was issued outside of the journal or writeback paths >> (and I hope you're not ionice-ing those!). Eventually, with the help of >> sandeen, I found some oddball cases that I doubt you're running into. >> Can you enlighten me as to where this (REQ_META write I/O) is happening? >> I don't disagree that it's a problem, I just would like to understand >> your problem case better. >> >> Anyway, it seems to me you could just set REQ_PRIO in the code paths you >> care about instead of modifying CFQ to treat REQ_META and REQ_PRIO as >> the same thing, which essentially undoes commit 65299a3b788bd ("block: >> separate priority boosting from REQ_META") from Christoph. > > I personally don't care too much about boosting for just PRIO, or for > both PRIO and META. For the important paths, we basically have both set > anyway. Yeah, I had considered that. I like that REQ_META is separated out for logging purposes. I'm not too concerned about whether we imply boosted priority from that or not. > These are reads. I was curious about writes. ;-) Anyway, it's good to validate that the read case is also relevant. > It's a classic case of starting some operation that ends up holding a > file system resource, then making very slow progress (due to task > being scheduled as idle IO), and hence blocking out potentially much > important tasks. > > The IO is marked META|PRIO, so if folks don't [care] for boosting on meta, > we can just kill that. "Normal" meta data could be scheduled as idle, > but anything scheduled under a fs lock or similar should be marked PRIO > and get boosted. Interesting. I would have thought that the cfqd->active_queue would have been preempted by a request marked with REQ_PRIO. But you're suggesting that did not happen? Specifically, cfq_insert_request would call cfq_rq_enqueued, and that would call cfq_preempt_queue if cfq_should_preempt (and I think it should, since the new request has REQ_PRIO set). Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/09/2016 12:31 PM, Jeff Moyer wrote: > Jens Axboe <axboe@kernel.dk> writes: > >> On 06/09/2016 09:55 AM, Jeff Moyer wrote: >>> Hi, Jens, >>> >>> Jens Axboe <axboe@fb.com> writes: >>> >>>> At Facebook, we have a number of cases where people use ionice to set a >>>> lower priority, then end up having tasks stuck for a long time because >>>> eg meta data updates from an idle priority tasks is blocking out higher >>>> priority processes. It's bad enough that it will trigger the softlockup >>>> warning. >>> >>> I expect a higher prio process could be blocked on a lower prio process >>> reading the same metadata, too. I had a hard time tracking down where >>> REQ_META WRITE I/O was issued outside of the journal or writeback paths >>> (and I hope you're not ionice-ing those!). Eventually, with the help of >>> sandeen, I found some oddball cases that I doubt you're running into. >>> Can you enlighten me as to where this (REQ_META write I/O) is happening? >>> I don't disagree that it's a problem, I just would like to understand >>> your problem case better. >>> >>> Anyway, it seems to me you could just set REQ_PRIO in the code paths you >>> care about instead of modifying CFQ to treat REQ_META and REQ_PRIO as >>> the same thing, which essentially undoes commit 65299a3b788bd ("block: >>> separate priority boosting from REQ_META") from Christoph. >> >> I personally don't care too much about boosting for just PRIO, or for >> both PRIO and META. For the important paths, we basically have both set >> anyway. > > Yeah, I had considered that. I like that REQ_META is separated out for > logging purposes. I'm not too concerned about whether we imply boosted > priority from that or not. Not a huge deal to me either, and most of the interesting REQ_META sites already set REQ_PRIO as well. >> These are reads. > > I was curious about writes. ;-) Anyway, it's good to validate that the > read case is also relevant. You mean O_DIRECT writes? Most of the buffered writes will come out of the associated threads, so I don't think it's a big of an issue there. We've only seen it for reads. >> It's a classic case of starting some operation that ends up holding a >> file system resource, then making very slow progress (due to task >> being scheduled as idle IO), and hence blocking out potentially much >> important tasks. >> >> The IO is marked META|PRIO, so if folks don't [care] for boosting on meta, >> we can just kill that. "Normal" meta data could be scheduled as idle, >> but anything scheduled under a fs lock or similar should be marked PRIO >> and get boosted. > > Interesting. I would have thought that the cfqd->active_queue would > have been preempted by a request marked with REQ_PRIO. But you're > suggesting that did not happen? > > Specifically, cfq_insert_request would call cfq_rq_enqueued, and that > would call cfq_preempt_queue if cfq_should_preempt (and I think it > should, since the new request has REQ_PRIO set). We seem to handily mostly ignore prio_pending for the idle class. If the new queue is idle, then we don't look at prio pending. I'd rather make this more explicit, the patch is pretty similar to what we had in the past. It's somewhat of a regression caused by commit 4aede84b33d, except I like using the request flags for this a lot more than the old current->fs_excl. REQ_PRIO should always be set for cases where we hold fs (or even directory) specific resources.
Jens Axboe <axboe@kernel.dk> writes: >> I was curious about writes. ;-) Anyway, it's good to validate that the >> read case is also relevant. > > You mean O_DIRECT writes? Most of the buffered writes will come out of > the associated threads, so I don't think it's a big of an issue > there. We've only seen it for reads. Well, you had me confused with your initial report: "... because eg meta data updates..." So I assumed that meant REQ_META WRITES. My bad. [snip] >> Interesting. I would have thought that the cfqd->active_queue would >> have been preempted by a request marked with REQ_PRIO. But you're >> suggesting that did not happen? [snip] > We seem to handily mostly ignore prio_pending for the idle class. If Right, I forgot we were talking about idle class. Sorry. > the new queue is idle, then we don't look at prio pending. I'd rather > make this more explicit, the patch is pretty similar to what we had in > the past. It's somewhat of a regression caused by commit 4aede84b33d, > except I like using the request flags for this a lot more than the old > current->fs_excl. REQ_PRIO should always be set for cases where we > hold fs (or even directory) specific resources. Ah, thanks for the reference! Now I'll go back and finish reviewing the actual patch. -Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe <axboe@fb.com> writes: > At Facebook, we have a number of cases where people use ionice to set a > lower priority, then end up having tasks stuck for a long time because > eg meta data updates from an idle priority tasks is blocking out higher > priority processes. It's bad enough that it will trigger the softlockup > warning. > > This patch adds code to CFQ that bumps the priority class and data for > an idle task, if is doing IO marked as PRIO or META. With this, we no > longer see the softlockups. > > Signed-off-by: Jens Axboe <axboe@fb.com> > > diff --git a/block/blk-core.c b/block/blk-core.c > index 32a283eb7274..3cfd67d006fb 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -1781,6 +1781,11 @@ get_rq: > rw_flags |= REQ_SYNC; > > /* > + * Add in META/PRIO flags, if set, before we get to the IO scheduler > + */ > + rw_flags |= (bio->bi_rw & (REQ_META | REQ_PRIO)); > + > + /* This needs a docbook update. It now reads: * @rw_flags: RW and SYNC flags so whatever flags we're adding should be specified, I guess. Speaking of which, after much waffling, I think I've decided it would be cleaner to limit the priority boost to REQ_PRIO requests only. Other than that, I think this looks fine. Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/09/2016 03:28 PM, Jeff Moyer wrote: > Jens Axboe <axboe@fb.com> writes: > >> At Facebook, we have a number of cases where people use ionice to set a >> lower priority, then end up having tasks stuck for a long time because >> eg meta data updates from an idle priority tasks is blocking out higher >> priority processes. It's bad enough that it will trigger the softlockup >> warning. >> >> This patch adds code to CFQ that bumps the priority class and data for >> an idle task, if is doing IO marked as PRIO or META. With this, we no >> longer see the softlockups. >> >> Signed-off-by: Jens Axboe <axboe@fb.com> >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 32a283eb7274..3cfd67d006fb 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -1781,6 +1781,11 @@ get_rq: >> rw_flags |= REQ_SYNC; >> >> /* >> + * Add in META/PRIO flags, if set, before we get to the IO scheduler >> + */ >> + rw_flags |= (bio->bi_rw & (REQ_META | REQ_PRIO)); >> + >> + /* > > This needs a docbook update. It now reads: > > * @rw_flags: RW and SYNC flags > > so whatever flags we're adding should be specified, I guess. > > Speaking of which, after much waffling, I think I've decided it would be > cleaner to limit the priority boost to REQ_PRIO requests only. I went and checked, but I don't see it. Where is this? > Other than that, I think this looks fine. Thanks!
Jens Axboe <axboe@kernel.dk> writes: > On 06/09/2016 03:28 PM, Jeff Moyer wrote: >> Jens Axboe <axboe@fb.com> writes: >> >>> At Facebook, we have a number of cases where people use ionice to set a >>> lower priority, then end up having tasks stuck for a long time because >>> eg meta data updates from an idle priority tasks is blocking out higher >>> priority processes. It's bad enough that it will trigger the softlockup >>> warning. >>> >>> This patch adds code to CFQ that bumps the priority class and data for >>> an idle task, if is doing IO marked as PRIO or META. With this, we no >>> longer see the softlockups. >>> >>> Signed-off-by: Jens Axboe <axboe@fb.com> >>> >>> diff --git a/block/blk-core.c b/block/blk-core.c >>> index 32a283eb7274..3cfd67d006fb 100644 >>> --- a/block/blk-core.c >>> +++ b/block/blk-core.c >>> @@ -1781,6 +1781,11 @@ get_rq: >>> rw_flags |= REQ_SYNC; >>> >>> /* >>> + * Add in META/PRIO flags, if set, before we get to the IO scheduler >>> + */ >>> + rw_flags |= (bio->bi_rw & (REQ_META | REQ_PRIO)); >>> + >>> + /* >> >> This needs a docbook update. It now reads: >> >> * @rw_flags: RW and SYNC flags >> >> so whatever flags we're adding should be specified, I guess. >> >> Speaking of which, after much waffling, I think I've decided it would be >> cleaner to limit the priority boost to REQ_PRIO requests only. > > I went and checked, but I don't see it. Where is this? Oops, sorry. I meant that get_request and __get_request need updates to their documentation. On the second part (in case there was confusion on what I meant there), what I meant was only do the prio boost for REQ_PRIO requests instead of also doing it for REQ_META. The way I arrived at that conclusion was when I was going to ask you to update the documentation for REQ_META to state that it implied REQ_PRIO, at which point, one has to wonder why we need two flags. There are cases where REQ_PRIO is used without REQ_META. There are cases where REQ_META is used withoug REQ_PRIO. And of course, there are cases where they're both sent down. REQ_META itself is useful for tracing, and also makes the code self-documenting. REQ_PRIO pretty clearly means that we should boost priority for this request. And I think Christoph was making a case for REQ_META that doesn't require a priority boost (if I read what he said correctly). So, I think they serve different purposes. Have I convinced you? It'll make your patch smaller! ;-) -Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/09/2016 03:47 PM, Jeff Moyer wrote: > Jens Axboe <axboe@kernel.dk> writes: > >> On 06/09/2016 03:28 PM, Jeff Moyer wrote: >>> Jens Axboe <axboe@fb.com> writes: >>> >>>> At Facebook, we have a number of cases where people use ionice to set a >>>> lower priority, then end up having tasks stuck for a long time because >>>> eg meta data updates from an idle priority tasks is blocking out higher >>>> priority processes. It's bad enough that it will trigger the softlockup >>>> warning. >>>> >>>> This patch adds code to CFQ that bumps the priority class and data for >>>> an idle task, if is doing IO marked as PRIO or META. With this, we no >>>> longer see the softlockups. >>>> >>>> Signed-off-by: Jens Axboe <axboe@fb.com> >>>> >>>> diff --git a/block/blk-core.c b/block/blk-core.c >>>> index 32a283eb7274..3cfd67d006fb 100644 >>>> --- a/block/blk-core.c >>>> +++ b/block/blk-core.c >>>> @@ -1781,6 +1781,11 @@ get_rq: >>>> rw_flags |= REQ_SYNC; >>>> >>>> /* >>>> + * Add in META/PRIO flags, if set, before we get to the IO scheduler >>>> + */ >>>> + rw_flags |= (bio->bi_rw & (REQ_META | REQ_PRIO)); >>>> + >>>> + /* >>> >>> This needs a docbook update. It now reads: >>> >>> * @rw_flags: RW and SYNC flags >>> >>> so whatever flags we're adding should be specified, I guess. >>> >>> Speaking of which, after much waffling, I think I've decided it would be >>> cleaner to limit the priority boost to REQ_PRIO requests only. >> >> I went and checked, but I don't see it. Where is this? > > Oops, sorry. I meant that get_request and __get_request need updates to > their documentation. See followup email, that no longer applies! > On the second part (in case there was confusion on what I meant there), > what I meant was only do the prio boost for REQ_PRIO requests instead > of also doing it for REQ_META. The way I arrived at that conclusion was > when I was going to ask you to update the documentation for REQ_META to > state that it implied REQ_PRIO, at which point, one has to wonder why we > need two flags. > > There are cases where REQ_PRIO is used without REQ_META. > There are cases where REQ_META is used withoug REQ_PRIO. > And of course, there are cases where they're both sent down. > > REQ_META itself is useful for tracing, and also makes the code > self-documenting. > > REQ_PRIO pretty clearly means that we should boost priority for this > request. And I think Christoph was making a case for REQ_META that > doesn't require a priority boost (if I read what he said correctly). > > So, I think they serve different purposes. Have I convinced you? It'll > make your patch smaller! ;-) I got what you meant, the updated patch in that same followup email has it cut down to just REQ_PRIO and drops RQ_PRIO_MASK as a result.
diff --git a/block/blk-core.c b/block/blk-core.c index 32a283eb7274..3cfd67d006fb 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1781,6 +1781,11 @@ get_rq: rw_flags |= REQ_SYNC; /* + * Add in META/PRIO flags, if set, before we get to the IO scheduler + */ + rw_flags |= (bio->bi_rw & (REQ_META | REQ_PRIO)); + + /* * Grab a free request. This is might sleep but can not fail. * Returns with the queue unlocked. */ diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 4e5978426ee7..7969882e0a2a 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -72,6 +72,8 @@ static struct kmem_cache *cfq_pool; #define CFQ_WEIGHT_LEGACY_DFL 500 #define CFQ_WEIGHT_LEGACY_MAX 1000 +#define RQ_PRIO_MASK (REQ_META | REQ_PRIO) + struct cfq_ttime { u64 last_end_request; @@ -141,7 +143,7 @@ struct cfq_queue { /* io prio of this group */ unsigned short ioprio, org_ioprio; - unsigned short ioprio_class; + unsigned short ioprio_class, org_ioprio_class; pid_t pid; @@ -1114,8 +1116,8 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2, if (rq_is_sync(rq1) != rq_is_sync(rq2)) return rq_is_sync(rq1) ? rq1 : rq2; - if ((rq1->cmd_flags ^ rq2->cmd_flags) & REQ_PRIO) - return rq1->cmd_flags & REQ_PRIO ? rq1 : rq2; + if ((rq1->cmd_flags ^ rq2->cmd_flags) & RQ_PRIO_MASK) + return rq1->cmd_flags & RQ_PRIO_MASK ? rq1 : rq2; s1 = blk_rq_pos(rq1); s2 = blk_rq_pos(rq2); @@ -2530,7 +2532,7 @@ static void cfq_remove_request(struct request *rq) cfqq->cfqd->rq_queued--; cfqg_stats_update_io_remove(RQ_CFQG(rq), req_op(rq), rq->cmd_flags); - if (rq->cmd_flags & REQ_PRIO) { + if (rq->cmd_flags & RQ_PRIO_MASK) { WARN_ON(!cfqq->prio_pending); cfqq->prio_pending--; } @@ -3700,6 +3702,7 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct cfq_io_cq *cic) * elevate the priority of this queue */ cfqq->org_ioprio = cfqq->ioprio; + cfqq->org_ioprio_class = cfqq->ioprio_class; cfq_clear_cfqq_prio_changed(cfqq); } @@ -4012,7 +4015,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq, * So both queues are sync. Let the new request get disk time if * it's a metadata request and the current queue is doing regular IO. */ - if ((rq->cmd_flags & REQ_PRIO) && !cfqq->prio_pending) + if ((rq->cmd_flags & RQ_PRIO_MASK) && !cfqq->prio_pending) return true; /* An idle queue should not be idle now for some reason */ @@ -4073,7 +4076,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct cfq_io_cq *cic = RQ_CIC(rq); cfqd->rq_queued++; - if (rq->cmd_flags & REQ_PRIO) + if (rq->cmd_flags & RQ_PRIO_MASK) cfqq->prio_pending++; cfq_update_io_thinktime(cfqd, cfqq, cic); @@ -4295,6 +4298,20 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) cfq_schedule_dispatch(cfqd); } +static void cfqq_boost_on_meta(struct cfq_queue *cfqq, int op_flags) +{ + if (!(op_flags & RQ_PRIO_MASK)) { + cfqq->ioprio_class = cfqq->org_ioprio_class; + cfqq->ioprio = cfqq->org_ioprio; + return; + } + + if (cfq_class_idle(cfqq)) + cfqq->ioprio_class = IOPRIO_CLASS_BE; + if (cfqq->ioprio > IOPRIO_NORM) + cfqq->ioprio = IOPRIO_NORM; +} + static inline int __cfq_may_queue(struct cfq_queue *cfqq) { if (cfq_cfqq_wait_request(cfqq) && !cfq_cfqq_must_alloc_slice(cfqq)) { @@ -4325,6 +4342,7 @@ static int cfq_may_queue(struct request_queue *q, int op, int op_flags) cfqq = cic_to_cfqq(cic, rw_is_sync(op, op_flags)); if (cfqq) { cfq_init_prio_data(cfqq, cic); + cfqq_boost_on_meta(cfqq, op_flags); return __cfq_may_queue(cfqq); }
At Facebook, we have a number of cases where people use ionice to set a lower priority, then end up having tasks stuck for a long time because eg meta data updates from an idle priority tasks is blocking out higher priority processes. It's bad enough that it will trigger the softlockup warning. This patch adds code to CFQ that bumps the priority class and data for an idle task, if is doing IO marked as PRIO or META. With this, we no longer see the softlockups. Signed-off-by: Jens Axboe <axboe@fb.com>