Message ID | 0eb94fa3-a1d0-f9b3-fb51-c22eaad225a7@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] block: reduce kblockd_mod_delayed_work_on() CPU consumption | expand |
On Tue, Dec 14, 2021 at 01:49:34PM -0700, Jens Axboe wrote: > Dexuan reports that he's seeing spikes of very heavy CPU utilization when > running 24 disks and using the 'none' scheduler. This happens off the > sched restart path, because SCSI requires the queue to be restarted async, > and hence we're hammering on mod_delayed_work_on() to ensure that the work > item gets run appropriately. > > Avoid hammering on the timer and just use queue_work_on() if no delay > has been specified. > > Reported-and-tested-by: Dexuan Cui <decui@microsoft.com> > Link: https://lore.kernel.org/linux-block/BYAPR21MB1270C598ED214C0490F47400BF719@BYAPR21MB1270.namprd21.prod.outlook.com/ > Signed-off-by: Jens Axboe <axboe@kernel.dk> > > --- > > diff --git a/block/blk-core.c b/block/blk-core.c > index 1378d084c770..c1833f95cb97 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -1484,6 +1484,8 @@ EXPORT_SYMBOL(kblockd_schedule_work); > int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, > unsigned long delay) > { > + if (!delay) > + return queue_work_on(cpu, kblockd_workqueue, &dwork->work); > return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay); Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks, Ming
On 14/12/2021 20:49, Jens Axboe wrote: > Dexuan reports that he's seeing spikes of very heavy CPU utilization when > running 24 disks and using the 'none' scheduler. This happens off the > sched restart path, because SCSI requires the queue to be restarted async, > and hence we're hammering on mod_delayed_work_on() to ensure that the work > item gets run appropriately. > > Avoid hammering on the timer and just use queue_work_on() if no delay > has been specified. > > Reported-and-tested-by: Dexuan Cui <decui@microsoft.com> > Link: https://lore.kernel.org/linux-block/BYAPR21MB1270C598ED214C0490F47400BF719@BYAPR21MB1270.namprd21.prod.outlook.com/ > Signed-off-by: Jens Axboe <axboe@kernel.dk> > > --- > > diff --git a/block/blk-core.c b/block/blk-core.c > index 1378d084c770..c1833f95cb97 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -1484,6 +1484,8 @@ EXPORT_SYMBOL(kblockd_schedule_work); > int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, > unsigned long delay) > { > + if (!delay) > + return queue_work_on(cpu, kblockd_workqueue, &dwork->work); > return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay); > } > EXPORT_SYMBOL(kblockd_mod_delayed_work_on); > Hi Jens, I have a related comment on the current code and interface it uses, if you don't mind, as I did wonder if we are doing a msec_to_jiffies(0 [not built-in const]) call somewhere. So we pass msecs to blk-mq.c, and we do a msec_to_jiffies() call on it before calling kblockd_mod_delayed_work_on(). Now most/all callsites uses const value for the msec value, so if we did the msec_to_jiffies() conversion at the callsites and passed a jiffies value, it should be compiled out by gcc. This is my current __blk_mq_delay_run_hw_queue assembler: 0000000000001ef0 <__blk_mq_delay_run_hw_queue>: [snip] 2024: a942dfb6 ldp x22, x23, [x29, #40] 2028: 2a1503e0 mov w0, w21 202c: 94000000 bl 0 <__msecs_to_jiffies> kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work, 2030: aa0003e2 mov x2, x0 2034: 91010261 add x1, x19, #0x40 2038: 2a1403e0 mov w0, w20 203c: 94000000 bl 0 <kblockd_mod_delayed_work_on> I'm not sure if you would want to change so many APIs or if jiffies is sensible to pass or even any performance gain. Additionally Function blk_mq_delay_kick_requeue_list() would not see so much gain in such a change as msec value is not const. Any thoughts? Maybe testing performance would not do much harm. Thanks, John
On 12/15/21 3:25 AM, John Garry wrote: > On 14/12/2021 20:49, Jens Axboe wrote: >> Dexuan reports that he's seeing spikes of very heavy CPU utilization when >> running 24 disks and using the 'none' scheduler. This happens off the >> sched restart path, because SCSI requires the queue to be restarted async, >> and hence we're hammering on mod_delayed_work_on() to ensure that the work >> item gets run appropriately. >> >> Avoid hammering on the timer and just use queue_work_on() if no delay >> has been specified. >> >> Reported-and-tested-by: Dexuan Cui <decui@microsoft.com> >> Link: https://lore.kernel.org/linux-block/BYAPR21MB1270C598ED214C0490F47400BF719@BYAPR21MB1270.namprd21.prod.outlook.com/ >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> >> --- >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 1378d084c770..c1833f95cb97 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -1484,6 +1484,8 @@ EXPORT_SYMBOL(kblockd_schedule_work); >> int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, >> unsigned long delay) >> { >> + if (!delay) >> + return queue_work_on(cpu, kblockd_workqueue, &dwork->work); >> return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay); >> } >> EXPORT_SYMBOL(kblockd_mod_delayed_work_on); >> > > Hi Jens, > > I have a related comment on the current code and interface it uses, if > you don't mind, as I did wonder if we are doing a msec_to_jiffies(0 [not > built-in const]) call somewhere. > > So we pass msecs to blk-mq.c, and we do a msec_to_jiffies() call on it > before calling kblockd_mod_delayed_work_on(). Now most/all callsites > uses const value for the msec value, so if we did the msec_to_jiffies() > conversion at the callsites and passed a jiffies value, it should be > compiled out by gcc. This is my current __blk_mq_delay_run_hw_queue > assembler: > > 0000000000001ef0 <__blk_mq_delay_run_hw_queue>: > [snip] > 2024: a942dfb6 ldp x22, x23, [x29, #40] > 2028: 2a1503e0 mov w0, w21 > 202c: 94000000 bl 0 <__msecs_to_jiffies> > kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work, > 2030: aa0003e2 mov x2, x0 > 2034: 91010261 add x1, x19, #0x40 > 2038: 2a1403e0 mov w0, w20 > 203c: 94000000 bl 0 <kblockd_mod_delayed_work_on> > > I'm not sure if you would want to change so many APIs or if jiffies is > sensible to pass or even any performance gain. Additionally Function > blk_mq_delay_kick_requeue_list() would not see so much gain in such a > change as msec value is not const. Any thoughts? Maybe testing > performance would not do much harm. In general I totally agree with you, it'd be smarter to flip the conversion so it can be done in a more efficient manner. At the same time, the queue delay running is not at all a fast path, so shouldn't really matter in practice.
On 15/12/2021 15:47, Jens Axboe wrote: >> 0000000000001ef0 <__blk_mq_delay_run_hw_queue>: >> [snip] >> 2024: a942dfb6 ldp x22, x23, [x29, #40] >> 2028: 2a1503e0 mov w0, w21 >> 202c: 94000000 bl 0 <__msecs_to_jiffies> >> kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work, >> 2030: aa0003e2 mov x2, x0 >> 2034: 91010261 add x1, x19, #0x40 >> 2038: 2a1403e0 mov w0, w20 >> 203c: 94000000 bl 0 <kblockd_mod_delayed_work_on> >> >> I'm not sure if you would want to change so many APIs or if jiffies is >> sensible to pass or even any performance gain. Additionally Function >> blk_mq_delay_kick_requeue_list() would not see so much gain in such a >> change as msec value is not const. Any thoughts? Maybe testing >> performance would not do much harm. > In general I totally agree with you, it'd be smarter to flip the > conversion so it can be done in a more efficient manner. > At the same > time, the queue delay running is not at all a fast path, so shouldn't > really matter in practice. ok, I just thought that from checking your change that we have a frequent msec_to_jiffies(0 [non const]) call in __blk_mq_delay_run_hw_queue() -> kblockd_mod_delayed_work_on(). Thanks, John
diff --git a/block/blk-core.c b/block/blk-core.c index 1378d084c770..c1833f95cb97 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1484,6 +1484,8 @@ EXPORT_SYMBOL(kblockd_schedule_work); int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned long delay) { + if (!delay) + return queue_work_on(cpu, kblockd_workqueue, &dwork->work); return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay); } EXPORT_SYMBOL(kblockd_mod_delayed_work_on);
Dexuan reports that he's seeing spikes of very heavy CPU utilization when running 24 disks and using the 'none' scheduler. This happens off the sched restart path, because SCSI requires the queue to be restarted async, and hence we're hammering on mod_delayed_work_on() to ensure that the work item gets run appropriately. Avoid hammering on the timer and just use queue_work_on() if no delay has been specified. Reported-and-tested-by: Dexuan Cui <decui@microsoft.com> Link: https://lore.kernel.org/linux-block/BYAPR21MB1270C598ED214C0490F47400BF719@BYAPR21MB1270.namprd21.prod.outlook.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> ---