Message ID | 20240122224220.1206234-1-qyousef@layalina.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block/blk-mq: Don't complete locally if capacities are different | expand |
On Mon, Jan 22, 2024 at 10:42:20PM +0000, Qais Yousef wrote: > The logic in blk_mq_complete_need_ipi() assumes SMP systems where all > CPUs have equal capacities What is a capacity here? > + return arch_scale_cpu_capacity(this_cpu) >= arch_scale_cpu_capacity(that_cpu); oerly long line here. Also pleas split patches for different subsystems.
On 1/23/24 1:47 AM, Christoph Hellwig wrote: > On Mon, Jan 22, 2024 at 10:42:20PM +0000, Qais Yousef wrote: >> The logic in blk_mq_complete_need_ipi() assumes SMP systems where all >> CPUs have equal capacities > > What is a capacity here? It seems to be the chosen word to describe the performance potential of the core in question, we use it elsewhere in the kernel. But yes, could do with a bit more of an explanation. >> + return arch_scale_cpu_capacity(this_cpu) >= arch_scale_cpu_capacity(that_cpu); > > oerly long line here. > > Also pleas split patches for different subsystems. Yes please, the sched/topology thing should be a separate prep patch.
On 01/23/24 08:58, Jens Axboe wrote: > On 1/23/24 1:47 AM, Christoph Hellwig wrote: > > On Mon, Jan 22, 2024 at 10:42:20PM +0000, Qais Yousef wrote: > >> The logic in blk_mq_complete_need_ipi() assumes SMP systems where all > >> CPUs have equal capacities > > > > What is a capacity here? > > It seems to be the chosen word to describe the performance potential of > the core in question, we use it elsewhere in the kernel. But yes, could > do with a bit more of an explanation. Is referring to it as compute capacity makes it clearer? Sorry I thought that's a common term. > > >> + return arch_scale_cpu_capacity(this_cpu) >= arch_scale_cpu_capacity(that_cpu); > > > > oerly long line here. This is consistent with similar long lines in the same file and it's more readable as one line. checkpatch doesn't complain about this being long; I think they look for 100 or 120 now. This is 86. > > > > Also pleas split patches for different subsystems. > > Yes please, the sched/topology thing should be a separate prep patch. Okay. I thought the norm to keep such small patches self contained as adding dead code followed by a one liner isn't always seen as better. But a split it is :) I'll give the sched/arm folks time to have a look before I post a new version. Thanks! -- Qais Yousef
On 01/22/24 22:42, Qais Yousef wrote: > The logic in blk_mq_complete_need_ipi() assumes SMP systems where all > CPUs have equal capacities and only LLC cache can make a different on > perceived performance. But this assumption falls apart on HMP systems > where LLC is shared, but the CPUs have different capacities. Staying > local then can have a big performance impact if the IO request was done > from a CPU with higher capacity but the interrupt is serviced on a lower > capacity CPU. > > Introduce new cpus_gte_capacity() function to enable do the additional > check. As I was preparing to send a new version. I thought it is worth noting here that I initially had cpus_equal_capacity() to check if the performance on the two cpus is equal, but then opted to change it to >= so that we favour perf. But now I am having 2nd thoughts again. If the interrupt was on a bigger core but the request comes from a little core, is it better to move it to little core (which can save power) to match the requester, or better honour the fact that someone has put the interrupt on a bigger core and maybe they prefer to keep requests there by default? I am leaning back towards cpus_equal_capacity() so we match the performance level the requester is running at. irqbalancer can end up shuffling interrupts potentially and I believe it is better for the scheduler to handle the performance level the requester should be running at, and then here our job should be to match that level without making more assumptions. -- Qais Yousef > > Without the patch I see the BLOCK softirq always running on little cores > (where the hardirq is serviced). With it I can see it running on all > cores. > > This was noticed after the topology change [1] where now on a big.LITTLE > we truly get that the LLC is shared between all cores where as in the > past it was being misrepresented for historical reasons. The logic > exposed a missing dependency on capacities for such systems where there > can be a big performance difference between the CPUs. > > This of course introduced a noticeable change in behavior depending on > how the topology is presented. Leading to regressions in some workloads > as the performance of the BLOCK softirq on littles can be noticeably > worse. > > [1] https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> > --- > block/blk-mq.c | 5 +++-- > include/linux/sched/topology.h | 6 ++++++ > kernel/sched/core.c | 8 ++++++++ > 3 files changed, 17 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index ac18f802c027..9b2d278a7ae7 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1163,10 +1163,11 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq) > if (force_irqthreads()) > return false; > > - /* same CPU or cache domain? Complete locally */ > + /* same CPU or cache domain and capacity? Complete locally */ > if (cpu == rq->mq_ctx->cpu || > (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) && > - cpus_share_cache(cpu, rq->mq_ctx->cpu))) > + cpus_share_cache(cpu, rq->mq_ctx->cpu) && > + cpus_gte_capacity(cpu, rq->mq_ctx->cpu))) > return false; > > /* don't try to IPI to an offline CPU */ > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index a6e04b4a21d7..31cef5780ba4 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -176,6 +176,7 @@ extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], > cpumask_var_t *alloc_sched_domains(unsigned int ndoms); > void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms); > > +bool cpus_gte_capacity(int this_cpu, int that_cpu); > bool cpus_share_cache(int this_cpu, int that_cpu); > bool cpus_share_resources(int this_cpu, int that_cpu); > > @@ -226,6 +227,11 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], > { > } > > +static inline bool cpus_gte_capacity(int this_cpu, int that_cpu) > +{ > + return true; > +} > + > static inline bool cpus_share_cache(int this_cpu, int that_cpu) > { > return true; > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index db4be4921e7f..db5ab4b3cee7 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3954,6 +3954,14 @@ void wake_up_if_idle(int cpu) > } > } > > +bool cpus_gte_capacity(int this_cpu, int that_cpu) > +{ > + if (this_cpu == that_cpu) > + return true; > + > + return arch_scale_cpu_capacity(this_cpu) >= arch_scale_cpu_capacity(that_cpu); > +} > + > bool cpus_share_cache(int this_cpu, int that_cpu) > { > if (this_cpu == that_cpu) > -- > 2.34.1 >
diff --git a/block/blk-mq.c b/block/blk-mq.c index ac18f802c027..9b2d278a7ae7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1163,10 +1163,11 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq) if (force_irqthreads()) return false; - /* same CPU or cache domain? Complete locally */ + /* same CPU or cache domain and capacity? Complete locally */ if (cpu == rq->mq_ctx->cpu || (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) && - cpus_share_cache(cpu, rq->mq_ctx->cpu))) + cpus_share_cache(cpu, rq->mq_ctx->cpu) && + cpus_gte_capacity(cpu, rq->mq_ctx->cpu))) return false; /* don't try to IPI to an offline CPU */ diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index a6e04b4a21d7..31cef5780ba4 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -176,6 +176,7 @@ extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], cpumask_var_t *alloc_sched_domains(unsigned int ndoms); void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms); +bool cpus_gte_capacity(int this_cpu, int that_cpu); bool cpus_share_cache(int this_cpu, int that_cpu); bool cpus_share_resources(int this_cpu, int that_cpu); @@ -226,6 +227,11 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], { } +static inline bool cpus_gte_capacity(int this_cpu, int that_cpu) +{ + return true; +} + static inline bool cpus_share_cache(int this_cpu, int that_cpu) { return true; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index db4be4921e7f..db5ab4b3cee7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3954,6 +3954,14 @@ void wake_up_if_idle(int cpu) } } +bool cpus_gte_capacity(int this_cpu, int that_cpu) +{ + if (this_cpu == that_cpu) + return true; + + return arch_scale_cpu_capacity(this_cpu) >= arch_scale_cpu_capacity(that_cpu); +} + bool cpus_share_cache(int this_cpu, int that_cpu) { if (this_cpu == that_cpu)
The logic in blk_mq_complete_need_ipi() assumes SMP systems where all CPUs have equal capacities and only LLC cache can make a different on perceived performance. But this assumption falls apart on HMP systems where LLC is shared, but the CPUs have different capacities. Staying local then can have a big performance impact if the IO request was done from a CPU with higher capacity but the interrupt is serviced on a lower capacity CPU. Introduce new cpus_gte_capacity() function to enable do the additional check. Without the patch I see the BLOCK softirq always running on little cores (where the hardirq is serviced). With it I can see it running on all cores. This was noticed after the topology change [1] where now on a big.LITTLE we truly get that the LLC is shared between all cores where as in the past it was being misrepresented for historical reasons. The logic exposed a missing dependency on capacities for such systems where there can be a big performance difference between the CPUs. This of course introduced a noticeable change in behavior depending on how the topology is presented. Leading to regressions in some workloads as the performance of the BLOCK softirq on littles can be noticeably worse. [1] https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> --- block/blk-mq.c | 5 +++-- include/linux/sched/topology.h | 6 ++++++ kernel/sched/core.c | 8 ++++++++ 3 files changed, 17 insertions(+), 2 deletions(-)