Message ID | 20230307143558.294354-8-vschneid@redhat.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | Generic IPI sending tracepoint | expand |
Context | Check | Description |
---|---|---|
conchuod/cover_letter | success | Series has a cover letter |
conchuod/tree_selection | success | Guessed tree name to be for-next |
conchuod/fixes_present | success | Fixes tag not required for -next series |
conchuod/maintainers_pattern | success | MAINTAINERS pattern errors before the patch: 1 and now 1 |
conchuod/verify_signedoff | success | Signed-off-by tag matches author and committer |
conchuod/kdoc | success | Errors and warnings before: 0 this patch: 0 |
conchuod/build_rv64_clang_allmodconfig | fail | Failed to build the tree with this patch. |
conchuod/module_param | success | Was 0 now: 0 |
conchuod/build_rv64_gcc_allmodconfig | success | Errors and warnings before: 99 this patch: 99 |
conchuod/alphanumeric_selects | success | Out of order selects before the patch: 728 and now 728 |
conchuod/build_rv32_defconfig | fail | Build failed |
conchuod/dtb_warn_rv64 | success | Errors and warnings before: 3 this patch: 3 |
conchuod/header_inline | success | No static functions without inline keyword in header files |
conchuod/checkpatch | warning | CHECK: extern prototypes should be avoided in .h files |
conchuod/source_inline | success | Was 0 now: 0 |
conchuod/build_rv64_nommu_k210_defconfig | success | Build OK |
conchuod/verify_fixes | success | No Fixes tag |
conchuod/build_rv64_nommu_virt_defconfig | success | Build OK |
On Tue, Mar 07, 2023 at 02:35:58PM +0000, Valentin Schneider wrote: > @@ -477,6 +490,25 @@ static __always_inline void csd_unlock(struct __call_single_data *csd) > smp_store_release(&csd->node.u_flags, 0); > } > > +static __always_inline void > +raw_smp_call_single_queue(int cpu, struct llist_node *node, smp_call_func_t func) > +{ > + /* > + * The list addition should be visible to the target CPU when it pops > + * the head of the list to pull the entry off it in the IPI handler > + * because of normal cache coherency rules implied by the underlying > + * llist ops. > + * > + * If IPIs can go out of order to the cache coherency protocol > + * in an architecture, sufficient synchronisation should be added > + * to arch code to make it appear to obey cache coherency WRT > + * locking and barrier primitives. Generic code isn't really > + * equipped to do the right thing... > + */ > + if (llist_add(node, &per_cpu(call_single_queue, cpu))) > + send_call_function_single_ipi(cpu, func); > +} > + > static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data); > > void __smp_call_single_queue(int cpu, struct llist_node *node) > @@ -493,21 +525,25 @@ void __smp_call_single_queue(int cpu, struct llist_node *node) > } > } > #endif > /* > + * We have to check the type of the CSD before queueing it, because > + * once queued it can have its flags cleared by > + * flush_smp_call_function_queue() > + * even if we haven't sent the smp_call IPI yet (e.g. the stopper > + * executes migration_cpu_stop() on the remote CPU). > */ > + if (trace_ipi_send_cpumask_enabled()) { > + call_single_data_t *csd; > + smp_call_func_t func; > + > + csd = container_of(node, call_single_data_t, node.llist); > + func = CSD_TYPE(csd) == CSD_TYPE_TTWU ? > + sched_ttwu_pending : csd->func; > + > + raw_smp_call_single_queue(cpu, node, func); > + } else { > + raw_smp_call_single_queue(cpu, node, NULL); > + } > } Hurmph... so we only really consume @func when we IPI. Would it not be more useful to trace this thing for *every* csd enqeued?
On 22/03/23 10:53, Peter Zijlstra wrote: > On Tue, Mar 07, 2023 at 02:35:58PM +0000, Valentin Schneider wrote: > >> @@ -477,6 +490,25 @@ static __always_inline void csd_unlock(struct __call_single_data *csd) >> smp_store_release(&csd->node.u_flags, 0); >> } >> >> +static __always_inline void >> +raw_smp_call_single_queue(int cpu, struct llist_node *node, smp_call_func_t func) >> +{ >> + /* >> + * The list addition should be visible to the target CPU when it pops >> + * the head of the list to pull the entry off it in the IPI handler >> + * because of normal cache coherency rules implied by the underlying >> + * llist ops. >> + * >> + * If IPIs can go out of order to the cache coherency protocol >> + * in an architecture, sufficient synchronisation should be added >> + * to arch code to make it appear to obey cache coherency WRT >> + * locking and barrier primitives. Generic code isn't really >> + * equipped to do the right thing... >> + */ >> + if (llist_add(node, &per_cpu(call_single_queue, cpu))) >> + send_call_function_single_ipi(cpu, func); >> +} >> + >> static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data); >> >> void __smp_call_single_queue(int cpu, struct llist_node *node) >> @@ -493,21 +525,25 @@ void __smp_call_single_queue(int cpu, struct llist_node *node) >> } >> } >> #endif >> /* >> + * We have to check the type of the CSD before queueing it, because >> + * once queued it can have its flags cleared by >> + * flush_smp_call_function_queue() >> + * even if we haven't sent the smp_call IPI yet (e.g. the stopper >> + * executes migration_cpu_stop() on the remote CPU). >> */ >> + if (trace_ipi_send_cpumask_enabled()) { >> + call_single_data_t *csd; >> + smp_call_func_t func; >> + >> + csd = container_of(node, call_single_data_t, node.llist); >> + func = CSD_TYPE(csd) == CSD_TYPE_TTWU ? >> + sched_ttwu_pending : csd->func; >> + >> + raw_smp_call_single_queue(cpu, node, func); >> + } else { >> + raw_smp_call_single_queue(cpu, node, NULL); >> + } >> } > > Hurmph... so we only really consume @func when we IPI. Would it not be > more useful to trace this thing for *every* csd enqeued? It's true that any CSD enqueued on that CPU's call_single_queue in the [first CSD llist_add()'ed, IPI IRQ hits] timeframe is a potential source of interference. However, can we be sure that first CSD isn't an indirect cause for the following ones? say the target CPU exits RCU EQS due to the IPI, there's a bit of time before it gets to flush_smp_call_function_queue() where some other CSD could be enqueued *because* of that change in state. I couldn't find a easy example of that, I might be biased as this is where I'd like to go wrt IPI'ing isolated CPUs in usermode. But regardless, when correlating an IPI IRQ with its source, we'd always have to look at the first CSD in that CSD stack.
On Wed, Mar 22, 2023 at 12:20:28PM +0000, Valentin Schneider wrote: > On 22/03/23 10:53, Peter Zijlstra wrote: > > Hurmph... so we only really consume @func when we IPI. Would it not be > > more useful to trace this thing for *every* csd enqeued? > > It's true that any CSD enqueued on that CPU's call_single_queue in the > [first CSD llist_add()'ed, IPI IRQ hits] timeframe is a potential source of > interference. > > However, can we be sure that first CSD isn't an indirect cause for the > following ones? say the target CPU exits RCU EQS due to the IPI, there's a > bit of time before it gets to flush_smp_call_function_queue() where some other CSD > could be enqueued *because* of that change in state. > > I couldn't find a easy example of that, I might be biased as this is where > I'd like to go wrt IPI'ing isolated CPUs in usermode. But regardless, when > correlating an IPI IRQ with its source, we'd always have to look at the > first CSD in that CSD stack. So I was thinking something like this: --- Subject: trace,smp: Trace all smp_function_call*() invocations From: Peter Zijlstra <peterz@infradead.org> Date: Wed Mar 22 14:58:36 CET 2023 (Ab)use the trace_ipi_send_cpu*() family to trace all smp_function_call*() invocations, not only those that result in an actual IPI. The queued entries log their callback function while the actual IPIs are traced on generic_smp_call_function_single_interrupt(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/smp.c | 58 ++++++++++++++++++++++++++++++---------------------------- 1 file changed, 30 insertions(+), 28 deletions(-) --- a/kernel/smp.c +++ b/kernel/smp.c @@ -106,18 +106,20 @@ void __init call_function_init(void) } static __always_inline void -send_call_function_single_ipi(int cpu, smp_call_func_t func) +send_call_function_single_ipi(int cpu) { if (call_function_single_prep_ipi(cpu)) { - trace_ipi_send_cpu(cpu, _RET_IP_, func); + trace_ipi_send_cpu(cpu, _RET_IP_, + generic_smp_call_function_single_interrupt); arch_send_call_function_single_ipi(cpu); } } static __always_inline void -send_call_function_ipi_mask(const struct cpumask *mask, smp_call_func_t func) +send_call_function_ipi_mask(const struct cpumask *mask) { - trace_ipi_send_cpumask(mask, _RET_IP_, func); + trace_ipi_send_cpumask(mask, _RET_IP_, + generic_smp_call_function_single_interrupt); arch_send_call_function_ipi_mask(mask); } @@ -318,25 +320,6 @@ static __always_inline void csd_unlock(s smp_store_release(&csd->node.u_flags, 0); } -static __always_inline void -raw_smp_call_single_queue(int cpu, struct llist_node *node, smp_call_func_t func) -{ - /* - * The list addition should be visible to the target CPU when it pops - * the head of the list to pull the entry off it in the IPI handler - * because of normal cache coherency rules implied by the underlying - * llist ops. - * - * If IPIs can go out of order to the cache coherency protocol - * in an architecture, sufficient synchronisation should be added - * to arch code to make it appear to obey cache coherency WRT - * locking and barrier primitives. Generic code isn't really - * equipped to do the right thing... - */ - if (llist_add(node, &per_cpu(call_single_queue, cpu))) - send_call_function_single_ipi(cpu, func); -} - static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data); void __smp_call_single_queue(int cpu, struct llist_node *node) @@ -356,10 +339,23 @@ void __smp_call_single_queue(int cpu, st func = CSD_TYPE(csd) == CSD_TYPE_TTWU ? sched_ttwu_pending : csd->func; - raw_smp_call_single_queue(cpu, node, func); - } else { - raw_smp_call_single_queue(cpu, node, NULL); + trace_ipi_send_cpu(cpu, _RET_IP_, func); } + + /* + * The list addition should be visible to the target CPU when it pops + * the head of the list to pull the entry off it in the IPI handler + * because of normal cache coherency rules implied by the underlying + * llist ops. + * + * If IPIs can go out of order to the cache coherency protocol + * in an architecture, sufficient synchronisation should be added + * to arch code to make it appear to obey cache coherency WRT + * locking and barrier primitives. Generic code isn't really + * equipped to do the right thing... + */ + if (llist_add(node, &per_cpu(call_single_queue, cpu))) + send_call_function_single_ipi(cpu); } /* @@ -798,14 +794,20 @@ static void smp_call_function_many_cond( } /* + * Trace each smp_function_call_*() as an IPI, actual IPIs + * will be traced with func==generic_smp_call_function_single_ipi(). + */ + trace_ipi_send_cpumask(cfd->cpumask_ipi, _RET_IP_, func); + + /* * Choose the most efficient way to send an IPI. Note that the * number of CPUs might be zero due to concurrent changes to the * provided mask. */ if (nr_cpus == 1) - send_call_function_single_ipi(last_cpu, func); + send_call_function_single_ipi(last_cpu); else if (likely(nr_cpus > 1)) - send_call_function_ipi_mask(cfd->cpumask_ipi, func); + send_call_function_ipi_mask(cfd->cpumask_ipi); } if (run_local && (!cond_func || cond_func(this_cpu, info))) {
On 22/03/23 15:04, Peter Zijlstra wrote: > On Wed, Mar 22, 2023 at 12:20:28PM +0000, Valentin Schneider wrote: >> On 22/03/23 10:53, Peter Zijlstra wrote: > >> > Hurmph... so we only really consume @func when we IPI. Would it not be >> > more useful to trace this thing for *every* csd enqeued? >> >> It's true that any CSD enqueued on that CPU's call_single_queue in the >> [first CSD llist_add()'ed, IPI IRQ hits] timeframe is a potential source of >> interference. >> >> However, can we be sure that first CSD isn't an indirect cause for the >> following ones? say the target CPU exits RCU EQS due to the IPI, there's a >> bit of time before it gets to flush_smp_call_function_queue() where some other CSD >> could be enqueued *because* of that change in state. >> >> I couldn't find a easy example of that, I might be biased as this is where >> I'd like to go wrt IPI'ing isolated CPUs in usermode. But regardless, when >> correlating an IPI IRQ with its source, we'd always have to look at the >> first CSD in that CSD stack. > > So I was thinking something like this: > > --- > Subject: trace,smp: Trace all smp_function_call*() invocations > From: Peter Zijlstra <peterz@infradead.org> > Date: Wed Mar 22 14:58:36 CET 2023 > > (Ab)use the trace_ipi_send_cpu*() family to trace all > smp_function_call*() invocations, not only those that result in an > actual IPI. > > The queued entries log their callback function while the actual IPIs > are traced on generic_smp_call_function_single_interrupt(). > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > kernel/smp.c | 58 ++++++++++++++++++++++++++++++---------------------------- > 1 file changed, 30 insertions(+), 28 deletions(-) > > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -106,18 +106,20 @@ void __init call_function_init(void) > } > > static __always_inline void > -send_call_function_single_ipi(int cpu, smp_call_func_t func) > +send_call_function_single_ipi(int cpu) > { > if (call_function_single_prep_ipi(cpu)) { > - trace_ipi_send_cpu(cpu, _RET_IP_, func); > + trace_ipi_send_cpu(cpu, _RET_IP_, > + generic_smp_call_function_single_interrupt); Hm, this does get rid of the func being passed down the helpers, but this means the trace events are now stateful, i.e. I need the first and last events in a CSD stack to figure out which one actually caused the IPI. It also requires whoever is looking at the trace to be aware of which IPIs are attached to a CSD, and which ones aren't. ATM that's only the resched IPI, but per the cover letter there's more to come (e.g. tick_broadcast() for arm64/riscv and a few others). For instance: hackbench-157 [001] 10.894320: ipi_send_cpu: cpu=3 callsite=check_preempt_curr+0x37 callback=0x0 hackbench-157 [001] 10.895068: ipi_send_cpu: cpu=3 callsite=try_to_wake_up+0x29e callback=sched_ttwu_pending+0x0 hackbench-157 [001] 10.895068: ipi_send_cpu: cpu=3 callsite=try_to_wake_up+0x29e callback=generic_smp_call_function_single_interrupt+0x0 That first one sent a RESCHEDULE IPI, the second one a CALL_FUNCTION one, but you really have to know what you're looking at... Are you worried about the @func being pushed down? Staring at x86 asm is not good for the soul, but AFAICT this does cause an extra register to be popped in the prologue because all of the helpers are __always_inline, so both paths of the static key(s) are in the same stackframe. I can "improve" this with: --- diff --git a/kernel/smp.c b/kernel/smp.c index 5cd680a7e78ef..55f120dae1713 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -511,6 +511,26 @@ raw_smp_call_single_queue(int cpu, struct llist_node *node, smp_call_func_t func static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data); +static noinline void __smp_call_single_queue_trace(int cpu, struct llist_node *node) +{ + call_single_data_t *csd; + smp_call_func_t func; + + + /* + * We have to check the type of the CSD before queueing it, because + * once queued it can have its flags cleared by + * flush_smp_call_function_queue() + * even if we haven't sent the smp_call IPI yet (e.g. the stopper + * executes migration_cpu_stop() on the remote CPU). + */ + csd = container_of(node, call_single_data_t, node.llist); + func = CSD_TYPE(csd) == CSD_TYPE_TTWU ? + sched_ttwu_pending : csd->func; + + raw_smp_call_single_queue(cpu, node, func); +} + void __smp_call_single_queue(int cpu, struct llist_node *node) { #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG @@ -525,25 +545,10 @@ void __smp_call_single_queue(int cpu, struct llist_node *node) } } #endif - /* - * We have to check the type of the CSD before queueing it, because - * once queued it can have its flags cleared by - * flush_smp_call_function_queue() - * even if we haven't sent the smp_call IPI yet (e.g. the stopper - * executes migration_cpu_stop() on the remote CPU). - */ - if (trace_ipi_send_cpumask_enabled()) { - call_single_data_t *csd; - smp_call_func_t func; - - csd = container_of(node, call_single_data_t, node.llist); - func = CSD_TYPE(csd) == CSD_TYPE_TTWU ? - sched_ttwu_pending : csd->func; - - raw_smp_call_single_queue(cpu, node, func); - } else { + if (trace_ipi_send_cpumask_enabled()) + __smp_call_single_queue_trace(cpu, node); + else raw_smp_call_single_queue(cpu, node, NULL); - } } /*
On Wed, Mar 22, 2023 at 05:01:13PM +0000, Valentin Schneider wrote: > > So I was thinking something like this: > Hm, this does get rid of the func being passed down the helpers, but this > means the trace events are now stateful, i.e. I need the first and last > events in a CSD stack to figure out which one actually caused the IPI. Isn't much of tracing stateful? I mean, why am I always writing awk programs to parse trace output? The one that is directly followed by generic_smp_call_function_single_interrupt() (horrible name that), is the one that tripped the IPI. > It also requires whoever is looking at the trace to be aware of which IPIs > are attached to a CSD, and which ones aren't. ATM that's only the resched > IPI, but per the cover letter there's more to come (e.g. tick_broadcast() > for arm64/riscv and a few others). For instance: > > hackbench-157 [001] 10.894320: ipi_send_cpu: cpu=3 callsite=check_preempt_curr+0x37 callback=0x0 Arguably we should be setting callback to scheduler_ipi(), except ofcourse, that's not an actual function... Maybe we can do "extern inline" for the actual users and provide a dummy function for the symbol when tracing. > hackbench-157 [001] 10.895068: ipi_send_cpu: cpu=3 callsite=try_to_wake_up+0x29e callback=sched_ttwu_pending+0x0 > hackbench-157 [001] 10.895068: ipi_send_cpu: cpu=3 callsite=try_to_wake_up+0x29e callback=generic_smp_call_function_single_interrupt+0x0 > > That first one sent a RESCHEDULE IPI, the second one a CALL_FUNCTION one, > but you really have to know what you're looking at... But you have to know that anyway, you can't do tracing and not know wtf you're doing. Or rather, if you do, I don't give a crap and you can keep the pieces :-) Grepping the callback should be pretty quick resolution at to what trips it, no? (also, if you *realllllly* can't manage, we can always add yet another argument that gives a type thingy) > Are you worried about the @func being pushed down? Not really, I was finding it odd that only the first csd was being logged. Either you should log them all (after all, the target CPU will run them all and you might still wonder where the heck they came from) or it should log none and always report that hideous long function name I can't be arsed to type again :-) > Staring at x86 asm is not good for the soul, Scarred for life :-) What's worse, due to being exposed to Intel syntax at a young age, I'm now permantently confused as to the argument order of x86 asm.
On 22/03/23 18:22, Peter Zijlstra wrote: > On Wed, Mar 22, 2023 at 05:01:13PM +0000, Valentin Schneider wrote: > >> > So I was thinking something like this: > >> Hm, this does get rid of the func being passed down the helpers, but this >> means the trace events are now stateful, i.e. I need the first and last >> events in a CSD stack to figure out which one actually caused the IPI. > > Isn't much of tracing stateful? I mean, why am I always writing awk > programs to parse trace output? > > The one that is directly followed by > generic_smp_call_function_single_interrupt() (horrible name that), is > the one that tripped the IPI. > Right. >> It also requires whoever is looking at the trace to be aware of which IPIs >> are attached to a CSD, and which ones aren't. ATM that's only the resched >> IPI, but per the cover letter there's more to come (e.g. tick_broadcast() >> for arm64/riscv and a few others). For instance: >> >> hackbench-157 [001] 10.894320: ipi_send_cpu: cpu=3 callsite=check_preempt_curr+0x37 callback=0x0 > > Arguably we should be setting callback to scheduler_ipi(), except > ofcourse, that's not an actual function... > > Maybe we can do "extern inline" for the actual users and provide a dummy > function for the symbol when tracing. > Huh, I wasn't aware that was an option, I'll look into that. I did scribble down a comment next to smp_send_reschedule(), but having a decodable function name would be better! >> hackbench-157 [001] 10.895068: ipi_send_cpu: cpu=3 callsite=try_to_wake_up+0x29e callback=sched_ttwu_pending+0x0 >> hackbench-157 [001] 10.895068: ipi_send_cpu: cpu=3 callsite=try_to_wake_up+0x29e callback=generic_smp_call_function_single_interrupt+0x0 >> >> That first one sent a RESCHEDULE IPI, the second one a CALL_FUNCTION one, >> but you really have to know what you're looking at... > > But you have to know that anyway, you can't do tracing and not know wtf > you're doing. Or rather, if you do, I don't give a crap and you can keep > the pieces :-) > > Grepping the callback should be pretty quick resolution at to what trips > it, no? > > (also, if you *realllllly* can't manage, we can always add yet another > argument that gives a type thingy) > Ah, I was a bit unclear here - I don't care too much about the IPI type being used, but rather being able to figure out on IRQ entry where that IPI came from - thinking some more about now, I don't think logging *all* CSDs causes an issue there, as you'd look at the earliest-not-seen-yet event targeting this CPU anyway. That'll be made easy once I get to having cpumask filters for ftrace, so I can just issue something like: trace-cmd record -e 'ipi_send_cpu' -f "cpu == 3" -e 'ipi_send_cpumask' -f "cpus \in {3}" -T hackbench (it's somewhere on the todolist...) TL;DR: I *think* I've convinced myself logging all of them isn't an issue - I'm going to play with this on something "smarter" than just hackbench under QEMU just to drill it in.
On Wed, Mar 22, 2023 at 06:22:28PM +0000, Valentin Schneider wrote: > On 22/03/23 18:22, Peter Zijlstra wrote: > >> hackbench-157 [001] 10.894320: ipi_send_cpu: cpu=3 callsite=check_preempt_curr+0x37 callback=0x0 > > > > Arguably we should be setting callback to scheduler_ipi(), except > > ofcourse, that's not an actual function... > > > > Maybe we can do "extern inline" for the actual users and provide a dummy > > function for the symbol when tracing. > > > > Huh, I wasn't aware that was an option, I'll look into that. I did scribble > down a comment next to smp_send_reschedule(), but having a decodable > function name would be better! So clang-15 builds the below (and generates the expected code), but gcc-12 vomits nonsense about a non-static inline calling a static inline or somesuch bollocks :-/ --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1991,7 +1991,7 @@ extern char *__get_task_comm(char *to, s }) #ifdef CONFIG_SMP -static __always_inline void scheduler_ipi(void) +extern __always_inline void scheduler_ipi(void) { /* * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -130,9 +130,9 @@ extern void arch_smp_send_reschedule(int * scheduler_ipi() is inline so can't be passed as callback reason, but the * callsite IP should be sufficient for root-causing IPIs sent from here. */ -#define smp_send_reschedule(cpu) ({ \ - trace_ipi_send_cpu(cpu, _RET_IP_, NULL); \ - arch_smp_send_reschedule(cpu); \ +#define smp_send_reschedule(cpu) ({ \ + trace_ipi_send_cpu(cpu, _RET_IP_, &scheduler_ipi); \ + arch_smp_send_reschedule(cpu); \ }) /* --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3790,6 +3790,15 @@ static int ttwu_runnable(struct task_str } #ifdef CONFIG_SMP +void scheduler_ipi(void) +{ + /* + * Actual users should end up using the extern inline, this is only + * here for the symbol. + */ + BUG(); +} + void sched_ttwu_pending(void *arg) { struct llist_node *llist = arg;
On 22/03/23 15:04, Peter Zijlstra wrote: > @@ -798,14 +794,20 @@ static void smp_call_function_many_cond( > } > > /* > + * Trace each smp_function_call_*() as an IPI, actual IPIs > + * will be traced with func==generic_smp_call_function_single_ipi(). > + */ > + trace_ipi_send_cpumask(cfd->cpumask_ipi, _RET_IP_, func); I just got a trace pointing out this can emit an event even though no IPI is sent if e.g. the cond_func predicate filters all CPUs in the argument mask: ipi_send_cpumask: cpumask= callsite=on_each_cpu_cond_mask+0x3c callback=flush_tlb_func+0x0 Maybe something like so on top? --- diff --git a/kernel/smp.c b/kernel/smp.c index ba5478814e677..1dc452017d000 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -791,6 +791,8 @@ static void smp_call_function_many_cond(const struct cpumask *mask, } } + if (!nr_cpus) + goto local; /* * Trace each smp_function_call_*() as an IPI, actual IPIs * will be traced with func==generic_smp_call_function_single_ipi(). @@ -804,10 +806,10 @@ static void smp_call_function_many_cond(const struct cpumask *mask, */ if (nr_cpus == 1) send_call_function_single_ipi(last_cpu); - else if (likely(nr_cpus > 1)) + else send_call_function_ipi_mask(cfd->cpumask_ipi); } - +local: if (run_local && (!cond_func || cond_func(this_cpu, info))) { unsigned long flags;
On Thu, Mar 23, 2023 at 04:25:25PM +0000, Valentin Schneider wrote: > On 22/03/23 15:04, Peter Zijlstra wrote: > > @@ -798,14 +794,20 @@ static void smp_call_function_many_cond( > > } > > > > /* > > + * Trace each smp_function_call_*() as an IPI, actual IPIs > > + * will be traced with func==generic_smp_call_function_single_ipi(). > > + */ > > + trace_ipi_send_cpumask(cfd->cpumask_ipi, _RET_IP_, func); > > I just got a trace pointing out this can emit an event even though no IPI > is sent if e.g. the cond_func predicate filters all CPUs in the argument > mask: > > ipi_send_cpumask: cpumask= callsite=on_each_cpu_cond_mask+0x3c callback=flush_tlb_func+0x0 > > Maybe something like so on top? > > --- > diff --git a/kernel/smp.c b/kernel/smp.c > index ba5478814e677..1dc452017d000 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -791,6 +791,8 @@ static void smp_call_function_many_cond(const struct cpumask *mask, > } > } > > + if (!nr_cpus) > + goto local; Hmm, this isn't right. You can get nr_cpus==0 even though it did add some to various lists but never was first. But urgh, even if we were to say count nr_queued we'd never get the mask right, because we don't track which CPUs have the predicate matched, only those we need to actually send an IPI to :/ Ooh, I think we can clear those bits from cfd->cpumask, arguably that's a correctness fix too, because the 'run_remote && wait' case shouldn't wait on things we didn't queue. Hmm? --- a/kernel/smp.c +++ b/kernel/smp.c @@ -728,9 +728,9 @@ static void smp_call_function_many_cond( int cpu, last_cpu, this_cpu = smp_processor_id(); struct call_function_data *cfd; bool wait = scf_flags & SCF_WAIT; + int nr_cpus = 0, nr_queued = 0; bool run_remote = false; bool run_local = false; - int nr_cpus = 0; lockdep_assert_preemption_disabled(); @@ -772,8 +772,10 @@ static void smp_call_function_many_cond( for_each_cpu(cpu, cfd->cpumask) { call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu); - if (cond_func && !cond_func(cpu, info)) + if (cond_func && !cond_func(cpu, info)) { + __cpumask_clear_cpu(cpu, cfd->cpumask); continue; + } csd_lock(csd); if (wait) @@ -789,13 +791,15 @@ static void smp_call_function_many_cond( nr_cpus++; last_cpu = cpu; } + nr_queued++; } /* * Trace each smp_function_call_*() as an IPI, actual IPIs * will be traced with func==generic_smp_call_function_single_ipi(). */ - trace_ipi_send_cpumask(cfd->cpumask_ipi, _RET_IP_, func); + if (nr_queued) + trace_ipi_send_cpumask(cfd->cpumask, _RET_IP_, func); /* * Choose the most efficient way to send an IPI. Note that the
On 23/03/23 18:41, Peter Zijlstra wrote: > On Thu, Mar 23, 2023 at 04:25:25PM +0000, Valentin Schneider wrote: >> On 22/03/23 15:04, Peter Zijlstra wrote: >> > @@ -798,14 +794,20 @@ static void smp_call_function_many_cond( >> > } >> > >> > /* >> > + * Trace each smp_function_call_*() as an IPI, actual IPIs >> > + * will be traced with func==generic_smp_call_function_single_ipi(). >> > + */ >> > + trace_ipi_send_cpumask(cfd->cpumask_ipi, _RET_IP_, func); >> >> I just got a trace pointing out this can emit an event even though no IPI >> is sent if e.g. the cond_func predicate filters all CPUs in the argument >> mask: >> >> ipi_send_cpumask: cpumask= callsite=on_each_cpu_cond_mask+0x3c callback=flush_tlb_func+0x0 >> >> Maybe something like so on top? >> >> --- >> diff --git a/kernel/smp.c b/kernel/smp.c >> index ba5478814e677..1dc452017d000 100644 >> --- a/kernel/smp.c >> +++ b/kernel/smp.c >> @@ -791,6 +791,8 @@ static void smp_call_function_many_cond(const struct cpumask *mask, >> } >> } >> >> + if (!nr_cpus) >> + goto local; > > Hmm, this isn't right. You can get nr_cpus==0 even though it did add > some to various lists but never was first. > Duh, glanced over that. > But urgh, even if we were to say count nr_queued we'd never get the mask > right, because we don't track which CPUs have the predicate matched, > only those we need to actually send an IPI to :/ > > Ooh, I think we can clear those bits from cfd->cpumask, arguably that's > a correctness fix too, because the 'run_remote && wait' case shouldn't > wait on things we didn't queue. > Yeah, that makes sense to me. Just one tiny suggestion below. > Hmm? > > > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -728,9 +728,9 @@ static void smp_call_function_many_cond( > int cpu, last_cpu, this_cpu = smp_processor_id(); > struct call_function_data *cfd; > bool wait = scf_flags & SCF_WAIT; > + int nr_cpus = 0, nr_queued = 0; > bool run_remote = false; > bool run_local = false; > - int nr_cpus = 0; > > lockdep_assert_preemption_disabled(); > > @@ -772,8 +772,10 @@ static void smp_call_function_many_cond( > for_each_cpu(cpu, cfd->cpumask) { > call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu); > > - if (cond_func && !cond_func(cpu, info)) > + if (cond_func && !cond_func(cpu, info)) { > + __cpumask_clear_cpu(cpu, cfd->cpumask); > continue; > + } > > csd_lock(csd); > if (wait) > @@ -789,13 +791,15 @@ static void smp_call_function_many_cond( > nr_cpus++; > last_cpu = cpu; > } > + nr_queued++; > } > > /* > * Trace each smp_function_call_*() as an IPI, actual IPIs > * will be traced with func==generic_smp_call_function_single_ipi(). > */ > - trace_ipi_send_cpumask(cfd->cpumask_ipi, _RET_IP_, func); > + if (nr_queued) With your change to cfd->cpumask, we could ditch nr_queued and make this if (!cpumask_empty(cfd->cpumask)) since cfd->cpumask now only contains CPUs that have had a CSD queued. > + trace_ipi_send_cpumask(cfd->cpumask, _RET_IP_, func); > > /* > * Choose the most efficient way to send an IPI. Note that the
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 85114f75f1c9c..60c79b4e4a5b1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3827,16 +3827,20 @@ void sched_ttwu_pending(void *arg) rq_unlock_irqrestore(rq, &rf); } -void send_call_function_single_ipi(int cpu) +/* + * Prepare the scene for sending an IPI for a remote smp_call + * + * Returns true if the caller can proceed with sending the IPI. + * Returns false otherwise. + */ +bool call_function_single_prep_ipi(int cpu) { - struct rq *rq = cpu_rq(cpu); - - if (!set_nr_if_polling(rq->idle)) { - trace_ipi_send_cpumask(cpumask_of(cpu), _RET_IP_, NULL); - arch_send_call_function_single_ipi(cpu); - } else { + if (set_nr_if_polling(cpu_rq(cpu)->idle)) { trace_sched_wake_idle_without_ipi(cpu); + return false; } + + return true; } /* diff --git a/kernel/sched/smp.h b/kernel/sched/smp.h index 2eb23dd0f2856..21ac44428bb02 100644 --- a/kernel/sched/smp.h +++ b/kernel/sched/smp.h @@ -6,7 +6,7 @@ extern void sched_ttwu_pending(void *arg); -extern void send_call_function_single_ipi(int cpu); +extern bool call_function_single_prep_ipi(int cpu); #ifdef CONFIG_SMP extern void flush_smp_call_function_queue(void); diff --git a/kernel/smp.c b/kernel/smp.c index 821b5986721ac..5cd680a7e78ef 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -161,9 +161,18 @@ void __init call_function_init(void) } static __always_inline void -send_call_function_ipi_mask(const struct cpumask *mask) +send_call_function_single_ipi(int cpu, smp_call_func_t func) { - trace_ipi_send_cpumask(mask, _RET_IP_, NULL); + if (call_function_single_prep_ipi(cpu)) { + trace_ipi_send_cpumask(cpumask_of(cpu), _RET_IP_, func); + arch_send_call_function_single_ipi(cpu); + } +} + +static __always_inline void +send_call_function_ipi_mask(const struct cpumask *mask, smp_call_func_t func) +{ + trace_ipi_send_cpumask(mask, _RET_IP_, func); arch_send_call_function_ipi_mask(mask); } @@ -430,12 +439,16 @@ static void __smp_call_single_queue_debug(int cpu, struct llist_node *node) struct cfd_seq_local *seq = this_cpu_ptr(&cfd_seq_local); struct call_function_data *cfd = this_cpu_ptr(&cfd_data); struct cfd_percpu *pcpu = per_cpu_ptr(cfd->pcpu, cpu); + struct __call_single_data *csd; + + csd = container_of(node, call_single_data_t, node.llist); + WARN_ON_ONCE(!(CSD_TYPE(csd) & (CSD_TYPE_SYNC | CSD_TYPE_ASYNC))); cfd_seq_store(pcpu->seq_queue, this_cpu, cpu, CFD_SEQ_QUEUE); if (llist_add(node, &per_cpu(call_single_queue, cpu))) { cfd_seq_store(pcpu->seq_ipi, this_cpu, cpu, CFD_SEQ_IPI); cfd_seq_store(seq->ping, this_cpu, cpu, CFD_SEQ_PING); - send_call_function_single_ipi(cpu); + send_call_function_single_ipi(cpu, csd->func); cfd_seq_store(seq->pinged, this_cpu, cpu, CFD_SEQ_PINGED); } else { cfd_seq_store(pcpu->seq_noipi, this_cpu, cpu, CFD_SEQ_NOIPI); @@ -477,6 +490,25 @@ static __always_inline void csd_unlock(struct __call_single_data *csd) smp_store_release(&csd->node.u_flags, 0); } +static __always_inline void +raw_smp_call_single_queue(int cpu, struct llist_node *node, smp_call_func_t func) +{ + /* + * The list addition should be visible to the target CPU when it pops + * the head of the list to pull the entry off it in the IPI handler + * because of normal cache coherency rules implied by the underlying + * llist ops. + * + * If IPIs can go out of order to the cache coherency protocol + * in an architecture, sufficient synchronisation should be added + * to arch code to make it appear to obey cache coherency WRT + * locking and barrier primitives. Generic code isn't really + * equipped to do the right thing... + */ + if (llist_add(node, &per_cpu(call_single_queue, cpu))) + send_call_function_single_ipi(cpu, func); +} + static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data); void __smp_call_single_queue(int cpu, struct llist_node *node) @@ -493,21 +525,25 @@ void __smp_call_single_queue(int cpu, struct llist_node *node) } } #endif - /* - * The list addition should be visible to the target CPU when it pops - * the head of the list to pull the entry off it in the IPI handler - * because of normal cache coherency rules implied by the underlying - * llist ops. - * - * If IPIs can go out of order to the cache coherency protocol - * in an architecture, sufficient synchronisation should be added - * to arch code to make it appear to obey cache coherency WRT - * locking and barrier primitives. Generic code isn't really - * equipped to do the right thing... + * We have to check the type of the CSD before queueing it, because + * once queued it can have its flags cleared by + * flush_smp_call_function_queue() + * even if we haven't sent the smp_call IPI yet (e.g. the stopper + * executes migration_cpu_stop() on the remote CPU). */ - if (llist_add(node, &per_cpu(call_single_queue, cpu))) - send_call_function_single_ipi(cpu); + if (trace_ipi_send_cpumask_enabled()) { + call_single_data_t *csd; + smp_call_func_t func; + + csd = container_of(node, call_single_data_t, node.llist); + func = CSD_TYPE(csd) == CSD_TYPE_TTWU ? + sched_ttwu_pending : csd->func; + + raw_smp_call_single_queue(cpu, node, func); + } else { + raw_smp_call_single_queue(cpu, node, NULL); + } } /* @@ -976,9 +1012,9 @@ static void smp_call_function_many_cond(const struct cpumask *mask, * provided mask. */ if (nr_cpus == 1) - send_call_function_single_ipi(last_cpu); + send_call_function_single_ipi(last_cpu, func); else if (likely(nr_cpus > 1)) - send_call_function_ipi_mask(cfd->cpumask_ipi); + send_call_function_ipi_mask(cfd->cpumask_ipi, func); cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->pinged, this_cpu, CFD_SEQ_NOCPU, CFD_SEQ_PINGED); }
Context ======= The newly-introduced ipi_send_cpumask tracepoint has a "callback" parameter which so far has only been fed with NULL. While CSD_TYPE_SYNC/ASYNC and CSD_TYPE_IRQ_WORK share a similar backing struct layout (meaning their callback func can be accessed without caring about the actual CSD type), CSD_TYPE_TTWU doesn't even have a function attached to its struct. This means we need to check the type of a CSD before eventually dereferencing its associated callback. This isn't as trivial as it sounds: the CSD type is stored in __call_single_node.u_flags, which get cleared right before the callback is executed via csd_unlock(). This implies checking the CSD type before it is enqueued on the call_single_queue, as the target CPU's queue can be flushed before we get to sending an IPI. Furthermore, send_call_function_single_ipi() only has a CPU parameter, and would need to have an additional argument to trickle down the invoked function. This is somewhat silly, as the extra argument will always be pushed down to the function even when nothing is being traced, which is unnecessary overhead. Changes ======= send_call_function_single_ipi() is only used by smp.c, and is defined in sched/core.c as it contains scheduler-specific ops (set_nr_if_polling() of a CPU's idle task). Split it into two parts: the scheduler bits remain in sched/core.c, and the actual IPI emission is moved into smp.c. This lets us define an __always_inline helper function that can take the related callback as parameter without creating useless register pressure in the non-traced path which only gains a (disabled) static branch. Do the same thing for the multi IPI case. Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- kernel/sched/core.c | 18 +++++++----- kernel/sched/smp.h | 2 +- kernel/smp.c | 72 +++++++++++++++++++++++++++++++++------------ 3 files changed, 66 insertions(+), 26 deletions(-)