Message ID | 20231222220045.2840714-1-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xen/livepatch: Make check_for_livepatch_work() faster in the common case | expand |
On 22.12.2023 23:00, Andrew Cooper wrote: > When livepatching is enabled, this function is used all the time. Really do > check the fastpath first, and annotate it likely() as this is the right answer > 100% of the time (to many significant figures). > > This cuts out 3 pointer dereferences in the "nothing to do path", and it seems > the optimiser has an easier time too. Bloat-o-meter reports: > > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57) > Function old new delta > check_for_livepatch_work.cold 1201 1183 -18 > check_for_livepatch_work 1021 982 -39 > > which isn't too shabby for no logical change. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> > I'm still a little disappointed with the code generation. GCC still chooses > to set up the full stack frame (6 regs, +3 more slots) intermixed with the > per-cpu calculations. > > In isolation, GCC can check the boolean without creating a stack frame: > > <work_to_to>: > 48 89 e2 mov %rsp,%rdx > 48 8d 05 de e1 37 00 lea 0x37e1de(%rip),%rax # ffff82d0405b6068 <per_cpu__work_to_do> > 48 81 ca ff 7f 00 00 or $0x7fff,%rdx > 8b 4a c1 mov -0x3f(%rdx),%ecx > 48 8d 15 45 aa 39 00 lea 0x39aa45(%rip),%rdx # ffff82d0405d28e0 <__per_cpu_offset> > 48 8b 14 ca mov (%rdx,%rcx,8),%rdx > 0f b6 04 02 movzbl (%rdx,%rax,1),%eax > c3 retq > > but I can't find a way to convince GCC that it would be worth not setting up a > stack frame in in the common case, and having a few extra mov reg/reg's later > in the uncommon case. > > I haven't tried manually splitting the function into a check() and a do() > function. Views on whether that might be acceptable? At a guess, do() would > need to be a static noinline to avoid it turning back into what it currently > is. Or maybe move the fast-path check into an inline function, which calls the (renamed) out-of-line one only when the fast-path check passes? Downside would be that the per-CPU work_to_do variable then couldn't be static anymore. Jan
On 04/01/2024 1:24 pm, Jan Beulich wrote: > On 22.12.2023 23:00, Andrew Cooper wrote: >> When livepatching is enabled, this function is used all the time. Really do >> check the fastpath first, and annotate it likely() as this is the right answer >> 100% of the time (to many significant figures). >> >> This cuts out 3 pointer dereferences in the "nothing to do path", and it seems >> the optimiser has an easier time too. Bloat-o-meter reports: >> >> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57) >> Function old new delta >> check_for_livepatch_work.cold 1201 1183 -18 >> check_for_livepatch_work 1021 982 -39 >> >> which isn't too shabby for no logical change. >> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > Reviewed-by: Jan Beulich <jbeulich@suse.com> Thanks. > >> I'm still a little disappointed with the code generation. GCC still chooses >> to set up the full stack frame (6 regs, +3 more slots) intermixed with the >> per-cpu calculations. >> >> In isolation, GCC can check the boolean without creating a stack frame: >> >> <work_to_to>: >> 48 89 e2 mov %rsp,%rdx >> 48 8d 05 de e1 37 00 lea 0x37e1de(%rip),%rax # ffff82d0405b6068 <per_cpu__work_to_do> >> 48 81 ca ff 7f 00 00 or $0x7fff,%rdx >> 8b 4a c1 mov -0x3f(%rdx),%ecx >> 48 8d 15 45 aa 39 00 lea 0x39aa45(%rip),%rdx # ffff82d0405d28e0 <__per_cpu_offset> >> 48 8b 14 ca mov (%rdx,%rcx,8),%rdx >> 0f b6 04 02 movzbl (%rdx,%rax,1),%eax >> c3 retq >> >> but I can't find a way to convince GCC that it would be worth not setting up a >> stack frame in in the common case, and having a few extra mov reg/reg's later >> in the uncommon case. >> >> I haven't tried manually splitting the function into a check() and a do() >> function. Views on whether that might be acceptable? At a guess, do() would >> need to be a static noinline to avoid it turning back into what it currently >> is. > Or maybe move the fast-path check into an inline function, which calls the > (renamed) out-of-line one only when the fast-path check passes? Downside > would be that the per-CPU work_to_do variable then couldn't be static anymore. We can't do that unfortunately. It's called from assembly in reset_stack_and_*() But, I've had another idea. I think attribute cold can help here, as it will move the majority of the implementation into a separate section. ~Andrew
On Fri, Dec 22, 2023 at 10:01 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > When livepatching is enabled, this function is used all the time. Really do > check the fastpath first, and annotate it likely() as this is the right answer > 100% of the time (to many significant figures). > > This cuts out 3 pointer dereferences in the "nothing to do path", and it seems > the optimiser has an easier time too. Bloat-o-meter reports: > > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57) > Function old new delta > check_for_livepatch_work.cold 1201 1183 -18 > check_for_livepatch_work 1021 982 -39 > > which isn't too shabby for no logical change. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
diff --git a/xen/common/livepatch.c b/xen/common/livepatch.c index 1209fea2566c..b6275339f663 100644 --- a/xen/common/livepatch.c +++ b/xen/common/livepatch.c @@ -1706,15 +1706,15 @@ void check_for_livepatch_work(void) s_time_t timeout; unsigned long flags; + /* Fast path: no work to do. */ + if ( likely(!per_cpu(work_to_do, cpu)) ) + return; + /* Only do any work when invoked in truly idle state. */ if ( system_state != SYS_STATE_active || !is_idle_domain(current->sched_unit->domain) ) return; - /* Fast path: no work to do. */ - if ( !per_cpu(work_to_do, cpu ) ) - return; - smp_rmb(); /* In case we aborted, other CPUs can skip right away. */ if ( !livepatch_work.do_work )
When livepatching is enabled, this function is used all the time. Really do check the fastpath first, and annotate it likely() as this is the right answer 100% of the time (to many significant figures). This cuts out 3 pointer dereferences in the "nothing to do path", and it seems the optimiser has an easier time too. Bloat-o-meter reports: add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57) Function old new delta check_for_livepatch_work.cold 1201 1183 -18 check_for_livepatch_work 1021 982 -39 which isn't too shabby for no logical change. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> CC: Ross Lagerwall <ross.lagerwall@citrix.com> CC: Jan Beulich <JBeulich@suse.com> CC: Roger Pau Monné <roger.pau@citrix.com> CC: Wei Liu <wl@xen.org> I'm still a little disappointed with the code generation. GCC still chooses to set up the full stack frame (6 regs, +3 more slots) intermixed with the per-cpu calculations. In isolation, GCC can check the boolean without creating a stack frame: <work_to_to>: 48 89 e2 mov %rsp,%rdx 48 8d 05 de e1 37 00 lea 0x37e1de(%rip),%rax # ffff82d0405b6068 <per_cpu__work_to_do> 48 81 ca ff 7f 00 00 or $0x7fff,%rdx 8b 4a c1 mov -0x3f(%rdx),%ecx 48 8d 15 45 aa 39 00 lea 0x39aa45(%rip),%rdx # ffff82d0405d28e0 <__per_cpu_offset> 48 8b 14 ca mov (%rdx,%rcx,8),%rdx 0f b6 04 02 movzbl (%rdx,%rax,1),%eax c3 retq but I can't find a way to convince GCC that it would be worth not setting up a stack frame in in the common case, and having a few extra mov reg/reg's later in the uncommon case. I haven't tried manually splitting the function into a check() and a do() function. Views on whether that might be acceptable? At a guess, do() would need to be a static noinline to avoid it turning back into what it currently is. --- xen/common/livepatch.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) base-commit: 49818cde637b5ec20383e46b71f93b2e7d867686