Message ID | d5fd3646-18b3-4dae-8da7-6afa187f930e@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: increase NMI timer frequency if necessary | expand |
On 25/01/2024 4:55 pm, Jan Beulich wrote: > Since the performance counters used for the NMI watchdog count non- > halted cycles, they may count at a rate higher than cpu_khz. Is this in theory, or observed in practice? It is my understanding that perf counters count in P0 reference cycles, and not at the Turbo/CBS rate. > Thus the > watchdog tick may occur more frequently than invocations of the timer > if we don't account for the ratio between nominal and maximum CPU clock > speeds, which would be a problem in particular when "watchdog_timeout=1" > is in effect (for high enough ratios even larger timout values may pose > a problem). > > Leverage the so far display-only data we collect on newer Intel and AMD > CPUs. On older CPUs we just have to (continue to) hope that the default > frequency of 1 Hz is okay(-ish) to use. > > While adding the new variable, also move the (now adjacent) cpu_khz to > .data.ro_after_init. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > This renders the "log" in the function names somewhat stale, but I don't > think this strictly warrants renaming the functions right away. I'm not comfortable with this change. It's adding to a complicated timing problem, rather than simplifying it. The real problem we've got is that the NMI handler is guessing at the timeout by counting NMIs, not by actually counting time. There are several ways to fix this even with the current rendezvous logic. When the NMI handler can actually say "if ( NOW() - last > timeout )", then the exact frequently of NMIs becomes far less important. ~Andrew
On 19.03.2024 21:51, Andrew Cooper wrote: > On 25/01/2024 4:55 pm, Jan Beulich wrote: >> Since the performance counters used for the NMI watchdog count non- >> halted cycles, they may count at a rate higher than cpu_khz. > > Is this in theory, or observed in practice? It's been over two months since doing the experiments, so I can only go from memory, but my recollection is that I actually observed higher rates, just not high enough (yet) for the watchdog (without this change) to start malfunctioning. > It is my understanding that perf counters count in P0 reference cycles, > and not at the Turbo/CBS rate. > >> Thus the >> watchdog tick may occur more frequently than invocations of the timer >> if we don't account for the ratio between nominal and maximum CPU clock >> speeds, which would be a problem in particular when "watchdog_timeout=1" >> is in effect (for high enough ratios even larger timout values may pose >> a problem). >> >> Leverage the so far display-only data we collect on newer Intel and AMD >> CPUs. On older CPUs we just have to (continue to) hope that the default >> frequency of 1 Hz is okay(-ish) to use. >> >> While adding the new variable, also move the (now adjacent) cpu_khz to >> .data.ro_after_init. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> --- >> This renders the "log" in the function names somewhat stale, but I don't >> think this strictly warrants renaming the functions right away. > > I'm not comfortable with this change. It's adding to a complicated > timing problem, rather than simplifying it. The actual change to the watchdog logic is minimal - a build-time constant is replaced by a boot-time determined value. > The real problem we've got is that the NMI handler is guessing at the > timeout by counting NMIs, not by actually counting time. There are > several ways to fix this even with the current rendezvous logic. When > the NMI handler can actually say "if ( NOW() - last > timeout )", then > the exact frequently of NMIs becomes far less important. But that would come with its own downsides: The logic within the NMI handler should be as simple as possible, involving as little as possible other code. NOW(), for example, cannot really be used there without first fiddling with the time rendezvous (to make sure an NMI hitting in the middle of an update to the scaling values will know how to use consistent data; that could e.g. be done by flip-flopping between two instances of the data, with a "selector" always flipped last). Jan
--- a/xen/arch/x86/cpu/amd.c +++ b/xen/arch/x86/cpu/amd.c @@ -657,12 +657,18 @@ void amd_log_freq(const struct cpuinfo_x : (((v) & 0xff) * 25 * 8) / (((v) >> 8) & 0x3f)) if (idx && idx < h && !rdmsr_safe(0xC0010064 + idx, val) && (val >> 63) && - !rdmsr_safe(0xC0010064, hi) && (hi >> 63)) + !rdmsr_safe(0xC0010064, hi) && (hi >> 63)) { + if (c == &boot_cpu_data) + cpu_max_mhz = FREQ(hi); printk("CPU%u: %lu (%lu ... %lu) MHz\n", smp_processor_id(), FREQ(val), FREQ(lo), FREQ(hi)); - else if (h && !rdmsr_safe(0xC0010064, hi) && (hi >> 63)) + } + else if (h && !rdmsr_safe(0xC0010064, hi) && (hi >> 63)) { + if (c == &boot_cpu_data) + cpu_max_mhz = FREQ(hi); printk("CPU%u: %lu ... %lu MHz\n", smp_processor_id(), FREQ(lo), FREQ(hi)); + } else printk("CPU%u: %lu MHz\n", smp_processor_id(), FREQ(lo)); #undef FREQ --- a/xen/arch/x86/cpu/intel.c +++ b/xen/arch/x86/cpu/intel.c @@ -456,7 +456,11 @@ static void intel_log_freq(const struct if ( eax ) printk(" base: %u MHz", eax); if ( ebx ) + { + if ( c == &boot_cpu_data ) + cpu_max_mhz = ebx; printk(" max: %u MHz", ebx); + } printk("\n"); } } @@ -522,6 +526,8 @@ static void intel_log_freq(const struct printk("CPU%u: ", smp_processor_id()); if ( min_ratio ) printk("%u ... ", (factor * min_ratio + 50) / 100); + if ( c == &boot_cpu_data && !cpu_max_mhz ) + cpu_max_mhz = (factor * max_ratio + 50) / 100; printk("%u MHz\n", (factor * max_ratio + 50) / 100); } --- a/xen/arch/x86/include/asm/time.h +++ b/xen/arch/x86/include/asm/time.h @@ -8,6 +8,8 @@ typedef u64 cycles_t; extern bool disable_tsc_sync; +extern unsigned int cpu_max_mhz; + static inline cycles_t get_cycles(void) { return rdtsc_ordered(); --- a/xen/arch/x86/nmi.c +++ b/xen/arch/x86/nmi.c @@ -213,10 +213,12 @@ void __init check_nmi_watchdog(void) return; } +static unsigned int __ro_after_init timer_gap = MILLISECS(1000); + static void cf_check nmi_timer_fn(void *unused) { this_cpu(nmi_timer_ticks)++; - set_timer(&this_cpu(nmi_timer), NOW() + MILLISECS(1000)); + set_timer(&this_cpu(nmi_timer), NOW() + timer_gap); } void disable_lapic_nmi_watchdog(void) @@ -477,8 +479,17 @@ bool watchdog_enabled(void) int __init watchdog_setup(void) { + unsigned long cpu_mhz = cpu_khz / 1000; unsigned int cpu; + if ( cpu_max_mhz > cpu_mhz ) + { + timer_gap = timer_gap * cpu_mhz / cpu_max_mhz; + /* To be on the safe side, bound to 1ms. */ + if ( timer_gap < MILLISECS(1) ) + timer_gap = MILLISECS(1); + } + /* * Activate periodic heartbeats. We cannot do this earlier during * setup because the timer infrastructure is not available. --- a/xen/arch/x86/time.c +++ b/xen/arch/x86/time.c @@ -47,7 +47,9 @@ static char __initdata opt_clocksource[10]; string_param("clocksource", opt_clocksource); -unsigned long __read_mostly cpu_khz; /* CPU clock frequency in kHz. */ +unsigned long __ro_after_init cpu_khz; /* CPU clock frequency in kHz. */ +unsigned int __ro_after_init cpu_max_mhz; /* CPU max (known) clkfreq in MHz. */ + DEFINE_SPINLOCK(rtc_lock); unsigned long pit0_ticks;
Since the performance counters used for the NMI watchdog count non- halted cycles, they may count at a rate higher than cpu_khz. Thus the watchdog tick may occur more frequently than invocations of the timer if we don't account for the ratio between nominal and maximum CPU clock speeds, which would be a problem in particular when "watchdog_timeout=1" is in effect (for high enough ratios even larger timout values may pose a problem). Leverage the so far display-only data we collect on newer Intel and AMD CPUs. On older CPUs we just have to (continue to) hope that the default frequency of 1 Hz is okay(-ish) to use. While adding the new variable, also move the (now adjacent) cpu_khz to .data.ro_after_init. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- This renders the "log" in the function names somewhat stale, but I don't think this strictly warrants renaming the functions right away.