Message ID | Y8WtE7BNJ0gTrqIS@cathedrallabs.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] mce: prevent concurrent polling of MCE events | expand |
> +static DEFINE_RAW_SPINLOCK(timer_fn_lock); > > static unsigned long mce_adjust_timer_default(unsigned long interval) > { > @@ -1628,7 +1629,9 @@ static void mce_timer_fn(struct timer_list *t) > iv = __this_cpu_read(mce_next_interval); > > if (mce_available(this_cpu_ptr(&cpu_info))) { > + raw_spin_lock(&timer_fn_lock); > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); > + raw_spin_unlock(&timer_fn_lock); > > if (mce_intel_cmci_poll()) { > iv = mce_adjust_timer(iv); If the CMCI polling interrupts on those large number of CPUs are staggered at different times, then this should be fine. But if a largish number of CPUs get in lockstep with each other, then this could get ugly. I.e. 200 CPUs all take the "CMCI poll" interrupt together. Then get stuck in a convoy here as one at a time step through checking the machine check banks. I'm not sure if that is just a theoretical issue, or how bad it might actually be if it happened. One option to avoid this might be to change from a fixed five minute polling interval to "five minute plus/minus rand(50) jiffies". Then even if some CPUs did sync up, they'd quickly diverge again. -Tony
On Tue, Jan 17, 2023 at 05:42:36PM +0000, Luck, Tony wrote: > > > > if (mce_intel_cmci_poll()) { > > iv = mce_adjust_timer(iv); > > If the CMCI polling interrupts on those large number of CPUs are > staggered at different times, then this should be fine. But if a largish > number of CPUs get in lockstep with each other, then this could get > ugly. I.e. 200 CPUs all take the "CMCI poll" interrupt together. Then > get stuck in a convoy here as one at a time step through checking > the machine check banks. Yes, but could change the patch to include mce_available() into the protection of the lock. It should cleared once machine_check_poll() clears the bank state, no? > One option to avoid this might be to change from a fixed five minute > polling interval to "five minute plus/minus rand(50) jiffies". Then even > if some CPUs did sync up, they'd quickly diverge again. That'd be fine for me as well. Perhaps initialize the polling interval using cpu numbers to space them then random for rescheduling?
> Yes, but could change the patch to include mce_available() into the > protection of the lock. It should cleared once machine_check_poll() clears > the bank state, no? Which machines are showing this problem? Most modern systems support CMCI. So I'm thinking that this case shows up because the sysadmin booted with "mce=no_cmc;". In that case I don't think mce_available() check would change anything. -Tony
On Tue, Jan 17, 2023 at 06:54:02PM +0000, Luck, Tony wrote: > > Yes, but could change the patch to include mce_available() into the > > protection of the lock. It should cleared once machine_check_poll() clears > > the bank state, no? > > Which machines are showing this problem? Most modern systems support > CMCI. So I'm thinking that this case shows up because the sysadmin booted > with "mce=no_cmc;". In that case I don't think mce_available() check would > change anything. That is correct, ignore what I said.
Hi Tony, On Tue, Jan 17, 2023 at 05:42:36PM +0000, Luck, Tony wrote: > > +static DEFINE_RAW_SPINLOCK(timer_fn_lock); > > > > static unsigned long mce_adjust_timer_default(unsigned long interval) > > { > > @@ -1628,7 +1629,9 @@ static void mce_timer_fn(struct timer_list *t) > > iv = __this_cpu_read(mce_next_interval); > > > > if (mce_available(this_cpu_ptr(&cpu_info))) { > > + raw_spin_lock(&timer_fn_lock); > > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); > > + raw_spin_unlock(&timer_fn_lock); > > > > if (mce_intel_cmci_poll()) { > > iv = mce_adjust_timer(iv); > > If the CMCI polling interrupts on those large number of CPUs are > staggered at different times, then this should be fine. But if a largish > number of CPUs get in lockstep with each other, then this could get > ugly. I.e. 200 CPUs all take the "CMCI poll" interrupt together. Then > get stuck in a convoy here as one at a time step through checking > the machine check banks. > > I'm not sure if that is just a theoretical issue, or how bad it might > actually be if it happened. > > One option to avoid this might be to change from a fixed five minute > polling interval to "five minute plus/minus rand(50) jiffies". Then even > if some CPUs did sync up, they'd quickly diverge again. I did experiment different ranges including forcing a minimum difference based on cpu number but eventually I'd see repeated events. Perhaps a combination would be acceptable? Set the first run of each cpu spaced based on CPU number then use a spinlock to synchronize it? That way we minimize the chance of a big number of CPUs stuck on the same lock but at same time guarantee we won't have duplicated events.
> One option to avoid this might be to change from a fixed five minute > > polling interval to "five minute plus/minus rand(50) jiffies". Then even > > if some CPUs did sync up, they'd quickly diverge again. > > I did experiment different ranges including forcing a minimum difference > based on cpu number but eventually I'd see repeated events. Perhaps a > combination would be acceptable? Set the first run of each cpu spaced based on > CPU number then use a spinlock to synchronize it? That way we minimize the > chance of a big number of CPUs stuck on the same lock but at same time > guarantee we won't have duplicated events. Aristeu, Yes. A variable poll interval (to avoid CPUs all bunching together) with the spinlock to serialize the cases where some CPUs do line up their polling would work. -Tony
On Fri, Apr 28, 2023 at 04:43:46PM +0000, Luck, Tony wrote: > > One option to avoid this might be to change from a fixed five minute > > > polling interval to "five minute plus/minus rand(50) jiffies". Then even > > > if some CPUs did sync up, they'd quickly diverge again. > > > > I did experiment different ranges including forcing a minimum difference > > based on cpu number but eventually I'd see repeated events. Perhaps a > > combination would be acceptable? Set the first run of each cpu spaced based on > > CPU number then use a spinlock to synchronize it? That way we minimize the > > chance of a big number of CPUs stuck on the same lock but at same time > > guarantee we won't have duplicated events. > > Aristeu, > > Yes. A variable poll interval (to avoid CPUs all bunching together) with > the spinlock to serialize the cases where some CPUs do line up their > polling would work. I just ran a test on a system booted with mce=no_cmci and saw the duplicate reporting. Digging into the code I see that I'm not the first to think that a little randomness in timers on different CPUs would be a good idea. The MCE code starts/restarts timers with: mod_timer(t, round_jiffies(when)); Looking at the round_jiffies() function I sw that it is grabbing the current CPU number. unsigned long round_jiffies(unsigned long j) { return round_jiffies_common(j, raw_smp_processor_id(), false); } And round_jiffies_common() explains with this comment: /* * We don't want all cpus firing their timers at once hitting the * same lock or cachelines, so we skew each extra cpu with an extra * 3 jiffies. This 3 jiffies came originally from the mm/ code which * already did this. * The skew is done by adding 3*cpunr, then round, then subtract this * extra offset again. */ j += cpu * 3; rem = j % HZ; So your original patch https://lore.kernel.org/all/Y8WtE7BNJ0gTrqIS@cathedrallabs.org/ that just added the spinlock should be fine! -Tony
On Thu, May 11, 2023 at 04:38:57PM -0700, Tony Luck wrote: > I just ran a test on a system booted with mce=no_cmci and saw the > duplicate reporting. Digging into the code I see that I'm not the > first to think that a little randomness in timers on different CPUs > would be a good idea. The MCE code starts/restarts timers with: > > mod_timer(t, round_jiffies(when)); > > Looking at the round_jiffies() function I sw that it is grabbing > the current CPU number. > > unsigned long round_jiffies(unsigned long j) > { > return round_jiffies_common(j, raw_smp_processor_id(), false); > } > > And round_jiffies_common() explains with this comment: > > /* > * We don't want all cpus firing their timers at once hitting the > * same lock or cachelines, so we skew each extra cpu with an extra > * 3 jiffies. This 3 jiffies came originally from the mm/ code which > * already did this. > * The skew is done by adding 3*cpunr, then round, then subtract this > * extra offset again. > */ > j += cpu * 3; > > rem = j % HZ; > > So your original patch https://lore.kernel.org/all/Y8WtE7BNJ0gTrqIS@cathedrallabs.org/ > that just added the spinlock should be fine! Thanks Tony. You want me to resend it or the original is enough?
> Thanks Tony. You want me to resend it or the original is enough? The original was marked RFC. So send a new version without the RFC tag. You can include: Reviewed-by: Tony Luck <tony.luck@intel.com> -Tony
--- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1597,6 +1597,7 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL; static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */ static DEFINE_PER_CPU(struct timer_list, mce_timer); +static DEFINE_RAW_SPINLOCK(timer_fn_lock); static unsigned long mce_adjust_timer_default(unsigned long interval) { @@ -1628,7 +1629,9 @@ static void mce_timer_fn(struct timer_list *t) iv = __this_cpu_read(mce_next_interval); if (mce_available(this_cpu_ptr(&cpu_info))) { + raw_spin_lock(&timer_fn_lock); machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); + raw_spin_unlock(&timer_fn_lock); if (mce_intel_cmci_poll()) { iv = mce_adjust_timer(iv);
I've considered creating an array so there'd be one lock per package but that'd add excessive complexity for something that happens by default every 5 minutes. Thoughts? --------- Error injection in modern HP machines with CMCI disabled will cause the injected MCE to be found only by polling. Because these newer machines have a big number of CPUs per package, it makes a lot more likely for multiple CPUs polling IMC registers (that are shared in the same package) at same time, causing multiple reports of the same MCE. Signed-off-by: Aristeu Rozanski <aris@ruivo.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: linux-edac@vger.kernel.org