Message ID | 20240528063836.5248-1-TonyWWang-oc@zhaoxin.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | x86/hpet: Read HPET directly if panic in progress | expand |
On 5/27/24 23:38, Tony W Wang-oc wrote: ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c > index c96ae8fee95e..ecadd0698d6a 100644 > --- a/arch/x86/kernel/hpet.c > +++ b/arch/x86/kernel/hpet.c > @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs) > if (in_nmi()) > return (u64)hpet_readl(HPET_COUNTER); > > + /* > + * Read HPET directly if panic in progress. > + */ > + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID)) > + return (u64)hpet_readl(HPET_COUNTER); > + There is literally one other piece of the code in the kernel doing something similar: the printk() implementation. There's no other clocksource or timekeeping code that does this on any architecture. Why doesn't this problem apply to any other clock sources? Why should the problem be fixed in the clock sources themselves? Why doesn't printk() deadlock on systems using the HPET? In other words, I think we should fix pstore to be more like printk rather than hacking around this in each clock source.
On Tue, May 28 2024 at 07:18, Dave Hansen wrote: > On 5/27/24 23:38, Tony W Wang-oc wrote: > ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c >> index c96ae8fee95e..ecadd0698d6a 100644 >> --- a/arch/x86/kernel/hpet.c >> +++ b/arch/x86/kernel/hpet.c >> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs) >> if (in_nmi()) >> return (u64)hpet_readl(HPET_COUNTER); >> >> + /* >> + * Read HPET directly if panic in progress. >> + */ >> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID)) >> + return (u64)hpet_readl(HPET_COUNTER); >> + > > There is literally one other piece of the code in the kernel doing > something similar: the printk() implementation. There's no other > clocksource or timekeeping code that does this on any architecture. > > Why doesn't this problem apply to any other clock sources? I principle it applies to any clocksource which needs a spinlock to serialize access. HPET is not the only insanity here. Think about i8253 :) Most real clocksources, like TSC and the majority of the preferred clock sources on other architectures are perfectly fine. They just read and be done. > Why should the problem be fixed in the clock sources themselves? Why > doesn't printk() deadlock on systems using the HPET? Because regular printk()s are deferred to irq work when in NMI and similar contexts, but that obviously does not apply to panic situations. Also NMI is treated special even in the HPET code. See below. > In other words, I think we should fix pstore to be more like printk > rather than hacking around this in each clock source. pstore is perfectly fine. It uses a NMI safe time accessor function which is then tripping over the HPET lock. That's really a HPET specific problem. Though what I read out of the changelog is that the MCE hits the same CPU 'x' which holds the lock. But that's fairy tale material as you can see in the patch above: if (in_nmi()) return (u64)hpet_readl(HPET_COUNTER); For that particular case the dead lock, which would actually be a live lock, cannot happen because in kernel MCEs are NMI class exceptions and therefore in_nmi() evaluates to true and that new voodoo can't be reached at all. Now there are two other scenarios which really can make that happen: 1) A non-NMI class exception within the lock held region CPU A acquire(hpet_lock); ... <- #PF, #GP, #DE, ... -> panic() If any of that happens within that lock held section then the live lock on the hpet_lock is the least of your worries. Seriously, I don't care about this at all. 2) The actual scenario is: CPU A CPU B lock(hpet_lock) MCE hits user space ... exc_machine_check_user() irqentry_enter_from_user_mode(regs); irqentry_enter_from_user_mode() obviously does not mark the exception as NMI class, so in_nmi() evaluates to false. That would actually dead lock if CPU A is not making progress and releases hpet_lock. Sounds unlikely to happen, right? But in reality it can because of MCE broadcast. Assume that both CPUs go into MCE: CPU A CPU B lock(hpet_lock) exc_machine_check_user() irqentry_enter_from_user_mode(); exc_machine_check_kernel() do_machine_check() irqentry_nmi_enter(); mce_panic() do_machine_check() if (atomic_inc_return(&mce_panicked) > 1) mce_panic() wait_for_panic(); <- Not taken if (atomic_inc_return(&mce_panicked) > 1) wait_for_panic(); <- Taken .... hpet_read() -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A obviously can't release the lock. So the proposed patch makes sense to some extent. But it only cures the symptom. The real underlying questions are: 1) Should we provide a panic mode read callback for clocksources which are affected by this? 2) Is it correct to claim that a MCE which hits user space and ends up in mce_panic() is still just a regular exception or should we upgrade to NMI class context when we enter mce_panic() or even go as far to upgrade to NMI class context for any panic() invocation? #1 Solves it at the clocksource level. It still needs HPET specific changes. #2 Solves a whole class of issues ... while potentially introducing new ones :) To me upgrading any panic() invocation to NMI class context makes a lot of sense because in that case all bets are off. in_nmi() is used in quite some places to avoid such problems. IOW, that would kill a whole class of issues instead of "curing" the HPET problem locally for the price of an extra conditional. Not that the extra conditional matters much if HPET is the clocksource as that's awfully slow anyway and I really don't care about that. But I very much care about avoiding to sprinkle panic_cpu checks all over the place. Thanks, tglx
On Tue, 28 May 2024 at 15:12, Thomas Gleixner <tglx@linutronix.de> wrote: > > I principle it applies to any clocksource which needs a spinlock to > serialize access. HPET is not the only insanity here. HPET may be the main / only one we care about. Because: > Think about i8253 :) I see the smiley, but yeah, I don't think we really care about it. > 1) Should we provide a panic mode read callback for clocksources which > are affected by this? The current patch under discussion may be ugly, but looks workable. Local ugliness isn't necessarily a show-stopper. So if the HPET is the *only* case which has this situation, I vote for just doing the ugly thing. Now, if *other* cases exist, and can't be worked around in similar ways, then that argues for a more "proper" fix. And no, I don't think i8253 is a strong enough argument. I don't actually believe you can realistically find a machine that doesn't have HPET or the TSC and really falls back on the i8253 any more. And if you *do* find hw like that, is it SMP-capable? And can you find somebody who cares? > 2) Is it correct to claim that a MCE which hits user space and ends up in > mce_panic() is still just a regular exception or should we upgrade to > NMI class context when we enter mce_panic() or even go as far to > upgrade to NMI class context for any panic() invocation? I do think that an NMI in user space should be considered mostly just a normal exception. From a kernel perspective, the NMI'ness just doesn't matter. That said, I find your suggestion of making 'panic()' just basically act as an NMI context intriguing. And cleaner than the atomic_read(&panic_cpu) thing. Are there any other situations than this odd HPET thing where that would change semantics? Linus
On 2024/5/29 06:12, Thomas Gleixner wrote: > > > [这封邮件来自外部发件人 谨防风险] > > On Tue, May 28 2024 at 07:18, Dave Hansen wrote: >> On 5/27/24 23:38, Tony W Wang-oc wrote: >> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c >>> index c96ae8fee95e..ecadd0698d6a 100644 >>> --- a/arch/x86/kernel/hpet.c >>> +++ b/arch/x86/kernel/hpet.c >>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs) >>> if (in_nmi()) >>> return (u64)hpet_readl(HPET_COUNTER); >>> >>> + /* >>> + * Read HPET directly if panic in progress. >>> + */ >>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID)) >>> + return (u64)hpet_readl(HPET_COUNTER); >>> + >> >> There is literally one other piece of the code in the kernel doing >> something similar: the printk() implementation. There's no other >> clocksource or timekeeping code that does this on any architecture. >> >> Why doesn't this problem apply to any other clock sources? > > I principle it applies to any clocksource which needs a spinlock to > serialize access. HPET is not the only insanity here. > > Think about i8253 :) > > Most real clocksources, like TSC and the majority of the preferred clock > sources on other architectures are perfectly fine. They just read and be > done. > >> Why should the problem be fixed in the clock sources themselves? Why >> doesn't printk() deadlock on systems using the HPET? > > Because regular printk()s are deferred to irq work when in NMI and > similar contexts, but that obviously does not apply to panic > situations. Also NMI is treated special even in the HPET code. See > below. > >> In other words, I think we should fix pstore to be more like printk >> rather than hacking around this in each clock source. > > pstore is perfectly fine. It uses a NMI safe time accessor function > which is then tripping over the HPET lock. That's really a HPET specific > problem. > > Though what I read out of the changelog is that the MCE hits the same > CPU 'x' which holds the lock. But that's fairy tale material as you can > see in the patch above: > > if (in_nmi()) > return (u64)hpet_readl(HPET_COUNTER); > > For that particular case the dead lock, which would actually be a live > lock, cannot happen because in kernel MCEs are NMI class exceptions and > therefore in_nmi() evaluates to true and that new voodoo can't be > reached at all. > > Now there are two other scenarios which really can make that happen: > > 1) A non-NMI class exception within the lock held region > > CPU A > acquire(hpet_lock); > ... <- #PF, #GP, #DE, ... -> panic() > > If any of that happens within that lock held section then the live > lock on the hpet_lock is the least of your worries. Seriously, I > don't care about this at all. > > 2) The actual scenario is: > > CPU A CPU B > lock(hpet_lock) > MCE hits user space > ... > exc_machine_check_user() > irqentry_enter_from_user_mode(regs); > > irqentry_enter_from_user_mode() obviously does not mark the > exception as NMI class, so in_nmi() evaluates to false. That would > actually dead lock if CPU A is not making progress and releases > hpet_lock. > > Sounds unlikely to happen, right? But in reality it can because of > MCE broadcast. Assume that both CPUs go into MCE: > > CPU A CPU B > lock(hpet_lock) > exc_machine_check_user() > irqentry_enter_from_user_mode(); > exc_machine_check_kernel() do_machine_check() > irqentry_nmi_enter(); mce_panic() > do_machine_check() if (atomic_inc_return(&mce_panicked) > 1) > mce_panic() wait_for_panic(); <- Not taken > > if (atomic_inc_return(&mce_panicked) > 1) > wait_for_panic(); <- Taken > > .... > hpet_read() > > -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A > obviously can't release the lock. > Because MCE handler will call printk() before call the panic, so printk() deadlock may happen in this scenario: CPU A CPU B printk() lock(console_owner_lock) exc_machine_check_user() irqentry_enter_from_user_mode() exc_machine_check_kernel() do_machine_check() irqentry_nmi_enter() mce_panic() do_machine_check() printk_mce() #A mce_panic() ... wait_for_panic() panic() printk deadlock will happened at #A because in_nmi() evaluates to false on CPU B and CPU B do not enter the panic() AT #A. Update user space MCE handler to NMI class context is preferred? Sincerely TonyWWang-oc > So the proposed patch makes sense to some extent. But it only cures the > symptom. The real underlying questions are: > > 1) Should we provide a panic mode read callback for clocksources which > are affected by this? > > 2) Is it correct to claim that a MCE which hits user space and ends up in > mce_panic() is still just a regular exception or should we upgrade to > NMI class context when we enter mce_panic() or even go as far to > upgrade to NMI class context for any panic() invocation? > > #1 Solves it at the clocksource level. It still needs HPET specific > changes. > > #2 Solves a whole class of issues > > ... while potentially introducing new ones :) > > To me upgrading any panic() invocation to NMI class context makes a > lot of sense because in that case all bets are off. > > in_nmi() is used in quite some places to avoid such problems. IOW, > that would kill a whole class of issues instead of "curing" the HPET > problem locally for the price of an extra conditional. Not that the > extra conditional matters much if HPET is the clocksource as that's > awfully slow anyway and I really don't care about that. > > But I very much care about avoiding to sprinkle panic_cpu checks all > over the place. > > Thanks, > > tglx
On 2024/5/29 12:39, Tony W Wang-oc wrote: > > > On 2024/5/29 06:12, Thomas Gleixner wrote: >> >> >> [这封邮件来自外部发件人 谨防风险] >> >> On Tue, May 28 2024 at 07:18, Dave Hansen wrote: >>> On 5/27/24 23:38, Tony W Wang-oc wrote: >>> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c >>>> index c96ae8fee95e..ecadd0698d6a 100644 >>>> --- a/arch/x86/kernel/hpet.c >>>> +++ b/arch/x86/kernel/hpet.c >>>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs) >>>> if (in_nmi()) >>>> return (u64)hpet_readl(HPET_COUNTER); >>>> >>>> + /* >>>> + * Read HPET directly if panic in progress. >>>> + */ >>>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID)) >>>> + return (u64)hpet_readl(HPET_COUNTER); >>>> + >>> >>> There is literally one other piece of the code in the kernel doing >>> something similar: the printk() implementation. There's no other >>> clocksource or timekeeping code that does this on any architecture. >>> >>> Why doesn't this problem apply to any other clock sources? >> >> I principle it applies to any clocksource which needs a spinlock to >> serialize access. HPET is not the only insanity here. >> >> Think about i8253 :) >> >> Most real clocksources, like TSC and the majority of the preferred clock >> sources on other architectures are perfectly fine. They just read and be >> done. >> >>> Why should the problem be fixed in the clock sources themselves? Why >>> doesn't printk() deadlock on systems using the HPET? >> >> Because regular printk()s are deferred to irq work when in NMI and >> similar contexts, but that obviously does not apply to panic >> situations. Also NMI is treated special even in the HPET code. See >> below. >> >>> In other words, I think we should fix pstore to be more like printk >>> rather than hacking around this in each clock source. >> >> pstore is perfectly fine. It uses a NMI safe time accessor function >> which is then tripping over the HPET lock. That's really a HPET specific >> problem. >> >> Though what I read out of the changelog is that the MCE hits the same >> CPU 'x' which holds the lock. But that's fairy tale material as you can >> see in the patch above: >> >> if (in_nmi()) >> return (u64)hpet_readl(HPET_COUNTER); >> >> For that particular case the dead lock, which would actually be a live >> lock, cannot happen because in kernel MCEs are NMI class exceptions and >> therefore in_nmi() evaluates to true and that new voodoo can't be >> reached at all. >> >> Now there are two other scenarios which really can make that happen: >> >> 1) A non-NMI class exception within the lock held region >> >> CPU A >> acquire(hpet_lock); >> ... <- #PF, #GP, #DE, ... -> panic() >> >> If any of that happens within that lock held section then the live >> lock on the hpet_lock is the least of your worries. Seriously, I >> don't care about this at all. >> Actually, this scenario is what this patch is trying to solve. We encountered hpet_lock deadlock from the call path of the MCE handler, and this hpet_lock deadlock scenario may happen when others exceptions' handler like #PF/#GP... to call the panic. So read_hpet should avoid deadlock if panic in progress. Sincerely TonyWWang-oc >> 2) The actual scenario is: >> >> CPU A CPU B >> lock(hpet_lock) >> MCE hits user space >> ... >> exc_machine_check_user() >> irqentry_enter_from_user_mode(regs); >> >> irqentry_enter_from_user_mode() obviously does not mark the >> exception as NMI class, so in_nmi() evaluates to false. That would >> actually dead lock if CPU A is not making progress and releases >> hpet_lock. >> >> Sounds unlikely to happen, right? But in reality it can because of >> MCE broadcast. Assume that both CPUs go into MCE: >> >> CPU A CPU B >> lock(hpet_lock) >> exc_machine_check_user() >> irqentry_enter_from_user_mode(); >> exc_machine_check_kernel() do_machine_check() >> irqentry_nmi_enter(); mce_panic() >> do_machine_check() if >> (atomic_inc_return(&mce_panicked) > 1) >> mce_panic() wait_for_panic(); <- Not taken >> >> if (atomic_inc_return(&mce_panicked) > 1) >> wait_for_panic(); <- Taken >> >> .... >> hpet_read() >> >> -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A >> obviously can't release the lock. >> > > Because MCE handler will call printk() before call the panic, so > printk() deadlock may happen in this scenario: > > CPU A CPU B > printk() > lock(console_owner_lock) > exc_machine_check_user() > irqentry_enter_from_user_mode() > exc_machine_check_kernel() do_machine_check() > irqentry_nmi_enter() mce_panic() > do_machine_check() printk_mce() #A > mce_panic() ... > wait_for_panic() panic() > > printk deadlock will happened at #A because in_nmi() evaluates to false > on CPU B and CPU B do not enter the panic() AT #A. > > Update user space MCE handler to NMI class context is preferred? > > Sincerely > TonyWWang-oc > >> So the proposed patch makes sense to some extent. But it only cures the >> symptom. The real underlying questions are: >> >> 1) Should we provide a panic mode read callback for clocksources which >> are affected by this? >> >> 2) Is it correct to claim that a MCE which hits user space and ends >> up in >> mce_panic() is still just a regular exception or should we >> upgrade to >> NMI class context when we enter mce_panic() or even go as far to >> upgrade to NMI class context for any panic() invocation? >> >> #1 Solves it at the clocksource level. It still needs HPET specific >> changes. >> >> #2 Solves a whole class of issues >> >> ... while potentially introducing new ones :) >> >> To me upgrading any panic() invocation to NMI class context makes a >> lot of sense because in that case all bets are off. >> >> in_nmi() is used in quite some places to avoid such problems. IOW, >> that would kill a whole class of issues instead of "curing" the HPET >> problem locally for the price of an extra conditional. Not that the >> extra conditional matters much if HPET is the clocksource as that's >> awfully slow anyway and I really don't care about that. >> >> But I very much care about avoiding to sprinkle panic_cpu checks all >> over the place. >> >> Thanks, >> >> tglx
Linus! On Tue, May 28 2024 at 16:22, Linus Torvalds wrote: > On Tue, 28 May 2024 at 15:12, Thomas Gleixner <tglx@linutronix.de> wrote: > I see the smiley, but yeah, I don't think we really care about it. Indeed. But the same problem exists on other architectures as well. drivers/clocksource alone has 4 examples aside of i8253 >> 1) Should we provide a panic mode read callback for clocksources which >> are affected by this? > > The current patch under discussion may be ugly, but looks workable. > Local ugliness isn't necessarily a show-stopper. > > So if the HPET is the *only* case which has this situation, I vote for > just doing the ugly thing. > > Now, if *other* cases exist, and can't be worked around in similar > ways, then that argues for a more "proper" fix. > > And no, I don't think i8253 is a strong enough argument. I don't > actually believe you can realistically find a machine that doesn't > have HPET or the TSC and really falls back on the i8253 any more. And > if you *do* find hw like that, is it SMP-capable? And can you find > somebody who cares? Probably not. >> 2) Is it correct to claim that a MCE which hits user space and ends up in >> mce_panic() is still just a regular exception or should we upgrade to >> NMI class context when we enter mce_panic() or even go as far to >> upgrade to NMI class context for any panic() invocation? > > I do think that an NMI in user space should be considered mostly just > a normal exception. From a kernel perspective, the NMI'ness just > doesn't matter. That's correct. I don't want to change that at all especially not for recoverable MCEs. > That said, I find your suggestion of making 'panic()' just basically > act as an NMI context intriguing. And cleaner than the > atomic_read(&panic_cpu) thing. > > Are there any other situations than this odd HPET thing where that > would change semantics? I need to go and stare at this some more. Thanks, tglx
On Wed, May 29 2024 at 12:39, Tony W Wang-oc wrote: > printk deadlock will happened at #A because in_nmi() evaluates to false > on CPU B and CPU B do not enter the panic() AT #A. > > Update user space MCE handler to NMI class context is preferred? No.
On Wed, May 29 2024 at 14:44, Tony W Wang-oc wrote: > Actually, this scenario is what this patch is trying to solve. > > We encountered hpet_lock deadlock from the call path of the MCE handler, > and this hpet_lock deadlock scenario may happen when others exceptions' > handler like #PF/#GP... to call the panic. So read_hpet should avoid > deadlock if panic in progress. That's not what your changelog says.
On 2024/5/29 15:45, Thomas Gleixner wrote: > > > [这封邮件来自外部发件人 谨防风险] > > On Wed, May 29 2024 at 14:44, Tony W Wang-oc wrote: >> Actually, this scenario is what this patch is trying to solve. >> >> We encountered hpet_lock deadlock from the call path of the MCE handler, >> and this hpet_lock deadlock scenario may happen when others exceptions' >> handler like #PF/#GP... to call the panic. So read_hpet should avoid >> deadlock if panic in progress. > > That's not what your changelog says. Yes. The example flow I gave with the MCE handler in my changelog is misleading.
On 2024/5/29 15:42, Thomas Gleixner wrote: > > > [这封邮件来自外部发件人 谨防风险] > > Linus! > > On Tue, May 28 2024 at 16:22, Linus Torvalds wrote: >> On Tue, 28 May 2024 at 15:12, Thomas Gleixner <tglx@linutronix.de> wrote: >> I see the smiley, but yeah, I don't think we really care about it. > > Indeed. But the same problem exists on other architectures as > well. drivers/clocksource alone has 4 examples aside of i8253 > >>> 1) Should we provide a panic mode read callback for clocksources which >>> are affected by this? >> >> The current patch under discussion may be ugly, but looks workable. >> Local ugliness isn't necessarily a show-stopper. >> >> So if the HPET is the *only* case which has this situation, I vote for >> just doing the ugly thing. >> >> Now, if *other* cases exist, and can't be worked around in similar >> ways, then that argues for a more "proper" fix. >> >> And no, I don't think i8253 is a strong enough argument. I don't >> actually believe you can realistically find a machine that doesn't >> have HPET or the TSC and really falls back on the i8253 any more. And >> if you *do* find hw like that, is it SMP-capable? And can you find >> somebody who cares? > > Probably not. > >>> 2) Is it correct to claim that a MCE which hits user space and ends up in >>> mce_panic() is still just a regular exception or should we upgrade to >>> NMI class context when we enter mce_panic() or even go as far to >>> upgrade to NMI class context for any panic() invocation? >> After MCE has occurred, it is possible for the MCE handler to execute the add_taint() function without panic. For example, the fake_panic is configured. So the above patch method does not seem to be able to cover the printk deadlock caused by the add_taint() function in the MCE handler when a MCE occurs in user space. Sincerely TonyWWang-oc >> I do think that an NMI in user space should be considered mostly just >> a normal exception. From a kernel perspective, the NMI'ness just >> doesn't matter. > > That's correct. I don't want to change that at all especially not for > recoverable MCEs. > >> That said, I find your suggestion of making 'panic()' just basically >> act as an NMI context intriguing. And cleaner than the >> atomic_read(&panic_cpu) thing. >> >> Are there any other situations than this odd HPET thing where that >> would change semantics? > > I need to go and stare at this some more. > > Thanks, > > tglx
On Wed, Jun 05, 2024 at 02:23:32PM +0800, Tony W Wang-oc wrote: > After MCE has occurred, it is possible for the MCE handler to execute the > add_taint() function without panic. For example, the fake_panic is > configured. fake_panic is an ancient debugging leftover. If you set it, you get what you deserve.
On 2024/6/5 17:36, Borislav Petkov wrote: > > > [这封邮件来自外部发件人 谨防风险] > > On Wed, Jun 05, 2024 at 02:23:32PM +0800, Tony W Wang-oc wrote: >> After MCE has occurred, it is possible for the MCE handler to execute the >> add_taint() function without panic. For example, the fake_panic is >> configured. > > fake_panic is an ancient debugging leftover. If you set it, you get what > you deserve. > It may also happen without setting fake_panic, such as an MCE error of the UCNA/SRAO type occurred? Sincerely TonyWWang-oc > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
On Wed, Jun 05, 2024 at 06:10:07PM +0800, Tony W Wang-oc wrote: > It may also happen without setting fake_panic, such as an MCE error of the > UCNA/SRAO type occurred? Which types exactly do you mean when you're looking at the severities[] array in severity.c? And what scenario are you talking about? To get an #MC exception and detect only UCNA/SRAO errors? Can that even happen on any hardware?
On 2024/6/5 19:33, Borislav Petkov wrote: > > > [这封邮件来自外部发件人 谨防风险] > > On Wed, Jun 05, 2024 at 06:10:07PM +0800, Tony W Wang-oc wrote: >> It may also happen without setting fake_panic, such as an MCE error of the >> UCNA/SRAO type occurred? > > Which types exactly do you mean when you're looking at the severities[] > array in severity.c? > > And what scenario are you talking about? > > To get an #MC exception and detect only UCNA/SRAO errors? Can that even > happen on any hardware? > Yes, I mean an #MC exception happened and detect only like SRAO errors like below: MCESEV( AO, "Action optional: memory scrubbing error", SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK, MCI_STATUS_UC|MCACOD_SCRUB) ), MCESEV( AO, "Action optional: last level cache writeback error", SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB) ), I think these errors are actually encountered on some platforms that support these type of errors report to the #MC. Sincerely TonyWWang-oc > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
> > Which types exactly do you mean when you're looking at the severities[] > > array in severity.c? > > > > And what scenario are you talking about? > > > > To get an #MC exception and detect only UCNA/SRAO errors? Can that even > > happen on any hardware? > > > > Yes, I mean an #MC exception happened and detect only like SRAO errors > like below: > > MCESEV( > AO, "Action optional: memory scrubbing error", > SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK, > MCI_STATUS_UC|MCACOD_SCRUB) > ), > MCESEV( > AO, "Action optional: last level cache writeback error", > SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB) > ), > > I think these errors are actually encountered on some platforms that > support these type of errors report to the #MC. Intel servers from Nehalem through Cascade Lake reported memory controller patrol scrub uncorrected error with #MC and SRAO signature. Icelake and newer use CMCI with a UCNA signature. -Tony
On 2024/6/5 23:51, Luck, Tony wrote: > > > [这封邮件来自外部发件人 谨防风险] > >>> Which types exactly do you mean when you're looking at the severities[] >>> array in severity.c? >>> >>> And what scenario are you talking about? >>> >>> To get an #MC exception and detect only UCNA/SRAO errors? Can that even >>> happen on any hardware? >>> >> >> Yes, I mean an #MC exception happened and detect only like SRAO errors >> like below: >> >> MCESEV( >> AO, "Action optional: memory scrubbing error", >> SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK, >> MCI_STATUS_UC|MCACOD_SCRUB) >> ), >> MCESEV( >> AO, "Action optional: last level cache writeback error", >> SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB) >> ), >> >> I think these errors are actually encountered on some platforms that >> support these type of errors report to the #MC. > > Intel servers from Nehalem through Cascade Lake reported memory controller > patrol scrub uncorrected error with #MC and SRAO signature. > > Icelake and newer use CMCI with a UCNA signature. > I have a question, does Intel use #MC to report UCNA errors? Sincerely TonyWWang-oc
>> Icelake and newer use CMCI with a UCNA signature. >> > > I have a question, does Intel use #MC to report UCNA errors? No. They are reported with CMCI[1] (assuming it is enabled by IA32_MCi_CTL2 bit 30). -Tony [1] Usage evolved and naming did not keep up. An "Uncorrected" error is being signaled using "Corrected Machine Check Interrupt".
On 2024/5/29 06:12, Thomas Gleixner wrote: > > > [这封邮件来自外部发件人 谨防风险] > > On Tue, May 28 2024 at 07:18, Dave Hansen wrote: >> On 5/27/24 23:38, Tony W Wang-oc wrote: >> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c >>> index c96ae8fee95e..ecadd0698d6a 100644 >>> --- a/arch/x86/kernel/hpet.c >>> +++ b/arch/x86/kernel/hpet.c >>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs) >>> if (in_nmi()) >>> return (u64)hpet_readl(HPET_COUNTER); >>> >>> + /* >>> + * Read HPET directly if panic in progress. >>> + */ >>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID)) >>> + return (u64)hpet_readl(HPET_COUNTER); >>> + >> >> There is literally one other piece of the code in the kernel doing >> something similar: the printk() implementation. There's no other >> clocksource or timekeeping code that does this on any architecture. >> >> Why doesn't this problem apply to any other clock sources? > > I principle it applies to any clocksource which needs a spinlock to > serialize access. HPET is not the only insanity here. > > Think about i8253 :) > > Most real clocksources, like TSC and the majority of the preferred clock > sources on other architectures are perfectly fine. They just read and be > done. > >> Why should the problem be fixed in the clock sources themselves? Why >> doesn't printk() deadlock on systems using the HPET? > > Because regular printk()s are deferred to irq work when in NMI and > similar contexts, but that obviously does not apply to panic > situations. Also NMI is treated special even in the HPET code. See > below. > >> In other words, I think we should fix pstore to be more like printk >> rather than hacking around this in each clock source. > > pstore is perfectly fine. It uses a NMI safe time accessor function > which is then tripping over the HPET lock. That's really a HPET specific > problem. > > Though what I read out of the changelog is that the MCE hits the same > CPU 'x' which holds the lock. But that's fairy tale material as you can > see in the patch above: > > if (in_nmi()) > return (u64)hpet_readl(HPET_COUNTER); > > For that particular case the dead lock, which would actually be a live > lock, cannot happen because in kernel MCEs are NMI class exceptions and > therefore in_nmi() evaluates to true and that new voodoo can't be > reached at all. > > Now there are two other scenarios which really can make that happen: > > 1) A non-NMI class exception within the lock held region > > CPU A > acquire(hpet_lock); > ... <- #PF, #GP, #DE, ... -> panic() > > If any of that happens within that lock held section then the live > lock on the hpet_lock is the least of your worries. Seriously, I > don't care about this at all. > > 2) The actual scenario is: > > CPU A CPU B > lock(hpet_lock) > MCE hits user space > ... > exc_machine_check_user() > irqentry_enter_from_user_mode(regs); > > irqentry_enter_from_user_mode() obviously does not mark the > exception as NMI class, so in_nmi() evaluates to false. That would > actually dead lock if CPU A is not making progress and releases > hpet_lock. > > Sounds unlikely to happen, right? But in reality it can because of > MCE broadcast. Assume that both CPUs go into MCE: > > CPU A CPU B > lock(hpet_lock) > exc_machine_check_user() > irqentry_enter_from_user_mode(); > exc_machine_check_kernel() do_machine_check() > irqentry_nmi_enter(); mce_panic() > do_machine_check() if (atomic_inc_return(&mce_panicked) > 1) > mce_panic() wait_for_panic(); <- Not taken > > if (atomic_inc_return(&mce_panicked) > 1) > wait_for_panic(); <- Taken > > .... > hpet_read() > > -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A > obviously can't release the lock. > For this scenario, an experiment was designed for the printk: a, Install a driver module that repeatedly sending IPIs to multiple cores to executes printk. b, Run a user-level testing tool like stream on all cores. c, Trigger a MCE hardware error. During burnin tests a-c, reproduce the following case: CPU A CPU B printk() console_owner exc_machine_check_user() irqentry_enter_from_user_mode() exc_machine_check_kernel() do_machine_check() irqentry_nmi_enter() mce_panic() do_machine_check() print_mce() ... ... while(console_waiter) cpu_relax(); <- deadloop mce_timed_out() <-timeout wait_for_panic() panic("Panicing machine check CPU died"); In this case CPU B is the monarch CPU in MCE handler, CPU B waiting to be the console_owner and CPU A can't release the console_owner. So the monarch CPU B deadloop happened, as a result other CPU witch waiting the monarch CPU timeout will call the panic function. This problem is caused by the use of printk in the MCE handler. Actually, I found the comments for the MCE handler like: * This is executed in #MC context not subject to normal locking rules. * This implies that most kernel services cannot be safely used. Don't even * think about putting a printk in there! Should consider not using printk in the MCE handler? Sincerely! TonyWWang-oc > So the proposed patch makes sense to some extent. But it only cures the > symptom. The real underlying questions are: > > 1) Should we provide a panic mode read callback for clocksources which > are affected by this? > > 2) Is it correct to claim that a MCE which hits user space and ends up in > mce_panic() is still just a regular exception or should we upgrade to > NMI class context when we enter mce_panic() or even go as far to > upgrade to NMI class context for any panic() invocation? > > #1 Solves it at the clocksource level. It still needs HPET specific > changes. > > #2 Solves a whole class of issues > > ... while potentially introducing new ones :) > > To me upgrading any panic() invocation to NMI class context makes a > lot of sense because in that case all bets are off. > > in_nmi() is used in quite some places to avoid such problems. IOW, > that would kill a whole class of issues instead of "curing" the HPET > problem locally for the price of an extra conditional. Not that the > extra conditional matters much if HPET is the clocksource as that's > awfully slow anyway and I really don't care about that. > > But I very much care about avoiding to sprinkle panic_cpu checks all > over the place. > > Thanks, > > tglx
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c index c96ae8fee95e..ecadd0698d6a 100644 --- a/arch/x86/kernel/hpet.c +++ b/arch/x86/kernel/hpet.c @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs) if (in_nmi()) return (u64)hpet_readl(HPET_COUNTER); + /* + * Read HPET directly if panic in progress. + */ + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID)) + return (u64)hpet_readl(HPET_COUNTER); + /* * Read the current state of the lock and HPET value atomically. */
When the clocksource of the system is HPET,a CPU executing read_hpet might be interrupted by exceptions to executing the panic,this may lead to read_hpet dead loops: CPU x CPU x ---- ---- read_hpet() arch_spin_trylock(&hpet.lock) [CPU x got the hpet.lock] #MCE happened do_machine_check() mce_panic() panic() kmsg_dump() pstore_dump() pstore_record_init() ktime_get_real_fast_ns() read_hpet() [dead loops] To avoid this dead loops, read HPET directly if panic in progress. Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com> --- arch/x86/kernel/hpet.c | 6 ++++++ 1 file changed, 6 insertions(+)