x86/hpet: Read HPET directly if panic in progress

Message ID	20240528063836.5248-1-TonyWWang-oc@zhaoxin.com (mailing list archive)
State	Handled Elsewhere
Headers	show Received: from mx2.zhaoxin.com (mx2.zhaoxin.com [203.110.167.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CD05D1BDD3 for <linux-hardening@vger.kernel.org>; Tue, 28 May 2024 06:54:58 +0000 (UTC) From: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com> X-Barracuda-RBL-Trusted-Forwarder: 10.29.252.7 To: <tglx@linutronix.de>, <mingo@redhat.com>, <bp@alien8.de>, <dave.hansen@linux.intel.com>, <x86@kernel.org>, <hpa@zytor.com>, <keescook@chromium.org>, <tony.luck@intel.com>, <gpiccoli@igalia.com>, <mat.jonczyk@o2.pl>, <rdunlap@infradead.org>, <alexandre.belloni@bootlin.com>, <mario.limonciello@amd.com>, <TonyWWang-oc@zhaoxin.com>, <yaolu@kylinos.cn>, <bhelgaas@google.com>, <justinstitt@google.com>, <linux-kernel@vger.kernel.org>, <linux-hardening@vger.kernel.org> CC: <CobeChen@zhaoxin.com>, <TimGuo@zhaoxin.com>, <LeoLiu-oc@zhaoxin.com> Subject: [PATCH] x86/hpet: Read HPET directly if panic in progress Date: Tue, 28 May 2024 14:38:36 +0800 Message-ID: <20240528063836.5248-1-TonyWWang-oc@zhaoxin.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit
Series	x86/hpet: Read HPET directly if panic in progress \| expand x86/hpet: Read HPET directly if panic in progress

Tony W Wang-oc May 28, 2024, 6:38 a.m. UTC

When the clocksource of the system is HPET，a CPU executing read_hpet
might be interrupted by exceptions to executing the panic，this may
lead to read_hpet dead loops:

CPU x				    CPU x
----                                ----
read_hpet()
  arch_spin_trylock(&hpet.lock)
  [CPU x got the hpet.lock]         #MCE happened
				    do_machine_check()
				      mce_panic()
				        panic()
					  kmsg_dump()
					    pstore_dump()
					      pstore_record_init()
						ktime_get_real_fast_ns()
						  read_hpet()
						  [dead loops]

To avoid this dead loops, read HPET directly if panic in progress.

Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
---
 arch/x86/kernel/hpet.c | 6 ++++++
 1 file changed, 6 insertions(+)

Dave Hansen May 28, 2024, 2:18 p.m. UTC | #1

On 5/27/24 23:38, Tony W Wang-oc wrote:
...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index c96ae8fee95e..ecadd0698d6a 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>  	if (in_nmi())
>  		return (u64)hpet_readl(HPET_COUNTER);
>  
> +	/*
> +	 * Read HPET directly if panic in progress.
> +	 */
> +	if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
> +		return (u64)hpet_readl(HPET_COUNTER);
> +

There is literally one other piece of the code in the kernel doing
something similar: the printk() implementation.  There's no other
clocksource or timekeeping code that does this on any architecture.

Why doesn't this problem apply to any other clock sources?

Why should the problem be fixed in the clock sources themselves?  Why
doesn't printk() deadlock on systems using the HPET?

In other words, I think we should fix pstore to be more like printk
rather than hacking around this in each clock source.

Thomas Gleixner May 28, 2024, 10:12 p.m. UTC | #2

On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
> On 5/27/24 23:38, Tony W Wang-oc wrote:
> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>> index c96ae8fee95e..ecadd0698d6a 100644
>> --- a/arch/x86/kernel/hpet.c
>> +++ b/arch/x86/kernel/hpet.c
>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>>  	if (in_nmi())
>>  		return (u64)hpet_readl(HPET_COUNTER);
>>  
>> +	/*
>> +	 * Read HPET directly if panic in progress.
>> +	 */
>> +	if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>> +		return (u64)hpet_readl(HPET_COUNTER);
>> +
>
> There is literally one other piece of the code in the kernel doing
> something similar: the printk() implementation.  There's no other
> clocksource or timekeeping code that does this on any architecture.
>
> Why doesn't this problem apply to any other clock sources?

I principle it applies to any clocksource which needs a spinlock to
serialize access. HPET is not the only insanity here.

Think about i8253 :)

Most real clocksources, like TSC and the majority of the preferred clock
sources on other architectures are perfectly fine. They just read and be
done.

> Why should the problem be fixed in the clock sources themselves?  Why
> doesn't printk() deadlock on systems using the HPET?

Because regular printk()s are deferred to irq work when in NMI and
similar contexts, but that obviously does not apply to panic
situations. Also NMI is treated special even in the HPET code. See
below.

> In other words, I think we should fix pstore to be more like printk
> rather than hacking around this in each clock source.

pstore is perfectly fine. It uses a NMI safe time accessor function
which is then tripping over the HPET lock. That's really a HPET specific
problem.

Though what I read out of the changelog is that the MCE hits the same
CPU 'x' which holds the lock. But that's fairy tale material as you can
see in the patch above:

  	if (in_nmi())
  		return (u64)hpet_readl(HPET_COUNTER);

For that particular case the dead lock, which would actually be a live
lock, cannot happen because in kernel MCEs are NMI class exceptions and
therefore in_nmi() evaluates to true and that new voodoo can't be
reached at all.

Now there are two other scenarios which really can make that happen:

 1) A non-NMI class exception within the lock held region

    CPU A
    acquire(hpet_lock);
    ...                 <- #PF, #GP, #DE, ... -> panic()

    If any of that happens within that lock held section then the live
    lock on the hpet_lock is the least of your worries. Seriously, I
    don't care about this at all.

 2) The actual scenario is:

    CPU A                       CPU B
    lock(hpet_lock)
                                MCE hits user space
                                ...
                                exc_machine_check_user()
                                  irqentry_enter_from_user_mode(regs);

    irqentry_enter_from_user_mode() obviously does not mark the
    exception as NMI class, so in_nmi() evaluates to false. That would
    actually dead lock if CPU A is not making progress and releases
    hpet_lock.

    Sounds unlikely to happen, right? But in reality it can because of
    MCE broadcast. Assume that both CPUs go into MCE:

    CPU A                       CPU B
    lock(hpet_lock)
                                exc_machine_check_user()
                                  irqentry_enter_from_user_mode();
    exc_machine_check_kernel()    do_machine_check()
      irqentry_nmi_enter();         mce_panic()
      do_machine_check()            if (atomic_inc_return(&mce_panicked) > 1)
        mce_panic()                     wait_for_panic(); <- Not taken

        if (atomic_inc_return(&mce_panicked) > 1)
            wait_for_panic(); <- Taken

                                    ....
                                    hpet_read()

    -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
       obviously can't release the lock.

So the proposed patch makes sense to some extent. But it only cures the
symptom. The real underlying questions are:

  1) Should we provide a panic mode read callback for clocksources which
     are affected by this?

  2) Is it correct to claim that a MCE which hits user space and ends up in
     mce_panic() is still just a regular exception or should we upgrade to
     NMI class context when we enter mce_panic() or even go as far to
     upgrade to NMI class context for any panic() invocation?

#1 Solves it at the clocksource level. It still needs HPET specific
   changes.

#2 Solves a whole class of issues

   ... while potentially introducing new ones :)

   To me upgrading any panic() invocation to NMI class context makes a
   lot of sense because in that case all bets are off.

   in_nmi() is used in quite some places to avoid such problems. IOW,
   that would kill a whole class of issues instead of "curing" the HPET
   problem locally for the price of an extra conditional. Not that the
   extra conditional matters much if HPET is the clocksource as that's
   awfully slow anyway and I really don't care about that.

   But I very much care about avoiding to sprinkle panic_cpu checks all
   over the place.

Thanks,

        tglx

Linus Torvalds May 28, 2024, 11:22 p.m. UTC | #3

On Tue, 28 May 2024 at 15:12, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> I principle it applies to any clocksource which needs a spinlock to
> serialize access. HPET is not the only insanity here.

HPET may be the main / only one we care about.

Because:

> Think about i8253 :)

I see the smiley, but yeah, I don't think we really care about it.

>   1) Should we provide a panic mode read callback for clocksources which
>      are affected by this?

The current patch under discussion may be ugly, but looks workable.
Local ugliness isn't necessarily a show-stopper.

So if the HPET is the *only* case which has this situation, I vote for
just doing the ugly thing.

Now, if *other* cases exist, and can't be worked around in similar
ways, then that argues for a more "proper" fix.

And no, I don't think i8253 is a strong enough argument. I don't
actually believe you can realistically find a machine that doesn't
have HPET or the TSC and really falls back on the i8253 any more. And
if you *do* find hw like that, is it SMP-capable? And can you find
somebody who cares?

>   2) Is it correct to claim that a MCE which hits user space and ends up in
>      mce_panic() is still just a regular exception or should we upgrade to
>      NMI class context when we enter mce_panic() or even go as far to
>      upgrade to NMI class context for any panic() invocation?

I do think that an NMI in user space should be considered mostly just
a normal exception. From a kernel perspective, the NMI'ness just
doesn't matter.

That said, I find your suggestion of making 'panic()' just basically
act as an NMI context intriguing. And cleaner than the
atomic_read(&panic_cpu) thing.

Are there any other situations than this odd HPET thing where that
would change semantics?

               Linus

Tony W Wang-oc May 29, 2024, 4:39 a.m. UTC | #4

On 2024/5/29 06:12, Thomas Gleixner wrote:
> 
> 
> [这封邮件来自外部发件人 谨防风险]
> 
> On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
>> On 5/27/24 23:38, Tony W Wang-oc wrote:
>> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>>> index c96ae8fee95e..ecadd0698d6a 100644
>>> --- a/arch/x86/kernel/hpet.c
>>> +++ b/arch/x86/kernel/hpet.c
>>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>>>       if (in_nmi())
>>>               return (u64)hpet_readl(HPET_COUNTER);
>>>
>>> +    /*
>>> +     * Read HPET directly if panic in progress.
>>> +     */
>>> +    if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>>> +            return (u64)hpet_readl(HPET_COUNTER);
>>> +
>>
>> There is literally one other piece of the code in the kernel doing
>> something similar: the printk() implementation.  There's no other
>> clocksource or timekeeping code that does this on any architecture.
>>
>> Why doesn't this problem apply to any other clock sources?
> 
> I principle it applies to any clocksource which needs a spinlock to
> serialize access. HPET is not the only insanity here.
> 
> Think about i8253 :)
> 
> Most real clocksources, like TSC and the majority of the preferred clock
> sources on other architectures are perfectly fine. They just read and be
> done.
> 
>> Why should the problem be fixed in the clock sources themselves?  Why
>> doesn't printk() deadlock on systems using the HPET?
> 
> Because regular printk()s are deferred to irq work when in NMI and
> similar contexts, but that obviously does not apply to panic
> situations. Also NMI is treated special even in the HPET code. See
> below.
> 
>> In other words, I think we should fix pstore to be more like printk
>> rather than hacking around this in each clock source.
> 
> pstore is perfectly fine. It uses a NMI safe time accessor function
> which is then tripping over the HPET lock. That's really a HPET specific
> problem.
> 
> Though what I read out of the changelog is that the MCE hits the same
> CPU 'x' which holds the lock. But that's fairy tale material as you can
> see in the patch above:
> 
>          if (in_nmi())
>                  return (u64)hpet_readl(HPET_COUNTER);
> 
> For that particular case the dead lock, which would actually be a live
> lock, cannot happen because in kernel MCEs are NMI class exceptions and
> therefore in_nmi() evaluates to true and that new voodoo can't be
> reached at all.
> 
> Now there are two other scenarios which really can make that happen:
> 
>   1) A non-NMI class exception within the lock held region
> 
>      CPU A
>      acquire(hpet_lock);
>      ...                 <- #PF, #GP, #DE, ... -> panic()
> 
>      If any of that happens within that lock held section then the live
>      lock on the hpet_lock is the least of your worries. Seriously, I
>      don't care about this at all.
> 
>   2) The actual scenario is:
> 
>      CPU A                       CPU B
>      lock(hpet_lock)
>                                  MCE hits user space
>                                  ...
>                                  exc_machine_check_user()
>                                    irqentry_enter_from_user_mode(regs);
> 
>      irqentry_enter_from_user_mode() obviously does not mark the
>      exception as NMI class, so in_nmi() evaluates to false. That would
>      actually dead lock if CPU A is not making progress and releases
>      hpet_lock.
> 
>      Sounds unlikely to happen, right? But in reality it can because of
>      MCE broadcast. Assume that both CPUs go into MCE:
> 
>      CPU A                       CPU B
>      lock(hpet_lock)
>                                  exc_machine_check_user()
>                                    irqentry_enter_from_user_mode();
>      exc_machine_check_kernel()    do_machine_check()
>        irqentry_nmi_enter();         mce_panic()
>        do_machine_check()            if (atomic_inc_return(&mce_panicked) > 1)
>          mce_panic()                     wait_for_panic(); <- Not taken
> 
>          if (atomic_inc_return(&mce_panicked) > 1)
>              wait_for_panic(); <- Taken
> 
>                                      ....
>                                      hpet_read()
> 
>      -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
>         obviously can't release the lock.
> 

Because MCE handler will call printk() before call the panic, so 
printk() deadlock may happen in this scenario:

CPU A                            CPU B
printk()
   lock(console_owner_lock)
                                  exc_machine_check_user()
                                    irqentry_enter_from_user_mode()
exc_machine_check_kernel()         do_machine_check()
   irqentry_nmi_enter()               mce_panic()
   do_machine_check()                 printk_mce()  #A
     mce_panic()                      ...
       wait_for_panic()               panic()

printk deadlock will happened at #A because in_nmi() evaluates to false 
on CPU B and CPU B do not enter the panic() AT #A.

Update user space MCE handler to NMI class context is preferred?

Sincerely
TonyWWang-oc

> So the proposed patch makes sense to some extent. But it only cures the
> symptom. The real underlying questions are:
> 
>    1) Should we provide a panic mode read callback for clocksources which
>       are affected by this?
> 
>    2) Is it correct to claim that a MCE which hits user space and ends up in
>       mce_panic() is still just a regular exception or should we upgrade to
>       NMI class context when we enter mce_panic() or even go as far to
>       upgrade to NMI class context for any panic() invocation?
> 
> #1 Solves it at the clocksource level. It still needs HPET specific
>     changes.
> 
> #2 Solves a whole class of issues
> 
>     ... while potentially introducing new ones :)
> 
>     To me upgrading any panic() invocation to NMI class context makes a
>     lot of sense because in that case all bets are off.
> 
>     in_nmi() is used in quite some places to avoid such problems. IOW,
>     that would kill a whole class of issues instead of "curing" the HPET
>     problem locally for the price of an extra conditional. Not that the
>     extra conditional matters much if HPET is the clocksource as that's
>     awfully slow anyway and I really don't care about that.
> 
>     But I very much care about avoiding to sprinkle panic_cpu checks all
>     over the place.
> 
> Thanks,
> 
>          tglx

Tony W Wang-oc May 29, 2024, 6:44 a.m. UTC | #5

On 2024/5/29 12:39, Tony W Wang-oc wrote:
> 
> 
> On 2024/5/29 06:12, Thomas Gleixner wrote:
>>
>>
>> [这封邮件来自外部发件人 谨防风险]
>>
>> On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
>>> On 5/27/24 23:38, Tony W Wang-oc wrote:
>>> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>>>> index c96ae8fee95e..ecadd0698d6a 100644
>>>> --- a/arch/x86/kernel/hpet.c
>>>> +++ b/arch/x86/kernel/hpet.c
>>>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>>>>       if (in_nmi())
>>>>               return (u64)hpet_readl(HPET_COUNTER);
>>>>
>>>> +    /*
>>>> +     * Read HPET directly if panic in progress.
>>>> +     */
>>>> +    if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>>>> +            return (u64)hpet_readl(HPET_COUNTER);
>>>> +
>>>
>>> There is literally one other piece of the code in the kernel doing
>>> something similar: the printk() implementation.  There's no other
>>> clocksource or timekeeping code that does this on any architecture.
>>>
>>> Why doesn't this problem apply to any other clock sources?
>>
>> I principle it applies to any clocksource which needs a spinlock to
>> serialize access. HPET is not the only insanity here.
>>
>> Think about i8253 :)
>>
>> Most real clocksources, like TSC and the majority of the preferred clock
>> sources on other architectures are perfectly fine. They just read and be
>> done.
>>
>>> Why should the problem be fixed in the clock sources themselves?  Why
>>> doesn't printk() deadlock on systems using the HPET?
>>
>> Because regular printk()s are deferred to irq work when in NMI and
>> similar contexts, but that obviously does not apply to panic
>> situations. Also NMI is treated special even in the HPET code. See
>> below.
>>
>>> In other words, I think we should fix pstore to be more like printk
>>> rather than hacking around this in each clock source.
>>
>> pstore is perfectly fine. It uses a NMI safe time accessor function
>> which is then tripping over the HPET lock. That's really a HPET specific
>> problem.
>>
>> Though what I read out of the changelog is that the MCE hits the same
>> CPU 'x' which holds the lock. But that's fairy tale material as you can
>> see in the patch above:
>>
>>          if (in_nmi())
>>                  return (u64)hpet_readl(HPET_COUNTER);
>>
>> For that particular case the dead lock, which would actually be a live
>> lock, cannot happen because in kernel MCEs are NMI class exceptions and
>> therefore in_nmi() evaluates to true and that new voodoo can't be
>> reached at all.
>>
>> Now there are two other scenarios which really can make that happen:
>>
>>   1) A non-NMI class exception within the lock held region
>>
>>      CPU A
>>      acquire(hpet_lock);
>>      ...                 <- #PF, #GP, #DE, ... -> panic()
>>
>>      If any of that happens within that lock held section then the live
>>      lock on the hpet_lock is the least of your worries. Seriously, I
>>      don't care about this at all.
>>

Actually, this scenario is what this patch is trying to solve.

We encountered hpet_lock deadlock from the call path of the MCE handler,
and this hpet_lock deadlock scenario may happen when others exceptions'
handler like #PF/#GP... to call the panic. So read_hpet should avoid
deadlock if panic in progress.

Sincerely
TonyWWang-oc

>>   2) The actual scenario is:
>>
>>      CPU A                       CPU B
>>      lock(hpet_lock)
>>                                  MCE hits user space
>>                                  ...
>>                                  exc_machine_check_user()
>>                                    irqentry_enter_from_user_mode(regs);
>>
>>      irqentry_enter_from_user_mode() obviously does not mark the
>>      exception as NMI class, so in_nmi() evaluates to false. That would
>>      actually dead lock if CPU A is not making progress and releases
>>      hpet_lock.
>>
>>      Sounds unlikely to happen, right? But in reality it can because of
>>      MCE broadcast. Assume that both CPUs go into MCE:
>>
>>      CPU A                       CPU B
>>      lock(hpet_lock)
>>                                  exc_machine_check_user()
>>                                    irqentry_enter_from_user_mode();
>>      exc_machine_check_kernel()    do_machine_check()
>>        irqentry_nmi_enter();         mce_panic()
>>        do_machine_check()            if 
>> (atomic_inc_return(&mce_panicked) > 1)
>>          mce_panic()                     wait_for_panic(); <- Not taken
>>
>>          if (atomic_inc_return(&mce_panicked) > 1)
>>              wait_for_panic(); <- Taken
>>
>>                                      ....
>>                                      hpet_read()
>>
>>      -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
>>         obviously can't release the lock.
>>
> 
> Because MCE handler will call printk() before call the panic, so 
> printk() deadlock may happen in this scenario:
> 
> CPU A                            CPU B
> printk()
>    lock(console_owner_lock)
>                                   exc_machine_check_user()
>                                     irqentry_enter_from_user_mode()
> exc_machine_check_kernel()         do_machine_check()
>    irqentry_nmi_enter()               mce_panic()
>    do_machine_check()                 printk_mce()  #A
>      mce_panic()                      ...
>        wait_for_panic()               panic()
> 
> printk deadlock will happened at #A because in_nmi() evaluates to false 
> on CPU B and CPU B do not enter the panic() AT #A.
> 
> Update user space MCE handler to NMI class context is preferred?
> 
> Sincerely
> TonyWWang-oc
> 
>> So the proposed patch makes sense to some extent. But it only cures the
>> symptom. The real underlying questions are:
>>
>>    1) Should we provide a panic mode read callback for clocksources which
>>       are affected by this?
>>
>>    2) Is it correct to claim that a MCE which hits user space and ends 
>> up in
>>       mce_panic() is still just a regular exception or should we 
>> upgrade to
>>       NMI class context when we enter mce_panic() or even go as far to
>>       upgrade to NMI class context for any panic() invocation?
>>
>> #1 Solves it at the clocksource level. It still needs HPET specific
>>     changes.
>>
>> #2 Solves a whole class of issues
>>
>>     ... while potentially introducing new ones :)
>>
>>     To me upgrading any panic() invocation to NMI class context makes a
>>     lot of sense because in that case all bets are off.
>>
>>     in_nmi() is used in quite some places to avoid such problems. IOW,
>>     that would kill a whole class of issues instead of "curing" the HPET
>>     problem locally for the price of an extra conditional. Not that the
>>     extra conditional matters much if HPET is the clocksource as that's
>>     awfully slow anyway and I really don't care about that.
>>
>>     But I very much care about avoiding to sprinkle panic_cpu checks all
>>     over the place.
>>
>> Thanks,
>>
>>          tglx

Thomas Gleixner May 29, 2024, 7:42 a.m. UTC | #6

Linus!

On Tue, May 28 2024 at 16:22, Linus Torvalds wrote:
> On Tue, 28 May 2024 at 15:12, Thomas Gleixner <tglx@linutronix.de> wrote:
> I see the smiley, but yeah, I don't think we really care about it.

Indeed. But the same problem exists on other architectures as
well. drivers/clocksource alone has 4 examples aside of i8253

>>   1) Should we provide a panic mode read callback for clocksources which
>>      are affected by this?
>
> The current patch under discussion may be ugly, but looks workable.
> Local ugliness isn't necessarily a show-stopper.
>
> So if the HPET is the *only* case which has this situation, I vote for
> just doing the ugly thing.
>
> Now, if *other* cases exist, and can't be worked around in similar
> ways, then that argues for a more "proper" fix.
>
> And no, I don't think i8253 is a strong enough argument. I don't
> actually believe you can realistically find a machine that doesn't
> have HPET or the TSC and really falls back on the i8253 any more. And
> if you *do* find hw like that, is it SMP-capable? And can you find
> somebody who cares?

Probably not.

>>   2) Is it correct to claim that a MCE which hits user space and ends up in
>>      mce_panic() is still just a regular exception or should we upgrade to
>>      NMI class context when we enter mce_panic() or even go as far to
>>      upgrade to NMI class context for any panic() invocation?
>
> I do think that an NMI in user space should be considered mostly just
> a normal exception. From a kernel perspective, the NMI'ness just
> doesn't matter.

That's correct. I don't want to change that at all especially not for
recoverable MCEs.

> That said, I find your suggestion of making 'panic()' just basically
> act as an NMI context intriguing. And cleaner than the
> atomic_read(&panic_cpu) thing.
>
> Are there any other situations than this odd HPET thing where that
> would change semantics?

I need to go and stare at this some more.

Thanks,

        tglx

Thomas Gleixner May 29, 2024, 7:44 a.m. UTC | #7

On Wed, May 29 2024 at 12:39, Tony W Wang-oc wrote:
> printk deadlock will happened at #A because in_nmi() evaluates to false 
> on CPU B and CPU B do not enter the panic() AT #A.
>
> Update user space MCE handler to NMI class context is preferred?

No.

Thomas Gleixner May 29, 2024, 7:45 a.m. UTC | #8

On Wed, May 29 2024 at 14:44, Tony W Wang-oc wrote:
> Actually, this scenario is what this patch is trying to solve.
>
> We encountered hpet_lock deadlock from the call path of the MCE handler,
> and this hpet_lock deadlock scenario may happen when others exceptions'
> handler like #PF/#GP... to call the panic. So read_hpet should avoid
> deadlock if panic in progress.

That's not what your changelog says.

Tony W Wang-oc May 29, 2024, 8:15 a.m. UTC | #9

On 2024/5/29 15:45, Thomas Gleixner wrote:
> 
> 
> [这封邮件来自外部发件人 谨防风险]
> 
> On Wed, May 29 2024 at 14:44, Tony W Wang-oc wrote:
>> Actually, this scenario is what this patch is trying to solve.
>>
>> We encountered hpet_lock deadlock from the call path of the MCE handler,
>> and this hpet_lock deadlock scenario may happen when others exceptions'
>> handler like #PF/#GP... to call the panic. So read_hpet should avoid
>> deadlock if panic in progress.
> 
> That's not what your changelog says.

Yes.
The example flow I gave with the MCE handler in my changelog is misleading.

Tony W Wang-oc June 5, 2024, 6:23 a.m. UTC | #10

On 2024/5/29 15:42, Thomas Gleixner wrote:
> 
> 
> [这封邮件来自外部发件人 谨防风险]
> 
> Linus!
> 
> On Tue, May 28 2024 at 16:22, Linus Torvalds wrote:
>> On Tue, 28 May 2024 at 15:12, Thomas Gleixner <tglx@linutronix.de> wrote:
>> I see the smiley, but yeah, I don't think we really care about it.
> 
> Indeed. But the same problem exists on other architectures as
> well. drivers/clocksource alone has 4 examples aside of i8253
> 
>>>    1) Should we provide a panic mode read callback for clocksources which
>>>       are affected by this?
>>
>> The current patch under discussion may be ugly, but looks workable.
>> Local ugliness isn't necessarily a show-stopper.
>>
>> So if the HPET is the *only* case which has this situation, I vote for
>> just doing the ugly thing.
>>
>> Now, if *other* cases exist, and can't be worked around in similar
>> ways, then that argues for a more "proper" fix.
>>
>> And no, I don't think i8253 is a strong enough argument. I don't
>> actually believe you can realistically find a machine that doesn't
>> have HPET or the TSC and really falls back on the i8253 any more. And
>> if you *do* find hw like that, is it SMP-capable? And can you find
>> somebody who cares?
> 
> Probably not.
> 
>>>    2) Is it correct to claim that a MCE which hits user space and ends up in
>>>       mce_panic() is still just a regular exception or should we upgrade to
>>>       NMI class context when we enter mce_panic() or even go as far to
>>>       upgrade to NMI class context for any panic() invocation?
>>

After MCE has occurred, it is possible for the MCE handler to execute 
the add_taint() function without panic. For example, the fake_panic is 
configured.

So the above patch method does not seem to be able to cover the printk 
deadlock caused by the add_taint() function in the MCE handler when a 
MCE occurs in user space.

Sincerely
TonyWWang-oc

>> I do think that an NMI in user space should be considered mostly just
>> a normal exception. From a kernel perspective, the NMI'ness just
>> doesn't matter.
> 
> That's correct. I don't want to change that at all especially not for
> recoverable MCEs.
> 
>> That said, I find your suggestion of making 'panic()' just basically
>> act as an NMI context intriguing. And cleaner than the
>> atomic_read(&panic_cpu) thing.
>>
>> Are there any other situations than this odd HPET thing where that
>> would change semantics?
> 
> I need to go and stare at this some more.
> 
> Thanks,
> 
>          tglx

Borislav Petkov June 5, 2024, 9:36 a.m. UTC | #11

On Wed, Jun 05, 2024 at 02:23:32PM +0800, Tony W Wang-oc wrote:
> After MCE has occurred, it is possible for the MCE handler to execute the
> add_taint() function without panic. For example, the fake_panic is
> configured.

fake_panic is an ancient debugging leftover. If you set it, you get what
you deserve.

Tony W Wang-oc June 5, 2024, 10:10 a.m. UTC | #12

On 2024/6/5 17:36, Borislav Petkov wrote:
> 
> 
> [这封邮件来自外部发件人 谨防风险]
> 
> On Wed, Jun 05, 2024 at 02:23:32PM +0800, Tony W Wang-oc wrote:
>> After MCE has occurred, it is possible for the MCE handler to execute the
>> add_taint() function without panic. For example, the fake_panic is
>> configured.
> 
> fake_panic is an ancient debugging leftover. If you set it, you get what
> you deserve.
> 

It may also happen without setting fake_panic, such as an MCE error of 
the UCNA/SRAO type occurred?

Sincerely
TonyWWang-oc

> --
> Regards/Gruss,
>      Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

Borislav Petkov June 5, 2024, 11:33 a.m. UTC | #13

On Wed, Jun 05, 2024 at 06:10:07PM +0800, Tony W Wang-oc wrote:
> It may also happen without setting fake_panic, such as an MCE error of the
> UCNA/SRAO type occurred?

Which types exactly do you mean when you're looking at the severities[]
array in severity.c?

And what scenario are you talking about?

To get an #MC exception and detect only UCNA/SRAO errors? Can that even
happen on any hardware?

Tony W Wang-oc June 5, 2024, 12:02 p.m. UTC | #14

On 2024/6/5 19:33, Borislav Petkov wrote:
> 
> 
> [这封邮件来自外部发件人 谨防风险]
> 
> On Wed, Jun 05, 2024 at 06:10:07PM +0800, Tony W Wang-oc wrote:
>> It may also happen without setting fake_panic, such as an MCE error of the
>> UCNA/SRAO type occurred?
> 
> Which types exactly do you mean when you're looking at the severities[]
> array in severity.c?
> 
> And what scenario are you talking about?
> 
> To get an #MC exception and detect only UCNA/SRAO errors? Can that even
> happen on any hardware?
> 

Yes, I mean an #MC exception happened and detect only like SRAO errors 
like below:

         MCESEV(
                 AO, "Action optional: memory scrubbing error",
                 SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK, 
MCI_STATUS_UC|MCACOD_SCRUB)
                 ),
         MCESEV(
                 AO, "Action optional: last level cache writeback error",
                 SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
                 ),

I think these errors are actually encountered on some platforms that 
support these type of errors report to the #MC.

Sincerely
TonyWWang-oc
> --
> Regards/Gruss,
>      Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

Tony Luck June 5, 2024, 3:51 p.m. UTC | #15

> > Which types exactly do you mean when you're looking at the severities[]
> > array in severity.c?
> >
> > And what scenario are you talking about?
> >
> > To get an #MC exception and detect only UCNA/SRAO errors? Can that even
> > happen on any hardware?
> >
>
> Yes, I mean an #MC exception happened and detect only like SRAO errors
> like below:
>
>          MCESEV(
>                  AO, "Action optional: memory scrubbing error",
>                  SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK,
> MCI_STATUS_UC|MCACOD_SCRUB)
>                  ),
>          MCESEV(
>                  AO, "Action optional: last level cache writeback error",
>                  SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
>                  ),
>
> I think these errors are actually encountered on some platforms that
> support these type of errors report to the #MC.

Intel servers from Nehalem through Cascade Lake reported memory controller
patrol scrub uncorrected error with #MC and SRAO signature.

Icelake and newer use CMCI with a UCNA signature.

-Tony

Tony W Wang-oc June 6, 2024, 8:44 a.m. UTC | #16

On 2024/6/5 23:51, Luck, Tony wrote:
> 
> 
> [这封邮件来自外部发件人 谨防风险]
> 
>>> Which types exactly do you mean when you're looking at the severities[]
>>> array in severity.c?
>>>
>>> And what scenario are you talking about?
>>>
>>> To get an #MC exception and detect only UCNA/SRAO errors? Can that even
>>> happen on any hardware?
>>>
>>
>> Yes, I mean an #MC exception happened and detect only like SRAO errors
>> like below:
>>
>>           MCESEV(
>>                   AO, "Action optional: memory scrubbing error",
>>                   SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK,
>> MCI_STATUS_UC|MCACOD_SCRUB)
>>                   ),
>>           MCESEV(
>>                   AO, "Action optional: last level cache writeback error",
>>                   SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
>>                   ),
>>
>> I think these errors are actually encountered on some platforms that
>> support these type of errors report to the #MC.
> 
> Intel servers from Nehalem through Cascade Lake reported memory controller
> patrol scrub uncorrected error with #MC and SRAO signature.
> 
> Icelake and newer use CMCI with a UCNA signature.
> 

I have a question, does Intel use #MC to report UCNA errors?

Sincerely
TonyWWang-oc

Tony Luck June 6, 2024, 5 p.m. UTC | #17

>> Icelake and newer use CMCI with a UCNA signature.
>> 
>
> I have a question, does Intel use #MC to report UCNA errors?

No. They are reported with CMCI[1] (assuming it is enabled by IA32_MCi_CTL2 bit 30).

-Tony

[1] Usage evolved and naming did not keep up. An "Uncorrected" error is being signaled
using "Corrected Machine Check Interrupt".

Tony W Wang-oc July 1, 2024, 11:09 a.m. UTC | #18

On 2024/5/29 06:12, Thomas Gleixner wrote:
> 
> 
> [这封邮件来自外部发件人 谨防风险]
> 
> On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
>> On 5/27/24 23:38, Tony W Wang-oc wrote:
>> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>>> index c96ae8fee95e..ecadd0698d6a 100644
>>> --- a/arch/x86/kernel/hpet.c
>>> +++ b/arch/x86/kernel/hpet.c
>>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>>>       if (in_nmi())
>>>               return (u64)hpet_readl(HPET_COUNTER);
>>>
>>> +    /*
>>> +     * Read HPET directly if panic in progress.
>>> +     */
>>> +    if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>>> +            return (u64)hpet_readl(HPET_COUNTER);
>>> +
>>
>> There is literally one other piece of the code in the kernel doing
>> something similar: the printk() implementation.  There's no other
>> clocksource or timekeeping code that does this on any architecture.
>>
>> Why doesn't this problem apply to any other clock sources?
> 
> I principle it applies to any clocksource which needs a spinlock to
> serialize access. HPET is not the only insanity here.
> 
> Think about i8253 :)
> 
> Most real clocksources, like TSC and the majority of the preferred clock
> sources on other architectures are perfectly fine. They just read and be
> done.
> 
>> Why should the problem be fixed in the clock sources themselves?  Why
>> doesn't printk() deadlock on systems using the HPET?
> 
> Because regular printk()s are deferred to irq work when in NMI and
> similar contexts, but that obviously does not apply to panic
> situations. Also NMI is treated special even in the HPET code. See
> below.
> 
>> In other words, I think we should fix pstore to be more like printk
>> rather than hacking around this in each clock source.
> 
> pstore is perfectly fine. It uses a NMI safe time accessor function
> which is then tripping over the HPET lock. That's really a HPET specific
> problem.
> 
> Though what I read out of the changelog is that the MCE hits the same
> CPU 'x' which holds the lock. But that's fairy tale material as you can
> see in the patch above:
> 
>          if (in_nmi())
>                  return (u64)hpet_readl(HPET_COUNTER);
> 
> For that particular case the dead lock, which would actually be a live
> lock, cannot happen because in kernel MCEs are NMI class exceptions and
> therefore in_nmi() evaluates to true and that new voodoo can't be
> reached at all.
> 
> Now there are two other scenarios which really can make that happen:
> 
>   1) A non-NMI class exception within the lock held region
> 
>      CPU A
>      acquire(hpet_lock);
>      ...                 <- #PF, #GP, #DE, ... -> panic()
> 
>      If any of that happens within that lock held section then the live
>      lock on the hpet_lock is the least of your worries. Seriously, I
>      don't care about this at all.
> 
>   2) The actual scenario is:
> 
>      CPU A                       CPU B
>      lock(hpet_lock)
>                                  MCE hits user space
>                                  ...
>                                  exc_machine_check_user()
>                                    irqentry_enter_from_user_mode(regs);
> 
>      irqentry_enter_from_user_mode() obviously does not mark the
>      exception as NMI class, so in_nmi() evaluates to false. That would
>      actually dead lock if CPU A is not making progress and releases
>      hpet_lock.
> 
>      Sounds unlikely to happen, right? But in reality it can because of
>      MCE broadcast. Assume that both CPUs go into MCE:
> 
>      CPU A                       CPU B
>      lock(hpet_lock)
>                                  exc_machine_check_user()
>                                    irqentry_enter_from_user_mode();
>      exc_machine_check_kernel()    do_machine_check()
>        irqentry_nmi_enter();         mce_panic()
>        do_machine_check()            if (atomic_inc_return(&mce_panicked) > 1)
>          mce_panic()                     wait_for_panic(); <- Not taken
> 
>          if (atomic_inc_return(&mce_panicked) > 1)
>              wait_for_panic(); <- Taken
> 
>                                      ....
>                                      hpet_read()
> 
>      -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
>         obviously can't release the lock.
> 

For this scenario, an experiment was designed for the printk:
a, Install a driver module that repeatedly sending IPIs to multiple 
cores to executes printk.
b, Run a user-level testing tool like stream on all cores.
c, Trigger a MCE hardware error.
During burnin tests a-c, reproduce the following case:

CPU A                              CPU B
printk()
console_owner
                                    exc_machine_check_user()
                                      irqentry_enter_from_user_mode()
exc_machine_check_kernel()           do_machine_check()
   irqentry_nmi_enter()                 mce_panic()
   do_machine_check()                     print_mce()
                                            ...
     ...                                    while(console_waiter)
                                              cpu_relax(); <- deadloop
     mce_timed_out() <-timeout
       wait_for_panic()
         panic("Panicing machine check CPU died");

In this case CPU B is the monarch CPU in MCE handler, CPU B waiting to 
be the console_owner and CPU A can't release the console_owner.
So the monarch CPU B deadloop happened, as a result other CPU witch 
waiting the monarch CPU timeout will call the panic function.

This problem is caused by the use of printk in the MCE handler.

Actually, I found the comments for the MCE handler like:
  * This is executed in #MC context not subject to normal locking rules.
  * This implies that most kernel services cannot be safely used. Don't even
  * think about putting a printk in there!

Should consider not using printk in the MCE handler?

Sincerely!
TonyWWang-oc

> So the proposed patch makes sense to some extent. But it only cures the
> symptom. The real underlying questions are:
> 
>    1) Should we provide a panic mode read callback for clocksources which
>       are affected by this?
> 
>    2) Is it correct to claim that a MCE which hits user space and ends up in
>       mce_panic() is still just a regular exception or should we upgrade to
>       NMI class context when we enter mce_panic() or even go as far to
>       upgrade to NMI class context for any panic() invocation?
> 
> #1 Solves it at the clocksource level. It still needs HPET specific
>     changes.
> 
> #2 Solves a whole class of issues
> 
>     ... while potentially introducing new ones :)
> 
>     To me upgrading any panic() invocation to NMI class context makes a
>     lot of sense because in that case all bets are off.
> 
>     in_nmi() is used in quite some places to avoid such problems. IOW,
>     that would kill a whole class of issues instead of "curing" the HPET
>     problem locally for the price of an extra conditional. Not that the
>     extra conditional matters much if HPET is the clocksource as that's
>     awfully slow anyway and I really don't care about that.
> 
>     But I very much care about avoiding to sprinkle panic_cpu checks all
>     over the place.
> 
> Thanks,
> 
>          tglx

x86/hpet: Read HPET directly if panic in progress

Commit Message

Comments

Patch