mbox series

[v2,0/5] mm/hwpoison: Fix regressions in memory failure handling

Message ID 20250217063335.22257-1-xueshuai@linux.alibaba.com (mailing list archive)
Headers show
Series mm/hwpoison: Fix regressions in memory failure handling | expand

Message

Shuai Xue Feb. 17, 2025, 6:33 a.m. UTC
changes singce v1:
- Patch 1: Fix cur_sev and sev type to `int` per Tony
- Patch 4: Fix return value to 0 for clean pages per Miaohe
- Patch 5: pick return value comments of memory-failure()

This patch addresses three regressions identified in memory failure
handling, as discovered using ras-tools[1]:

- `./einj_mem_uc copyin -f`
- `./einj_mem_uc futex -f`
- `./einj_mem_uc instr`

The regressions in the copyin and futex cases were caused by the
replacement of `EX_TYPE_UACCESS` with `EX_TYPE_EFAULT_REG` in some
copy-from-user operations, leading to kernel panics. The instr case
regression resulted from the PTE entry not being marked as hwpoison,
causing the system to send unnecessary SIGBUS signals.

These fixes ensure proper handling of memory errors and prevent kernel
panics and unnecessary signal dispatch.

[1]https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

Shuai Xue (5):
  x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
  x86/mce: dump error msg from severities
  x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix
    copy-from-user operations regression
  mm/hwpoison: Fix incorrect "not recovered" report for recovered clean
    pages
  mm: memory-failure: move return value documentation to function
    declaration

 arch/x86/kernel/cpu/mce/core.c     | 26 +++++++++++++-------------
 arch/x86/kernel/cpu/mce/severity.c | 21 ++++++++++++++++-----
 mm/memory-failure.c                | 21 +++++++++++++++------
 3 files changed, 44 insertions(+), 24 deletions(-)

Comments

Andrew Morton Feb. 18, 2025, 3:29 a.m. UTC | #1
On Mon, 17 Feb 2025 14:33:30 +0800 Shuai Xue <xueshuai@linux.alibaba.com> wrote:

> This patch addresses three regressions identified in memory failure
> handling, as discovered using ras-tools[1]:

Thanks.  I added these to mm.git for further exposure and testing. 
They do appear to be more an x86 thing so I'll do the usual thing: if
they later turn up in the x86 tree I shall drop the mm.git copies.
Borislav Petkov Feb. 18, 2025, 8:03 a.m. UTC | #2
On Mon, Feb 17, 2025 at 07:29:54PM -0800, Andrew Morton wrote:
> On Mon, 17 Feb 2025 14:33:30 +0800 Shuai Xue <xueshuai@linux.alibaba.com> wrote:
> 
> > This patch addresses three regressions identified in memory failure
> > handling, as discovered using ras-tools[1]:
> 
> Thanks.  I added these to mm.git for further exposure and testing. 
> They do appear to be more an x86 thing so I'll do the usual thing: if
> they later turn up in the x86 tree I shall drop the mm.git copies.  

Please drop 'em. They look confused and need proper review first.

Also, your commits have

"Cc: Acked-by:Thomas Gleixner <tglx@linutronix.de>"

When did tglx ack them? I don't see anything in my mbox.

Andrew, again, please do not take unreviewed x86 patches through your tree
without synchronizing with us. This keeps happening and it is not helping at
all - the opposite is happening.

:-(

Thx.
Borislav Petkov Feb. 18, 2025, 8:27 a.m. UTC | #3
On Mon, Feb 17, 2025 at 02:33:30PM +0800, Shuai Xue wrote:
> changes singce v1:
> - Patch 1: Fix cur_sev and sev type to `int` per Tony
> - Patch 4: Fix return value to 0 for clean pages per Miaohe
> - Patch 5: pick return value comments of memory-failure()
> 
> This patch addresses three regressions identified in memory failure
> handling, as discovered using ras-tools[1]:
> 
> - `./einj_mem_uc copyin -f`
> - `./einj_mem_uc futex -f`
> - `./einj_mem_uc instr`

This is not how you write a problem statement and explain why your patches
exist.

You need to state:

1. What are you trying to do
2. What is the expected outcome and why
3. What actually happens and why
4. The fix, in your opinion, should be X or Y

Not quote some ras tools commands. Show me that you actually know what you're
doing and explain the problem in human understandable way.  And then we can
talk fixes.

Thx.
Shuai Xue Feb. 18, 2025, 11:31 a.m. UTC | #4
在 2025/2/18 16:27, Borislav Petkov 写道:
> On Mon, Feb 17, 2025 at 02:33:30PM +0800, Shuai Xue wrote:
>> changes singce v1:
>> - Patch 1: Fix cur_sev and sev type to `int` per Tony
>> - Patch 4: Fix return value to 0 for clean pages per Miaohe
>> - Patch 5: pick return value comments of memory-failure()
>>
>> This patch addresses three regressions identified in memory failure
>> handling, as discovered using ras-tools[1]:
>>
>> - `./einj_mem_uc copyin -f`
>> - `./einj_mem_uc futex -f`
>> - `./einj_mem_uc instr`
> 
> This is not how you write a problem statement and explain why your patches
> exist.
> 
> You need to state:
> 
> 1. What are you trying to do
> 2. What is the expected outcome and why
> 3. What actually happens and why
> 4. The fix, in your opinion, should be X or Y
> 
> Not quote some ras tools commands. Show me that you actually know what you're
> doing and explain the problem in human understandable way.  And then we can
> talk fixes.
> 
> Thx.
> 

Sorry for the confusion.

> 1. What are you trying to do

I am tring to fix two memory failure regression in upstream kernel compared
with 5.10 LTS.

- copyin case: poison found while copying from user space.
- instr case: poison found while instruction fetching in user space

> 2. What is the expected outcome and why

For copyin case:

Kernel can recover from poison found while copying from user space.  MCE check
the fixup handler type to decide whether an in kernel #MC can be recovered.
When EX_TYPE_UACCESS is found, the PC jumps to recovery code specified in
_ASM_EXTABLE_FAULT() and return a -EFAULT to user space.

For instr case:

If a poison found while instruction fetching in user space, full recovery is
possible. User process takes #PF, Linux allocates a new page and fills by
reading from storage.

> 3. What actually happens and why

For copyin case: kernel panic since v5.17

Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable
fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup
type for copy-from-user operations, changing it from EX_TYPE_UACCESS to
EX_TYPE_EFAULT_REG.

For instr case: user process is killed by a SIGBUS signal

Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not
recovered"") introduced a bug that kill_accessing_process() return -EHWPOISON
for instr case, as result, kill_me_maybe() send a SIGBUS to user process.

> 4. The fix, in your opinion, should be X or Y

For copyin case: add EX_TYPE_EFAULT_REG as a recovery type.
For instr case: let kill_accessing_process return 0 to prevent a SIGBUS.

For patch 1 and 2:

While debuging the two regression, I found `msg` in predefined `severities`, e.g.

     MCESEV(
         AO, "Action optional: last level cache writeback error",
         SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
         ),

is helpful for me to know what kind of MCE is happened. For a fatal machine
check, kernel panic use the message and I want to extend to collect the message
and print it out for non-fatal one.

For patch 5:

The return value of memory_failure() is quite important while discussed instr
case regression with Tony and Miaohe for patch 4, so move comment to the place
it belongs to.

I hope the information provided above effectively addresses your concerns.
Please feel free to let me know if you have any further questions or need
additional clarification.

Thanks.
Shuai
Borislav Petkov Feb. 18, 2025, 12:24 p.m. UTC | #5
On Tue, Feb 18, 2025 at 07:31:34PM +0800, Shuai Xue wrote:
> Kernel can recover from poison found while copying from user space.

Where was that poison found? On user pages? So reading them consumes the
poison?

So you're not really seeing real issues on real hw - you're using ras tools to
trigger those, correct?

If so, what guarantees ras tools are doing the right thing?

> MCE check the fixup handler type to decide whether an in kernel #MC can be
> recovered.  When EX_TYPE_UACCESS is found,

Sounds like poison on user memory...

> the PC jumps to recovery code specified in _ASM_EXTABLE_FAULT() and return
> a -EFAULT to user space.

> For instr case:
> 
> If a poison found while instruction fetching in user space, full recovery is
> possible. User process takes #PF, Linux allocates a new page and fills by
> reading from storage.
> 
> > 3. What actually happens and why
> 
> For copyin case: kernel panic since v5.17
> 
> Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable
> fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup
> type for copy-from-user operations, changing it from EX_TYPE_UACCESS to
> EX_TYPE_EFAULT_REG.

What do futexes have to do with copying user memory?

> For instr case: user process is killed by a SIGBUS signal
> 
> Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not
> recovered"") introduced a bug that kill_accessing_process() return -EHWPOISON
> for instr case, as result, kill_me_maybe() send a SIGBUS to user process.

This makes my head hurt... a race between the CMCI reporting an uncorrected
error... why does the CMCI report uncorrected errors? This sounds like some
nasty confusion.

And you've basically reused the format and wording of 046545a661af for your
commit message and makes staring at those a PITA.

Tony, what's going on with that CMCI and SRAR race?
Shuai Xue Feb. 18, 2025, 1:08 p.m. UTC | #6
在 2025/2/18 20:24, Borislav Petkov 写道:
> On Tue, Feb 18, 2025 at 07:31:34PM +0800, Shuai Xue wrote:
>> Kernel can recover from poison found while copying from user space.
> 
> Where was that poison found? On user pages? So reading them consumes the
> poison?

Yes, the poison is found on user pages.

Form commit log, the mechanism is added by Tony and suggested by you.
https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com

> 
> So you're not really seeing real issues on real hw - you're using ras tools to
> trigger those, correct?
> 
> If so, what guarantees ras tools are doing the right thing?

Ras-tools do it by three step:

- alloc memory in userspace
- inject two bit ECC error (uncorretable error) to the memory (by EINJ interface)
- write the memory to a file fd ( by write(2) )

It's the same as with real issue. There's no magic to it.

Doesn't AMD support it?

> 
>> MCE check the fixup handler type to decide whether an in kernel #MC can be
>> recovered.  When EX_TYPE_UACCESS is found,
> 
> Sounds like poison on user memory...

Yes, sorry for confusion.

> 
>> the PC jumps to recovery code specified in _ASM_EXTABLE_FAULT() and return
>> a -EFAULT to user space.
> 
>> For instr case:
>>
>> If a poison found while instruction fetching in user space, full recovery is
>> possible. User process takes #PF, Linux allocates a new page and fills by
>> reading from storage.
>>
>>> 3. What actually happens and why
>>
>> For copyin case: kernel panic since v5.17
>>
>> Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable
>> fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup
>> type for copy-from-user operations, changing it from EX_TYPE_UACCESS to
>> EX_TYPE_EFAULT_REG.
> 
> What do futexes have to do with copying user memory?

Return -EFAULT to userspace.

> 
>> For instr case: user process is killed by a SIGBUS signal
>>
>> Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not
>> recovered"") introduced a bug that kill_accessing_process() return -EHWPOISON
>> for instr case, as result, kill_me_maybe() send a SIGBUS to user process.
> 
> This makes my head hurt... a race between the CMCI reporting an uncorrected
> error... why does the CMCI report uncorrected errors? This sounds like some
> nasty confusion.
> 
> And you've basically reused the format and wording of 046545a661af for your
> commit message and makes staring at those a PITA.
> 
> Tony, what's going on with that CMCI and SRAR race?
> 

I try to answer why the CMCI reporting an uncorrected error. Tony, please
correct me if I missed anyting.

When core issue a memory to a memory controller finds a 2 bit ECC error, it
will pass data with a poison flag through bus.

1. Home Agent logs UNCA error and signals CMCI ifenable.
2. Home Agent forwards data with poison indication bit set.
3. DCU detects the posion data, logs SRAR eror and triggers #MCE if recoverable.


Thanks.
Shuai
Borislav Petkov Feb. 18, 2025, 1:17 p.m. UTC | #7
On Tue, Feb 18, 2025 at 09:08:25PM +0800, Shuai Xue wrote:
> Yes, the poison is found on user pages.
> 
> Form commit log, the mechanism is added by Tony and suggested by you.
> https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com

I'm not talking about how it is detected - I'm asking about *what* you're
doing exactly. I want to figure out what and why you're doing what you're
doing.

> It's the same as with real issue. There's no magic to it.

Magic or not, doesn't matter. The only question is whether this can happen in
real life and it is not just you using some tools and "fixing" things that
ain't broke.

> > What do futexes have to do with copying user memory?
> 
> Return -EFAULT to userspace.

This doesn't even begin to answer my question so I'll ask again:

"What do futexes have to do with copying user memory?"
Shuai Xue Feb. 18, 2025, 1:53 p.m. UTC | #8
在 2025/2/18 21:17, Borislav Petkov 写道:
> On Tue, Feb 18, 2025 at 09:08:25PM +0800, Shuai Xue wrote:
>> Yes, the poison is found on user pages.
>>
>> Form commit log, the mechanism is added by Tony and suggested by you.
>> https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com
> 
> I'm not talking about how it is detected - I'm asking about *what* you're
> doing exactly. I want to figure out what and why you're doing what you're
> doing.
> 
>> It's the same as with real issue. There's no magic to it.
> 
> Magic or not, doesn't matter. The only question is whether this can happen in
> real life and it is not just you using some tools and "fixing" things that
> ain't broke.

The regression is reported by end user and we also observed in the production.

[5056863.064239] task: ffff8837d2a2a0c0 task.stack: ffffc90065814000
[5056863.137299] RIP: 0010:[<ffffffff813ad231>]  [<ffffffff813ad231>] __get_user_8+0x21/0x2b
...
[5056864.512018] Call Trace:
[5056864.543440]  [<ffffffff8111c203>] ? exit_robust_list+0x33/0x110
[5056864.616456]  [<ffffffff81088399>] mm_release+0x109/0x140
[5056864.682178]  [<ffffffff8108faf9>] do_exit+0x159/0xb60
[5056864.744785]  [<ffffffff81090583>] do_group_exit+0x43/0xb0
[5056864.811551]  [<ffffffff8109bdc9>] get_signal+0x289/0x630
[5056864.877277]  [<ffffffff8102d227>] do_signal+0x37/0x690
[5056864.940925]  [<ffffffff8111c4e5>] ? do_futex+0x205/0x520
[5056865.006651]  [<ffffffff8111c885>] ? SyS_futex+0x85/0x170
[5056865.072378]  [<ffffffff81003726>] exit_to_usermode_loop+0x76/0xc0
[5056865.147464]  [<ffffffff81003d01>] do_syscall_64+0x171/0x180
[5056865.216311]  [<ffffffff81741c8e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6
> 
>>> What do futexes have to do with copying user memory?
>>
>> Return -EFAULT to userspace.
> 
> This doesn't even begin to answer my question so I'll ask again:
> 
> "What do futexes have to do with copying user memory?"
> 

Sorry, I did not get your point.

Thanks.
Shuai
Borislav Petkov Feb. 18, 2025, 3:31 p.m. UTC | #9
On Tue, Feb 18, 2025 at 09:53:17PM +0800, Shuai Xue wrote:
> The regression is reported by end user and we also observed in the production.

Where is that report? How many times do I have to ask about different aspects
of your patches until you explain the *whole* picture?

> [5056863.064239] task: ffff8837d2a2a0c0 task.stack: ffffc90065814000
> [5056863.137299] RIP: 0010:[<ffffffff813ad231>]  [<ffffffff813ad231>] __get_user_8+0x21/0x2b
> ...
> [5056864.512018] Call Trace:

This tells me exactly 0 - I see some truncated stack trace.

> Sorry, I did not get your point.

I don't get your text either. Until this is explained and debugged properly,
it is not going anywhere.
Luck, Tony Feb. 18, 2025, 5:30 p.m. UTC | #10
> > For instr case: user process is killed by a SIGBUS signal
> >
> > Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not
> > recovered"") introduced a bug that kill_accessing_process() return -EHWPOISON
> > for instr case, as result, kill_me_maybe() send a SIGBUS to user process.
>
> This makes my head hurt... a race between the CMCI reporting an uncorrected
> error... why does the CMCI report uncorrected errors? This sounds like some
> nasty confusion.

My head hurts too. The problem is the evolution and subsequent overloading of
limited signal options in Intel error reporting.

Prior to Icelake memory controllers reported patrol scrub events that detected
a previously unseen uncorrected error in memory by signaling a broadcast
machine check with an SRAO (Software Recoverable Action Optional) signature
in the machine check bank.

This was overkill. It's not an urgent problem. No core is on the verge of consuming
that bad data.

But the fix causes the confusion. The machine check bank signature was changed
to UCNA (Uncorrected, No Action required), and signal changed to #CMCI (since
that was the only option available in the toolbox :-(

That's how we ended up with *UN*corrected errors tied to *C*MCI.

Just to add to the confusion, Linux does take an action (in uc_decode_notifier())
to try to offline the page despite the UC*NA* signature name.

> And you've basically reused the format and wording of 046545a661af for your
> commit message and makes staring at those a PITA.
>
> Tony, what's going on with that CMCI and SRAR race?

Now the race ... having decided that CMCI/UCNA is the best action for patrol
scrub errors, the memory controller uses it for reads too. But the memory controller
is executing asynchronously from the core, and can't tell the difference between a
"real" read and a speculative read. So it will do CMCI/UCNA if an error is found in
any read.

Thus:

1) Core is clever and thinks address A is needed soon, issues a speculative read.
2) Core finds it is going to use address A soon after sending the read request
3) The CMCI from the memory controller is in a race with the core that will soon try to retire the load from address A.

Quite often (because speculation has got better) the CMCI from the memory controller
is delivered before the core is committed to the instruction reading address A, so the
interrupt is taken, and Linux offlines the page (marking it as poison).

When the interrupt returns, the core gets to the load instruction, and gets a #PF because
the offline process marked the page not-present and flushed the TLB.

Finally the #PF handler tries to fix the page fault, sees that page is marked as poison
so sends SIGBUS to the process.

Note, AMD might have a similar race with the MCE_DEFERRED_SEVERITY signal?
(but with less confusing naming).

-Tony
Luck, Tony Feb. 18, 2025, 5:59 p.m. UTC | #11
>> What do futexes have to do with copying user memory?
>
> Return -EFAULT to userspace.

Missed this bit. Kernel code for futex does a get_user() to read the
value of the futex from user memory.

-Tony
Shuai Xue Feb. 19, 2025, 6:04 a.m. UTC | #12
在 2025/2/19 01:59, Luck, Tony 写道:
>>> What do futexes have to do with copying user memory?
>>
>> Return -EFAULT to userspace.
> 
> Missed this bit. Kernel code for futex does a get_user() to read the
> value of the futex from user memory.
> 
> -Tony
> 
> 

Tony, you saved me.

Thanks.
Shuai
Shuai Xue Feb. 19, 2025, 7:13 a.m. UTC | #13
在 2025/2/18 23:31, Borislav Petkov 写道:
> On Tue, Feb 18, 2025 at 09:53:17PM +0800, Shuai Xue wrote:
>> The regression is reported by end user and we also observed in the production.
> 
> Where is that report? How many times do I have to ask about different aspects
> of your patches until you explain the *whole* picture?

Sorry, I can not provide the internal report.

Thank you for your continued attention to this matter.

I appreciate your thoroughness, and I'd like to clarify a few points: I've made
every effort to explain the issue comprehensively in my previous responses,
following your previous instructions. And I do not have a more bigger picture.
My picture is not introducing a new feature but addressing a regression caused
by previous commits and improving system reliability. The root cause and its
impact are detailed in patches 3 and 4.

> 
>> [5056863.064239] task: ffff8837d2a2a0c0 task.stack: ffffc90065814000
>> [5056863.137299] RIP: 0010:[<ffffffff813ad231>]  [<ffffffff813ad231>] __get_user_8+0x21/0x2b
>> ...
>> [5056864.512018] Call Trace:
> 
> This tells me exactly 0 - I see some truncated stack trace.
> 
>> Sorry, I did not get your point.
> 
> I don't get your text either. Until this is explained and debugged properly,
> it is not going anywhere.
> 

Tony helped to answer this answer. If you are a asking for how futex trigger a
poison and handle it, hope bellow callstack helps.

futex(2)
     do_futex
         futex_wait
	    __futex_wait
	        futex_wait_setup
    		    get_user // return -EFAULT to top futex(2)

I've strived to provide all the relevant information within the constraints I'm
operating under. If there are specific aspects you feel need further
clarification, please let me know, and I'll do my best to address them.

Thank you for your patience and guidance throughout this process. I look
forward to your feedback and to moving this patch forward.

Shuai
Borislav Petkov Feb. 19, 2025, 8:10 a.m. UTC | #14
On Tue, Feb 18, 2025 at 05:30:19PM +0000, Luck, Tony wrote:
First of all, thanks for explaining - that helps a lot!

> That's how we ended up with *UN*corrected errors tied to *C*MCI.
> 
> Just to add to the confusion, Linux does take an action (in uc_decode_notifier())
> to try to offline the page despite the UC*NA* signature name.

So, AFAIU, hw folks are basically trying to tell us: well, this is
*technically* an uncorrectable error but meh, not really important. We just
met it while fetching some data while scrubbing so who knows whether you'll
consume it or not. Meh...

So why don't we simply do that?

We report the signature but we do not try to offline anything. When we get to
*actually* consume it non-speculatively, *then* we run memory failure and then
we offline the page.

Hmmm?

Would that solve that particular debacle?

Thx.
Luck, Tony Feb. 19, 2025, 5:11 p.m. UTC | #15
> First of all, thanks for explaining - that helps a lot!
>
> > That's how we ended up with *UN*corrected errors tied to *C*MCI.
> >
> > Just to add to the confusion, Linux does take an action (in uc_decode_notifier())
> > to try to offline the page despite the UC*NA* signature name.
>
> So, AFAIU, hw folks are basically trying to tell us: well, this is
> *technically* an uncorrectable error but meh, not really important. We just
> met it while fetching some data while scrubbing so who knows whether you'll
> consume it or not. Meh...
>
> So why don't we simply do that?

We could, but I don't like it much. By taking the page offline from the relatively
kind environment of a regular interrupt, we often avoid taking a machine check
(which is an unfriendly environment for software).

> We report the signature but we do not try to offline anything. When we get to
> *actually* consume it non-speculatively, *then* we run memory failure and then
> we offline the page.
>
> Hmmm?
>
> Would that solve that particular debacle?

Perhaps. It removes the race. But at the cost of always taking a machine
check instead of frequently avoiding it with the uc_decode_notifier() offline.

Modern Intel Xeons (>= SkyLake) support local machine check[1]. Even
newer Xeons report #MC as recoverable from pretty much all user mode
poison consumption. So machine check isn't as painful on modern systems
as it used to be.

We could make the action in uc_decode_notifier() configurable. Default=off
but with a command line option to enable for systems that are stuck with
broadcast machine checks.

On Intel that would mean not registering the notifier at all. What about AMD?
Do you have similar races for MCE_DEFERRED_SEVERITY errors?

-Tony

[1] Some OEMs still do not enable LMCE in their BIOS.
Borislav Petkov Feb. 20, 2025, 11:19 a.m. UTC | #16
On Wed, Feb 19, 2025 at 05:11:00PM +0000, Luck, Tony wrote:
> We could, but I don't like it much. By taking the page offline from the relatively
> kind environment of a regular interrupt, we often avoid taking a machine check
> (which is an unfriendly environment for software).

Right.

> We could make the action in uc_decode_notifier() configurable. Default=off
> but with a command line option to enable for systems that are stuck with
> broadcast machine checks.

So we can figure that out during boot - no need for yet another cmdline
option.

It still doesn't fix the race and I'd like to fix that instead, in the optimal
case.

But looking at Shuai's patch, I guess fixing the reporting is fine too - we
need to fix the commit message to explain why this thing even happens.

I.e., basically what you wrote and Shuai could use that explanation to write
a commit message explaining what the situation is along with the background so
that when we go back to this later, we will actually know what is going on.

But looking at

  046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")

That thing was trying to fix the same reporting fail. Why didn't it do that?

Ooooh, now I see what the issue is. He doesn't want to kill the process which
gets the wrong SIGBUS. Maybe the commit title should've said that:

  mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages

or so.

But how/why is that ok?

Are we confident that

+        * ret = 0 when poison page is a clean page and it's dropped, no
+        * SIGBUS is needed.

can *always* and *only* happen when there's a CMCI *and* a #MC race and the
CMCI has won the race?

Can memory poison return 0 there too, for another reason and we end up *not
killing* a process which we should have?

Hmmm.

> On Intel that would mean not registering the notifier at all. What about AMD?
> Do you have similar races for MCE_DEFERRED_SEVERITY errors?

Probably. Lemme ask around.

> [1] Some OEMs still do not enable LMCE in their BIOS.

Oh, ofc. Gotta love BIOS. They'll get the message when LMCE becomes obsolete,
trust me.

Are we force-enabling LMCE in this case when booting?
Luck, Tony Feb. 20, 2025, 5:50 p.m. UTC | #17
> > We could, but I don't like it much. By taking the page offline from the relatively
> > kind environment of a regular interrupt, we often avoid taking a machine check
> > (which is an unfriendly environment for software).
>
> Right.
>
> > We could make the action in uc_decode_notifier() configurable. Default=off
> > but with a command line option to enable for systems that are stuck with
> > broadcast machine checks.
>
> So we can figure that out during boot - no need for yet another cmdline
> option.

Yup. I think the boot time test might be something like:

	// Enable UCNA offline for systems with broadcast machine check
	if (!(AMD || LMCE))
		mce_register_decode_chain(&mce_uc_nb);
>
> It still doesn't fix the race and I'd like to fix that instead, in the optimal
> case.
>
> But looking at Shuai's patch, I guess fixing the reporting is fine too - we
> need to fix the commit message to explain why this thing even happens.
>
> I.e., basically what you wrote and Shuai could use that explanation to write
> a commit message explaining what the situation is along with the background so
> that when we go back to this later, we will actually know what is going on.

Agreed. Shaui needs to harvest this thread to fill out the details in the commit
messages.

>
> But looking at
>
>   046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>
> That thing was trying to fix the same reporting fail. Why didn't it do that?
>
> Ooooh, now I see what the issue is. He doesn't want to kill the process which
> gets the wrong SIGBUS. Maybe the commit title should've said that:
>
>   mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages
>
> or so.
>
> But how/why is that ok?
>
> Are we confident that
>
> +        * ret = 0 when poison page is a clean page and it's dropped, no
> +        * SIGBUS is needed.
>
> can *always* and *only* happen when there's a CMCI *and* a #MC race and the
> CMCI has won the race?

There are probably other races. Two CPUs both take local #MC on the same page
(maybe not all that rare in threaded processes ... or even with some hot code in 
a shared library).

> Can memory poison return 0 there too, for another reason and we end up *not
> killing* a process which we should have?
>
> Hmmm.

Hmmm indeed. Needs some thought. Though failing to kill a process likely means
it retries the access and comes right back to try again (without the race this time).

>
> > On Intel that would mean not registering the notifier at all. What about AMD?
> > Do you have similar races for MCE_DEFERRED_SEVERITY errors?
>
> Probably. Lemme ask around.
>
> > [1] Some OEMs still do not enable LMCE in their BIOS.
>
> Oh, ofc. Gotta love BIOS. They'll get the message when LMCE becomes obsolete,
> trust me.
>
> Are we force-enabling LMCE in this case when booting?

Linux tries to enable if LMCE is supported, but BIOS has veto power.
See the bit in lmce_supported() that checks MSR_IA32_FEAT_CTL

-Tony
Shuai Xue Feb. 21, 2025, 6:05 a.m. UTC | #18
在 2025/2/21 01:50, Luck, Tony 写道:
>>> We could, but I don't like it much. By taking the page offline from the relatively
>>> kind environment of a regular interrupt, we often avoid taking a machine check
>>> (which is an unfriendly environment for software).
>>
>> Right.
>>
>>> We could make the action in uc_decode_notifier() configurable. Default=off
>>> but with a command line option to enable for systems that are stuck with
>>> broadcast machine checks.
>>
>> So we can figure that out during boot - no need for yet another cmdline
>> option.
> 
> Yup. I think the boot time test might be something like:
> 
> 	// Enable UCNA offline for systems with broadcast machine check
> 	if (!(AMD || LMCE))
> 		mce_register_decode_chain(&mce_uc_nb);
>>
>> It still doesn't fix the race and I'd like to fix that instead, in the optimal
>> case.
>>
>> But looking at Shuai's patch, I guess fixing the reporting is fine too - we
>> need to fix the commit message to explain why this thing even happens.
>>
>> I.e., basically what you wrote and Shuai could use that explanation to write
>> a commit message explaining what the situation is along with the background so
>> that when we go back to this later, we will actually know what is going on.
> 
> Agreed. Shaui needs to harvest this thread to fill out the details in the commit
> messages.

Sure, I'd like to add more backgroud details with Tony's explanation.

> 
>>
>> But looking at
>>
>>    046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>>
>> That thing was trying to fix the same reporting fail. Why didn't it do that?
>>
>> Ooooh, now I see what the issue is. He doesn't want to kill the process which
>> gets the wrong SIGBUS. Maybe the commit title should've said that:
>>
>>    mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages
>>
>> or so.
>>
>> But how/why is that ok?
>>
>> Are we confident that
>>
>> +        * ret = 0 when poison page is a clean page and it's dropped, no
>> +        * SIGBUS is needed.
>>
>> can *always* and *only* happen when there's a CMCI *and* a #MC race and the
>> CMCI has won the race?
> 
> There are probably other races. Two CPUs both take local #MC on the same page
> (maybe not all that rare in threaded processes ... or even with some hot code in
> a shared library).
> 
>> Can memory poison return 0 there too, for another reason and we end up *not
>> killing* a process which we should have?
>>
>> Hmmm.
> 
> Hmmm indeed. Needs some thought. Though failing to kill a process likely means
> it retries the access and comes right back to try again (without the race this time).
> 

Emmm, if two threaded processes consume a poisond data, there may three CPUs
race, two of which take local #MC on the same page and one take CMCI. For,
example:

#perf script
kworker/48:1-mm 25516 [048]  1713.893549: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
         ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
         ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
         ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
         ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
         ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
         ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
         ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
         ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])

einj_mem_uc 44530 [184]  1713.908089: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 44531 [089]  1713.916319: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

It seems to complicate the issue further.

IMHO, we should focus on three main points:

- kill_accessing_process() is only called when the flags are set to
   MF_ACTION_REQUIRED, which means it is in the MCE path.
- Whether the page is clean determines the behavior of try_to_unmap. For a
   dirty page, try_to_unmap uses TTU_HWPOISON to unmap the PTE and convert the
   PTE entry to a swap entry. For a clean page, try_to_unmap uses ~TTU_HWPOISON
   and simply unmaps the PTE.
- When does walk_page_range() with hwpoison_walk_ops return 1?
   1. If the poison page still exists, we should of course kill the current
      process.
   2. If the poison page does not exist, but is_hwpoison_entry is true, meaning
      it is a dirty page, we should also kill the current process, too.
   3. Otherwise, it returns 0, which means the page is clean.


Thanks.
Shuai