Message ID | 20250217063335.22257-1-xueshuai@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/hwpoison: Fix regressions in memory failure handling | expand |
On Mon, 17 Feb 2025 14:33:30 +0800 Shuai Xue <xueshuai@linux.alibaba.com> wrote: > This patch addresses three regressions identified in memory failure > handling, as discovered using ras-tools[1]: Thanks. I added these to mm.git for further exposure and testing. They do appear to be more an x86 thing so I'll do the usual thing: if they later turn up in the x86 tree I shall drop the mm.git copies.
On Mon, Feb 17, 2025 at 07:29:54PM -0800, Andrew Morton wrote: > On Mon, 17 Feb 2025 14:33:30 +0800 Shuai Xue <xueshuai@linux.alibaba.com> wrote: > > > This patch addresses three regressions identified in memory failure > > handling, as discovered using ras-tools[1]: > > Thanks. I added these to mm.git for further exposure and testing. > They do appear to be more an x86 thing so I'll do the usual thing: if > they later turn up in the x86 tree I shall drop the mm.git copies. Please drop 'em. They look confused and need proper review first. Also, your commits have "Cc: Acked-by:Thomas Gleixner <tglx@linutronix.de>" When did tglx ack them? I don't see anything in my mbox. Andrew, again, please do not take unreviewed x86 patches through your tree without synchronizing with us. This keeps happening and it is not helping at all - the opposite is happening. :-( Thx.
On Mon, Feb 17, 2025 at 02:33:30PM +0800, Shuai Xue wrote: > changes singce v1: > - Patch 1: Fix cur_sev and sev type to `int` per Tony > - Patch 4: Fix return value to 0 for clean pages per Miaohe > - Patch 5: pick return value comments of memory-failure() > > This patch addresses three regressions identified in memory failure > handling, as discovered using ras-tools[1]: > > - `./einj_mem_uc copyin -f` > - `./einj_mem_uc futex -f` > - `./einj_mem_uc instr` This is not how you write a problem statement and explain why your patches exist. You need to state: 1. What are you trying to do 2. What is the expected outcome and why 3. What actually happens and why 4. The fix, in your opinion, should be X or Y Not quote some ras tools commands. Show me that you actually know what you're doing and explain the problem in human understandable way. And then we can talk fixes. Thx.
在 2025/2/18 16:27, Borislav Petkov 写道: > On Mon, Feb 17, 2025 at 02:33:30PM +0800, Shuai Xue wrote: >> changes singce v1: >> - Patch 1: Fix cur_sev and sev type to `int` per Tony >> - Patch 4: Fix return value to 0 for clean pages per Miaohe >> - Patch 5: pick return value comments of memory-failure() >> >> This patch addresses three regressions identified in memory failure >> handling, as discovered using ras-tools[1]: >> >> - `./einj_mem_uc copyin -f` >> - `./einj_mem_uc futex -f` >> - `./einj_mem_uc instr` > > This is not how you write a problem statement and explain why your patches > exist. > > You need to state: > > 1. What are you trying to do > 2. What is the expected outcome and why > 3. What actually happens and why > 4. The fix, in your opinion, should be X or Y > > Not quote some ras tools commands. Show me that you actually know what you're > doing and explain the problem in human understandable way. And then we can > talk fixes. > > Thx. > Sorry for the confusion. > 1. What are you trying to do I am tring to fix two memory failure regression in upstream kernel compared with 5.10 LTS. - copyin case: poison found while copying from user space. - instr case: poison found while instruction fetching in user space > 2. What is the expected outcome and why For copyin case: Kernel can recover from poison found while copying from user space. MCE check the fixup handler type to decide whether an in kernel #MC can be recovered. When EX_TYPE_UACCESS is found, the PC jumps to recovery code specified in _ASM_EXTABLE_FAULT() and return a -EFAULT to user space. For instr case: If a poison found while instruction fetching in user space, full recovery is possible. User process takes #PF, Linux allocates a new page and fills by reading from storage. > 3. What actually happens and why For copyin case: kernel panic since v5.17 Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup type for copy-from-user operations, changing it from EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG. For instr case: user process is killed by a SIGBUS signal Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") introduced a bug that kill_accessing_process() return -EHWPOISON for instr case, as result, kill_me_maybe() send a SIGBUS to user process. > 4. The fix, in your opinion, should be X or Y For copyin case: add EX_TYPE_EFAULT_REG as a recovery type. For instr case: let kill_accessing_process return 0 to prevent a SIGBUS. For patch 1 and 2: While debuging the two regression, I found `msg` in predefined `severities`, e.g. MCESEV( AO, "Action optional: last level cache writeback error", SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB) ), is helpful for me to know what kind of MCE is happened. For a fatal machine check, kernel panic use the message and I want to extend to collect the message and print it out for non-fatal one. For patch 5: The return value of memory_failure() is quite important while discussed instr case regression with Tony and Miaohe for patch 4, so move comment to the place it belongs to. I hope the information provided above effectively addresses your concerns. Please feel free to let me know if you have any further questions or need additional clarification. Thanks. Shuai
On Tue, Feb 18, 2025 at 07:31:34PM +0800, Shuai Xue wrote: > Kernel can recover from poison found while copying from user space. Where was that poison found? On user pages? So reading them consumes the poison? So you're not really seeing real issues on real hw - you're using ras tools to trigger those, correct? If so, what guarantees ras tools are doing the right thing? > MCE check the fixup handler type to decide whether an in kernel #MC can be > recovered. When EX_TYPE_UACCESS is found, Sounds like poison on user memory... > the PC jumps to recovery code specified in _ASM_EXTABLE_FAULT() and return > a -EFAULT to user space. > For instr case: > > If a poison found while instruction fetching in user space, full recovery is > possible. User process takes #PF, Linux allocates a new page and fills by > reading from storage. > > > 3. What actually happens and why > > For copyin case: kernel panic since v5.17 > > Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable > fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup > type for copy-from-user operations, changing it from EX_TYPE_UACCESS to > EX_TYPE_EFAULT_REG. What do futexes have to do with copying user memory? > For instr case: user process is killed by a SIGBUS signal > > Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not > recovered"") introduced a bug that kill_accessing_process() return -EHWPOISON > for instr case, as result, kill_me_maybe() send a SIGBUS to user process. This makes my head hurt... a race between the CMCI reporting an uncorrected error... why does the CMCI report uncorrected errors? This sounds like some nasty confusion. And you've basically reused the format and wording of 046545a661af for your commit message and makes staring at those a PITA. Tony, what's going on with that CMCI and SRAR race?
在 2025/2/18 20:24, Borislav Petkov 写道: > On Tue, Feb 18, 2025 at 07:31:34PM +0800, Shuai Xue wrote: >> Kernel can recover from poison found while copying from user space. > > Where was that poison found? On user pages? So reading them consumes the > poison? Yes, the poison is found on user pages. Form commit log, the mechanism is added by Tony and suggested by you. https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com > > So you're not really seeing real issues on real hw - you're using ras tools to > trigger those, correct? > > If so, what guarantees ras tools are doing the right thing? Ras-tools do it by three step: - alloc memory in userspace - inject two bit ECC error (uncorretable error) to the memory (by EINJ interface) - write the memory to a file fd ( by write(2) ) It's the same as with real issue. There's no magic to it. Doesn't AMD support it? > >> MCE check the fixup handler type to decide whether an in kernel #MC can be >> recovered. When EX_TYPE_UACCESS is found, > > Sounds like poison on user memory... Yes, sorry for confusion. > >> the PC jumps to recovery code specified in _ASM_EXTABLE_FAULT() and return >> a -EFAULT to user space. > >> For instr case: >> >> If a poison found while instruction fetching in user space, full recovery is >> possible. User process takes #PF, Linux allocates a new page and fills by >> reading from storage. >> >>> 3. What actually happens and why >> >> For copyin case: kernel panic since v5.17 >> >> Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable >> fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup >> type for copy-from-user operations, changing it from EX_TYPE_UACCESS to >> EX_TYPE_EFAULT_REG. > > What do futexes have to do with copying user memory? Return -EFAULT to userspace. > >> For instr case: user process is killed by a SIGBUS signal >> >> Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not >> recovered"") introduced a bug that kill_accessing_process() return -EHWPOISON >> for instr case, as result, kill_me_maybe() send a SIGBUS to user process. > > This makes my head hurt... a race between the CMCI reporting an uncorrected > error... why does the CMCI report uncorrected errors? This sounds like some > nasty confusion. > > And you've basically reused the format and wording of 046545a661af for your > commit message and makes staring at those a PITA. > > Tony, what's going on with that CMCI and SRAR race? > I try to answer why the CMCI reporting an uncorrected error. Tony, please correct me if I missed anyting. When core issue a memory to a memory controller finds a 2 bit ECC error, it will pass data with a poison flag through bus. 1. Home Agent logs UNCA error and signals CMCI ifenable. 2. Home Agent forwards data with poison indication bit set. 3. DCU detects the posion data, logs SRAR eror and triggers #MCE if recoverable. Thanks. Shuai
On Tue, Feb 18, 2025 at 09:08:25PM +0800, Shuai Xue wrote: > Yes, the poison is found on user pages. > > Form commit log, the mechanism is added by Tony and suggested by you. > https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com I'm not talking about how it is detected - I'm asking about *what* you're doing exactly. I want to figure out what and why you're doing what you're doing. > It's the same as with real issue. There's no magic to it. Magic or not, doesn't matter. The only question is whether this can happen in real life and it is not just you using some tools and "fixing" things that ain't broke. > > What do futexes have to do with copying user memory? > > Return -EFAULT to userspace. This doesn't even begin to answer my question so I'll ask again: "What do futexes have to do with copying user memory?"
在 2025/2/18 21:17, Borislav Petkov 写道: > On Tue, Feb 18, 2025 at 09:08:25PM +0800, Shuai Xue wrote: >> Yes, the poison is found on user pages. >> >> Form commit log, the mechanism is added by Tony and suggested by you. >> https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com > > I'm not talking about how it is detected - I'm asking about *what* you're > doing exactly. I want to figure out what and why you're doing what you're > doing. > >> It's the same as with real issue. There's no magic to it. > > Magic or not, doesn't matter. The only question is whether this can happen in > real life and it is not just you using some tools and "fixing" things that > ain't broke. The regression is reported by end user and we also observed in the production. [5056863.064239] task: ffff8837d2a2a0c0 task.stack: ffffc90065814000 [5056863.137299] RIP: 0010:[<ffffffff813ad231>] [<ffffffff813ad231>] __get_user_8+0x21/0x2b ... [5056864.512018] Call Trace: [5056864.543440] [<ffffffff8111c203>] ? exit_robust_list+0x33/0x110 [5056864.616456] [<ffffffff81088399>] mm_release+0x109/0x140 [5056864.682178] [<ffffffff8108faf9>] do_exit+0x159/0xb60 [5056864.744785] [<ffffffff81090583>] do_group_exit+0x43/0xb0 [5056864.811551] [<ffffffff8109bdc9>] get_signal+0x289/0x630 [5056864.877277] [<ffffffff8102d227>] do_signal+0x37/0x690 [5056864.940925] [<ffffffff8111c4e5>] ? do_futex+0x205/0x520 [5056865.006651] [<ffffffff8111c885>] ? SyS_futex+0x85/0x170 [5056865.072378] [<ffffffff81003726>] exit_to_usermode_loop+0x76/0xc0 [5056865.147464] [<ffffffff81003d01>] do_syscall_64+0x171/0x180 [5056865.216311] [<ffffffff81741c8e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6 > >>> What do futexes have to do with copying user memory? >> >> Return -EFAULT to userspace. > > This doesn't even begin to answer my question so I'll ask again: > > "What do futexes have to do with copying user memory?" > Sorry, I did not get your point. Thanks. Shuai
On Tue, Feb 18, 2025 at 09:53:17PM +0800, Shuai Xue wrote: > The regression is reported by end user and we also observed in the production. Where is that report? How many times do I have to ask about different aspects of your patches until you explain the *whole* picture? > [5056863.064239] task: ffff8837d2a2a0c0 task.stack: ffffc90065814000 > [5056863.137299] RIP: 0010:[<ffffffff813ad231>] [<ffffffff813ad231>] __get_user_8+0x21/0x2b > ... > [5056864.512018] Call Trace: This tells me exactly 0 - I see some truncated stack trace. > Sorry, I did not get your point. I don't get your text either. Until this is explained and debugged properly, it is not going anywhere.
> > For instr case: user process is killed by a SIGBUS signal > > > > Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not > > recovered"") introduced a bug that kill_accessing_process() return -EHWPOISON > > for instr case, as result, kill_me_maybe() send a SIGBUS to user process. > > This makes my head hurt... a race between the CMCI reporting an uncorrected > error... why does the CMCI report uncorrected errors? This sounds like some > nasty confusion. My head hurts too. The problem is the evolution and subsequent overloading of limited signal options in Intel error reporting. Prior to Icelake memory controllers reported patrol scrub events that detected a previously unseen uncorrected error in memory by signaling a broadcast machine check with an SRAO (Software Recoverable Action Optional) signature in the machine check bank. This was overkill. It's not an urgent problem. No core is on the verge of consuming that bad data. But the fix causes the confusion. The machine check bank signature was changed to UCNA (Uncorrected, No Action required), and signal changed to #CMCI (since that was the only option available in the toolbox :-( That's how we ended up with *UN*corrected errors tied to *C*MCI. Just to add to the confusion, Linux does take an action (in uc_decode_notifier()) to try to offline the page despite the UC*NA* signature name. > And you've basically reused the format and wording of 046545a661af for your > commit message and makes staring at those a PITA. > > Tony, what's going on with that CMCI and SRAR race? Now the race ... having decided that CMCI/UCNA is the best action for patrol scrub errors, the memory controller uses it for reads too. But the memory controller is executing asynchronously from the core, and can't tell the difference between a "real" read and a speculative read. So it will do CMCI/UCNA if an error is found in any read. Thus: 1) Core is clever and thinks address A is needed soon, issues a speculative read. 2) Core finds it is going to use address A soon after sending the read request 3) The CMCI from the memory controller is in a race with the core that will soon try to retire the load from address A. Quite often (because speculation has got better) the CMCI from the memory controller is delivered before the core is committed to the instruction reading address A, so the interrupt is taken, and Linux offlines the page (marking it as poison). When the interrupt returns, the core gets to the load instruction, and gets a #PF because the offline process marked the page not-present and flushed the TLB. Finally the #PF handler tries to fix the page fault, sees that page is marked as poison so sends SIGBUS to the process. Note, AMD might have a similar race with the MCE_DEFERRED_SEVERITY signal? (but with less confusing naming). -Tony
>> What do futexes have to do with copying user memory? > > Return -EFAULT to userspace. Missed this bit. Kernel code for futex does a get_user() to read the value of the futex from user memory. -Tony
在 2025/2/19 01:59, Luck, Tony 写道: >>> What do futexes have to do with copying user memory? >> >> Return -EFAULT to userspace. > > Missed this bit. Kernel code for futex does a get_user() to read the > value of the futex from user memory. > > -Tony > > Tony, you saved me. Thanks. Shuai
在 2025/2/18 23:31, Borislav Petkov 写道: > On Tue, Feb 18, 2025 at 09:53:17PM +0800, Shuai Xue wrote: >> The regression is reported by end user and we also observed in the production. > > Where is that report? How many times do I have to ask about different aspects > of your patches until you explain the *whole* picture? Sorry, I can not provide the internal report. Thank you for your continued attention to this matter. I appreciate your thoroughness, and I'd like to clarify a few points: I've made every effort to explain the issue comprehensively in my previous responses, following your previous instructions. And I do not have a more bigger picture. My picture is not introducing a new feature but addressing a regression caused by previous commits and improving system reliability. The root cause and its impact are detailed in patches 3 and 4. > >> [5056863.064239] task: ffff8837d2a2a0c0 task.stack: ffffc90065814000 >> [5056863.137299] RIP: 0010:[<ffffffff813ad231>] [<ffffffff813ad231>] __get_user_8+0x21/0x2b >> ... >> [5056864.512018] Call Trace: > > This tells me exactly 0 - I see some truncated stack trace. > >> Sorry, I did not get your point. > > I don't get your text either. Until this is explained and debugged properly, > it is not going anywhere. > Tony helped to answer this answer. If you are a asking for how futex trigger a poison and handle it, hope bellow callstack helps. futex(2) do_futex futex_wait __futex_wait futex_wait_setup get_user // return -EFAULT to top futex(2) I've strived to provide all the relevant information within the constraints I'm operating under. If there are specific aspects you feel need further clarification, please let me know, and I'll do my best to address them. Thank you for your patience and guidance throughout this process. I look forward to your feedback and to moving this patch forward. Shuai
On Tue, Feb 18, 2025 at 05:30:19PM +0000, Luck, Tony wrote: First of all, thanks for explaining - that helps a lot! > That's how we ended up with *UN*corrected errors tied to *C*MCI. > > Just to add to the confusion, Linux does take an action (in uc_decode_notifier()) > to try to offline the page despite the UC*NA* signature name. So, AFAIU, hw folks are basically trying to tell us: well, this is *technically* an uncorrectable error but meh, not really important. We just met it while fetching some data while scrubbing so who knows whether you'll consume it or not. Meh... So why don't we simply do that? We report the signature but we do not try to offline anything. When we get to *actually* consume it non-speculatively, *then* we run memory failure and then we offline the page. Hmmm? Would that solve that particular debacle? Thx.
> First of all, thanks for explaining - that helps a lot! > > > That's how we ended up with *UN*corrected errors tied to *C*MCI. > > > > Just to add to the confusion, Linux does take an action (in uc_decode_notifier()) > > to try to offline the page despite the UC*NA* signature name. > > So, AFAIU, hw folks are basically trying to tell us: well, this is > *technically* an uncorrectable error but meh, not really important. We just > met it while fetching some data while scrubbing so who knows whether you'll > consume it or not. Meh... > > So why don't we simply do that? We could, but I don't like it much. By taking the page offline from the relatively kind environment of a regular interrupt, we often avoid taking a machine check (which is an unfriendly environment for software). > We report the signature but we do not try to offline anything. When we get to > *actually* consume it non-speculatively, *then* we run memory failure and then > we offline the page. > > Hmmm? > > Would that solve that particular debacle? Perhaps. It removes the race. But at the cost of always taking a machine check instead of frequently avoiding it with the uc_decode_notifier() offline. Modern Intel Xeons (>= SkyLake) support local machine check[1]. Even newer Xeons report #MC as recoverable from pretty much all user mode poison consumption. So machine check isn't as painful on modern systems as it used to be. We could make the action in uc_decode_notifier() configurable. Default=off but with a command line option to enable for systems that are stuck with broadcast machine checks. On Intel that would mean not registering the notifier at all. What about AMD? Do you have similar races for MCE_DEFERRED_SEVERITY errors? -Tony [1] Some OEMs still do not enable LMCE in their BIOS.
On Wed, Feb 19, 2025 at 05:11:00PM +0000, Luck, Tony wrote: > We could, but I don't like it much. By taking the page offline from the relatively > kind environment of a regular interrupt, we often avoid taking a machine check > (which is an unfriendly environment for software). Right. > We could make the action in uc_decode_notifier() configurable. Default=off > but with a command line option to enable for systems that are stuck with > broadcast machine checks. So we can figure that out during boot - no need for yet another cmdline option. It still doesn't fix the race and I'd like to fix that instead, in the optimal case. But looking at Shuai's patch, I guess fixing the reporting is fine too - we need to fix the commit message to explain why this thing even happens. I.e., basically what you wrote and Shuai could use that explanation to write a commit message explaining what the situation is along with the background so that when we go back to this later, we will actually know what is going on. But looking at 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") That thing was trying to fix the same reporting fail. Why didn't it do that? Ooooh, now I see what the issue is. He doesn't want to kill the process which gets the wrong SIGBUS. Maybe the commit title should've said that: mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages or so. But how/why is that ok? Are we confident that + * ret = 0 when poison page is a clean page and it's dropped, no + * SIGBUS is needed. can *always* and *only* happen when there's a CMCI *and* a #MC race and the CMCI has won the race? Can memory poison return 0 there too, for another reason and we end up *not killing* a process which we should have? Hmmm. > On Intel that would mean not registering the notifier at all. What about AMD? > Do you have similar races for MCE_DEFERRED_SEVERITY errors? Probably. Lemme ask around. > [1] Some OEMs still do not enable LMCE in their BIOS. Oh, ofc. Gotta love BIOS. They'll get the message when LMCE becomes obsolete, trust me. Are we force-enabling LMCE in this case when booting?
> > We could, but I don't like it much. By taking the page offline from the relatively > > kind environment of a regular interrupt, we often avoid taking a machine check > > (which is an unfriendly environment for software). > > Right. > > > We could make the action in uc_decode_notifier() configurable. Default=off > > but with a command line option to enable for systems that are stuck with > > broadcast machine checks. > > So we can figure that out during boot - no need for yet another cmdline > option. Yup. I think the boot time test might be something like: // Enable UCNA offline for systems with broadcast machine check if (!(AMD || LMCE)) mce_register_decode_chain(&mce_uc_nb); > > It still doesn't fix the race and I'd like to fix that instead, in the optimal > case. > > But looking at Shuai's patch, I guess fixing the reporting is fine too - we > need to fix the commit message to explain why this thing even happens. > > I.e., basically what you wrote and Shuai could use that explanation to write > a commit message explaining what the situation is along with the background so > that when we go back to this later, we will actually know what is going on. Agreed. Shaui needs to harvest this thread to fill out the details in the commit messages. > > But looking at > > 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") > > That thing was trying to fix the same reporting fail. Why didn't it do that? > > Ooooh, now I see what the issue is. He doesn't want to kill the process which > gets the wrong SIGBUS. Maybe the commit title should've said that: > > mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages > > or so. > > But how/why is that ok? > > Are we confident that > > + * ret = 0 when poison page is a clean page and it's dropped, no > + * SIGBUS is needed. > > can *always* and *only* happen when there's a CMCI *and* a #MC race and the > CMCI has won the race? There are probably other races. Two CPUs both take local #MC on the same page (maybe not all that rare in threaded processes ... or even with some hot code in a shared library). > Can memory poison return 0 there too, for another reason and we end up *not > killing* a process which we should have? > > Hmmm. Hmmm indeed. Needs some thought. Though failing to kill a process likely means it retries the access and comes right back to try again (without the race this time). > > > On Intel that would mean not registering the notifier at all. What about AMD? > > Do you have similar races for MCE_DEFERRED_SEVERITY errors? > > Probably. Lemme ask around. > > > [1] Some OEMs still do not enable LMCE in their BIOS. > > Oh, ofc. Gotta love BIOS. They'll get the message when LMCE becomes obsolete, > trust me. > > Are we force-enabling LMCE in this case when booting? Linux tries to enable if LMCE is supported, but BIOS has veto power. See the bit in lmce_supported() that checks MSR_IA32_FEAT_CTL -Tony
在 2025/2/21 01:50, Luck, Tony 写道: >>> We could, but I don't like it much. By taking the page offline from the relatively >>> kind environment of a regular interrupt, we often avoid taking a machine check >>> (which is an unfriendly environment for software). >> >> Right. >> >>> We could make the action in uc_decode_notifier() configurable. Default=off >>> but with a command line option to enable for systems that are stuck with >>> broadcast machine checks. >> >> So we can figure that out during boot - no need for yet another cmdline >> option. > > Yup. I think the boot time test might be something like: > > // Enable UCNA offline for systems with broadcast machine check > if (!(AMD || LMCE)) > mce_register_decode_chain(&mce_uc_nb); >> >> It still doesn't fix the race and I'd like to fix that instead, in the optimal >> case. >> >> But looking at Shuai's patch, I guess fixing the reporting is fine too - we >> need to fix the commit message to explain why this thing even happens. >> >> I.e., basically what you wrote and Shuai could use that explanation to write >> a commit message explaining what the situation is along with the background so >> that when we go back to this later, we will actually know what is going on. > > Agreed. Shaui needs to harvest this thread to fill out the details in the commit > messages. Sure, I'd like to add more backgroud details with Tony's explanation. > >> >> But looking at >> >> 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") >> >> That thing was trying to fix the same reporting fail. Why didn't it do that? >> >> Ooooh, now I see what the issue is. He doesn't want to kill the process which >> gets the wrong SIGBUS. Maybe the commit title should've said that: >> >> mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages >> >> or so. >> >> But how/why is that ok? >> >> Are we confident that >> >> + * ret = 0 when poison page is a clean page and it's dropped, no >> + * SIGBUS is needed. >> >> can *always* and *only* happen when there's a CMCI *and* a #MC race and the >> CMCI has won the race? > > There are probably other races. Two CPUs both take local #MC on the same page > (maybe not all that rare in threaded processes ... or even with some hot code in > a shared library). > >> Can memory poison return 0 there too, for another reason and we end up *not >> killing* a process which we should have? >> >> Hmmm. > > Hmmm indeed. Needs some thought. Though failing to kill a process likely means > it retries the access and comes right back to try again (without the race this time). > Emmm, if two threaded processes consume a poisond data, there may three CPUs race, two of which take local #MC on the same page and one take CMCI. For, example: #perf script kworker/48:1-mm 25516 [048] 1713.893549: probe:memory_failure: (ffffffffaa622db4) ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms]) ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms]) ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms]) ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms]) ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms]) ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms]) ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms]) ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms]) ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms]) ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms]) einj_mem_uc 44530 [184] 1713.908089: probe:memory_failure: (ffffffffaa622db4) ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms]) ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms]) ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms]) ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms]) ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms]) ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms]) 405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc) einj_mem_uc 44531 [089] 1713.916319: probe:memory_failure: (ffffffffaa622db4) ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms]) ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms]) ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms]) ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms]) ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms]) ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms]) 405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc) It seems to complicate the issue further. IMHO, we should focus on three main points: - kill_accessing_process() is only called when the flags are set to MF_ACTION_REQUIRED, which means it is in the MCE path. - Whether the page is clean determines the behavior of try_to_unmap. For a dirty page, try_to_unmap uses TTU_HWPOISON to unmap the PTE and convert the PTE entry to a swap entry. For a clean page, try_to_unmap uses ~TTU_HWPOISON and simply unmaps the PTE. - When does walk_page_range() with hwpoison_walk_ops return 1? 1. If the poison page still exists, we should of course kill the current process. 2. If the poison page does not exist, but is_hwpoison_entry is true, meaning it is a dirty page, we should also kill the current process, too. 3. Otherwise, it returns 0, which means the page is clean. Thanks. Shuai