Message ID | 20231014051754.3759099-1-zhiquan1.li@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic | expand |
> Co-developed-by: Youquan Song <youquan.song@intel.com> > Signed-off-by: Youquan Song <youquan.song@intel.com> > Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com> > Signed-off-by: Tony Luck <tony.luck@intel.com> > Signed-off-by: Ingo Molnar <mingo@kernel.org> > Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > Cc: Borislav Petkov <bp@alien8.de> > Link: https://lore.kernel.org/all/20230719211625.298785-1-tony.luck@intel.com/#t This still has problems. You should have removed my Signed-off-by. You should NOT have added Ingo's Signed-off-by (the only time to add someone else's Signed-off-by is when paired with a Co-developed-by). -Tony
On 2023/10/14 13:12, Luck, Tony wrote: >> Co-developed-by: Youquan Song <youquan.song@intel.com> >> Signed-off-by: Youquan Song <youquan.song@intel.com> >> Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com> >> Signed-off-by: Tony Luck <tony.luck@intel.com> >> Signed-off-by: Ingo Molnar <mingo@kernel.org> >> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> >> Cc: Borislav Petkov <bp@alien8.de> >> Link: https://lore.kernel.org/all/20230719211625.298785-1-tony.luck@intel.com/#t > > This still has problems. You should have removed my Signed-off-by. You should NOT > have added Ingo's Signed-off-by (the only time to add someone else's Signed-off-by > is when paired with a Co-developed-by). > Oh, this is V3, not RESEND, I should reset the SoB chain. Thanks for your reminder, Tony. If my understanding is correct, the version below fixes the tag list. Best Regards, Zhiquan ========> From: Zhiquan Li <zhiquan1.li@intel.com> Date: Sat, 14 Oct 2023 13:03:17 -0600 Subject: [PATCH v3] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic Memory errors don't happen very often, especially the severity is fatal. However, in large-scale scenarios, such as data centers, it might still happen. For some MCE fatal error cases, the kernel might call mce_panic() to terminate the production kernel directly, thus there is no opportunity to queue a task for calling memory_failure() which will try to make the kernel survive via memory failure handling. Unfortunately, the capture kernel will panic for the same reason if it touches the error memory again. The consequence is that only an incomplete vmcore is left for sustaining engineers, it's a big headache for them to make clear what happened in the past. The main task of kdump kernel is providing a "window" - /proc/vmcore, for the dump program to access old memory. A dump program running in userspace determines the "policy". Which pages need to be dumped is determined by the configuration of dump program, it reads out the pages that the sustaining engineer is interested in and excludes the rest. Likewise, the dump program can exclude the poisoned page to avoid touching the error page again, the prerequisite is the PG_hwpoison page flag is set correctly by kernel. The de facto dump program (makedumpfile) already supports this function in a decade ago. To set the PG_hwpoison flag in the production kernel just before it panics is the only missing step to make everything work. And it would not introduce additional overhead in capture kernel or conflict with other hwpoision-related code in production kernel. It leverages the already existing mechanisms to fix the issue as much as possible, so the code changes are lightweight. Co-developed-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Borislav Petkov <bp@alien8.de> --- V2: https://lore.kernel.org/all/20230914030539.1622477-1-zhiquan1.li@intel.com/ Changes since V2: - Rebased to v6.6-rc5. - Explained full scenario in commit message per Boris's suggestion. - Included Ingo's fixes. Link: https://lore.kernel.org/all/ZRsUpM%2FXtPAE50Rm@gmail.com/ V1: https://lore.kernel.org/all/20230127015030.30074-1-tony.luck@intel.com/ Changes since V1: - Revised the commit message as per Naoya's suggestion. - Replaced "TODO" comment in code with comments based on mailing list discussion on the lack of value in covering other page types. - Added the tag from Naoya. Link: https://lore.kernel.org/all/20230327083739.GA956278@hori.linux.bs1.fc.nec.co.jp/ --- arch/x86/kernel/cpu/mce/core.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 6f35f724cc14..905e80c776b8 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -233,6 +233,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) struct llist_node *pending; struct mce_evt_llist *l; int apei_err = 0; + struct page *p; /* * Allow instrumentation around external facilities usage. Not that it @@ -286,6 +287,17 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) if (!fake_panic) { if (panic_timeout == 0) panic_timeout = mca_cfg.panic_timeout; + /* + * Kdump can exclude the HWPoison page to avoid touching the error + * page again, the prerequisite is that the PG_hwpoison page flag is + * set. However, for some MCE fatal error cases, there is no + * opportunity to queue a task for calling memory_failure(), and as a + * result, the capture kernel panics. So mark the page as HWPoison + * before kernel panic() for MCE. + */ + p = pfn_to_online_page(final->addr >> PAGE_SHIFT); + if (final && (final->status & MCI_STATUS_ADDRV) && p) + SetPageHWPoison(p); panic(msg); } else pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
On Sat, Oct 14, 2023 at 05:34:12PM +0800, Zhiquan Li wrote:
> Oh, this is V3, not RESEND, I should reset the SoB chain.
You should slow down and take the time to read the document I pointed
you at.
On Sat, Oct 14, 2023 at 05:34:12PM +0800, Zhiquan Li wrote: > Memory errors don't happen very often, especially the severity is fatal. > However, in large-scale scenarios, such as data centers, it might still > happen. For some MCE fatal error cases, the kernel might call > mce_panic() to terminate the production kernel directly, thus there is > no opportunity to queue a task for calling memory_failure() which will > try to make the kernel survive via memory failure handling. You can't "make the kernel survive" if the error has been deemed critical. That's mce_severity()'s job. If it grades the error's severity wrongly and memory_failure() should run after all, then this is a different story. > @@ -286,6 +287,17 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) > if (!fake_panic) { > if (panic_timeout == 0) > panic_timeout = mca_cfg.panic_timeout; This whole thing... > + /* > + * Kdump can exclude the HWPoison page to avoid touching the error > + * page again, the prerequisite is that the PG_hwpoison page flag is > + * set. However, for some MCE fatal error cases, there is no > + * opportunity to queue a task for calling memory_failure(), and as a > + * result, the capture kernel panics. So mark the page as HWPoison > + * before kernel panic() for MCE. > + */ > + p = pfn_to_online_page(final->addr >> PAGE_SHIFT); > + if (final && (final->status & MCI_STATUS_ADDRV) && p) > + SetPageHWPoison(p); ... needs to be inside: if (kexec_crash_loaded() { ... } otherwise it'll be useless work on the panic path. Thx.
On 2023/10/16 17:11, Borislav Petkov wrote: >> For some MCE fatal error cases, the kernel might call >> mce_panic() to terminate the production kernel directly, thus there is >> no opportunity to queue a task for calling memory_failure() which will >> try to make the kernel survive via memory failure handling. > You can't "make the kernel survive" if the error has been deemed > critical. That's mce_severity()'s job. If it grades the error's severity > wrongly and memory_failure() should run after all, then this is > a different story. > I understand what you mean. Looks I didn't express myself well on this point and caused ambiguity. Maybe removing the attributive clause would make it brief and clear? Such as, For some MCE fatal error cases, the kernel might call mce_panic() to terminate the production kernel directly, there is no opportunity to queue a task for calling memory_failure(). Or do you have other better suggestions? Thanks. >> @@ -286,6 +287,17 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) >> if (!fake_panic) { >> if (panic_timeout == 0) >> panic_timeout = mca_cfg.panic_timeout; > This whole thing... > >> + /* >> + * Kdump can exclude the HWPoison page to avoid touching the error >> + * page again, the prerequisite is that the PG_hwpoison page flag is >> + * set. However, for some MCE fatal error cases, there is no >> + * opportunity to queue a task for calling memory_failure(), and as a >> + * result, the capture kernel panics. So mark the page as HWPoison >> + * before kernel panic() for MCE. >> + */ >> + p = pfn_to_online_page(final->addr >> PAGE_SHIFT); >> + if (final && (final->status & MCI_STATUS_ADDRV) && p) >> + SetPageHWPoison(p); > ... needs to be inside: > > if (kexec_crash_loaded() { > ... > } > > otherwise it'll be useless work on the panic path. Good idea! The condition makes it more robust. I'll validate it and send V4. Best Regards, Zhiquan
> I understand what you mean. Looks I didn't express myself well on this > point and caused ambiguity. Maybe removing the attributive clause would > make it brief and clear? Such as, > > For some MCE fatal error cases, the kernel might call > mce_panic() to terminate the production kernel directly, there > is no opportunity to queue a task for calling memory_failure(). How about: When there is a fatal machine check Linux calls mce_panic() without checking to see if bad data at some memory address was reported in the machine check banks. If kexec is enabled, check for memory errors and mark the page as poisoned so that the kexec'd kernel can avoid accessing the page. -Tony
On 2023/10/14 18:18, Borislav Petkov wrote: > You should slow down and take the time to read the document I pointed > you at. I'd like to revise the tag list as below for next version, and reference rules are following each action. Please correct me if I still understand the rules in submitting-patches.rst wrongly. Co-developed-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Borislav Petkov <bp@alien8.de> 1) As the author of the initial patch and the person who submitting the patch, I put my SoB following Youquan's tags per below rule: Notably, the last Signed-off-by: must always be that of the developer submitting the patch. 2) Remove Tony's SoB. I had confirmed with him, he needn't Co-developed-by: tag, so the SoB added by himself in V1 and V2 should be removed. In fact, I'm not sure how to deal with such SoB for "RESEND" case. While resent V2 I retained the SoB to reflect the chain. According to my understanding, the RESEND process emphasizes "not modifying". 3) Remove Ingo's SoB. Because a new version means a new review cycle, the SoB added in V2 should be reset to reflect the *new* real route, unless Ingo needs a Co-developed-by: tag. Relative rule is following: Any further SoBs (Signed-off-by:'s) following the author's SoB are from people handling and transporting the patch, but were not involved in its development. SoB chains should reflect the *real* route a patch took as it was propagated to the maintainers and ultimately to Linus, with the first SoB entry signalling primary authorship of a single author. I missed this point while I sent V3 :-( 4) Keep Naoya's Reviewed-by: according to below rule, because there is no substantial change in V3. Both Tested-by and Reviewed-by tags, once received on mailing list from tester or reviewer, should be added by author to the applicable patches when sending next versions. However if the patch has changed substantially in following version, these tags might not be applicable anymore and thus should be removed. 5) Add Cc: tag to you per below rule :-) This is the only tag which might be added without an explicit action by the person it names - but it should indicate that this person was copied on the patch. This tag documents that potentially interested parties have been included in the discussion. Best Regards, Zhiquan
On Tue, Oct 17, 2023 at 01:24:53AM +0000, Luck, Tony wrote: > How about: > > When there is a fatal machine check Linux calls mce_panic() > without checking to see if bad data at some memory address > was reported in the machine check banks. ... for the simple reason that the kernel cannot allow itself to do any unnecessary work but panic immediately so that it can stop the propagation of bad data. Now, it's a whole different story whether that's the right thing to do and whether the data has already propagated so that the panic is moot. The whole point I'm trying to make is that the machine panics because the error severity dictates it to do so. And there's no opportunity to queue recovery work because it simply cannot in that case. So the commit message should simply state that we're marking the page as poison for the kexec'ed kernel's sake and not because of anything else. > If kexec is enabled, check for memory errors and mark the > page as poisoned so that the kexec'd kernel can avoid accessing > the page. Yap, yours makes sense. Thx.
On 2023/10/17 19:18, Borislav Petkov wrote: > ... for the simple reason that the kernel cannot allow itself to do any > unnecessary work but panic immediately so that it can stop the > propagation of bad data. > > Now, it's a whole different story whether that's the right thing to do > and whether the data has already propagated so that the panic is moot. > > The whole point I'm trying to make is that the machine panics because > the error severity dictates it to do so. And there's no opportunity to > queue recovery work because it simply cannot in that case. So the commit > message should simply state that we're marking the page as poison for > the kexec'ed kernel's sake and not because of anything else. > Wonderful! Thanks for your detail explanation, Boris! I think I got the point why you emphasized "can't make the kernel survive" before. In such scenario the consideration for recovery doesn’t make sense at all, even thought there is opportunity it shouldn’t do that, the only choice is panic ASAP. >> If kexec is enabled, check for memory errors and mark the >> page as poisoned so that the kexec'd kernel can avoid accessing >> the page. > Yap, yours makes sense. Tony, your commit message made me realize how verbose my commit message is. May I simplify the whole commit message as following for next version? ---start--- Memory errors don't happen very often, especially the severity is fatal. However, in large-scale scenarios, such as data centers, it might still happen. When there is a fatal machine check Linux calls mce_panic() without checking to see if bad data at some memory address was reported in the machine check banks. If kexec is enabled, check for memory errors and mark the page as poisoned so that the kexec'ed kernel can avoid accessing the page. ---end--- It already covers the scenario, root cause and solution, and focuses on kernel. No need to talk something else. Thanks to both of you for great insights. Best Regards, Zhiquan
> Tony, your commit message made me realize how verbose my commit message > is. May I simplify the whole commit message as following for next version? > > ---start--- > Memory errors don't happen very often, especially the severity is fatal. > However, in large-scale scenarios, such as data centers, it might still > happen. When there is a fatal machine check Linux calls mce_panic() > without checking to see if bad data at some memory address > was reported in the machine check banks. > > If kexec is enabled, check for memory errors and mark the page as > poisoned so that the kexec'ed kernel can avoid accessing the page. > ---end--- > > It already covers the scenario, root cause and solution, and focuses on > kernel. No need to talk something else. Yes, you can use that. Being concise (but keeping the important details) is the art of a good commit message. -Tony
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 6f35f724cc14..905e80c776b8 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -233,6 +233,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) struct llist_node *pending; struct mce_evt_llist *l; int apei_err = 0; + struct page *p; /* * Allow instrumentation around external facilities usage. Not that it @@ -286,6 +287,17 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) if (!fake_panic) { if (panic_timeout == 0) panic_timeout = mca_cfg.panic_timeout; + /* + * Kdump can exclude the HWPoison page to avoid touching the error + * page again, the prerequisite is that the PG_hwpoison page flag is + * set. However, for some MCE fatal error cases, there is no + * opportunity to queue a task for calling memory_failure(), and as a + * result, the capture kernel panics. So mark the page as HWPoison + * before kernel panic() for MCE. + */ + p = pfn_to_online_page(final->addr >> PAGE_SHIFT); + if (final && (final->status & MCI_STATUS_ADDRV) && p) + SetPageHWPoison(p); panic(msg); } else pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);