[v3,0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE

Message ID	20210421005728.1994268-1-nao.horiguchi@gmail.com (mailing list archive)
Headers	show Return-Path: <SRS0=3Ur0=JS=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BE1C961409 From: Naoya Horiguchi <nao.horiguchi@gmail.com> To: linux-mm@kvack.org, Tony Luck <tony.luck@intel.com>, Aili Yao <yaoaili@kingsoft.com> Cc: Andrew Morton <akpm@linux-foundation.org>, Oscar Salvador <osalvador@suse.de>, David Hildenbrand <david@redhat.com>, Borislav Petkov <bp@alien8.de>, Andy Lutomirski <luto@kernel.org>, Naoya Horiguchi <naoya.horiguchi@nec.com>, Jue Wang <juew@google.com>, linux-kernel@vger.kernel.org Subject: [PATCH v3 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE Date: Wed, 21 Apr 2021 09:57:25 +0900 Message-Id: <20210421005728.1994268-1-nao.horiguchi@gmail.com> MIME-Version: 1.0 Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf04; identity=mailfrom; envelope-from="<nao.horiguchi@gmail.com>"; helo=mail-pg1-f175.google.com; client-ip=209.85.215.175 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm,hwpoison: fix sending SIGBUS for Action Required MCE \| expand [v3,0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE [v3,1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races [v3,2/3] mm,hwpoison: return -EHWPOISON when page already [v3,3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

Naoya Horiguchi April 21, 2021, 12:57 a.m. UTC

Hi,

I updated the series based on previous feedbacks/discussions.

Changelog v2 -> v3:
- 1/3 aligned returning code with "goto out" approach,
- 3/3 added one Tested-by tag,
- 3/3 fixed comment on kill_accessing_process().

Changelog v1 -> v2:
- 3/3 fixed on initializing local variables and calculation logic
  of error virtual address,

v1: https://lore.kernel.org/linux-mm/20210412224320.1747638-1-nao.horiguchi@gmail.com/
v2 (only 3/3 is posted): https://lore.kernel.org/linux-mm/20210419023658.GA1962954@u2004/

Thanks,
Naoya Horiguchi

--- quote from cover letter of v1 ---

I wrote this patchset to materialize what I think is the current
allowable solution mentioned by the previous discussion [1].
I simply borrowed Tony's mutex patch and Aili's return code patch,
then I queued another one to find error virtual address in the best
effort manner.  I know that this is not a perfect solution, but
should work for some typical case.

[1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/
---
Summary:

Aili Yao (1):
      mm,hwpoison: return -EHWPOISON when page already

Naoya Horiguchi (1):
      mm,hwpoison: add kill_accessing_process() to find error virtual address

Tony Luck (1):
      mm/memory-failure: Use a mutex to avoid memory_failure() races

 arch/x86/kernel/cpu/mce/core.c |  13 ++-
 include/linux/swapops.h        |   5 ++
 mm/memory-failure.c            | 181 +++++++++++++++++++++++++++++++++++++----
 3 files changed, 183 insertions(+), 16 deletions(-)

Luck, Tony May 7, 2021, 2:10 a.m. UTC | #1

> I updated the series based on previous feedbacks/discussions.

Tested-by: Tony Luck <tony.luck@intel.com>

-Tony

Luck, Tony May 7, 2021, 3:37 a.m. UTC | #2

>> I updated the series based on previous feedbacks/discussions.
>
> Tested-by: Tony Luck <tony.luck@intel.com>

Maybe this series should get a "Cc: stable" tag too?

I don't have a specific commit to provide for Fixes:, but it deals
with an issue where sometimes processes return from the machine
check handler and hit the same machine check again.

It's hard[1] to hit in v5.8, but somehow gets progressively easier in
newer versions.

-Tony

[1] Initially I thought it didn't happen at all in v5.8. Then wasted a bunch
of time trying to bisect. But my "git bisect good" responses turned out to
be randomly assigned when I just hadn't tried enough times to make the
problem show up.

HORIGUCHI NAOYA(堀口直也) May 7, 2021, 5:24 a.m. UTC | #3

On Fri, May 07, 2021 at 03:37:07AM +0000, Luck, Tony wrote:
> >> I updated the series based on previous feedbacks/discussions.
> >
> > Tested-by: Tony Luck <tony.luck@intel.com>
> 
> Maybe this series should get a "Cc: stable" tag too?

The second patch (I squashed 2/3 and 3/3 into one in v4) is a little
too large (more than 100 lines of change), which doesn't meet stable rule.
But the first patch looks suitable to me for stable.

> 
> I don't have a specific commit to provide for Fixes:, but it deals
> with an issue where sometimes processes return from the machine
> check handler and hit the same machine check again.

If we consider that this issue is visible after LMCE is supported,
the following commit could be the target of Fixes tag.

    commit 243d657eaf540db882f73497060da5a4f7d86a90
    Author: Ashok Raj <ashok.raj@intel.com>
    Date:   Thu Jun 4 18:55:24 2015 +0200

        x86/mce: Handle Local MCE events

but according to your test result, any other factor might affect when
the issue actually got reproducible.

> 
> It's hard[1] to hit in v5.8, but somehow gets progressively easier in
> newer versions.
> 
> -Tony
> 
> [1] Initially I thought it didn't happen at all in v5.8. Then wasted a bunch
> of time trying to bisect. But my "git bisect good" responses turned out to
> be randomly assigned when I just hadn't tried enough times to make the
> problem show up.

Thanks for time and effort for testing.

- Naoya

yaoaili [么爱利] May 7, 2021, 11:14 a.m. UTC | #4

On Fri, 7 May 2021 05:24:22 +0000
HORIGUCHI NAOYA(堀口　直也) <naoya.horiguchi@nec.com> wrote:

> On Fri, May 07, 2021 at 03:37:07AM +0000, Luck, Tony wrote:
> > >> I updated the series based on previous feedbacks/discussions.  
> > >
> > > Tested-by: Tony Luck <tony.luck@intel.com>  
> > 
> > Maybe this series should get a "Cc: stable" tag too?  
> 
> The second patch (I squashed 2/3 and 3/3 into one in v4) is a little
> too large (more than 100 lines of change), which doesn't meet stable rule.
> But the first patch looks suitable to me for stable.

Before cc:stable, would you please do one stress test first?
It failed in my server, but I didn't dig into it, maybe the fail is meaningless.
Just a small suggestion.

Thanks!
Aili Yao

Luck, Tony May 7, 2021, 6:02 p.m. UTC | #5

> Before cc:stable, would you please do one stress test first?
> It failed in my server, but I didn't dig into it, maybe the fail is meaningless.
> Just a small suggestion.

Upstream plus these three patches passed an overnight 8-hour stress test
on four machines here (running a test that's been failing with hung kernels
and kernel crashes w/o these patches).

What were the symptoms of your failed server?

-Tony

yaoaili [么爱利] May 8, 2021, 2:38 a.m. UTC | #6

On Fri, 7 May 2021 18:02:13 +0000
"Luck, Tony" <tony.luck@intel.com> wrote:

> > Before cc:stable, would you please do one stress test first?
> > It failed in my server, but I didn't dig into it, maybe the fail is meaningless.
> > Just a small suggestion.  
> 
> Upstream plus these three patches passed an overnight 8-hour stress test
> on four machines here (running a test that's been failing with hung kernels
> and kernel crashes w/o these patches).
> 
> What were the symptoms of your failed server?
> 

Sorry, I am not sure if the stress test in my server is same with yours.
Usually, I will do one RAS stress test from mce-test/cases/stress before post something.
Maybe this test is not general useful or you have tested it and passed or you may think
the test is not proper though or you tested and you don't think it's one real issue,
then just ignore it please.

Thanks!
Aili Yao

Luck, Tony May 10, 2021, 3:28 a.m. UTC | #7

> Sorry, I am not sure if the stress test in my server is same with yours.

It's better if you test different things ... that's how we get better coverage.

> Usually, I will do one RAS stress test from mce-test/cases/stress before post something.
> Maybe this test is not general useful or you have tested it and passed or you may think
> the test is not proper though or you tested and you don't think it's one real issue,
> then just ignore it please.

I haven't re-run the mce-test cases for a while. The test that was run on these patches
involves a multi-threaded application running on all logical CPUs. When poison is injected
many of the CPUs can hit the poison close together. So it's a good stress test for the new
series.

-Tony

[v3,0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE

Message

Comments