diff mbox series

[v3] mm,hwpoison: return -EHWPOISON when page already poisoned

Message ID 20210331192540.2141052f@alex-virtual-machine (mailing list archive)
State New, archived
Headers show
Series [v3] mm,hwpoison: return -EHWPOISON when page already poisoned | expand

Commit Message

yaoaili [么爱利] March 31, 2021, 11:25 a.m. UTC
When the page is already poisoned, another memory_failure() call in the
same page now return 0, meaning OK. For nested memory mce handling, this
behavior may lead to one mce looping, Example:

1.When LCME is enabled, and there are two processes A && B running on
different core X && Y separately, which will access one same page, then
the page corrupted when process A access it, a MCE will be rasied to
core X and the error process is just underway.

2.Then B access the page and trigger another MCE to core Y, it will also
do error process, it will see TestSetPageHWPoison be true, and 0 is
returned.

3.The kill_me_maybe will check the return:

1244 static void kill_me_maybe(struct callback_head *cb)
1245 {

1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
p->mce_whole_page);
1257                 sync_core();
1258                 return;
1259         }

1267 }

4. The error process for B will end, and may nothing happened if
kill-early is not set, The process B will re-excute instruction and get
into mce again and then loop happens. And also the set_mce_nospec()
here is not proper, may refer to commit fd0e786d9d09 ("x86/mm,
mm/hwpoison: Don't unconditionally unmap kernel 1:1 pages").

For other cases which care the return value of memory_failure() should
check why they want to process a memory error which have already been
processed. This behavior seems reasonable.

Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
---
 mm/memory-failure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Tony Luck April 1, 2021, 3:33 p.m. UTC | #1
On Wed, Mar 31, 2021 at 07:25:40PM +0800, Aili Yao wrote:
> When the page is already poisoned, another memory_failure() call in the
> same page now return 0, meaning OK. For nested memory mce handling, this
> behavior may lead to one mce looping, Example:
> 
> 1.When LCME is enabled, and there are two processes A && B running on
> different core X && Y separately, which will access one same page, then
> the page corrupted when process A access it, a MCE will be rasied to
> core X and the error process is just underway.
> 
> 2.Then B access the page and trigger another MCE to core Y, it will also
> do error process, it will see TestSetPageHWPoison be true, and 0 is
> returned.
> 
> 3.The kill_me_maybe will check the return:
> 
> 1244 static void kill_me_maybe(struct callback_head *cb)
> 1245 {
> 
> 1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> 1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> 1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
> p->mce_whole_page);
> 1257                 sync_core();
> 1258                 return;
> 1259         }
> 
> 1267 }

With your change memory_failure() will return -EHWPOISON for the
second task that consumes poison ... so that "if" statement won't
be true and so we fall into the following code:

1273         if (p->mce_vaddr != (void __user *)-1l) {
1274                 force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
1275         } else {
1276                 pr_err("Memory error not recovered");
1277                 kill_me_now(cb);
1278         }

If this was a copy_from_user() machine check, p->mce_vaddr is set and
the task gets a BUS_MCEERR_AR SIGBUS, otherwise we print that

	"Memory error not recovered"

message and send a generic SIGBUS.  I don't think either of those options
is right.

Combined with my "mutex" patch (to get rid of races where 2nd process returns
early, but first process is still looking for mappings to unmap and tasks
to signal) this patch moves forward a bit. But I think it needs an
additional change here in kill_me_maybe() to just "return" if there is a
EHWPOISON return from memory_failure()

-Tony
yaoaili [么爱利] April 2, 2021, 1:18 a.m. UTC | #2
On Thu, 1 Apr 2021 08:33:20 -0700
"Luck, Tony" <tony.luck@intel.com> wrote:

> On Wed, Mar 31, 2021 at 07:25:40PM +0800, Aili Yao wrote:
> > When the page is already poisoned, another memory_failure() call in the
> > same page now return 0, meaning OK. For nested memory mce handling, this
> > behavior may lead to one mce looping, Example:
> > 
> > 1.When LCME is enabled, and there are two processes A && B running on
> > different core X && Y separately, which will access one same page, then
> > the page corrupted when process A access it, a MCE will be rasied to
> > core X and the error process is just underway.
> > 
> > 2.Then B access the page and trigger another MCE to core Y, it will also
> > do error process, it will see TestSetPageHWPoison be true, and 0 is
> > returned.
> > 
> > 3.The kill_me_maybe will check the return:
> > 
> > 1244 static void kill_me_maybe(struct callback_head *cb)
> > 1245 {
> > 
> > 1254         if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > 1255             !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > 1256                 set_mce_nospec(p->mce_addr >> PAGE_SHIFT,
> > p->mce_whole_page);
> > 1257                 sync_core();
> > 1258                 return;
> > 1259         }
> > 
> > 1267 }
> 
> With your change memory_failure() will return -EHWPOISON for the
> second task that consumes poison ... so that "if" statement won't
> be true and so we fall into the following code:
> 
> 1273         if (p->mce_vaddr != (void __user *)-1l) {
> 1274                 force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> 1275         } else {
> 1276                 pr_err("Memory error not recovered");
> 1277                 kill_me_now(cb);
> 1278         }
> 
> If this was a copy_from_user() machine check, p->mce_vaddr is set and
> the task gets a BUS_MCEERR_AR SIGBUS, otherwise we print that
> 
> 	"Memory error not recovered"
> 
> message and send a generic SIGBUS.  I don't think either of those options
> is right.
> 
> Combined with my "mutex" patch (to get rid of races where 2nd process returns
> early, but first process is still looking for mappings to unmap and tasks
> to signal) this patch moves forward a bit. But I think it needs an
> additional change here in kill_me_maybe() to just "return" if there is a
> EHWPOISON return from memory_failure()
> 
Got this, Thanks for your reply!
I will dig into this!
Tony Luck April 2, 2021, 3:11 p.m. UTC | #3
>> Combined with my "mutex" patch (to get rid of races where 2nd process returns
>> early, but first process is still looking for mappings to unmap and tasks
>> to signal) this patch moves forward a bit. But I think it needs an
>> additional change here in kill_me_maybe() to just "return" if there is a
>> EHWPOISON return from memory_failure()
>> 
> Got this, Thanks for your reply!
> I will dig into this!

One problem with this approach is when the first task to find poison
fails to complete actions. Then the poison pages are not unmapped,
and just returning from kill_me_maybe() gets into a loop :-(

-Tony
HORIGUCHI NAOYA(堀口 直也) April 5, 2021, 1:50 p.m. UTC | #4
On Fri, Apr 02, 2021 at 03:11:20PM +0000, Luck, Tony wrote:
> >> Combined with my "mutex" patch (to get rid of races where 2nd process returns
> >> early, but first process is still looking for mappings to unmap and tasks
> >> to signal) this patch moves forward a bit. But I think it needs an
> >> additional change here in kill_me_maybe() to just "return" if there is a
> >> EHWPOISON return from memory_failure()
> >> 
> > Got this, Thanks for your reply!
> > I will dig into this!
> 
> One problem with this approach is when the first task to find poison
> fails to complete actions. Then the poison pages are not unmapped,
> and just returning from kill_me_maybe() gets into a loop :-(

Yes, that's the pain point.  We need send SIGBUS to the current process in
"already haredware poisoned" case of memory_failure().  SIGBUS should
contain the error virtual address, but unfortunately walking the page table
or using p->mce_vaddr is not always reliable now.

So as a second-best approach, we can extend the "walking page table"
approach such that we walk over the whole virtual address space to make sure
that the number of entries pointing to the error page is exactly 1.
If that's the case, then we can confidently send SIGBUS with it.  If we find
multiple entries pointing to the error page, then we give up guessing, then
send a nomral SIGBUS to the current process.  That's not worse than now,
and I think we need wait in the hope that the virtual address will be
available in MCE handler.

Anyway I'll try to write a patch for this.

Thanks,
Naoya Horiguchi
yaoaili [么爱利] April 6, 2021, 1:04 a.m. UTC | #5
On Mon, 5 Apr 2021 13:50:18 +0000
HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote:

> On Fri, Apr 02, 2021 at 03:11:20PM +0000, Luck, Tony wrote:
> > >> Combined with my "mutex" patch (to get rid of races where 2nd process returns
> > >> early, but first process is still looking for mappings to unmap and tasks
> > >> to signal) this patch moves forward a bit. But I think it needs an
> > >> additional change here in kill_me_maybe() to just "return" if there is a
> > >> EHWPOISON return from memory_failure()
> > >> 
> > > Got this, Thanks for your reply!
> > > I will dig into this!
> > 
> > One problem with this approach is when the first task to find poison
> > fails to complete actions. Then the poison pages are not unmapped,
> > and just returning from kill_me_maybe() gets into a loop :-(
> 
> Yes, that's the pain point.  We need send SIGBUS to the current process in
> "already haredware poisoned" case of memory_failure().  SIGBUS should
> contain the error virtual address, but unfortunately walking the page table
> or using p->mce_vaddr is not always reliable now.
> 
> So as a second-best approach, we can extend the "walking page table"
> approach such that we walk over the whole virtual address space to make sure
> that the number of entries pointing to the error page is exactly 1.
> If that's the case, then we can confidently send SIGBUS with it.  If we find
> multiple entries pointing to the error page, then we give up guessing, then
> send a nomral SIGBUS to the current process.  That's not worse than now,
> and I think we need wait in the hope that the virtual address will be
> available in MCE handler.
> 
> Anyway I'll try to write a patch for this.

Yeah, previous patch didn't adress the multiple virtual address issue, If there is a way to fix that,
That would be great!
diff mbox series

Patch

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..5cd42144b67c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1228,7 +1228,7 @@  static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(head)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 		       pfn);
-		return 0;
+		return -EHWPOISON;
 	}
 
 	num_poisoned_pages_inc();
@@ -1430,7 +1430,7 @@  int memory_failure(unsigned long pfn, int flags)
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
-		return 0;
+		return -EHWPOISON;
 	}
 
 	orig_head = hpage = compound_head(p);