diff mbox series

mm/memory-failure: Use a mutex to avoid memory_failure() races

Message ID 20210308225504.GA233893@agluck-desk2.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show
Series mm/memory-failure: Use a mutex to avoid memory_failure() races | expand

Commit Message

Tony Luck March 8, 2021, 10:55 p.m. UTC
There can be races when multiple CPUs consume poison from the same
page. The first into memory_failure() atomically sets the HWPoison
page flag and begins hunting for tasks that map this page. Eventually
it invalidates those mappings and may send a SIGBUS to the affected
tasks.

But while all that work is going on, other CPUs see a "success"
return code from memory_failure() and so they believe the error
has been handled and continue executing.

Fix by wrapping most of the internal parts of memory_failure() in
a mutex.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 mm/memory-failure.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Comments

HORIGUCHI NAOYA(堀口 直也) March 8, 2021, 11:42 p.m. UTC | #1
On Mon, Mar 08, 2021 at 02:55:04PM -0800, Luck, Tony wrote:
> There can be races when multiple CPUs consume poison from the same
> page. The first into memory_failure() atomically sets the HWPoison
> page flag and begins hunting for tasks that map this page. Eventually
> it invalidates those mappings and may send a SIGBUS to the affected
> tasks.
> 
> But while all that work is going on, other CPUs see a "success"
> return code from memory_failure() and so they believe the error
> has been handled and continue executing.
> 
> Fix by wrapping most of the internal parts of memory_failure() in
> a mutex.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>

Thanks!

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
HORIGUCHI NAOYA(堀口 直也) March 9, 2021, 6:04 a.m. UTC | #2
On Tue, Mar 09, 2021 at 10:04:21AM +0800, Aili Yao wrote:
> On Mon, 8 Mar 2021 14:55:04 -0800
> "Luck, Tony" <tony.luck@intel.com> wrote:
> 
> > There can be races when multiple CPUs consume poison from the same
> > page. The first into memory_failure() atomically sets the HWPoison
> > page flag and begins hunting for tasks that map this page. Eventually
> > it invalidates those mappings and may send a SIGBUS to the affected
> > tasks.
> > 
> > But while all that work is going on, other CPUs see a "success"
> > return code from memory_failure() and so they believe the error
> > has been handled and continue executing.
> > 
> > Fix by wrapping most of the internal parts of memory_failure() in
> > a mutex.
> > 
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
...
> 
> If others are OK with this method, then I am OK too.
> But I have two concerns, May you take into account:
> 
> 1. The memory_failure with 0 return code for race condition, then the kill_me_maybe() goes into branch:
> 	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> 	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> 		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> 		sync_core();
> 		return;
> 	}
> 
> while we place set_mce_nospec() here is for a reason, please see commit fd0e786d9d09024f67b.
> 
> 2. When memory_failure return 0 and maybe return to user process, and it may re-execute the instruction triggering previous fault, this behavior
> assume an implicit dependence that the related pte has been correctly set. or if not correctlily set, it will lead to infinite loop again.

These seem to be separate issues from memory_failure()'s concurrency issue,
so I'm still expecting that your patch is to be merged. Maybe do you want
to update it based on the discussion (if it's concluded)?

Thanks,
Naoya Horiguchi
yaoaili [么爱利] March 9, 2021, 6:38 a.m. UTC | #3
On Tue, 9 Mar 2021 06:04:41 +0000
HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote:

> ...
> > 
> > If others are OK with this method, then I am OK too.
> > But I have two concerns, May you take into account:
> > 
> > 1. The memory_failure with 0 return code for race condition, then the kill_me_maybe() goes into branch:
> > 	if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) &&
> > 	    !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
> > 		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
> > 		sync_core();
> > 		return;
> > 	}
> > 
> > while we place set_mce_nospec() here is for a reason, please see commit fd0e786d9d09024f67b.
> > 
> > 2. When memory_failure return 0 and maybe return to user process, and it may re-execute the instruction triggering previous fault, this behavior
> > assume an implicit dependence that the related pte has been correctly set. or if not correctlily set, it will lead to infinite loop again.  
> 
> These seem to be separate issues from memory_failure()'s concurrency issue,
> so I'm still expecting that your patch is to be merged. Maybe do you want
> to update it based on the discussion (if it's concluded)?
> 
> Thanks,
> Naoya Horiguchi

I have submitted a v2 patch, and please help review.

Thanks!
diff mbox series

Patch

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..c1509f4b565e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1381,6 +1381,8 @@  static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	return rc;
 }
 
+static DEFINE_MUTEX(mf_mutex);
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1424,12 +1426,18 @@  int memory_failure(unsigned long pfn, int flags)
 		return -ENXIO;
 	}
 
+	mutex_lock(&mf_mutex);
+
 try_again:
-	if (PageHuge(p))
-		return memory_failure_hugetlb(pfn, flags);
+	if (PageHuge(p)) {
+		res = memory_failure_hugetlb(pfn, flags);
+		goto out2;
+	}
+
 	if (TestSetPageHWPoison(p)) {
 		pr_err("Memory failure: %#lx: already hardware poisoned\n",
 			pfn);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 
@@ -1463,9 +1471,11 @@  int memory_failure(unsigned long pfn, int flags)
 				res = MF_FAILED;
 			}
 			action_result(pfn, MF_MSG_BUDDY, res);
+			mutex_unlock(&mf_mutex);
 			return res == MF_RECOVERED ? 0 : -EBUSY;
 		} else {
 			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
+			mutex_unlock(&mf_mutex);
 			return -EBUSY;
 		}
 	}
@@ -1473,6 +1483,7 @@  int memory_failure(unsigned long pfn, int flags)
 	if (PageTransHuge(hpage)) {
 		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
 			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+			mutex_unlock(&mf_mutex);
 			return -EBUSY;
 		}
 		VM_BUG_ON_PAGE(!page_count(p), p);
@@ -1517,6 +1528,7 @@  int memory_failure(unsigned long pfn, int flags)
 		num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 	if (hwpoison_filter(p)) {
@@ -1524,6 +1536,7 @@  int memory_failure(unsigned long pfn, int flags)
 			num_poisoned_pages_dec();
 		unlock_page(p);
 		put_page(p);
+		mutex_unlock(&mf_mutex);
 		return 0;
 	}
 
@@ -1559,6 +1572,8 @@  int memory_failure(unsigned long pfn, int flags)
 	res = identify_page_state(pfn, p, page_flags);
 out:
 	unlock_page(p);
+out2:
+	mutex_unlock(&mf_mutex);
 	return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);