diff mbox series

[v4] mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()

Message ID 20220316120701.394061-1-naoya.horiguchi@linux.dev (mailing list archive)
State New
Headers show
Series [v4] mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb() | expand

Commit Message

Naoya Horiguchi March 16, 2022, 12:07 p.m. UTC
From: Miaohe Lin <linmiaohe@huawei.com>

There is a race condition between memory_failure_hugetlb() and hugetlb
free/demotion, which causes setting PageHWPoison flag on the wrong page.
The one simple result is that wrong processes can be killed, but another
(more serious) one is that the actual error is left unhandled, so no one
prevents later access to it, and that might lead to more serious results
like consuming corrupted data.

Think about the below race window:

  CPU 1                                   CPU 2
  memory_failure_hugetlb
  struct page *head = compound_head(p);
                                          hugetlb page might be freed to
                                          buddy, or even changed to another
                                          compound page.

  get_hwpoison_page -- page is not what we want now...

The compound_head is called outside hugetlb_lock, so the head is not
reliable.

So set PageHWPoison flag after passing prechecks. And to detect
potential violation, this patch also introduces a new action type
MF_MSG_DIFFERENT_PAGE_SIZE.

Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: <stable@vger.kernel.org>
---
ChangeLog v3 -> v4:
- squash with "mm/memory-failure.c: fix race with changing page
  compound again".
- update patch subject and description based on it.

ChangeLog v2 -> v3:
- rename the patch because page lock is not the primary factor to
  solve the reported issue.
- updated description in the same manner.
- call page_handle_poison() instead of __page_handle_poison() for
  free hugepage case.
- reorder put_page and unlock_page (thanks to Miaohe Lin)

ChangeLog v1 -> v2:
- pass subpage to get_hwpoison_huge_page() instead of head page.
- call compound_head() in hugetlb_lock to avoid race with hugetlb
  demotion/free.
---
 include/linux/mm.h      |  1 +
 include/ras/ras_event.h |  1 +
 mm/hugetlb.c            |  8 +++---
 mm/memory-failure.c     | 55 ++++++++++++++++++++++++-----------------
 4 files changed, 40 insertions(+), 25 deletions(-)

Comments

Mike Kravetz March 16, 2022, 10:51 p.m. UTC | #1
On 3/16/22 05:07, Naoya Horiguchi wrote:
> From: Miaohe Lin <linmiaohe@huawei.com>
> 
> There is a race condition between memory_failure_hugetlb() and hugetlb
> free/demotion, which causes setting PageHWPoison flag on the wrong page.
> The one simple result is that wrong processes can be killed, but another
> (more serious) one is that the actual error is left unhandled, so no one
> prevents later access to it, and that might lead to more serious results
> like consuming corrupted data.
> 
> Think about the below race window:
> 
>   CPU 1                                   CPU 2
>   memory_failure_hugetlb
>   struct page *head = compound_head(p);
>                                           hugetlb page might be freed to
>                                           buddy, or even changed to another
>                                           compound page.
> 
>   get_hwpoison_page -- page is not what we want now...
> 
> The compound_head is called outside hugetlb_lock, so the head is not
> reliable.
> 
> So set PageHWPoison flag after passing prechecks. And to detect
> potential violation, this patch also introduces a new action type
> MF_MSG_DIFFERENT_PAGE_SIZE.

Thanks for squashing these patches.

In my testing, there is a change in behavior that may not be intended.

My test strategy is:
- allocate two hugetlb pages
- create a mapping which reserves those two pages, but does not fault them in
  - as a result, the pages are on the free list but can not be freed
- inject error on a subpage of one of the huge pages
  - echo 0xYYY > /sys/kernel/debug/hwpoison/corrupt-pfn
- memory error code will call dissolve_free_huge_page
  - dissolve_free_huge_page returns -EBUSY because
    h->free_huge_pages - h->resv_huge_pages == 0
- We never end up setting Poison on the page with error or head page
- Huge page sitting on free list with error in subpage and not marked
- huge page with error could be given to an application or returned to buddy

Prior to this change, Poison would be set on the head page

I do not think this was an intended change in behavior.  But, perhaps it is
all we can do in this case?  Sorry for not being able to look more closely
at the code right now.
Miaohe Lin March 17, 2022, 9:28 a.m. UTC | #2
On 2022/3/17 6:51, Mike Kravetz wrote:
> On 3/16/22 05:07, Naoya Horiguchi wrote:
>> From: Miaohe Lin <linmiaohe@huawei.com>
>>
>> There is a race condition between memory_failure_hugetlb() and hugetlb
>> free/demotion, which causes setting PageHWPoison flag on the wrong page.
>> The one simple result is that wrong processes can be killed, but another
>> (more serious) one is that the actual error is left unhandled, so no one
>> prevents later access to it, and that might lead to more serious results
>> like consuming corrupted data.
>>
>> Think about the below race window:
>>
>>   CPU 1                                   CPU 2
>>   memory_failure_hugetlb
>>   struct page *head = compound_head(p);
>>                                           hugetlb page might be freed to
>>                                           buddy, or even changed to another
>>                                           compound page.
>>
>>   get_hwpoison_page -- page is not what we want now...
>>
>> The compound_head is called outside hugetlb_lock, so the head is not
>> reliable.
>>
>> So set PageHWPoison flag after passing prechecks. And to detect
>> potential violation, this patch also introduces a new action type
>> MF_MSG_DIFFERENT_PAGE_SIZE.
> 
> Thanks for squashing these patches.
> 
> In my testing, there is a change in behavior that may not be intended.
> 
> My test strategy is:
> - allocate two hugetlb pages
> - create a mapping which reserves those two pages, but does not fault them in
>   - as a result, the pages are on the free list but can not be freed
> - inject error on a subpage of one of the huge pages
>   - echo 0xYYY > /sys/kernel/debug/hwpoison/corrupt-pfn
> - memory error code will call dissolve_free_huge_page
>   - dissolve_free_huge_page returns -EBUSY because
>     h->free_huge_pages - h->resv_huge_pages == 0
> - We never end up setting Poison on the page with error or head page
> - Huge page sitting on free list with error in subpage and not marked
> - huge page with error could be given to an application or returned to buddy
> 
> Prior to this change, Poison would be set on the head page
> 

Many thanks for pointing this out. IIUC, this change in behavior should be a bit
unintended. We're trying to avoid setting PageHWPoison flag on the wrong page so
we have to set the PageHWPoison flag after passing prechecks as commit log said.
But there is room for improvement, e.g. when page changed to single page or another
compound-size page after we grab the page refcnt, we could also set PageHWPoison
before bailing out ? There might be something more we can do?

> I do not think this was an intended change in behavior.  But, perhaps it is
> all we can do in this case?  Sorry for not being able to look more closely
> at the code right now.   
> 

Thanks.
HORIGUCHI NAOYA(堀口 直也) March 17, 2022, 1:31 p.m. UTC | #3
On Wed, Mar 16, 2022 at 03:51:35PM -0700, Mike Kravetz wrote:
> On 3/16/22 05:07, Naoya Horiguchi wrote:
> > From: Miaohe Lin <linmiaohe@huawei.com>
> > 
> > There is a race condition between memory_failure_hugetlb() and hugetlb
> > free/demotion, which causes setting PageHWPoison flag on the wrong page.
> > The one simple result is that wrong processes can be killed, but another
> > (more serious) one is that the actual error is left unhandled, so no one
> > prevents later access to it, and that might lead to more serious results
> > like consuming corrupted data.
> > 
> > Think about the below race window:
> > 
> >   CPU 1                                   CPU 2
> >   memory_failure_hugetlb
> >   struct page *head = compound_head(p);
> >                                           hugetlb page might be freed to
> >                                           buddy, or even changed to another
> >                                           compound page.
> > 
> >   get_hwpoison_page -- page is not what we want now...
> > 
> > The compound_head is called outside hugetlb_lock, so the head is not
> > reliable.
> > 
> > So set PageHWPoison flag after passing prechecks. And to detect
> > potential violation, this patch also introduces a new action type
> > MF_MSG_DIFFERENT_PAGE_SIZE.
> 
> Thanks for squashing these patches.
> 
> In my testing, there is a change in behavior that may not be intended.
> 
> My test strategy is:
> - allocate two hugetlb pages
> - create a mapping which reserves those two pages, but does not fault them in
>   - as a result, the pages are on the free list but can not be freed
> - inject error on a subpage of one of the huge pages
>   - echo 0xYYY > /sys/kernel/debug/hwpoison/corrupt-pfn
> - memory error code will call dissolve_free_huge_page
>   - dissolve_free_huge_page returns -EBUSY because
>     h->free_huge_pages - h->resv_huge_pages == 0
> - We never end up setting Poison on the page with error or head page
> - Huge page sitting on free list with error in subpage and not marked
> - huge page with error could be given to an application or returned to buddy
> 
> Prior to this change, Poison would be set on the head page

You're right, this is not intended.
I should've kept current behavior for this case (set PageHWPoison
flag on the head page, and no refcnt taken), so I'll update the patch.

In this case the hwpoisoned hugepage remains in free list, but
dequeue_huge_page_node_exact() checks the PageHWPoison flag not to be
allocated, that's OK.  It might not be optimal that the hwpoisoned free
hugepage is in the list for long, so some mechanism to handle it in a
delayed manner.  One way is to call dissolve_free_huge_page() when a
hwpoisoned page is found in dequeue_huge_page_node_exact().  If the
dissolving succeeds, it's OK.  If it fails, keep it as-is expecting to
be dissolved next time.

Another related problem is that when hwpoison hugepage is listed in
free list, the information about raw error offset is lost.  So if
hugepage pool is shrinked and the hwpoison hugepage is freed to buddy,
the PageHWPoison flag remains on the ex-head page.  So we need somehow
keep error offset.

Anyway, this will be done in separate work, I'll just fix the problem
you mentioned.

Thank you very much,
Naoya Horiguchi

> 
> I do not think this was an intended change in behavior.  But, perhaps it is
> all we can do in this case?  Sorry for not being able to look more closely
> at the code right now.   
> -- 
> Mike Kravetz
HORIGUCHI NAOYA(堀口 直也) March 17, 2022, 2:44 p.m. UTC | #4
On Thu, Mar 17, 2022 at 05:28:13PM +0800, Miaohe Lin wrote:
> On 2022/3/17 6:51, Mike Kravetz wrote:
> > On 3/16/22 05:07, Naoya Horiguchi wrote:
> >> From: Miaohe Lin <linmiaohe@huawei.com>
...
> > 
> > In my testing, there is a change in behavior that may not be intended.
> > 
> > My test strategy is:
> > - allocate two hugetlb pages
> > - create a mapping which reserves those two pages, but does not fault them in
> >   - as a result, the pages are on the free list but can not be freed
> > - inject error on a subpage of one of the huge pages
> >   - echo 0xYYY > /sys/kernel/debug/hwpoison/corrupt-pfn
> > - memory error code will call dissolve_free_huge_page
> >   - dissolve_free_huge_page returns -EBUSY because
> >     h->free_huge_pages - h->resv_huge_pages == 0
> > - We never end up setting Poison on the page with error or head page
> > - Huge page sitting on free list with error in subpage and not marked
> > - huge page with error could be given to an application or returned to buddy
> > 
> > Prior to this change, Poison would be set on the head page
> > 
> 
> Many thanks for pointing this out. IIUC, this change in behavior should be a bit
> unintended. We're trying to avoid setting PageHWPoison flag on the wrong page so
> we have to set the PageHWPoison flag after passing prechecks as commit log said.
> But there is room for improvement, e.g. when page changed to single page or another
> compound-size page after we grab the page refcnt, we could also set PageHWPoison
> before bailing out ? There might be something more we can do?

Yha, we could have more improvement around it.  I think that we can add
SetPageHWPoison near action_result(MF_MSG_DIFFERENT_PAGE_SIZE), but on which
page?  Maybe setting PageHWPoison on the raw page (not head page of new
compound page) is better because the new compound pages should not be hugetlb.
What about action_result(MF_MSG_UNKNOWN)?  Maybe we should do the same.
IOW, setting PageHWPoison on the head page is justified only when the
page is surely a hugepage.

Thanks,
Naoya Horiguchi
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5744a3fc4716..68c101b399b7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3244,6 +3244,7 @@  enum mf_action_page_type {
 	MF_MSG_BUDDY,
 	MF_MSG_DAX,
 	MF_MSG_UNSPLIT_THP,
+	MF_MSG_DIFFERENT_PAGE_SIZE,
 	MF_MSG_UNKNOWN,
 };
 
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index d0337a41141c..1e694fd239b9 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -374,6 +374,7 @@  TRACE_EVENT(aer_event,
 	EM ( MF_MSG_BUDDY, "free buddy page" )				\
 	EM ( MF_MSG_DAX, "dax page" )					\
 	EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )			\
+	EM ( MF_MSG_DIFFERENT_PAGE_SIZE, "different page size" )	\
 	EMe ( MF_MSG_UNKNOWN, "unknown page" )
 
 /*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f294db835f4b..345fed90842e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6761,14 +6761,16 @@  bool isolate_huge_page(struct page *page, struct list_head *list)
 
 int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
 {
+	struct page *head;
 	int ret = 0;
 
 	*hugetlb = false;
 	spin_lock_irq(&hugetlb_lock);
-	if (PageHeadHuge(page)) {
+	head = compound_head(page);
+	if (PageHeadHuge(head)) {
 		*hugetlb = true;
-		if (HPageFreed(page) || HPageMigratable(page))
-			ret = get_page_unless_zero(page);
+		if (HPageFreed(head) || HPageMigratable(head))
+			ret = get_page_unless_zero(head);
 		else
 			ret = -EBUSY;
 	}
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 59d0de9beb60..5842aaadcc99 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -732,6 +732,7 @@  static const char * const action_page_types[] = {
 	[MF_MSG_BUDDY]			= "free buddy page",
 	[MF_MSG_DAX]			= "dax page",
 	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
+	[MF_MSG_DIFFERENT_PAGE_SIZE]	= "different page size",
 	[MF_MSG_UNKNOWN]		= "unknown page",
 };
 
@@ -1192,7 +1193,7 @@  static int __get_hwpoison_page(struct page *page, unsigned long flags)
 	int ret = 0;
 	bool hugetlb = false;
 
-	ret = get_hwpoison_huge_page(head, &hugetlb);
+	ret = get_hwpoison_huge_page(page, &hugetlb);
 	if (hugetlb)
 		return ret;
 
@@ -1279,11 +1280,10 @@  static int get_any_page(struct page *p, unsigned long flags)
 
 static int __get_unpoison_page(struct page *page)
 {
-	struct page *head = compound_head(page);
 	int ret = 0;
 	bool hugetlb = false;
 
-	ret = get_hwpoison_huge_page(head, &hugetlb);
+	ret = get_hwpoison_huge_page(page, &hugetlb);
 	if (hugetlb)
 		return ret;
 
@@ -1501,34 +1501,21 @@  static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	struct page *head = compound_head(p);
 	int res;
 	unsigned long page_flags;
-
-	if (TestSetPageHWPoison(head)) {
-		pr_err("Memory failure: %#lx: already hardware poisoned\n",
-		       pfn);
-		res = -EHWPOISON;
-		if (flags & MF_ACTION_REQUIRED)
-			res = kill_accessing_process(current, page_to_pfn(head), flags);
-		return res;
-	}
-
-	num_poisoned_pages_inc();
+	bool put = false;
+	bool already_hwpoisoned = false;
 
 	if (!(flags & MF_COUNT_INCREASED)) {
 		res = get_hwpoison_page(p, flags);
 		if (!res) {
 			lock_page(head);
 			if (hwpoison_filter(p)) {
-				if (TestClearPageHWPoison(head))
-					num_poisoned_pages_dec();
 				unlock_page(head);
 				return -EOPNOTSUPP;
 			}
 			unlock_page(head);
 			res = MF_FAILED;
-			if (__page_handle_poison(p)) {
-				page_ref_inc(p);
+			if (page_handle_poison(p, true, false))
 				res = MF_RECOVERED;
-			}
 			action_result(pfn, MF_MSG_FREE_HUGE, res);
 			return res == MF_RECOVERED ? 0 : -EBUSY;
 		} else if (res < 0) {
@@ -1538,16 +1525,32 @@  static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	}
 
 	lock_page(head);
+
+	/*
+	 * The page could have changed compound pages due to race window.
+	 * If this happens just bail out.
+	 */
+	if (!PageHuge(p) || compound_head(p) != head) {
+		action_result(pfn, MF_MSG_DIFFERENT_PAGE_SIZE, MF_IGNORED);
+		res = -EBUSY;
+		goto out;
+	}
+
 	page_flags = head->flags;
 
 	if (hwpoison_filter(p)) {
-		if (TestClearPageHWPoison(head))
-			num_poisoned_pages_dec();
-		put_page(p);
+		put = true;
 		res = -EOPNOTSUPP;
 		goto out;
 	}
 
+	if (TestSetPageHWPoison(head)) {
+		put = already_hwpoisoned = true;
+		goto out;
+	}
+
+	num_poisoned_pages_inc();
+
 	/*
 	 * TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
 	 * simply disable it. In order to make it work properly, we need
@@ -1572,6 +1575,14 @@  static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	return identify_page_state(pfn, p, page_flags);
 out:
 	unlock_page(head);
+	if (put)
+		put_page(p);
+	if (already_hwpoisoned) {
+		pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn);
+		res = -EHWPOISON;
+		if (flags & MF_ACTION_REQUIRED)
+			res = kill_accessing_process(current, page_to_pfn(head), flags);
+	}
 	return res;
 }