diff mbox series

[RFC] mm/memory-failure.c: fix memory failure race with memory offline

Message ID 20220226094034.23938-1-linmiaohe@huawei.com (mailing list archive)
State New
Headers show
Series [RFC] mm/memory-failure.c: fix memory failure race with memory offline | expand

Commit Message

Miaohe Lin Feb. 26, 2022, 9:40 a.m. UTC
There is a theoretical race window between memory failure and memory
offline. Think about the below scene:

  CPU A					  CPU B
memory_failure				offline_pages
  mutex_lock(&mf_mutex);
  TestSetPageHWPoison(p)
					  start_isolate_page_range
					    has_unmovable_pages
					      --PageHWPoison is movable
					  do {
					    scan_movable_pages
					    do_migrate_range
					      --PageHWPoison isn't migrated
					  }
					  test_pages_isolated
					    --PageHWPoison is isolated
					remove_memory
  access page... bang
  ...

When PageHWPoison is set, the page could be offlined anytime regardless
of the page refcnt. It's bacause start_isolate_page_range treats HWPoison
page as movable and already isolated, so the page range can be successfully
isolated. soft_offline_page and unpoison_memory have the similar race. Fix
this by using get_online_mems + put_online_mems pair to guard aginst memory
offline when doing memory failure.

There is a even worse race window. If the page refcnt is held, then memory
failure happens, the page could be offlined while it's still in use. So The
assumption that a page can not be offlined when the page refcnt is held is
now broken. This theoretical race window could happen in every vm activity.
But this race window might be too small to fix.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memory-failure.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Naoya Horiguchi Feb. 28, 2022, 12:04 p.m. UTC | #1
On Sat, Feb 26, 2022 at 05:40:34PM +0800, Miaohe Lin wrote:
> There is a theoretical race window between memory failure and memory
> offline. Think about the below scene:
> 
>   CPU A					  CPU B
> memory_failure				offline_pages
>   mutex_lock(&mf_mutex);
>   TestSetPageHWPoison(p)
> 					  start_isolate_page_range
> 					    has_unmovable_pages
> 					      --PageHWPoison is movable
> 					  do {
> 					    scan_movable_pages
> 					    do_migrate_range
> 					      --PageHWPoison isn't migrated
> 					  }
> 					  test_pages_isolated
> 					    --PageHWPoison is isolated
> 					remove_memory
>   access page... bang
>   ...
> 
> When PageHWPoison is set, the page could be offlined anytime regardless
> of the page refcnt. It's bacause start_isolate_page_range treats HWPoison
> page as movable and already isolated, so the page range can be successfully
> isolated. soft_offline_page and unpoison_memory have the similar race. Fix
> this by using get_online_mems + put_online_mems pair to guard aginst memory
> offline when doing memory failure.

Sounds convincing to me. Thanks for identifying. Is this problem reproduced
in your testing, or just picked out by code inspection?

> 
> There is a even worse race window. If the page refcnt is held, then memory
> failure happens, the page could be offlined while it's still in use. So The
> assumption that a page can not be offlined when the page refcnt is held is
> now broken. This theoretical race window could happen in every vm activity.
> But this race window might be too small to fix.

Yes, hwpoisoned pages can now be offlined while they have refcount > 0.
I think that they need to be categorize into two groups based on
whether the error page was successfully handled or not.

If a error page is successfully handled, then page_handle_poison() succeeded
and the refcount should be one (which is held only by hwpoison subsystem
itself, so there should be no other reference to it), so it should be safely
offlined as we do now.  But if error handling failed, the hwpoisoned page
have any value and there could remain some reference to it.  So we might
better to make offline_pages() fail for such "failed handling" pages, or for
all hwpoisoned pages with refcount more than one?

> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/memory-failure.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5444a8ef4867..b85232a64104 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1702,6 +1702,7 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (!sysctl_memory_failure_recovery)
>  		panic("Memory failure on page %lx", pfn);
> 
> +	get_online_mems();
>  	mutex_lock(&mf_mutex);
> 
>  	p = pfn_to_online_page(pfn);
> @@ -1894,11 +1895,13 @@ int memory_failure(unsigned long pfn, int flags)
>  identify_page_state:
>  	res = identify_page_state(pfn, p, page_flags);
>  	mutex_unlock(&mf_mutex);
> +	put_online_mems();
>  	return res;
>  unlock_page:
>  	unlock_page(p);
>  unlock_mutex:
>  	mutex_unlock(&mf_mutex);
> +	put_online_mems();
>  	return res;
>  }
>  EXPORT_SYMBOL_GPL(memory_failure);
> @@ -2058,6 +2061,7 @@ int unpoison_memory(unsigned long pfn)
>  	if (!pfn_valid(pfn))
>  		return -ENXIO;
> 
> +	get_online_mems();
>  	p = pfn_to_page(pfn);
>  	page = compound_head(p);
> 
> @@ -2114,6 +2118,7 @@ int unpoison_memory(unsigned long pfn)
> 
>  unlock_mutex:
>  	mutex_unlock(&mf_mutex);
> +	put_online_mems();
>  	return ret;
>  }
>  EXPORT_SYMBOL(unpoison_memory);
> @@ -2278,10 +2283,12 @@ int soft_offline_page(unsigned long pfn, int flags)
>  	if (flags & MF_COUNT_INCREASED)
>  		ref_page = pfn_to_page(pfn);
> 
> +	get_online_mems();

I felt that {get,put}_online_mems() can be put together with takeing/freeing
mf_mutex with some wrapper lock/unlock function, which slightly improves
code readability (developers won't have to care about two locking separately).
That requires moving mutex_lock(&mf_mutex), which could have non-trivial
change, but maybe that's not so bad.

Thanks,
Naoya Horiguchi

>  	/* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
>  	page = pfn_to_online_page(pfn);
>  	if (!page) {
>  		put_ref_page(ref_page);
> +		put_online_mems();
>  		return -EIO;
>  	}
> 
> @@ -2291,13 +2298,12 @@ int soft_offline_page(unsigned long pfn, int flags)
>  		pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
>  		put_ref_page(ref_page);
>  		mutex_unlock(&mf_mutex);
> +		put_online_mems();
>  		return 0;
>  	}
> 
>  retry:
> -	get_online_mems();
>  	ret = get_hwpoison_page(page, flags);
> -	put_online_mems();
> 
>  	if (ret > 0) {
>  		ret = soft_offline_in_use_page(page);
> @@ -2310,6 +2316,7 @@ int soft_offline_page(unsigned long pfn, int flags)
>  	}
> 
>  	mutex_unlock(&mf_mutex);
> +	put_online_mems();
> 
>  	return ret;
>  }
> ---
> 2.23.0
>
Miaohe Lin March 1, 2022, 3:32 a.m. UTC | #2
On 2022/2/28 20:04, Naoya Horiguchi wrote:
> On Sat, Feb 26, 2022 at 05:40:34PM +0800, Miaohe Lin wrote:
>> There is a theoretical race window between memory failure and memory
>> offline. Think about the below scene:
>>
>>   CPU A					  CPU B
>> memory_failure				offline_pages
>>   mutex_lock(&mf_mutex);
>>   TestSetPageHWPoison(p)
>> 					  start_isolate_page_range
>> 					    has_unmovable_pages
>> 					      --PageHWPoison is movable
>> 					  do {
>> 					    scan_movable_pages
>> 					    do_migrate_range
>> 					      --PageHWPoison isn't migrated
>> 					  }
>> 					  test_pages_isolated
>> 					    --PageHWPoison is isolated
>> 					remove_memory
>>   access page... bang
>>   ...
>>
>> When PageHWPoison is set, the page could be offlined anytime regardless
>> of the page refcnt. It's bacause start_isolate_page_range treats HWPoison
>> page as movable and already isolated, so the page range can be successfully
>> isolated. soft_offline_page and unpoison_memory have the similar race. Fix
>> this by using get_online_mems + put_online_mems pair to guard aginst memory
>> offline when doing memory failure.
> 
> Sounds convincing to me. Thanks for identifying. Is this problem reproduced
> in your testing, or just picked out by code inspection?

It's picked out by code inspection. I am investigating the memory failure and
memory offline code again recently.

> 
>>
>> There is a even worse race window. If the page refcnt is held, then memory
>> failure happens, the page could be offlined while it's still in use. So The
>> assumption that a page can not be offlined when the page refcnt is held is
>> now broken. This theoretical race window could happen in every vm activity.
>> But this race window might be too small to fix.
> 
> Yes, hwpoisoned pages can now be offlined while they have refcount > 0.
> I think that they need to be categorize into two groups based on
> whether the error page was successfully handled or not.
> 
> If a error page is successfully handled, then page_handle_poison() succeeded
> and the refcount should be one (which is held only by hwpoison subsystem
> itself, so there should be no other reference to it), so it should be safely
> offlined as we do now.  But if error handling failed, the hwpoisoned page
> have any value and there could remain some reference to it.  So we might
> better to make offline_pages() fail for such "failed handling" pages, or for
> all hwpoisoned pages with refcount more than one?

I'am tending to agree with you. If error handling failed, the hwpoisoned page
could reside in page cache, swap cache and so on. And if they are offlined,
there could be many problems.

But it seems there are some corner cases. If we failed to get_hwpoison_page,
PageHWPoison is set but page refcnt is not increased. So the hwpoisoned page
might be still in-use while refcnt = 1. Thus this in-use page could be offlined
as refcnt is not more than one. :(

> 
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/memory-failure.c | 11 +++++++++--
>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 5444a8ef4867..b85232a64104 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1702,6 +1702,7 @@ int memory_failure(unsigned long pfn, int flags)
>>  	if (!sysctl_memory_failure_recovery)
>>  		panic("Memory failure on page %lx", pfn);
>>
>> +	get_online_mems();
>>  	mutex_lock(&mf_mutex);
>>
>>  	p = pfn_to_online_page(pfn);
>> @@ -1894,11 +1895,13 @@ int memory_failure(unsigned long pfn, int flags)
>>  identify_page_state:
>>  	res = identify_page_state(pfn, p, page_flags);
>>  	mutex_unlock(&mf_mutex);
>> +	put_online_mems();
>>  	return res;
>>  unlock_page:
>>  	unlock_page(p);
>>  unlock_mutex:
>>  	mutex_unlock(&mf_mutex);
>> +	put_online_mems();
>>  	return res;
>>  }
>>  EXPORT_SYMBOL_GPL(memory_failure);
>> @@ -2058,6 +2061,7 @@ int unpoison_memory(unsigned long pfn)
>>  	if (!pfn_valid(pfn))
>>  		return -ENXIO;
>>
>> +	get_online_mems();
>>  	p = pfn_to_page(pfn);
>>  	page = compound_head(p);
>>
>> @@ -2114,6 +2118,7 @@ int unpoison_memory(unsigned long pfn)
>>
>>  unlock_mutex:
>>  	mutex_unlock(&mf_mutex);
>> +	put_online_mems();
>>  	return ret;
>>  }
>>  EXPORT_SYMBOL(unpoison_memory);
>> @@ -2278,10 +2283,12 @@ int soft_offline_page(unsigned long pfn, int flags)
>>  	if (flags & MF_COUNT_INCREASED)
>>  		ref_page = pfn_to_page(pfn);
>>
>> +	get_online_mems();
> 
> I felt that {get,put}_online_mems() can be put together with takeing/freeing
> mf_mutex with some wrapper lock/unlock function, which slightly improves
> code readability (developers won't have to care about two locking separately).
> That requires moving mutex_lock(&mf_mutex), which could have non-trivial
> change, but maybe that's not so bad.

Sounds good. Developers shouldn't have to care about two locking separately. Will
try to do it. Thanks.

> 
> Thanks,
> Naoya Horiguchi

Many thanks for your reply and great suggestion. :)

> 
>>  	/* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
>>  	page = pfn_to_online_page(pfn);
>>  	if (!page) {
>>  		put_ref_page(ref_page);
>> +		put_online_mems();
>>  		return -EIO;
>>  	}
>>
>> @@ -2291,13 +2298,12 @@ int soft_offline_page(unsigned long pfn, int flags)
>>  		pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
>>  		put_ref_page(ref_page);
>>  		mutex_unlock(&mf_mutex);
>> +		put_online_mems();
>>  		return 0;
>>  	}
>>
>>  retry:
>> -	get_online_mems();
>>  	ret = get_hwpoison_page(page, flags);
>> -	put_online_mems();
>>
>>  	if (ret > 0) {
>>  		ret = soft_offline_in_use_page(page);
>> @@ -2310,6 +2316,7 @@ int soft_offline_page(unsigned long pfn, int flags)
>>  	}
>>
>>  	mutex_unlock(&mf_mutex);
>> +	put_online_mems();
>>
>>  	return ret;
>>  }
>> ---
>> 2.23.0
>>
> .
>
David Hildenbrand March 1, 2022, 9:53 a.m. UTC | #3
On 26.02.22 10:40, Miaohe Lin wrote:
> There is a theoretical race window between memory failure and memory
> offline. Think about the below scene:
> 
>   CPU A					  CPU B
> memory_failure				offline_pages
>   mutex_lock(&mf_mutex);
>   TestSetPageHWPoison(p)
> 					  start_isolate_page_range
> 					    has_unmovable_pages
> 					      --PageHWPoison is movable
> 					  do {
> 					    scan_movable_pages
> 					    do_migrate_range
> 					      --PageHWPoison isn't migrated
> 					  }
> 					  test_pages_isolated
> 					    --PageHWPoison is isolated
> 					remove_memory
>   access page... bang
>   ...

I think the motivation for the offlining code was to not block memory
hotunplug (especially on ZONE_MOVABLE) just because there is a
HWpoisoned page. But how often does that happen?

It's all semi-broken either way. Assume you just offlined a memory block
with a hwpoisoned page. The memmap is stale and the information about
hwpoison is lost. You can happily re-online that memory block and use
*all* memory, including previously hwpoisoned memory. Note that this
used to be different in the past, when the memmap was initialized when
adding memory, not when onlining that memory.


IMHO, we should stop special casing hwpoison. Either fail offlining
completely if we stumble over a hwpoisoned page, or allow offlining only
if the refcount==0 -- just as any other page.
Miaohe Lin March 1, 2022, 1:22 p.m. UTC | #4
On 2022/3/1 17:53, David Hildenbrand wrote:
> On 26.02.22 10:40, Miaohe Lin wrote:
>> There is a theoretical race window between memory failure and memory
>> offline. Think about the below scene:
>>
>>   CPU A					  CPU B
>> memory_failure				offline_pages
>>   mutex_lock(&mf_mutex);
>>   TestSetPageHWPoison(p)
>> 					  start_isolate_page_range
>> 					    has_unmovable_pages
>> 					      --PageHWPoison is movable
>> 					  do {
>> 					    scan_movable_pages
>> 					    do_migrate_range
>> 					      --PageHWPoison isn't migrated
>> 					  }
>> 					  test_pages_isolated
>> 					    --PageHWPoison is isolated
>> 					remove_memory
>>   access page... bang
>>   ...
> 
> I think the motivation for the offlining code was to not block memory
> hotunplug (especially on ZONE_MOVABLE) just because there is a
> HWpoisoned page. But how often does that happen?

This should be really race. The memory failure itself shouldn't be common
otherwise we have other problems.

> 
> It's all semi-broken either way. Assume you just offlined a memory block
> with a hwpoisoned page. The memmap is stale and the information about
> hwpoison is lost. You can happily re-online that memory block and use
> *all* memory, including previously hwpoisoned memory. Note that this

Agree. This is how it works now. But it seems the hwpoisoned memory might can
be used again as normal memory after offline+online.

> used to be different in the past, when the memmap was initialized when
> adding memory, not when onlining that memory.
> 
> 
> IMHO, we should stop special casing hwpoison. Either fail offlining
> completely if we stumble over a hwpoisoned page, or allow offlining only
> if the refcount==0 -- just as any other page.
> 

I'm not sure whether this "rare" race condition worth fixing. But the problem
is there and we might come across it. Failing offlining completely sounds not
that good but it looks hard to reliably detect the "offline-safe" hwpoisoned page.
I can't come out a solution...

Many thanks for reply and comment. :)

>
Miaohe Lin March 10, 2022, 1:04 p.m. UTC | #5
On 2022/3/1 17:53, David Hildenbrand wrote:
> On 26.02.22 10:40, Miaohe Lin wrote:
>> There is a theoretical race window between memory failure and memory
>> offline. Think about the below scene:
>>
>>   CPU A					  CPU B
>> memory_failure				offline_pages
>>   mutex_lock(&mf_mutex);
>>   TestSetPageHWPoison(p)
>> 					  start_isolate_page_range
>> 					    has_unmovable_pages
>> 					      --PageHWPoison is movable
>> 					  do {
>> 					    scan_movable_pages
>> 					    do_migrate_range
>> 					      --PageHWPoison isn't migrated
>> 					  }
>> 					  test_pages_isolated
>> 					    --PageHWPoison is isolated
>> 					remove_memory
>>   access page... bang
>>   ...
> 
> I think the motivation for the offlining code was to not block memory
> hotunplug (especially on ZONE_MOVABLE) just because there is a
> HWpoisoned page. But how often does that happen?
> 
> It's all semi-broken either way. Assume you just offlined a memory block
> with a hwpoisoned page. The memmap is stale and the information about
> hwpoison is lost. You can happily re-online that memory block and use
> *all* memory, including previously hwpoisoned memory. Note that this
> used to be different in the past, when the memmap was initialized when
> adding memory, not when onlining that memory.
> 
> 
> IMHO, we should stop special casing hwpoison. Either fail offlining
> completely if we stumble over a hwpoisoned page, or allow offlining only
> if the refcount==0 -- just as any other page.
> 
> 

IIUC, there is no easy way to found out whether a hwpoinsoned page could be
safely offlined. If memory_failure succeeds, page refcnt should be 1. But if
failed, page refcnt is unknown. So it seems failing offlining completely if
we stumble over a hwpoisoned page is most suitable way to close the race. But
is this too overkill for such rare cases? Any suggestions?

Many thanks!
diff mbox series

Patch

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5444a8ef4867..b85232a64104 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1702,6 +1702,7 @@  int memory_failure(unsigned long pfn, int flags)
 	if (!sysctl_memory_failure_recovery)
 		panic("Memory failure on page %lx", pfn);
 
+	get_online_mems();
 	mutex_lock(&mf_mutex);
 
 	p = pfn_to_online_page(pfn);
@@ -1894,11 +1895,13 @@  int memory_failure(unsigned long pfn, int flags)
 identify_page_state:
 	res = identify_page_state(pfn, p, page_flags);
 	mutex_unlock(&mf_mutex);
+	put_online_mems();
 	return res;
 unlock_page:
 	unlock_page(p);
 unlock_mutex:
 	mutex_unlock(&mf_mutex);
+	put_online_mems();
 	return res;
 }
 EXPORT_SYMBOL_GPL(memory_failure);
@@ -2058,6 +2061,7 @@  int unpoison_memory(unsigned long pfn)
 	if (!pfn_valid(pfn))
 		return -ENXIO;
 
+	get_online_mems();
 	p = pfn_to_page(pfn);
 	page = compound_head(p);
 
@@ -2114,6 +2118,7 @@  int unpoison_memory(unsigned long pfn)
 
 unlock_mutex:
 	mutex_unlock(&mf_mutex);
+	put_online_mems();
 	return ret;
 }
 EXPORT_SYMBOL(unpoison_memory);
@@ -2278,10 +2283,12 @@  int soft_offline_page(unsigned long pfn, int flags)
 	if (flags & MF_COUNT_INCREASED)
 		ref_page = pfn_to_page(pfn);
 
+	get_online_mems();
 	/* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
 	page = pfn_to_online_page(pfn);
 	if (!page) {
 		put_ref_page(ref_page);
+		put_online_mems();
 		return -EIO;
 	}
 
@@ -2291,13 +2298,12 @@  int soft_offline_page(unsigned long pfn, int flags)
 		pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
 		put_ref_page(ref_page);
 		mutex_unlock(&mf_mutex);
+		put_online_mems();
 		return 0;
 	}
 
 retry:
-	get_online_mems();
 	ret = get_hwpoison_page(page, flags);
-	put_online_mems();
 
 	if (ret > 0) {
 		ret = soft_offline_in_use_page(page);
@@ -2310,6 +2316,7 @@  int soft_offline_page(unsigned long pfn, int flags)
 	}
 
 	mutex_unlock(&mf_mutex);
+	put_online_mems();
 
 	return ret;
 }