diff mbox series

[v4,4/6] mm: migrate: support poisoned recover from migrate folio

Message ID 20240603092439.3360652-5-wangkefeng.wang@huawei.com (mailing list archive)
State New
Headers show
Series mm: migrate: support poison recover from migrate folio | expand

Commit Message

Kefeng Wang June 3, 2024, 9:24 a.m. UTC
The folio migration is widely used in kernel, memory compaction, memory
hotplug, soft offline page, numa balance, memory demote/promotion, etc,
but once access a poisoned source folio when migrating, the kerenl will
panic.

There is a mechanism in the kernel to recover from uncorrectable memory
errors, ARCH_HAS_COPY_MC, which is already used in other core-mm paths,
eg, CoW, khugepaged, coredump, ksm copy, see copy_mc_to_{user,kernel},
copy_mc_{user_}highpage callers.

In order to support poisoned folio copy recover from migrate folio, we
chose to make folio migration tolerant of memory failures and return
error for folio migration, because folio migration is no guarantee
of success, this could avoid the similar panic shown below.

  CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0
  pc : copy_page+0x10/0xc0
  lr : copy_highpage+0x38/0x50
  ...
  Call trace:
   copy_page+0x10/0xc0
   folio_copy+0x78/0x90
   migrate_folio_extra+0x54/0xa0
   move_to_new_folio+0xd8/0x1f0
   migrate_folio_move+0xb8/0x300
   migrate_pages_batch+0x528/0x788
   migrate_pages_sync+0x8c/0x258
   migrate_pages+0x440/0x528
   soft_offline_in_use_page+0x2ec/0x3c0
   soft_offline_page+0x238/0x310
   soft_offline_page_store+0x6c/0xc0
   dev_attr_store+0x20/0x40
   sysfs_kf_write+0x4c/0x68
   kernfs_fop_write_iter+0x130/0x1c8
   new_sync_write+0xa4/0x138
   vfs_write+0x238/0x2d8
   ksys_write+0x74/0x110

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 mm/migrate.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

Comments

Jane Chu June 6, 2024, 9:27 p.m. UTC | #1
On 6/3/2024 2:24 AM, Kefeng Wang wrote:

> The folio migration is widely used in kernel, memory compaction, memory
> hotplug, soft offline page, numa balance, memory demote/promotion, etc,
> but once access a poisoned source folio when migrating, the kerenl will
> panic.
>
> There is a mechanism in the kernel to recover from uncorrectable memory
> errors, ARCH_HAS_COPY_MC, which is already used in other core-mm paths,
> eg, CoW, khugepaged, coredump, ksm copy, see copy_mc_to_{user,kernel},
> copy_mc_{user_}highpage callers.
>
> In order to support poisoned folio copy recover from migrate folio, we
> chose to make folio migration tolerant of memory failures and return
> error for folio migration, because folio migration is no guarantee
> of success, this could avoid the similar panic shown below.
>
>    CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0
>    pc : copy_page+0x10/0xc0
>    lr : copy_highpage+0x38/0x50

I'm curious at how you manage to test this case .  I mean, you trigger a 
soft_offline,

and a source  page with UE was being migrated, next, folio_copy() 
triggers an MCE and

system panic.  Did you use a bad dimm?

>    ...
>    Call trace:
>     copy_page+0x10/0xc0
>     folio_copy+0x78/0x90
>     migrate_folio_extra+0x54/0xa0
>     move_to_new_folio+0xd8/0x1f0
>     migrate_folio_move+0xb8/0x300
>     migrate_pages_batch+0x528/0x788
>     migrate_pages_sync+0x8c/0x258
>     migrate_pages+0x440/0x528
>     soft_offline_in_use_page+0x2ec/0x3c0
>     soft_offline_page+0x238/0x310
>     soft_offline_page_store+0x6c/0xc0
>     dev_attr_store+0x20/0x40
>     sysfs_kf_write+0x4c/0x68
>     kernfs_fop_write_iter+0x130/0x1c8
>     new_sync_write+0xa4/0x138
>     vfs_write+0x238/0x2d8
>     ksys_write+0x74/0x110
>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>   mm/migrate.c | 23 ++++++++++++++++++-----
>   1 file changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e930376c261a..28aa9da95781 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -663,16 +663,29 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
>   			   struct folio *src, void *src_private,
>   			   enum migrate_mode mode)
>   {
> -	int rc;
> +	int ret, expected_cnt = folio_expected_refs(mapping, src);
>   
> -	rc = folio_migrate_mapping(mapping, dst, src, 0);
> -	if (rc != MIGRATEPAGE_SUCCESS)
> -		return rc;
> +	if (!mapping) {
> +		if (folio_ref_count(src) != expected_cnt)
> +			return -EAGAIN;
> +	} else {
> +		if (!folio_ref_freeze(src, expected_cnt))
> +			return -EAGAIN;
> +	}
> +

Let me take a guess, the reason you split up folio_migrate_copy() is that

folio_mc_copy() should be done before the 'src' folio's ->flags is 
changed, right?

Is there any other reason?  Could you add a comment please?

> +	ret = folio_mc_copy(dst, src);
> +	if (unlikely(ret)) {
> +		if (mapping)
> +			folio_ref_unfreeze(src, expected_cnt);
> +		return ret;
> +	}
> +
> +	__folio_migrate_mapping(mapping, dst, src, expected_cnt);
>   
>   	if (src_private)
>   		folio_attach_private(dst, folio_detach_private(src));
>   
> -	folio_migrate_copy(dst, src);
> +	folio_migrate_flags(dst, src);
>   	return MIGRATEPAGE_SUCCESS;
>   }
>   

thanks,

-jane
Jane Chu June 6, 2024, 10:28 p.m. UTC | #2
On 6/6/2024 2:27 PM, Jane Chu wrote:

> On 6/3/2024 2:24 AM, Kefeng Wang wrote:
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index e930376c261a..28aa9da95781 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -663,16 +663,29 @@ static int __migrate_folio(struct address_space 
>> *mapping, struct folio *dst,
>>                  struct folio *src, void *src_private,
>>                  enum migrate_mode mode)
>>   {
>> -    int rc;
>> +    int ret, expected_cnt = folio_expected_refs(mapping, src);
>>   -    rc = folio_migrate_mapping(mapping, dst, src, 0);
>> -    if (rc != MIGRATEPAGE_SUCCESS)
>> -        return rc;
>> +    if (!mapping) {
>> +        if (folio_ref_count(src) != expected_cnt)
>> +            return -EAGAIN;
>> +    } else {
>> +        if (!folio_ref_freeze(src, expected_cnt))
>> +            return -EAGAIN;
>> +    }
>> +
>
> Let me take a guess, the reason you split up folio_migrate_copy() is that
>
> folio_mc_copy() should be done before the 'src' folio's ->flags is 
> changed, right?
>
> Is there any other reason?  Could you add a comment please?

I see, both the clearing of the 'dirty' bit in the source folio, and the 
xas_store of the

new folio to the mapping, these need to be done after folio_mc_copy 
considering in the

event of UE, memory_failure() is called to handle the poison in the 
source page.

That said, since the poisoned page was queued up and handling is 
asynchronous, so in

theory, there is an extremely unlikely chance that memory_failure() is 
invoked after

folio_migrate_mapping(), do you think things would still be cool?


thanks,

-jane

>
>> +    ret = folio_mc_copy(dst, src);
>> +    if (unlikely(ret)) {
>> +        if (mapping)
>> +            folio_ref_unfreeze(src, expected_cnt);
>> +        return ret;
>> +    }
>> +
>> +    __folio_migrate_mapping(mapping, dst, src, expected_cnt);
>>         if (src_private)
>>           folio_attach_private(dst, folio_detach_private(src));
>>   -    folio_migrate_copy(dst, src);
>> +    folio_migrate_flags(dst, src);
>>       return MIGRATEPAGE_SUCCESS;
>>   }
>
> thanks,
>
> -jane
>
>
Jane Chu June 6, 2024, 10:31 p.m. UTC | #3
On 6/6/2024 3:28 PM, Jane Chu wrote:
> On 6/6/2024 2:27 PM, Jane Chu wrote:
>
>> On 6/3/2024 2:24 AM, Kefeng Wang wrote:
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index e930376c261a..28aa9da95781 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -663,16 +663,29 @@ static int __migrate_folio(struct 
>>> address_space *mapping, struct folio *dst,
>>>                  struct folio *src, void *src_private,
>>>                  enum migrate_mode mode)
>>>   {
>>> -    int rc;
>>> +    int ret, expected_cnt = folio_expected_refs(mapping, src);
>>>   -    rc = folio_migrate_mapping(mapping, dst, src, 0);
>>> -    if (rc != MIGRATEPAGE_SUCCESS)
>>> -        return rc;
>>> +    if (!mapping) {
>>> +        if (folio_ref_count(src) != expected_cnt)
>>> +            return -EAGAIN;
>>> +    } else {
>>> +        if (!folio_ref_freeze(src, expected_cnt))
>>> +            return -EAGAIN;
>>> +    }
>>> +
>>
>> Let me take a guess, the reason you split up folio_migrate_copy() is 
>> that
>>
>> folio_mc_copy() should be done before the 'src' folio's ->flags is 
>> changed, right?
>>
>> Is there any other reason?  Could you add a comment please?
>
> I see, both the clearing of the 'dirty' bit in the source folio, and 
> the xas_store of the
>
> new folio to the mapping, these need to be done after folio_mc_copy 
> considering in the
>
> event of UE, memory_failure() is called to handle the poison in the 
> source page.
>
> That said, since the poisoned page was queued up and handling is 
> asynchronous, so in
>
> theory, there is an extremely unlikely chance that memory_failure() is 
> invoked after
>
> folio_migrate_mapping(), do you think things would still be cool?

Hmm, perhaps after xas_store, the source folio->mapping should be set to 
NULL.

thanks,

-jane

>
>
> thanks,
>
> -jane
>
>>
>>> +    ret = folio_mc_copy(dst, src);
>>> +    if (unlikely(ret)) {
>>> +        if (mapping)
>>> +            folio_ref_unfreeze(src, expected_cnt);
>>> +        return ret;
>>> +    }
>>> +
>>> +    __folio_migrate_mapping(mapping, dst, src, expected_cnt);
>>>         if (src_private)
>>>           folio_attach_private(dst, folio_detach_private(src));
>>>   -    folio_migrate_copy(dst, src);
>>> +    folio_migrate_flags(dst, src);
>>>       return MIGRATEPAGE_SUCCESS;
>>>   }
>>
>> thanks,
>>
>> -jane
>>
>>
>
Kefeng Wang June 7, 2024, 4:01 a.m. UTC | #4
On 2024/6/7 6:31, Jane Chu wrote:
> 
> On 6/6/2024 3:28 PM, Jane Chu wrote:
>> On 6/6/2024 2:27 PM, Jane Chu wrote:
>>
>>> On 6/3/2024 2:24 AM, Kefeng Wang wrote:
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index e930376c261a..28aa9da95781 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -663,16 +663,29 @@ static int __migrate_folio(struct 
>>>> address_space *mapping, struct folio *dst,
>>>>                  struct folio *src, void *src_private,
>>>>                  enum migrate_mode mode)
>>>>   {
>>>> -    int rc;
>>>> +    int ret, expected_cnt = folio_expected_refs(mapping, src);
>>>>   -    rc = folio_migrate_mapping(mapping, dst, src, 0);
>>>> -    if (rc != MIGRATEPAGE_SUCCESS)
>>>> -        return rc;
>>>> +    if (!mapping) {
>>>> +        if (folio_ref_count(src) != expected_cnt)
>>>> +            return -EAGAIN;
>>>> +    } else {
>>>> +        if (!folio_ref_freeze(src, expected_cnt))
>>>> +            return -EAGAIN;
>>>> +    }
>>>> +
>>>
>>> Let me take a guess, the reason you split up folio_migrate_copy() is 
>>> that
>>>
>>> folio_mc_copy() should be done before the 'src' folio's ->flags is 
>>> changed, right?
>>>
>>> Is there any other reason?  Could you add a comment please?
>>
>> I see, both the clearing of the 'dirty' bit in the source folio, and 
>> the xas_store of the
>>
>> new folio to the mapping, these need to be done after folio_mc_copy 
>> considering in the

Yes, many metadata are changed, and also some statistic(lruvec_state), 
so we have to move folio_copy() ahead.


>>
>> event of UE, memory_failure() is called to handle the poison in the 
>> source page.
>>
>> That said, since the poisoned page was queued up and handling is 
>> asynchronous, so in
>>
>> theory, there is an extremely unlikely chance that memory_failure() is 
>> invoked after
>>
>> folio_migrate_mapping(), do you think things would still be cool?
> 
> Hmm, perhaps after xas_store, the source folio->mapping should be set to 
> NULL.

When the folio_mc_copy() return -EHWPOISON, we never call
folio_migrate_mapping(), the source folio is not changed, so
it should be safe to handle the source folio by a asynchronous
memory_failure(), maybe I'm missing something?

PS: we test it via error injection to dimm and then soft offline memory.

Thanks.
Jane Chu June 7, 2024, 3:59 p.m. UTC | #5
On 6/6/2024 9:01 PM, Kefeng Wang wrote:

>
>
> On 2024/6/7 6:31, Jane Chu wrote:
>>
>> On 6/6/2024 3:28 PM, Jane Chu wrote:
>>> On 6/6/2024 2:27 PM, Jane Chu wrote:
>>>
>>>> On 6/3/2024 2:24 AM, Kefeng Wang wrote:
>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>> index e930376c261a..28aa9da95781 100644
>>>>> --- a/mm/migrate.c
>>>>> +++ b/mm/migrate.c
>>>>> @@ -663,16 +663,29 @@ static int __migrate_folio(struct 
>>>>> address_space *mapping, struct folio *dst,
>>>>>                  struct folio *src, void *src_private,
>>>>>                  enum migrate_mode mode)
>>>>>   {
>>>>> -    int rc;
>>>>> +    int ret, expected_cnt = folio_expected_refs(mapping, src);
>>>>>   -    rc = folio_migrate_mapping(mapping, dst, src, 0);
>>>>> -    if (rc != MIGRATEPAGE_SUCCESS)
>>>>> -        return rc;
>>>>> +    if (!mapping) {
>>>>> +        if (folio_ref_count(src) != expected_cnt)
>>>>> +            return -EAGAIN;
>>>>> +    } else {
>>>>> +        if (!folio_ref_freeze(src, expected_cnt))
>>>>> +            return -EAGAIN;
>>>>> +    }
>>>>> +
>>>>
>>>> Let me take a guess, the reason you split up folio_migrate_copy() 
>>>> is that
>>>>
>>>> folio_mc_copy() should be done before the 'src' folio's ->flags is 
>>>> changed, right?
>>>>
>>>> Is there any other reason?  Could you add a comment please?
>>>
>>> I see, both the clearing of the 'dirty' bit in the source folio, and 
>>> the xas_store of the
>>>
>>> new folio to the mapping, these need to be done after folio_mc_copy 
>>> considering in the
>
> Yes, many metadata are changed, and also some statistic(lruvec_state), 
> so we have to move folio_copy() ahead.
>
>
>>>
>>> event of UE, memory_failure() is called to handle the poison in the 
>>> source page.
>>>
>>> That said, since the poisoned page was queued up and handling is 
>>> asynchronous, so in
>>>
>>> theory, there is an extremely unlikely chance that memory_failure() 
>>> is invoked after
>>>
>>> folio_migrate_mapping(), do you think things would still be cool?
>>
>> Hmm, perhaps after xas_store, the source folio->mapping should be set 
>> to NULL.
>
> When the folio_mc_copy() return -EHWPOISON, we never call
> folio_migrate_mapping(), the source folio is not changed, so
> it should be safe to handle the source folio by a asynchronous
> memory_failure(), 
Right, I omitted this part, thanks!
> maybe I'm missing something?
>
> PS: we test it via error injection to dimm and then soft offline memory.

Got it.

Reviewed-by: Jane Chu <jane.chu@oracle.com>

thanks,

-jane

>
> Thanks.
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index e930376c261a..28aa9da95781 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -663,16 +663,29 @@  static int __migrate_folio(struct address_space *mapping, struct folio *dst,
 			   struct folio *src, void *src_private,
 			   enum migrate_mode mode)
 {
-	int rc;
+	int ret, expected_cnt = folio_expected_refs(mapping, src);
 
-	rc = folio_migrate_mapping(mapping, dst, src, 0);
-	if (rc != MIGRATEPAGE_SUCCESS)
-		return rc;
+	if (!mapping) {
+		if (folio_ref_count(src) != expected_cnt)
+			return -EAGAIN;
+	} else {
+		if (!folio_ref_freeze(src, expected_cnt))
+			return -EAGAIN;
+	}
+
+	ret = folio_mc_copy(dst, src);
+	if (unlikely(ret)) {
+		if (mapping)
+			folio_ref_unfreeze(src, expected_cnt);
+		return ret;
+	}
+
+	__folio_migrate_mapping(mapping, dst, src, expected_cnt);
 
 	if (src_private)
 		folio_attach_private(dst, folio_detach_private(src));
 
-	folio_migrate_copy(dst, src);
+	folio_migrate_flags(dst, src);
 	return MIGRATEPAGE_SUCCESS;
 }