Message ID | 20240603092439.3360652-5-wangkefeng.wang@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: migrate: support poison recover from migrate folio | expand |
On 6/3/2024 2:24 AM, Kefeng Wang wrote: > The folio migration is widely used in kernel, memory compaction, memory > hotplug, soft offline page, numa balance, memory demote/promotion, etc, > but once access a poisoned source folio when migrating, the kerenl will > panic. > > There is a mechanism in the kernel to recover from uncorrectable memory > errors, ARCH_HAS_COPY_MC, which is already used in other core-mm paths, > eg, CoW, khugepaged, coredump, ksm copy, see copy_mc_to_{user,kernel}, > copy_mc_{user_}highpage callers. > > In order to support poisoned folio copy recover from migrate folio, we > chose to make folio migration tolerant of memory failures and return > error for folio migration, because folio migration is no guarantee > of success, this could avoid the similar panic shown below. > > CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0 > pc : copy_page+0x10/0xc0 > lr : copy_highpage+0x38/0x50 I'm curious at how you manage to test this case . I mean, you trigger a soft_offline, and a source page with UE was being migrated, next, folio_copy() triggers an MCE and system panic. Did you use a bad dimm? > ... > Call trace: > copy_page+0x10/0xc0 > folio_copy+0x78/0x90 > migrate_folio_extra+0x54/0xa0 > move_to_new_folio+0xd8/0x1f0 > migrate_folio_move+0xb8/0x300 > migrate_pages_batch+0x528/0x788 > migrate_pages_sync+0x8c/0x258 > migrate_pages+0x440/0x528 > soft_offline_in_use_page+0x2ec/0x3c0 > soft_offline_page+0x238/0x310 > soft_offline_page_store+0x6c/0xc0 > dev_attr_store+0x20/0x40 > sysfs_kf_write+0x4c/0x68 > kernfs_fop_write_iter+0x130/0x1c8 > new_sync_write+0xa4/0x138 > vfs_write+0x238/0x2d8 > ksys_write+0x74/0x110 > > Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> > --- > mm/migrate.c | 23 ++++++++++++++++++----- > 1 file changed, 18 insertions(+), 5 deletions(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index e930376c261a..28aa9da95781 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -663,16 +663,29 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst, > struct folio *src, void *src_private, > enum migrate_mode mode) > { > - int rc; > + int ret, expected_cnt = folio_expected_refs(mapping, src); > > - rc = folio_migrate_mapping(mapping, dst, src, 0); > - if (rc != MIGRATEPAGE_SUCCESS) > - return rc; > + if (!mapping) { > + if (folio_ref_count(src) != expected_cnt) > + return -EAGAIN; > + } else { > + if (!folio_ref_freeze(src, expected_cnt)) > + return -EAGAIN; > + } > + Let me take a guess, the reason you split up folio_migrate_copy() is that folio_mc_copy() should be done before the 'src' folio's ->flags is changed, right? Is there any other reason? Could you add a comment please? > + ret = folio_mc_copy(dst, src); > + if (unlikely(ret)) { > + if (mapping) > + folio_ref_unfreeze(src, expected_cnt); > + return ret; > + } > + > + __folio_migrate_mapping(mapping, dst, src, expected_cnt); > > if (src_private) > folio_attach_private(dst, folio_detach_private(src)); > > - folio_migrate_copy(dst, src); > + folio_migrate_flags(dst, src); > return MIGRATEPAGE_SUCCESS; > } > thanks, -jane
On 6/6/2024 2:27 PM, Jane Chu wrote: > On 6/3/2024 2:24 AM, Kefeng Wang wrote: >> diff --git a/mm/migrate.c b/mm/migrate.c >> index e930376c261a..28aa9da95781 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -663,16 +663,29 @@ static int __migrate_folio(struct address_space >> *mapping, struct folio *dst, >> struct folio *src, void *src_private, >> enum migrate_mode mode) >> { >> - int rc; >> + int ret, expected_cnt = folio_expected_refs(mapping, src); >> - rc = folio_migrate_mapping(mapping, dst, src, 0); >> - if (rc != MIGRATEPAGE_SUCCESS) >> - return rc; >> + if (!mapping) { >> + if (folio_ref_count(src) != expected_cnt) >> + return -EAGAIN; >> + } else { >> + if (!folio_ref_freeze(src, expected_cnt)) >> + return -EAGAIN; >> + } >> + > > Let me take a guess, the reason you split up folio_migrate_copy() is that > > folio_mc_copy() should be done before the 'src' folio's ->flags is > changed, right? > > Is there any other reason? Could you add a comment please? I see, both the clearing of the 'dirty' bit in the source folio, and the xas_store of the new folio to the mapping, these need to be done after folio_mc_copy considering in the event of UE, memory_failure() is called to handle the poison in the source page. That said, since the poisoned page was queued up and handling is asynchronous, so in theory, there is an extremely unlikely chance that memory_failure() is invoked after folio_migrate_mapping(), do you think things would still be cool? thanks, -jane > >> + ret = folio_mc_copy(dst, src); >> + if (unlikely(ret)) { >> + if (mapping) >> + folio_ref_unfreeze(src, expected_cnt); >> + return ret; >> + } >> + >> + __folio_migrate_mapping(mapping, dst, src, expected_cnt); >> if (src_private) >> folio_attach_private(dst, folio_detach_private(src)); >> - folio_migrate_copy(dst, src); >> + folio_migrate_flags(dst, src); >> return MIGRATEPAGE_SUCCESS; >> } > > thanks, > > -jane > >
On 6/6/2024 3:28 PM, Jane Chu wrote: > On 6/6/2024 2:27 PM, Jane Chu wrote: > >> On 6/3/2024 2:24 AM, Kefeng Wang wrote: >>> diff --git a/mm/migrate.c b/mm/migrate.c >>> index e930376c261a..28aa9da95781 100644 >>> --- a/mm/migrate.c >>> +++ b/mm/migrate.c >>> @@ -663,16 +663,29 @@ static int __migrate_folio(struct >>> address_space *mapping, struct folio *dst, >>> struct folio *src, void *src_private, >>> enum migrate_mode mode) >>> { >>> - int rc; >>> + int ret, expected_cnt = folio_expected_refs(mapping, src); >>> - rc = folio_migrate_mapping(mapping, dst, src, 0); >>> - if (rc != MIGRATEPAGE_SUCCESS) >>> - return rc; >>> + if (!mapping) { >>> + if (folio_ref_count(src) != expected_cnt) >>> + return -EAGAIN; >>> + } else { >>> + if (!folio_ref_freeze(src, expected_cnt)) >>> + return -EAGAIN; >>> + } >>> + >> >> Let me take a guess, the reason you split up folio_migrate_copy() is >> that >> >> folio_mc_copy() should be done before the 'src' folio's ->flags is >> changed, right? >> >> Is there any other reason? Could you add a comment please? > > I see, both the clearing of the 'dirty' bit in the source folio, and > the xas_store of the > > new folio to the mapping, these need to be done after folio_mc_copy > considering in the > > event of UE, memory_failure() is called to handle the poison in the > source page. > > That said, since the poisoned page was queued up and handling is > asynchronous, so in > > theory, there is an extremely unlikely chance that memory_failure() is > invoked after > > folio_migrate_mapping(), do you think things would still be cool? Hmm, perhaps after xas_store, the source folio->mapping should be set to NULL. thanks, -jane > > > thanks, > > -jane > >> >>> + ret = folio_mc_copy(dst, src); >>> + if (unlikely(ret)) { >>> + if (mapping) >>> + folio_ref_unfreeze(src, expected_cnt); >>> + return ret; >>> + } >>> + >>> + __folio_migrate_mapping(mapping, dst, src, expected_cnt); >>> if (src_private) >>> folio_attach_private(dst, folio_detach_private(src)); >>> - folio_migrate_copy(dst, src); >>> + folio_migrate_flags(dst, src); >>> return MIGRATEPAGE_SUCCESS; >>> } >> >> thanks, >> >> -jane >> >> >
On 2024/6/7 6:31, Jane Chu wrote: > > On 6/6/2024 3:28 PM, Jane Chu wrote: >> On 6/6/2024 2:27 PM, Jane Chu wrote: >> >>> On 6/3/2024 2:24 AM, Kefeng Wang wrote: >>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>> index e930376c261a..28aa9da95781 100644 >>>> --- a/mm/migrate.c >>>> +++ b/mm/migrate.c >>>> @@ -663,16 +663,29 @@ static int __migrate_folio(struct >>>> address_space *mapping, struct folio *dst, >>>> struct folio *src, void *src_private, >>>> enum migrate_mode mode) >>>> { >>>> - int rc; >>>> + int ret, expected_cnt = folio_expected_refs(mapping, src); >>>> - rc = folio_migrate_mapping(mapping, dst, src, 0); >>>> - if (rc != MIGRATEPAGE_SUCCESS) >>>> - return rc; >>>> + if (!mapping) { >>>> + if (folio_ref_count(src) != expected_cnt) >>>> + return -EAGAIN; >>>> + } else { >>>> + if (!folio_ref_freeze(src, expected_cnt)) >>>> + return -EAGAIN; >>>> + } >>>> + >>> >>> Let me take a guess, the reason you split up folio_migrate_copy() is >>> that >>> >>> folio_mc_copy() should be done before the 'src' folio's ->flags is >>> changed, right? >>> >>> Is there any other reason? Could you add a comment please? >> >> I see, both the clearing of the 'dirty' bit in the source folio, and >> the xas_store of the >> >> new folio to the mapping, these need to be done after folio_mc_copy >> considering in the Yes, many metadata are changed, and also some statistic(lruvec_state), so we have to move folio_copy() ahead. >> >> event of UE, memory_failure() is called to handle the poison in the >> source page. >> >> That said, since the poisoned page was queued up and handling is >> asynchronous, so in >> >> theory, there is an extremely unlikely chance that memory_failure() is >> invoked after >> >> folio_migrate_mapping(), do you think things would still be cool? > > Hmm, perhaps after xas_store, the source folio->mapping should be set to > NULL. When the folio_mc_copy() return -EHWPOISON, we never call folio_migrate_mapping(), the source folio is not changed, so it should be safe to handle the source folio by a asynchronous memory_failure(), maybe I'm missing something? PS: we test it via error injection to dimm and then soft offline memory. Thanks.
On 6/6/2024 9:01 PM, Kefeng Wang wrote: > > > On 2024/6/7 6:31, Jane Chu wrote: >> >> On 6/6/2024 3:28 PM, Jane Chu wrote: >>> On 6/6/2024 2:27 PM, Jane Chu wrote: >>> >>>> On 6/3/2024 2:24 AM, Kefeng Wang wrote: >>>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>>> index e930376c261a..28aa9da95781 100644 >>>>> --- a/mm/migrate.c >>>>> +++ b/mm/migrate.c >>>>> @@ -663,16 +663,29 @@ static int __migrate_folio(struct >>>>> address_space *mapping, struct folio *dst, >>>>> struct folio *src, void *src_private, >>>>> enum migrate_mode mode) >>>>> { >>>>> - int rc; >>>>> + int ret, expected_cnt = folio_expected_refs(mapping, src); >>>>> - rc = folio_migrate_mapping(mapping, dst, src, 0); >>>>> - if (rc != MIGRATEPAGE_SUCCESS) >>>>> - return rc; >>>>> + if (!mapping) { >>>>> + if (folio_ref_count(src) != expected_cnt) >>>>> + return -EAGAIN; >>>>> + } else { >>>>> + if (!folio_ref_freeze(src, expected_cnt)) >>>>> + return -EAGAIN; >>>>> + } >>>>> + >>>> >>>> Let me take a guess, the reason you split up folio_migrate_copy() >>>> is that >>>> >>>> folio_mc_copy() should be done before the 'src' folio's ->flags is >>>> changed, right? >>>> >>>> Is there any other reason? Could you add a comment please? >>> >>> I see, both the clearing of the 'dirty' bit in the source folio, and >>> the xas_store of the >>> >>> new folio to the mapping, these need to be done after folio_mc_copy >>> considering in the > > Yes, many metadata are changed, and also some statistic(lruvec_state), > so we have to move folio_copy() ahead. > > >>> >>> event of UE, memory_failure() is called to handle the poison in the >>> source page. >>> >>> That said, since the poisoned page was queued up and handling is >>> asynchronous, so in >>> >>> theory, there is an extremely unlikely chance that memory_failure() >>> is invoked after >>> >>> folio_migrate_mapping(), do you think things would still be cool? >> >> Hmm, perhaps after xas_store, the source folio->mapping should be set >> to NULL. > > When the folio_mc_copy() return -EHWPOISON, we never call > folio_migrate_mapping(), the source folio is not changed, so > it should be safe to handle the source folio by a asynchronous > memory_failure(), Right, I omitted this part, thanks! > maybe I'm missing something? > > PS: we test it via error injection to dimm and then soft offline memory. Got it. Reviewed-by: Jane Chu <jane.chu@oracle.com> thanks, -jane > > Thanks.
diff --git a/mm/migrate.c b/mm/migrate.c index e930376c261a..28aa9da95781 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -663,16 +663,29 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, void *src_private, enum migrate_mode mode) { - int rc; + int ret, expected_cnt = folio_expected_refs(mapping, src); - rc = folio_migrate_mapping(mapping, dst, src, 0); - if (rc != MIGRATEPAGE_SUCCESS) - return rc; + if (!mapping) { + if (folio_ref_count(src) != expected_cnt) + return -EAGAIN; + } else { + if (!folio_ref_freeze(src, expected_cnt)) + return -EAGAIN; + } + + ret = folio_mc_copy(dst, src); + if (unlikely(ret)) { + if (mapping) + folio_ref_unfreeze(src, expected_cnt); + return ret; + } + + __folio_migrate_mapping(mapping, dst, src, expected_cnt); if (src_private) folio_attach_private(dst, folio_detach_private(src)); - folio_migrate_copy(dst, src); + folio_migrate_flags(dst, src); return MIGRATEPAGE_SUCCESS; }
The folio migration is widely used in kernel, memory compaction, memory hotplug, soft offline page, numa balance, memory demote/promotion, etc, but once access a poisoned source folio when migrating, the kerenl will panic. There is a mechanism in the kernel to recover from uncorrectable memory errors, ARCH_HAS_COPY_MC, which is already used in other core-mm paths, eg, CoW, khugepaged, coredump, ksm copy, see copy_mc_to_{user,kernel}, copy_mc_{user_}highpage callers. In order to support poisoned folio copy recover from migrate folio, we chose to make folio migration tolerant of memory failures and return error for folio migration, because folio migration is no guarantee of success, this could avoid the similar panic shown below. CPU: 1 PID: 88343 Comm: test_softofflin Kdump: loaded Not tainted 6.6.0 pc : copy_page+0x10/0xc0 lr : copy_highpage+0x38/0x50 ... Call trace: copy_page+0x10/0xc0 folio_copy+0x78/0x90 migrate_folio_extra+0x54/0xa0 move_to_new_folio+0xd8/0x1f0 migrate_folio_move+0xb8/0x300 migrate_pages_batch+0x528/0x788 migrate_pages_sync+0x8c/0x258 migrate_pages+0x440/0x528 soft_offline_in_use_page+0x2ec/0x3c0 soft_offline_page+0x238/0x310 soft_offline_page_store+0x6c/0xc0 dev_attr_store+0x20/0x40 sysfs_kf_write+0x4c/0x68 kernfs_fop_write_iter+0x130/0x1c8 new_sync_write+0xa4/0x138 vfs_write+0x238/0x2d8 ksys_write+0x74/0x110 Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> --- mm/migrate.c | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-)