Message ID | 20210224151619.67c29731@alex-virtual-machine (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm,hwpoison: return -EBUSY when page already poisoned | expand |
On 24.02.21 08:16, Aili Yao wrote: > When the page is already poisoned, another memory_failure() call in the > same page now return 0, meaning OK. For nested memory mce handling, this > behavior may lead real serious problem, Example: > > 1.When LCME is enabled, and there are two processes A && B running on > different core X && Y separately, which will access one same page, then > the page corrupted when process A access it, a MCE will be rasied to > core X and the error process is just underway. > > 2.Then B access the page and trigger another MCE to core Y, it will also > do error process, it will see TestSetPageHWPoison be true, and 0 is > returned. > > 3.The kill_me_maybe will check the return: > > 1244 static void kill_me_maybe(struct callback_head *cb) > 1245 { > > 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && > 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { > 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT, > p->mce_whole_page); > 1257 sync_core(); > 1258 return; > 1259 } > > 1267 } > > 4. The error process for B will end, and may nothing happened if > kill-early is not set, We may let the wrong data go into effect. > > For other cases which care the return value of memory_failure() should > check why they want to process a memory error which have already been > processed. This behavior seems reasonable. > > In kill_me_maybe, log the fact about the memory may not recovered, and > we will kill the related process. > Is -EBUSY then the right return value? I'd expect if it's already poisoned that we would get something like EHWPOISON. Does this affect existing user space interfaces (especially, via madvise?)?
On Wed, Feb 24, 2021 at 03:16:19PM +0800, Aili Yao wrote: > When the page is already poisoned, another memory_failure() call in the > same page now return 0, meaning OK. For nested memory mce handling, this > behavior may lead real serious problem, Example: I have some questions: > 1.When LCME is enabled, and there are two processes A && B running on > different core X && Y separately, which will access one same page, then > the page corrupted when process A access it, a MCE will be rasied to > core X and the error process is just underway. When !LMCE, that is not a problem because new MCE needs to wait for the ongoing MCE? > 2.Then B access the page and trigger another MCE to core Y, it will also > do error process, it will see TestSetPageHWPoison be true, and 0 is > returned. For non-nested calls, that is no problem because the page will be taken out of business(unmapped from the processes), right? So no more MCE are possible. > > 3.The kill_me_maybe will check the return: > > 1244 static void kill_me_maybe(struct callback_head *cb) > 1245 { > > 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && > 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { > 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT, So, IIUC, in case of a LMCE nested call, the second MCE will reach here. set_mce_nospec() will either mark the underlying page as not mapped/cached. Should not have memory_failure()->hwpoison_user_mappings() unmapped the page from both process A and B? Or this is in case the ongoing MCE(process A) has not still unmapped anything, so process B can still access this page. So with your change, process B will be sent a SIGBUG, while process A is still handling the MCE, right? > p->mce_whole_page); > 1257 sync_core(); > 1258 return; > 1259 } > > 1267 } > > 4. The error process for B will end, and may nothing happened if > kill-early is not set, We may let the wrong data go into effect. > > For other cases which care the return value of memory_failure() should > check why they want to process a memory error which have already been > processed. This behavior seems reasonable. > > In kill_me_maybe, log the fact about the memory may not recovered, and > we will kill the related process. > > Signed-off-by: Aili Yao <yaoaili@kingsoft.com> > --- > arch/x86/kernel/cpu/mce/core.c | 2 ++ > mm/memory-failure.c | 4 ++-- > 2 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index e133ce1e562b..db4afc5bf15a 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -1259,6 +1259,8 @@ static void kill_me_maybe(struct callback_head *cb) > } > > if (p->mce_vaddr != (void __user *)-1l) { > + pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", > + p->mce_addr >> PAGE_SHIFT, p->comm, p->pid); > force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); > } else { > pr_err("Memory error not recovered"); > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index e9481632fcd1..06f006174b8c 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1224,7 +1224,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags) > if (TestSetPageHWPoison(head)) { > pr_err("Memory failure: %#lx: already hardware poisoned\n", > pfn); > - return 0; > + return -EBUSY; As David said, madvise_inject_error() will start returning -EBUSY now in case we madvise(MADV_HWPOISON) on an already hwpoisoned page. AFAICS, memory_failure() can return 0, -Eerrors, and MF_XXX. Would it make sense to unify that? That way we could declare error codes that make somse sense (like MF_ALREADY_HWPOISONED).
On Wed, 24 Feb 2021 11:31:55 +0100 Oscar Salvador <osalvador@suse.de> wrote: > I have some questions: > > > 1.When LCME is enabled, and there are two processes A && B running on > > different core X && Y separately, which will access one same page, then > > the page corrupted when process A access it, a MCE will be rasied to > > core X and the error process is just underway. > > When !LMCE, that is not a problem because new MCE needs to wait for the ongoing MCE? I am not sure whether this case will happen when !LMCE, when I realized this place may be an issue I tried to reproduce it and my configuration is LMCE enabled. > > 2.Then B access the page and trigger another MCE to core Y, it will also > > do error process, it will see TestSetPageHWPoison be true, and 0 is > > returned. > > For non-nested calls, that is no problem because the page will be taken out > of business(unmapped from the processes), right? So no more MCE are possible. Yes, I think after the recovery jod is finished, other processes still access the page will meet a page fault and error will be returned; > > > > 3.The kill_me_maybe will check the return: > > > > 1244 static void kill_me_maybe(struct callback_head *cb) > > 1245 { > > > > 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && > > 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { > > 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT, > > So, IIUC, in case of a LMCE nested call, the second MCE will reach here. > set_mce_nospec() will either mark the underlying page as not mapped/cached. > This set_mce_nospec() is not proper when the recovery job is on the fly. In my test this function failed. > Should not have memory_failure()->hwpoison_user_mappings() unmapped the page > from both process A and B? Or this is in case the ongoing MCE(process A) has > not still unmapped anything, so process B can still access this page. > What I care is the process B triggered the error again after process A, I don't know how it return and proceed. > So with your change, process B will be sent a SIGBUG, while process A is still > handling the MCE, right? Right! > > p->mce_whole_page); > > 1257 sync_core(); > > 1258 return; > > 1259 } > > > > 1267 } > > > > 4. The error process for B will end, and may nothing happened if > > kill-early is not set, We may let the wrong data go into effect. > > > > For other cases which care the return value of memory_failure() should > > check why they want to process a memory error which have already been > > processed. This behavior seems reasonable. > > > > In kill_me_maybe, log the fact about the memory may not recovered, and > > we will kill the related process. > > > > Signed-off-by: Aili Yao <yaoaili@kingsoft.com> > > --- > > arch/x86/kernel/cpu/mce/core.c | 2 ++ > > mm/memory-failure.c | 4 ++-- > > 2 files changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > > index e133ce1e562b..db4afc5bf15a 100644 > > --- a/arch/x86/kernel/cpu/mce/core.c > > +++ b/arch/x86/kernel/cpu/mce/core.c > > @@ -1259,6 +1259,8 @@ static void kill_me_maybe(struct callback_head *cb) > > } > > > > if (p->mce_vaddr != (void __user *)-1l) { > > + pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", > > + p->mce_addr >> PAGE_SHIFT, p->comm, p->pid); > > force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); > > } else { > > pr_err("Memory error not recovered"); > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > index e9481632fcd1..06f006174b8c 100644 > > --- a/mm/memory-failure.c > > +++ b/mm/memory-failure.c > > @@ -1224,7 +1224,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags) > > if (TestSetPageHWPoison(head)) { > > pr_err("Memory failure: %#lx: already hardware poisoned\n", > > pfn); > > - return 0; > > + return -EBUSY; > > As David said, madvise_inject_error() will start returning -EBUSY now in case > we madvise(MADV_HWPOISON) on an already hwpoisoned page. > > AFAICS, memory_failure() can return 0, -Eerrors, and MF_XXX. > Would it make sense to unify that? That way we could declare error codes that > make somse sense (like MF_ALREADY_HWPOISONED). > @David: I checked the code again, and find a few places will care the exact return value, like: 1: drivers/base/memory.c:483: ret = memory_failure(pfn, 0); This is for hard page offline, I see the code in mcelog: static void offline_action(struct mempage *mp, u64 addr) { if (offline <= OFFLINE_ACCOUNT) return; Lprintf("Offlining page %llx\n", addr); if (memory_offline(addr) < 0) { Lprintf("Offlining page %llx failed: %s\n", addr, strerror(errno)); mp->offlined = PAGE_OFFLINE_FAILED; } else mp->offlined = PAGE_OFFLINE; } I think return an negative value will be more proper? As the related killing function may not be performed, and we can't say it's a success operation? 2:mm/hwpoison-inject.c:51: return memory_failure(pfn, 0); mm/madvise.c:910: ret = memory_failure(pfn, MF_COUNT_INCREASED); These two cases are mainly for error injections, I checked the test codes, mostly it only care if the value is 0 or < 0; I do the related test, normally it work well, but for stress test, sometimes in some case, I do meet some fail cases along with the -EBUSY return. I will dig more. Other place will only care if the return value is 0. or just ignore it. Hi naoya, what's your opnion for this possible issue, I need your inputs! Thanks Aili Yao
On Thu, Feb 25, 2021 at 11:43:29AM +0800, Aili Yao wrote: > On Wed, 24 Feb 2021 11:31:55 +0100 Oscar Salvador <osalvador@suse.de> wrote: ... > > > > > > > 3.The kill_me_maybe will check the return: > > > > > > 1244 static void kill_me_maybe(struct callback_head *cb) > > > 1245 { > > > > > > 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && > > > 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { > > > 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT, > > > > So, IIUC, in case of a LMCE nested call, the second MCE will reach here. > > set_mce_nospec() will either mark the underlying page as not mapped/cached. > > > This set_mce_nospec() is not proper when the recovery job is on the fly. In my test > this function failed. Hi Aili, I agree that this set_mce_nospec() is not expected to be called for "already hwpoisoned" page because in the reported case the error page is already contained and no need to resort changing cache mode. ... > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > > index e9481632fcd1..06f006174b8c 100644 > > > --- a/mm/memory-failure.c > > > +++ b/mm/memory-failure.c > > > @@ -1224,7 +1224,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags) > > > if (TestSetPageHWPoison(head)) { > > > pr_err("Memory failure: %#lx: already hardware poisoned\n", > > > pfn); > > > - return 0; > > > + return -EBUSY; > > > > As David said, madvise_inject_error() will start returning -EBUSY now in case > > we madvise(MADV_HWPOISON) on an already hwpoisoned page. > > > > AFAICS, memory_failure() can return 0, -Eerrors, and MF_XXX. > > Would it make sense to unify that? That way we could declare error codes that > > make somse sense (like MF_ALREADY_HWPOISONED). It seems to me that memory_failure() does not return MF_XXX. But yes, returning some positive value for the reported case could be a solution. > > > > @David: > > I checked the code again, and find a few places will care the exact return value, like: > > 1: drivers/base/memory.c:483: ret = memory_failure(pfn, 0); > This is for hard page offline, I see the code in mcelog: > static void offline_action(struct mempage *mp, u64 addr) > { > if (offline <= OFFLINE_ACCOUNT) > return; > Lprintf("Offlining page %llx\n", addr); > if (memory_offline(addr) < 0) { > Lprintf("Offlining page %llx failed: %s\n", addr, strerror(errno)); > mp->offlined = PAGE_OFFLINE_FAILED; > } else > mp->offlined = PAGE_OFFLINE; > } > I think return an negative value will be more proper? As the related killing function may not be performed, and we can't say > it's a success operation? > > 2:mm/hwpoison-inject.c:51: return memory_failure(pfn, 0); > mm/madvise.c:910: ret = memory_failure(pfn, MF_COUNT_INCREASED); > > These two cases are mainly for error injections, I checked the test codes, mostly it only care if the value is 0 or < 0; > I do the related test, normally it work well, but for stress test, sometimes in some case, I do meet some fail cases along with the -EBUSY return. > I will dig more. > > Other place will only care if the return value is 0. or just ignore it. > > Hi naoya, what's your opnion for this possible issue, I need your inputs! We could use some negative value (error code) to report the reported case, then as you mentioned above, some callers need change to handle the new case, and the same is true if you use some positive value. My preference is -EHWPOISON, but other options are fine if justified well. Thanks, Naoya Horiguchi
On Thu, Feb 25, 2021 at 11:28:18AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: > Hi Aili, > > I agree that this set_mce_nospec() is not expected to be called for > "already hwpoisoned" page because in the reported case the error > page is already contained and no need to resort changing cache mode. Out of curiosity, what is the current behavour now? Say we have an ongoing MCE which has marked the page as HWPoison but memory_failure did not take any action on the page yet. And then, we have another MCE, which ends up there. set_mce_nospec might clear _PAGE_PRESENT bit. Does that have any impact on the first MCE? > It seems to me that memory_failure() does not return MF_XXX. But yes, > returning some positive value for the reported case could be a solution. No, you are right. I somehow managed to confuse myself. I see now that MF_XXX return codes are filtered out in page_action. > We could use some negative value (error code) to report the reported case, > then as you mentioned above, some callers need change to handle the > new case, and the same is true if you use some positive value. > My preference is -EHWPOISON, but other options are fine if justified well. -EHWPOISON seems like a good fit.
On Thu, Feb 25, 2021 at 12:39:30PM +0100, Oscar Salvador wrote: > On Thu, Feb 25, 2021 at 11:28:18AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: > > Hi Aili, > > > > I agree that this set_mce_nospec() is not expected to be called for > > "already hwpoisoned" page because in the reported case the error > > page is already contained and no need to resort changing cache mode. > > Out of curiosity, what is the current behavour now? > Say we have an ongoing MCE which has marked the page as HWPoison but > memory_failure did not take any action on the page yet. > And then, we have another MCE, which ends up there. > set_mce_nospec might clear _PAGE_PRESENT bit. > > Does that have any impact on the first MCE? Hi Oscar, Thank you for shedding light on this, this race looks worrisome to me. We call try_to_unmap() inside memory_failure(), where we find affected ptes by page_vma_mapped_walk() and convert into hwpoison entires in try_to_unmap_one(). So there seems two racy cases: 1) CPU 0 CPU 1 page_vma_mapped_walk clear _PAGE_PRESENT bit // skipped the entry 2) CPU 0 CPU 1 page_vma_mapped_walk try_to_unmap_one clear _PAGE_PRESENT bit convert the entry set_pte_at In case 1, the affected processes get signals on later access, so although the info in SIGBUS could be different, that's OK. And we have no impact in case 2. Thanks, Naoya Horiguchi
On Thu, Feb 25, 2021 at 12:38:06PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: > Thank you for shedding light on this, this race looks worrisome to me. > We call try_to_unmap() inside memory_failure(), where we find affected > ptes by page_vma_mapped_walk() and convert into hwpoison entires in > try_to_unmap_one(). So there seems two racy cases: > > 1) > CPU 0 CPU 1 > page_vma_mapped_walk > clear _PAGE_PRESENT bit > // skipped the entry > > 2) > CPU 0 CPU 1 > page_vma_mapped_walk > try_to_unmap_one > clear _PAGE_PRESENT bit > convert the entry > set_pte_at > > In case 1, the affected processes get signals on later access, > so although the info in SIGBUS could be different, that's OK. > And we have no impact in case 2. I've been debugging a similar issue for a few days and finally got enough traces to partially understand what happened. The test case is a multi-threaded pointer chasing micro-benchmark running on all logical CPUs. We then inject poison into the address space of the process. All works fine if one thread consumes poison and completes all Linux machine check processing before any other threads read the poison. The page is unmapped, a SIGBUS is sent (which kills all threads). But in the problem case I see: CPU1 reads poison, takes a machine check. Gets to the kill_me_maybe() task work, which calls memory_failure() this CPU sets the page poison flag, but is still executing the rest of the flow to hunt down tasks/mappings to invalidate pages and send SIGBUS if required. CPU2 reads the poison. When it gets to memory_failure() there's an early return because the poison flag is already set. So in current code it returns and takes the machine check again. CPU3 reads the poison and starts along same path that CPU2 did. Meanwhile CPU1 gets far enough along in memory failure and hits a problem. It prints: [ 1867.409837] Memory failure: 0x42a9ff6: reserved kernel page still referenced by 1 users [ 1867.409850] Memory failure: 0x42a9ff6: recovery action for reserved kernel page: Failed and doesn't complete unmapping the page that CPU2 and CPU3 are touching. Other CPUs gradually reach the poison and join in the fun of repeatedly taking machine checks. I have not yet tracked why this user access is reporting as a "reserved kernel page". Some traces showed that futex(2) syscall was in use from this benchmark, so maybe the kernel locked a user page that was a contended futex??? Idea for what we should do next ... Now that x86 is calling memory_failure() from user context ... maybe parallel calls for the same page should be blocked until the first caller completes so we can: a) know that pages are unmapped (if that happens) b) all get the same success/fail status -Tony
On Thu, Feb 25, 2021 at 10:15:42AM -0800, Luck, Tony wrote: > On Thu, Feb 25, 2021 at 12:38:06PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: > > Thank you for shedding light on this, this race looks worrisome to me. > > We call try_to_unmap() inside memory_failure(), where we find affected > > ptes by page_vma_mapped_walk() and convert into hwpoison entires in > > try_to_unmap_one(). So there seems two racy cases: > > > > 1) > > CPU 0 CPU 1 > > page_vma_mapped_walk > > clear _PAGE_PRESENT bit > > // skipped the entry > > > > 2) > > CPU 0 CPU 1 > > page_vma_mapped_walk > > try_to_unmap_one > > clear _PAGE_PRESENT bit > > convert the entry > > set_pte_at > > > > In case 1, the affected processes get signals on later access, > > so although the info in SIGBUS could be different, that's OK. > > And we have no impact in case 2. > > I've been debugging a similar issue for a few days and finally > got enough traces to partially understand what happened. > > The test case is a multi-threaded pointer chasing micro-benchmark > running on all logical CPUs. We then inject poison into the address > space of the process. > > All works fine if one thread consumes poison and completes all > Linux machine check processing before any other threads read the > poison. The page is unmapped, a SIGBUS is sent (which kills all > threads). > > But in the problem case I see: Thanks for the description, it's helpful to understand the problem. > > CPU1 reads poison, takes a machine check. Gets to the > kill_me_maybe() task work, which calls memory_failure() > this CPU sets the page poison flag, but is still executing the > rest of the flow to hunt down tasks/mappings to invalidate pages > and send SIGBUS if required. > > CPU2 reads the poison. When it gets to memory_failure() > there's an early return because the poison flag is already > set. So in current code it returns and takes the machine > check again. > > CPU3 reads the poison and starts along same path that CPU2 > did. I think that the MCE loop happening on CPU2 and CPU3 is unexpected and these threads should immediately kill the current process on each CPU. force_sig_mceerr() in kill_me_maybe() is supposed to do it, so Aili's patch would fix this issue too? > > Meanwhile CPU1 gets far enough along in memory failure and hits > a problem. It prints: > > [ 1867.409837] Memory failure: 0x42a9ff6: reserved kernel page still referenced by 1 users > [ 1867.409850] Memory failure: 0x42a9ff6: recovery action for reserved kernel page: Failed > > and doesn't complete unmapping the page that CPU2 and CPU3 are touching. > > Other CPUs gradually reach the poison and join in the fun of repeatedly > taking machine checks. > > I have not yet tracked why this user access is reporting as a "reserved kernel page". > Some traces showed that futex(2) syscall was in use from this benchmark, > so maybe the kernel locked a user page that was a contended futex??? This might imply that current logic to identify page state does not work properly on this exotic type of user page, I'll take a look on this from futex's viewpoint. > > Idea for what we should do next ... Now that x86 is calling memory_failure() > from user context ... maybe parallel calls for the same page should > be blocked until the first caller completes so we can: > a) know that pages are unmapped (if that happens) > b) all get the same success/fail status One memory_failure() call changes the target page's status and affects all mappings to all affected processes, so I think that (ideally) we don't have to block other threads (letting them early return seems fine). Sometimes memory_failure() fails, but even in such case, PG_hwpoison is set on the page and other threads properly get SIGBUSs with this patch, so I think that we can avoid the worst scenario (like system stall by MCE loop). Thanks, Naoya Horiguchi
Hi naoya,Oscar,david: > > > We could use some negative value (error code) to report the reported case, > > then as you mentioned above, some callers need change to handle the > > new case, and the same is true if you use some positive value. > > My preference is -EHWPOISON, but other options are fine if justified well. > > -EHWPOISON seems like a good fit. > I am OK with the -EHWPOISON error code, But I have one doubt here: When we return this -EHWPOISON error code, Does this means we have to add a new error code to error-base.h or errno.h? Is this easy realized? Thanks! Aili Yao
Hi naoya, tony: > > > > Idea for what we should do next ... Now that x86 is calling memory_failure() > > from user context ... maybe parallel calls for the same page should > > be blocked until the first caller completes so we can: > > a) know that pages are unmapped (if that happens) > > b) all get the same success/fail status > > One memory_failure() call changes the target page's status and > affects all mappings to all affected processes, so I think that > (ideally) we don't have to block other threads (letting them > early return seems fine). Sometimes memory_failure() fails, > but even in such case, PG_hwpoison is set on the page and other > threads properly get SIGBUSs with this patch, so I think that > we can avoid the worst scenario (like system stall by MCE loop). > I agree with naoya's point, if we block for this issue, Does this change the result that the process should be killed? Or is there something other still need to be considered? Thanks! Aili Yao
On Thu, Feb 25, 2021 at 6:23 PM HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote: > > On Thu, Feb 25, 2021 at 10:15:42AM -0800, Luck, Tony wrote: > > CPU3 reads the poison and starts along same path that CPU2 > > did. > > I think that the MCE loop happening on CPU2 and CPU3 is unexpected > and these threads should immediately kill the current process on > each CPU. force_sig_mceerr() in kill_me_maybe() is supposed to do it, > so Aili's patch would fix this issue too? It would stop the looping. But for the case where the error came from user code we don't have the virtual address that was accessed at this point (normally this address is found during the reverse lokup from the physical address inside memory_failure()). So we can send a generic SIGBUS, but not one with the usual extra information about the location of the error. -Tony
On Fri, Feb 26, 2021 at 10:52:50AM +0800, Aili Yao wrote: > Hi naoya,Oscar,david: > > > > > We could use some negative value (error code) to report the reported case, > > > then as you mentioned above, some callers need change to handle the > > > new case, and the same is true if you use some positive value. > > > My preference is -EHWPOISON, but other options are fine if justified well. > > > > -EHWPOISON seems like a good fit. > > > I am OK with the -EHWPOISON error code, But I have one doubt here: > When we return this -EHWPOISON error code, Does this means we have to add a new error code > to error-base.h or errno.h? Is this easy realized? The page already poisoned isn't really an error though. Just the result of a race condition. What if we added an extra argument to memory_failure() so it can tell the caller that the specific reason for the early successful return is that the page was already poisoned? Something like this (untested - patch against v5.11): -Tony --- diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index e133ce1e562b..0e32c4d879fb 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -637,6 +637,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, { struct mce *mce = (struct mce *)data; unsigned long pfn; + int already = 0; if (!mce || !mce_usable_address(mce)) return NOTIFY_DONE; @@ -646,8 +647,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, return NOTIFY_DONE; pfn = mce->addr >> PAGE_SHIFT; - if (!memory_failure(pfn, 0)) { - set_mce_nospec(pfn, whole_page(mce)); + if (!memory_failure(pfn, 0, &already)) { + if (!already) + set_mce_nospec(pfn, whole_page(mce)); mce->kflags |= MCE_HANDLED_UC; } @@ -1245,15 +1247,19 @@ static void kill_me_maybe(struct callback_head *cb) { struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; + int already = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) flags |= MF_MUST_KILL; - if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && + if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); + if (already) + force_sig(SIGBUS); // BETTER CODE NEEDED HERE!!! + else + set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); sync_core(); return; } @@ -1440,7 +1446,7 @@ noinstr void do_machine_check(struct pt_regs *regs) EXPORT_SYMBOL_GPL(do_machine_check); #ifndef CONFIG_MEMORY_FAILURE -int memory_failure(unsigned long pfn, int flags) +int memory_failure(unsigned long pfn, int flags, int *already) { /* mce_severity() should not hand us an ACTION_REQUIRED error */ BUG_ON(flags & MF_ACTION_REQUIRED); diff --git a/drivers/base/memory.c b/drivers/base/memory.c index eef4ffb6122c..24c36623e492 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev, if (kstrtoull(buf, 0, &pfn) < 0) return -EINVAL; pfn >>= PAGE_SHIFT; - ret = memory_failure(pfn, 0); + ret = memory_failure(pfn, 0, NULL); return ret ? ret : count; } diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..88b92820465c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3045,7 +3045,7 @@ enum mf_flags { MF_MUST_KILL = 1 << 2, MF_SOFT_OFFLINE = 1 << 3, }; -extern int memory_failure(unsigned long pfn, int flags); +extern int memory_failure(unsigned long pfn, int flags, int *already); extern void memory_failure_queue(unsigned long pfn, int flags); extern void memory_failure_queue_kick(int cpu); extern int unpoison_memory(unsigned long pfn); diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c index 1ae1ebc2b9b1..bfd5151dcd3f 100644 --- a/mm/hwpoison-inject.c +++ b/mm/hwpoison-inject.c @@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val) inject: pr_info("Injecting memory failure at pfn %#lx\n", pfn); - return memory_failure(pfn, 0); + return memory_failure(pfn, 0, NULL); } static int hwpoison_unpoison(void *data, u64 val) diff --git a/mm/madvise.c b/mm/madvise.c index 6a660858784b..ade1956632aa 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -907,7 +907,7 @@ static int madvise_inject_error(int behavior, } else { pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", pfn, start); - ret = memory_failure(pfn, MF_COUNT_INCREASED); + ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL); } if (ret) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index e9481632fcd1..e8508e4d70e5 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1388,7 +1388,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, * Must run in process context (e.g. a work queue) with interrupts * enabled and no spinlocks hold. */ -int memory_failure(unsigned long pfn, int flags) +int memory_failure(unsigned long pfn, int flags, int *already) { struct page *p; struct page *hpage; @@ -1418,6 +1418,8 @@ int memory_failure(unsigned long pfn, int flags) if (PageHuge(p)) return memory_failure_hugetlb(pfn, flags); if (TestSetPageHWPoison(p)) { + if (already) + *already = 1; pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn); return 0; @@ -1624,7 +1626,7 @@ static void memory_failure_work_func(struct work_struct *work) if (entry.flags & MF_SOFT_OFFLINE) soft_offline_page(entry.pfn, entry.flags); else - memory_failure(entry.pfn, entry.flags); + memory_failure(entry.pfn, entry.flags, NULL); } }
On Fri, 26 Feb 2021 09:58:37 -0800 "Luck, Tony" <tony.luck@intel.com> wrote: > On Fri, Feb 26, 2021 at 10:52:50AM +0800, Aili Yao wrote: > > Hi naoya,Oscar,david: > > > > > > > We could use some negative value (error code) to report the reported case, > > > > then as you mentioned above, some callers need change to handle the > > > > new case, and the same is true if you use some positive value. > > > > My preference is -EHWPOISON, but other options are fine if justified well. > > > > > > -EHWPOISON seems like a good fit. > > > > > I am OK with the -EHWPOISON error code, But I have one doubt here: > > When we return this -EHWPOISON error code, Does this means we have to add a new error code > > to error-base.h or errno.h? Is this easy realized? > > The page already poisoned isn't really an error though. Just the result > of a race condition. What if we added an extra argument to memory_failure() > so it can tell the caller that the specific reason for the early successful > return is that the page was already poisoned? > It may be not an error, Is it reasonable to return a positive value like MF_HWPOISON, it seems the 0 return code donesn't tell the whole story. Your patch seems more safer, But I don't know if it's worth such multi module modifications for this case. It really should be referenced to other maintainers and reviewers and thet can give more expert suggestions. Thanks! Aili Yao
On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote: > Hi naoya, tony: > > > > > > Idea for what we should do next ... Now that x86 is calling memory_failure() > > > from user context ... maybe parallel calls for the same page should > > > be blocked until the first caller completes so we can: > > > a) know that pages are unmapped (if that happens) > > > b) all get the same success/fail status > > > > One memory_failure() call changes the target page's status and > > affects all mappings to all affected processes, so I think that > > (ideally) we don't have to block other threads (letting them > > early return seems fine). Sometimes memory_failure() fails, > > but even in such case, PG_hwpoison is set on the page and other > > threads properly get SIGBUSs with this patch, so I think that > > we can avoid the worst scenario (like system stall by MCE loop). > > > I agree with naoya's point, if we block for this issue, Does this change the result > that the process should be killed? Or is there something other still need to be considered? Ok ... no blocking ... I think someone in this thread suggested scanning the page tables to make sure the poisoned page had been unmapped. There's a walk_page_range() function that does all the work for that. Just need to supply some callback routines that check whether a mapping includes the bad PFN and clear the PRESENT bit. RFC patch below against v5.12-rc1 -Tony From 8de23b7f1be00ad38e129690dfe0b1558fad5ff8 Mon Sep 17 00:00:00 2001 From: Tony Luck <tony.luck@intel.com> Date: Tue, 2 Mar 2021 15:06:33 -0800 Subject: [PATCH] x86/mce: Handle races between machine checks When multiple CPUs hit the same poison memory there is a race. The first CPU into memory_failure() atomically marks the page as poison and continues processing to hunt down all the tasks that map this page so that the virtual addresses can be marked not-present and SIGBUS sent to the task that did the access. Later CPUs get an early return from memory_failure() and may return to user mode and access the poison again. Add a new argument to memory_failure() so that it can indicate when the race has been lost. Fix kill_me_maybe() to scan page tables in this case to unmap pages. --- arch/x86/kernel/cpu/mce/core.c | 61 +++++++++++++++++++++++++++++++--- drivers/base/memory.c | 2 +- include/linux/mm.h | 2 +- mm/hwpoison-inject.c | 2 +- mm/madvise.c | 2 +- mm/memory-failure.c | 6 ++-- 6 files changed, 64 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 7962355436da..2c6c560f3f92 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -30,6 +30,7 @@ #include <linux/slab.h> #include <linux/init.h> #include <linux/kmod.h> +#include <linux/pagewalk.h> #include <linux/poll.h> #include <linux/nmi.h> #include <linux/cpu.h> @@ -637,6 +638,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, { struct mce *mce = (struct mce *)data; unsigned long pfn; + int already = 0; if (!mce || !mce_usable_address(mce)) return NOTIFY_DONE; @@ -646,8 +648,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, return NOTIFY_DONE; pfn = mce->addr >> PAGE_SHIFT; - if (!memory_failure(pfn, 0)) { - set_mce_nospec(pfn, whole_page(mce)); + if (!memory_failure(pfn, 0, &already)) { + if (!already) + set_mce_nospec(pfn, whole_page(mce)); mce->kflags |= MCE_HANDLED_UC; } @@ -1248,6 +1251,50 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin *m = *final; } +static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + u64 pfn = (u64)walk->private; + + if (pte_pfn(*pte) == pfn) + pte->pte = pte->pte & ~_PAGE_PRESENT; + + return 0; +} + +static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + int shift = PMD_SHIFT - PAGE_SHIFT; + u64 pfn = (u64)walk->private; + + if (!pmd_large(*pmd)) + return 0; + + if (pmd_pfn(*pmd) >> shift == pfn >> shift) + pmd->pmd = pmd->pmd & ~_PAGE_PRESENT; + + return 0; +} + +static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + int shift = PUD_SHIFT - PAGE_SHIFT; + u64 pfn = (u64)walk->private; + + if (!pud_large(*pud)) + return 0; + + if (pud_pfn(*pud) >> shift == pfn >> shift) + pud->pud = pud->pud & ~_PAGE_PRESENT; + + return 0; +} + +static struct mm_walk_ops walk = { + .pte_entry = pte_entry, + .pmd_entry = pmd_entry, + .pud_entry = pud_entry +}; + static void kill_me_now(struct callback_head *ch) { force_sig(SIGBUS); @@ -1257,15 +1304,19 @@ static void kill_me_maybe(struct callback_head *cb) { struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; + int already = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) flags |= MF_MUST_KILL; - if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && + if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); + if (already) + walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT)); + else + set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); sync_core(); return; } @@ -1452,7 +1503,7 @@ noinstr void do_machine_check(struct pt_regs *regs) EXPORT_SYMBOL_GPL(do_machine_check); #ifndef CONFIG_MEMORY_FAILURE -int memory_failure(unsigned long pfn, int flags) +int memory_failure(unsigned long pfn, int flags, int *already) { /* mce_severity() should not hand us an ACTION_REQUIRED error */ BUG_ON(flags & MF_ACTION_REQUIRED); diff --git a/drivers/base/memory.c b/drivers/base/memory.c index f35298425575..144500983656 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev, if (kstrtoull(buf, 0, &pfn) < 0) return -EINVAL; pfn >>= PAGE_SHIFT; - ret = memory_failure(pfn, 0); + ret = memory_failure(pfn, 0, NULL); return ret ? ret : count; } diff --git a/include/linux/mm.h b/include/linux/mm.h index 77e64e3eac80..beaa6e871cbe 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3003,7 +3003,7 @@ enum mf_flags { MF_MUST_KILL = 1 << 2, MF_SOFT_OFFLINE = 1 << 3, }; -extern int memory_failure(unsigned long pfn, int flags); +extern int memory_failure(unsigned long pfn, int flags, int *already); extern void memory_failure_queue(unsigned long pfn, int flags); extern void memory_failure_queue_kick(int cpu); extern int unpoison_memory(unsigned long pfn); diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c index 1ae1ebc2b9b1..bfd5151dcd3f 100644 --- a/mm/hwpoison-inject.c +++ b/mm/hwpoison-inject.c @@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val) inject: pr_info("Injecting memory failure at pfn %#lx\n", pfn); - return memory_failure(pfn, 0); + return memory_failure(pfn, 0, NULL); } static int hwpoison_unpoison(void *data, u64 val) diff --git a/mm/madvise.c b/mm/madvise.c index df692d2e35d4..09f569fed68d 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -908,7 +908,7 @@ static int madvise_inject_error(int behavior, } else { pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", pfn, start); - ret = memory_failure(pfn, MF_COUNT_INCREASED); + ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL); } if (ret) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 24210c9bd843..9a8911aa5fc9 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1398,7 +1398,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, * Must run in process context (e.g. a work queue) with interrupts * enabled and no spinlocks hold. */ -int memory_failure(unsigned long pfn, int flags) +int memory_failure(unsigned long pfn, int flags, int *already) { struct page *p; struct page *hpage; @@ -1428,6 +1428,8 @@ int memory_failure(unsigned long pfn, int flags) if (PageHuge(p)) return memory_failure_hugetlb(pfn, flags); if (TestSetPageHWPoison(p)) { + if (already) + *already = 1; pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn); return 0; @@ -1634,7 +1636,7 @@ static void memory_failure_work_func(struct work_struct *work) if (entry.flags & MF_SOFT_OFFLINE) soft_offline_page(entry.pfn, entry.flags); else - memory_failure(entry.pfn, entry.flags); + memory_failure(entry.pfn, entry.flags, NULL); } }
On Tue, 2 Mar 2021 19:39:53 -0800 "Luck, Tony" <tony.luck@intel.com> wrote: > On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote: > > Hi naoya, tony: > > > > > > > > Idea for what we should do next ... Now that x86 is calling memory_failure() > > > > from user context ... maybe parallel calls for the same page should > > > > be blocked until the first caller completes so we can: > > > > a) know that pages are unmapped (if that happens) > > > > b) all get the same success/fail status > > > > > > One memory_failure() call changes the target page's status and > > > affects all mappings to all affected processes, so I think that > > > (ideally) we don't have to block other threads (letting them > > > early return seems fine). Sometimes memory_failure() fails, > > > but even in such case, PG_hwpoison is set on the page and other > > > threads properly get SIGBUSs with this patch, so I think that > > > we can avoid the worst scenario (like system stall by MCE loop). > > > > > I agree with naoya's point, if we block for this issue, Does this change the result > > that the process should be killed? Or is there something other still need to be considered? > > Ok ... no blocking ... I think someone in this thread suggested > scanning the page tables to make sure the poisoned page had been > unmapped. > > There's a walk_page_range() function that does all the work for that. > Just need to supply some callback routines that check whether a > mapping includes the bad PFN and clear the PRESENT bit. > > RFC patch below against v5.12-rc1 > > -Tony > > From 8de23b7f1be00ad38e129690dfe0b1558fad5ff8 Mon Sep 17 00:00:00 2001 > From: Tony Luck <tony.luck@intel.com> > Date: Tue, 2 Mar 2021 15:06:33 -0800 > Subject: [PATCH] x86/mce: Handle races between machine checks > > When multiple CPUs hit the same poison memory there is a race. The > first CPU into memory_failure() atomically marks the page as poison > and continues processing to hunt down all the tasks that map this page > so that the virtual addresses can be marked not-present and SIGBUS > sent to the task that did the access. > > Later CPUs get an early return from memory_failure() and may return > to user mode and access the poison again. > > Add a new argument to memory_failure() so that it can indicate when > the race has been lost. Fix kill_me_maybe() to scan page tables in > this case to unmap pages. > + > static void kill_me_now(struct callback_head *ch) > { > force_sig(SIGBUS); > @@ -1257,15 +1304,19 @@ static void kill_me_maybe(struct callback_head *cb) > { > struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); > int flags = MF_ACTION_REQUIRED; > + int already = 0; > > pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); > > if (!p->mce_ripv) > flags |= MF_MUST_KILL; > > - if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && > + if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) && > !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { > - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); > + if (already) > + walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT)); > + else > + set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); > sync_core(); > return; > MF_MUST_KILL = 1 << 2, > MF_SOFT_OFFLINE = 1 << 3, > }; I have one doubt here, when "walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT));" is done, so how is the process triggering this error returned if it have taken the wrong data?
Hi tony: > On Tue, 2 Mar 2021 19:39:53 -0800 > "Luck, Tony" <tony.luck@intel.com> wrote: > > > On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote: > > > Hi naoya, tony: > > > > > > > > > > Idea for what we should do next ... Now that x86 is calling memory_failure() > > > > > from user context ... maybe parallel calls for the same page should > > > > > be blocked until the first caller completes so we can: > > > > > a) know that pages are unmapped (if that happens) > > > > > b) all get the same success/fail status > > > > > > > > One memory_failure() call changes the target page's status and > > > > affects all mappings to all affected processes, so I think that > > > > (ideally) we don't have to block other threads (letting them > > > > early return seems fine). Sometimes memory_failure() fails, > > > > but even in such case, PG_hwpoison is set on the page and other > > > > threads properly get SIGBUSs with this patch, so I think that > > > > we can avoid the worst scenario (like system stall by MCE loop). > > > > > > > I agree with naoya's point, if we block for this issue, Does this change the result > > > that the process should be killed? Or is there something other still need to be considered? > > > > Ok ... no blocking ... I do think about blocking method and the error address issue with sigbus,here is my opinion, maybe helpful: For blocking, if we block here, there are some undefine work i think should be done. As we don't know the process B triggering this error again is early-kill or not, so the previous memory_failure() call may not add B on kill_list, even if B is on kill_list, the error level for B is not proper set, as B should get an AR SIGBUS; So we can't just wait, We must have some logic adding the process B to kill list, and as this is an AR error another change should be done to current code, we need more logic in kill_proc or some other place. Even if all the work is done right. There is one more serious scenario though, we even don't know the current step the previous memory_failure() is on, So previous modification may not be usefull at all; When this scenario happens, what we can do? block or return ? if finally we return, an error code should be taken back; so we have to go to error process logic and a signal without right address will be sent; For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there. I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback to the error brach again, remaining current code may be an easy option; Any methods or patchs can solve the issue in a better way is OK to me, i want this issue fixed and in more complete way!
> For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there. > I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when > we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback > to the error brach again, remaining current code may be an easy option; My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should do a bit more that just clear the PRESENT bit). So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe() If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS (just like it does if the page had been removed by an earlier UCNA/SRAO error). At least that's the theory. -Tony
On Wed, 3 Mar 2021 15:41:35 +0000 "Luck, Tony" <tony.luck@intel.com> wrote: > > For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there. > > I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when > > we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback > > to the error brach again, remaining current code may be an easy option; > > My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page > tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should > do a bit more that just clear the PRESENT bit). > > So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe() > > If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF Got this, I have some error thoughts here. > The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS > (just like it does if the page had been removed by an earlier UCNA/SRAO error). if your methods works, should it be like this? 1582 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); 1583 if (PageHuge(page)) { 1584 hugetlb_count_sub(compound_nr(page), mm); 1585 set_huge_swap_pte_at(mm, address, 1586 pvmw.pte, pteval, 1587 vma_mmu_pagesize(vma)); 1588 } else { 1589 dec_mm_counter(mm, mm_counter(page)); 1590 set_pte_at(mm, address, pvmw.pte, pteval); 1591 } the page fault check if it's a poison page using is_hwpoison_entry(),
On Thu, 4 Mar 2021 10:16:53 +0800 Aili Yao <yaoaili@kingsoft.com> wrote: > On Wed, 3 Mar 2021 15:41:35 +0000 > "Luck, Tony" <tony.luck@intel.com> wrote: > > > > For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there. > > > I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when > > > we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback > > > to the error brach again, remaining current code may be an easy option; > > > > My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page > > tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should > > do a bit more that just clear the PRESENT bit). > > > > So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe() And when this happen, the process will receive an SIGBUS with AO level, is it proper as not an AR? > > If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF > > Got this, I have some error thoughts here. > > > > The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS > > (just like it does if the page had been removed by an earlier UCNA/SRAO error). > > if your methods works, should it be like this? > > 1582 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); > 1583 if (PageHuge(page)) { > 1584 hugetlb_count_sub(compound_nr(page), mm); > 1585 set_huge_swap_pte_at(mm, address, > 1586 pvmw.pte, pteval, > 1587 vma_mmu_pagesize(vma)); > 1588 } else { > 1589 dec_mm_counter(mm, mm_counter(page)); > 1590 set_pte_at(mm, address, pvmw.pte, pteval); > 1591 } > > the page fault check if it's a poison page using is_hwpoison_entry(), > And if it works, does we need some locking mechanism before we call walk_page_range(); if we lock, does we need to process the blocking interrupted error as other places will do?
On Thu, 4 Mar 2021 12:19:41 +0800 Aili Yao <yaoaili@kingsoft.com> wrote: > On Thu, 4 Mar 2021 10:16:53 +0800 > Aili Yao <yaoaili@kingsoft.com> wrote: > > > On Wed, 3 Mar 2021 15:41:35 +0000 > > "Luck, Tony" <tony.luck@intel.com> wrote: > > > > > > For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there. > > > > I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when > > > > we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback > > > > to the error brach again, remaining current code may be an easy option; > > > > > > My RFC patch from yesterday removes the uncertainty about whether the page is there or not. After it walks the page > > > tables we know that the poison page isn't mapped (note that patch is RFC for a reason ... I'm 90% sure that it should > > > do a bit more that just clear the PRESENT bit). > > > > > > So perhaps memory_failure() has queued a SIGBUS for this task, if so, we take it when we return from kill_me_maybe() > > And when this happen, the process will receive an SIGBUS with AO level, is it proper as not an AR? > > > > If not, we will return to user mode and re-execute the failing instruction ... but because the page is unmapped we will take a #PF > > > > Got this, I have some error thoughts here. > > > > > > > The x86 page fault handler will see that the page for this physical address is marked HWPOISON, and it will send the SIGBUS > > > (just like it does if the page had been removed by an earlier UCNA/SRAO error). > > > > if your methods works, should it be like this? > > > > 1582 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); > > 1583 if (PageHuge(page)) { > > 1584 hugetlb_count_sub(compound_nr(page), mm); > > 1585 set_huge_swap_pte_at(mm, address, > > 1586 pvmw.pte, pteval, > > 1587 vma_mmu_pagesize(vma)); > > 1588 } else { > > 1589 dec_mm_counter(mm, mm_counter(page)); > > 1590 set_pte_at(mm, address, pvmw.pte, pteval); > > 1591 } > > > > the page fault check if it's a poison page using is_hwpoison_entry(), > > > > And if it works, does we need some locking mechanism before we call walk_page_range(); > if we lock, does we need to process the blocking interrupted error as other places will do? > And another thing: Do we need a call to flush_tlb_page(vma, address) to make the pte changes into effect?
On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote: > > > if your methods works, should it be like this? > > > > > > 1582 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); > > > 1583 if (PageHuge(page)) { > > > 1584 hugetlb_count_sub(compound_nr(page), mm); > > > 1585 set_huge_swap_pte_at(mm, address, > > > 1586 pvmw.pte, pteval, > > > 1587 vma_mmu_pagesize(vma)); > > > 1588 } else { > > > 1589 dec_mm_counter(mm, mm_counter(page)); > > > 1590 set_pte_at(mm, address, pvmw.pte, pteval); > > > 1591 } > > > > > > the page fault check if it's a poison page using is_hwpoison_entry(), > > > > > > > And if it works, does we need some locking mechanism before we call walk_page_range(); > > if we lock, does we need to process the blocking interrupted error as other places will do? > > > > And another thing: > Do we need a call to flush_tlb_page(vma, address) to make the pte changes into effect? Thanks for all the pointers. I added them to the patch (see below). [The pmd/pud cases may need some tweaking ... but just trying to get the 4K page case working first] I tried testing by skipping the call to memory_failure() and just using this new code to search the page tables for current page and marking it hwpoison (to simulate the case where 2nd process gets the early return from memory_failure(). Something is still missing because I get: [ 481.911298] mce: pte_entry: matched pfn - mark poison & zap pte [ 481.917935] MCE: Killing einj_mem_uc:5555 due to hardware memory corruption fault at 7fe64b33b400 [ 482.933775] BUG: Bad page cache in process einj_mem_uc pfn:408b6d6 [ 482.940777] page:0000000013ea6e96 refcount:3 mapcount:1 mapping:00000000e3a069d9 index:0x0 pfn:0x408b6d6 [ 482.951355] memcg:ffff94a809834000 [ 482.955153] aops:shmem_aops ino:3c04 [ 482.959142] flags: 0x97ffffc0880015(locked|uptodate|lru|swapbacked|hwpoison) [ 482.967018] raw: 0097ffffc0880015 ffff94c80e93ec00 ffff94c80e93ec00 ffff94c80a9b25a8 [ 482.975666] raw: 0000000000000000 0000000000000000 0000000300000000 ffff94a809834000 [ 482.984310] page dumped because: still mapped when deleted -Tony commit e5de44560b33e2d407704243566253a70f858a59 Author: Tony Luck <tony.luck@intel.com> Date: Tue Mar 2 15:06:33 2021 -0800 x86/mce: Handle races between machine checks When multiple CPUs hit the same poison memory there is a race. The first CPU into memory_failure() atomically marks the page as poison and continues processing to hunt down all the tasks that map this page so that the virtual addresses can be marked not-present and SIGBUS sent to the task that did the access. Later CPUs get an early return from memory_failure() and may return to user mode and access the poison again. Add a new argument to memory_failure() so that it can indicate when the race has been lost. Fix kill_me_maybe() to scan page tables in this case to unmap pages. diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 7962355436da..a52c6a772de2 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -28,8 +28,12 @@ #include <linux/sysfs.h> #include <linux/types.h> #include <linux/slab.h> +#include <linux/hugetlb.h> +#include <linux/swap.h> +#include <linux/swapops.h> #include <linux/init.h> #include <linux/kmod.h> +#include <linux/pagewalk.h> #include <linux/poll.h> #include <linux/nmi.h> #include <linux/cpu.h> @@ -637,6 +641,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, { struct mce *mce = (struct mce *)data; unsigned long pfn; + int already = 0; if (!mce || !mce_usable_address(mce)) return NOTIFY_DONE; @@ -646,8 +651,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, return NOTIFY_DONE; pfn = mce->addr >> PAGE_SHIFT; - if (!memory_failure(pfn, 0)) { - set_mce_nospec(pfn, whole_page(mce)); + if (!memory_failure(pfn, 0, &already)) { + if (!already) + set_mce_nospec(pfn, whole_page(mce)); mce->kflags |= MCE_HANDLED_UC; } @@ -1248,6 +1254,79 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin *m = *final; } +static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + u64 pfn = (u64)walk->private; + struct page *page; + pte_t pteval; + + if (pte_pfn(*pte) == pfn) { +pr_info("pte_entry: matched pfn - mark poison & zap pte\n"); + page = pfn_to_page(pfn); + lock_page(page); +SetPageHWPoison(page); + pteval = swp_entry_to_pte(make_hwpoison_entry(page)); + dec_mm_counter(walk->mm, mm_counter(page)); + set_pte_at(current->mm, addr, pte, pteval); + unlock_page(page); + flush_tlb_page(walk->vma, addr); + } + + return 0; +} + +static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + int shift = PMD_SHIFT - PAGE_SHIFT; + u64 pfn = (u64)walk->private; + struct page *page; + pte_t pteval; + + if (!pmd_large(*pmd)) + return 0; + + if (pmd_pfn(*pmd) >> shift == pfn >> shift) { + page = pfn_to_page(pfn); + lock_page(page); + pteval = swp_entry_to_pte(make_hwpoison_entry(page)); + hugetlb_count_sub(compound_nr(page), walk->mm); + set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pmd, pteval, vma_mmu_pagesize(walk->vma)); + unlock_page(page); + flush_tlb_page(walk->vma, addr); + } + + return 0; +} + +static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + int shift = PUD_SHIFT - PAGE_SHIFT; + u64 pfn = (u64)walk->private; + struct page *page; + pte_t pteval; + + if (!pud_large(*pud)) + return 0; + + if (pud_pfn(*pud) >> shift == pfn >> shift) { + page = pfn_to_page(pfn); + lock_page(page); + pteval = swp_entry_to_pte(make_hwpoison_entry(page)); + hugetlb_count_sub(compound_nr(page), walk->mm); + set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pud, pteval, vma_mmu_pagesize(walk->vma)); + unlock_page(page); + flush_tlb_page(walk->vma, addr); + } + + return 0; +} + +static struct mm_walk_ops walk = { + .pte_entry = pte_entry, + .pmd_entry = pmd_entry, + .pud_entry = pud_entry +}; + static void kill_me_now(struct callback_head *ch) { force_sig(SIGBUS); @@ -1257,15 +1336,22 @@ static void kill_me_maybe(struct callback_head *cb) { struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; + int already = 0; pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); if (!p->mce_ripv) flags |= MF_MUST_KILL; - if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && + if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); + if (already) { + mmap_read_lock(current->mm); + walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT)); + mmap_read_unlock(current->mm); + } else { + set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); + } sync_core(); return; } @@ -1452,7 +1538,7 @@ noinstr void do_machine_check(struct pt_regs *regs) EXPORT_SYMBOL_GPL(do_machine_check); #ifndef CONFIG_MEMORY_FAILURE -int memory_failure(unsigned long pfn, int flags) +int memory_failure(unsigned long pfn, int flags, int *already) { /* mce_severity() should not hand us an ACTION_REQUIRED error */ BUG_ON(flags & MF_ACTION_REQUIRED); diff --git a/drivers/base/memory.c b/drivers/base/memory.c index f35298425575..144500983656 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev, if (kstrtoull(buf, 0, &pfn) < 0) return -EINVAL; pfn >>= PAGE_SHIFT; - ret = memory_failure(pfn, 0); + ret = memory_failure(pfn, 0, NULL); return ret ? ret : count; } diff --git a/include/linux/mm.h b/include/linux/mm.h index 77e64e3eac80..beaa6e871cbe 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3003,7 +3003,7 @@ enum mf_flags { MF_MUST_KILL = 1 << 2, MF_SOFT_OFFLINE = 1 << 3, }; -extern int memory_failure(unsigned long pfn, int flags); +extern int memory_failure(unsigned long pfn, int flags, int *already); extern void memory_failure_queue(unsigned long pfn, int flags); extern void memory_failure_queue_kick(int cpu); extern int unpoison_memory(unsigned long pfn); diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c index 1ae1ebc2b9b1..bfd5151dcd3f 100644 --- a/mm/hwpoison-inject.c +++ b/mm/hwpoison-inject.c @@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val) inject: pr_info("Injecting memory failure at pfn %#lx\n", pfn); - return memory_failure(pfn, 0); + return memory_failure(pfn, 0, NULL); } static int hwpoison_unpoison(void *data, u64 val) diff --git a/mm/madvise.c b/mm/madvise.c index df692d2e35d4..09f569fed68d 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -908,7 +908,7 @@ static int madvise_inject_error(int behavior, } else { pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", pfn, start); - ret = memory_failure(pfn, MF_COUNT_INCREASED); + ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL); } if (ret) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 24210c9bd843..9a8911aa5fc9 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1398,7 +1398,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, * Must run in process context (e.g. a work queue) with interrupts * enabled and no spinlocks hold. */ -int memory_failure(unsigned long pfn, int flags) +int memory_failure(unsigned long pfn, int flags, int *already) { struct page *p; struct page *hpage; @@ -1428,6 +1428,8 @@ int memory_failure(unsigned long pfn, int flags) if (PageHuge(p)) return memory_failure_hugetlb(pfn, flags); if (TestSetPageHWPoison(p)) { + if (already) + *already = 1; pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn); return 0; @@ -1634,7 +1636,7 @@ static void memory_failure_work_func(struct work_struct *work) if (entry.flags & MF_SOFT_OFFLINE) soft_offline_page(entry.pfn, entry.flags); else - memory_failure(entry.pfn, entry.flags); + memory_failure(entry.pfn, entry.flags, NULL); } }
On Thu, 4 Mar 2021 15:57:20 -0800 "Luck, Tony" <tony.luck@intel.com> wrote: > On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote: > > > > if your methods works, should it be like this? > > > > > > > > 1582 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); > > > > 1583 if (PageHuge(page)) { > > > > 1584 hugetlb_count_sub(compound_nr(page), mm); > > > > 1585 set_huge_swap_pte_at(mm, address, > > > > 1586 pvmw.pte, pteval, > > > > 1587 vma_mmu_pagesize(vma)); > > > > 1588 } else { > > > > 1589 dec_mm_counter(mm, mm_counter(page)); > > > > 1590 set_pte_at(mm, address, pvmw.pte, pteval); > > > > 1591 } > > > > > > > > the page fault check if it's a poison page using is_hwpoison_entry(), > > > > > > > > > > And if it works, does we need some locking mechanism before we call walk_page_range(); > > > if we lock, does we need to process the blocking interrupted error as other places will do? > > > > > > > And another thing: > > Do we need a call to flush_tlb_page(vma, address) to make the pte changes into effect? > > Thanks for all the pointers. I added them to the patch (see below). > [The pmd/pud cases may need some tweaking ... but just trying to get > the 4K page case working first] > > I tried testing by skipping the call to memory_failure() and just > using this new code to search the page tables for current page and > marking it hwpoison (to simulate the case where 2nd process gets the > early return from memory_failure(). Something is still missing because I get: > > [ 481.911298] mce: pte_entry: matched pfn - mark poison & zap pte > [ 481.917935] MCE: Killing einj_mem_uc:5555 due to hardware memory corruption fault at 7fe64b33b400 > [ 482.933775] BUG: Bad page cache in process einj_mem_uc pfn:408b6d6 > [ 482.940777] page:0000000013ea6e96 refcount:3 mapcount:1 mapping:00000000e3a069d9 index:0x0 pfn:0x408b6d6 > [ 482.951355] memcg:ffff94a809834000 > [ 482.955153] aops:shmem_aops ino:3c04 > [ 482.959142] flags: 0x97ffffc0880015(locked|uptodate|lru|swapbacked|hwpoison) > [ 482.967018] raw: 0097ffffc0880015 ffff94c80e93ec00 ffff94c80e93ec00 ffff94c80a9b25a8 > [ 482.975666] raw: 0000000000000000 0000000000000000 0000000300000000 ffff94a809834000 > [ 482.984310] page dumped because: still mapped when deleted From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it? > commit e5de44560b33e2d407704243566253a70f858a59 > Author: Tony Luck <tony.luck@intel.com> > Date: Tue Mar 2 15:06:33 2021 -0800 > > x86/mce: Handle races between machine checks > > When multiple CPUs hit the same poison memory there is a race. The > first CPU into memory_failure() atomically marks the page as poison > and continues processing to hunt down all the tasks that map this page > so that the virtual addresses can be marked not-present and SIGBUS > sent to the task that did the access. > > Later CPUs get an early return from memory_failure() and may return > to user mode and access the poison again. > > Add a new argument to memory_failure() so that it can indicate when > the race has been lost. Fix kill_me_maybe() to scan page tables in > this case to unmap pages. > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index 7962355436da..a52c6a772de2 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -28,8 +28,12 @@ > #include <linux/sysfs.h> > #include <linux/types.h> > #include <linux/slab.h> > +#include <linux/hugetlb.h> > +#include <linux/swap.h> > +#include <linux/swapops.h> > #include <linux/init.h> > #include <linux/kmod.h> > +#include <linux/pagewalk.h> > #include <linux/poll.h> > #include <linux/nmi.h> > #include <linux/cpu.h> > @@ -637,6 +641,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, > { > struct mce *mce = (struct mce *)data; > unsigned long pfn; > + int already = 0; > > if (!mce || !mce_usable_address(mce)) > return NOTIFY_DONE; > @@ -646,8 +651,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, > return NOTIFY_DONE; > > pfn = mce->addr >> PAGE_SHIFT; > - if (!memory_failure(pfn, 0)) { > - set_mce_nospec(pfn, whole_page(mce)); > + if (!memory_failure(pfn, 0, &already)) { > + if (!already) > + set_mce_nospec(pfn, whole_page(mce)); > mce->kflags |= MCE_HANDLED_UC; > } > > @@ -1248,6 +1254,79 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin > *m = *final; > } > > +static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) > +{ > + u64 pfn = (u64)walk->private; > + struct page *page; > + pte_t pteval; > + > + if (pte_pfn(*pte) == pfn) { > +pr_info("pte_entry: matched pfn - mark poison & zap pte\n"); > + page = pfn_to_page(pfn); > + lock_page(page); > +SetPageHWPoison(page); > + pteval = swp_entry_to_pte(make_hwpoison_entry(page)); > + dec_mm_counter(walk->mm, mm_counter(page)); > + set_pte_at(current->mm, addr, pte, pteval); > + unlock_page(page); > + flush_tlb_page(walk->vma, addr); > + } > + > + return 0; > +} > + > +static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk) > +{ > + int shift = PMD_SHIFT - PAGE_SHIFT; > + u64 pfn = (u64)walk->private; > + struct page *page; > + pte_t pteval; > + > + if (!pmd_large(*pmd)) > + return 0; > + > + if (pmd_pfn(*pmd) >> shift == pfn >> shift) { > + page = pfn_to_page(pfn); > + lock_page(page); > + pteval = swp_entry_to_pte(make_hwpoison_entry(page)); > + hugetlb_count_sub(compound_nr(page), walk->mm); > + set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pmd, pteval, vma_mmu_pagesize(walk->vma)); > + unlock_page(page); > + flush_tlb_page(walk->vma, addr); > + } > + > + return 0; > +} > + > +static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk) > +{ > + int shift = PUD_SHIFT - PAGE_SHIFT; > + u64 pfn = (u64)walk->private; > + struct page *page; > + pte_t pteval; > + > + if (!pud_large(*pud)) > + return 0; > + > + if (pud_pfn(*pud) >> shift == pfn >> shift) { > + page = pfn_to_page(pfn); > + lock_page(page); > + pteval = swp_entry_to_pte(make_hwpoison_entry(page)); > + hugetlb_count_sub(compound_nr(page), walk->mm); > + set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pud, pteval, vma_mmu_pagesize(walk->vma)); > + unlock_page(page); > + flush_tlb_page(walk->vma, addr); > + } > + > + return 0; > +} > + > +static struct mm_walk_ops walk = { > + .pte_entry = pte_entry, > + .pmd_entry = pmd_entry, > + .pud_entry = pud_entry > +}; > + > static void kill_me_now(struct callback_head *ch) > { > force_sig(SIGBUS); > @@ -1257,15 +1336,22 @@ static void kill_me_maybe(struct callback_head *cb) > { > struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); > int flags = MF_ACTION_REQUIRED; > + int already = 0; > > pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); > > if (!p->mce_ripv) > flags |= MF_MUST_KILL; > > - if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && > + if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) && > !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { > - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); > + if (already) { > + mmap_read_lock(current->mm); > + walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT)); > + mmap_read_unlock(current->mm); > + } else { > + set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); > + } > sync_core(); > return; > } > @@ -1452,7 +1538,7 @@ noinstr void do_machine_check(struct pt_regs *regs) > EXPORT_SYMBOL_GPL(do_machine_check); > > #ifndef CONFIG_MEMORY_FAILURE > -int memory_failure(unsigned long pfn, int flags) > +int memory_failure(unsigned long pfn, int flags, int *already) > { > /* mce_severity() should not hand us an ACTION_REQUIRED error */ > BUG_ON(flags & MF_ACTION_REQUIRED); > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index f35298425575..144500983656 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev, > if (kstrtoull(buf, 0, &pfn) < 0) > return -EINVAL; > pfn >>= PAGE_SHIFT; > - ret = memory_failure(pfn, 0); > + ret = memory_failure(pfn, 0, NULL); > return ret ? ret : count; > } > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 77e64e3eac80..beaa6e871cbe 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3003,7 +3003,7 @@ enum mf_flags { > MF_MUST_KILL = 1 << 2, > MF_SOFT_OFFLINE = 1 << 3, > }; > -extern int memory_failure(unsigned long pfn, int flags); > +extern int memory_failure(unsigned long pfn, int flags, int *already); > extern void memory_failure_queue(unsigned long pfn, int flags); > extern void memory_failure_queue_kick(int cpu); > extern int unpoison_memory(unsigned long pfn); > diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c > index 1ae1ebc2b9b1..bfd5151dcd3f 100644 > --- a/mm/hwpoison-inject.c > +++ b/mm/hwpoison-inject.c > @@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val) > > inject: > pr_info("Injecting memory failure at pfn %#lx\n", pfn); > - return memory_failure(pfn, 0); > + return memory_failure(pfn, 0, NULL); > } > > static int hwpoison_unpoison(void *data, u64 val) > diff --git a/mm/madvise.c b/mm/madvise.c > index df692d2e35d4..09f569fed68d 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -908,7 +908,7 @@ static int madvise_inject_error(int behavior, > } else { > pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", > pfn, start); > - ret = memory_failure(pfn, MF_COUNT_INCREASED); > + ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL); > } > > if (ret) > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 24210c9bd843..9a8911aa5fc9 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1398,7 +1398,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, > * Must run in process context (e.g. a work queue) with interrupts > * enabled and no spinlocks hold. > */ > -int memory_failure(unsigned long pfn, int flags) > +int memory_failure(unsigned long pfn, int flags, int *already) > { > struct page *p; > struct page *hpage; > @@ -1428,6 +1428,8 @@ int memory_failure(unsigned long pfn, int flags) > if (PageHuge(p)) > return memory_failure_hugetlb(pfn, flags); > if (TestSetPageHWPoison(p)) { > + if (already) > + *already = 1; > pr_err("Memory failure: %#lx: already hardware poisoned\n", > pfn); > return 0; > @@ -1634,7 +1636,7 @@ static void memory_failure_work_func(struct work_struct *work) > if (entry.flags & MF_SOFT_OFFLINE) > soft_offline_page(entry.pfn, entry.flags); > else > - memory_failure(entry.pfn, entry.flags); > + memory_failure(entry.pfn, entry.flags, NULL); > } > } >
On Fri, 5 Mar 2021 09:30:16 +0800 Aili Yao <yaoaili@kingsoft.com> wrote: > On Thu, 4 Mar 2021 15:57:20 -0800 > "Luck, Tony" <tony.luck@intel.com> wrote: > > > On Thu, Mar 04, 2021 at 02:45:24PM +0800, Aili Yao wrote: > > > > > if your methods works, should it be like this? > > > > > > > > > > 1582 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); > > > > > 1583 if (PageHuge(page)) { > > > > > 1584 hugetlb_count_sub(compound_nr(page), mm); > > > > > 1585 set_huge_swap_pte_at(mm, address, > > > > > 1586 pvmw.pte, pteval, > > > > > 1587 vma_mmu_pagesize(vma)); > > > > > 1588 } else { > > > > > 1589 dec_mm_counter(mm, mm_counter(page)); > > > > > 1590 set_pte_at(mm, address, pvmw.pte, pteval); > > > > > 1591 } > > > > > > > > > > the page fault check if it's a poison page using is_hwpoison_entry(), > > > > > > > > > > > > > And if it works, does we need some locking mechanism before we call walk_page_range(); > > > > if we lock, does we need to process the blocking interrupted error as other places will do? > > > > > > > > > > And another thing: > > > Do we need a call to flush_tlb_page(vma, address) to make the pte changes into effect? > > > > Thanks for all the pointers. I added them to the patch (see below). > > [The pmd/pud cases may need some tweaking ... but just trying to get > > the 4K page case working first] > > > > I tried testing by skipping the call to memory_failure() and just > > using this new code to search the page tables for current page and > > marking it hwpoison (to simulate the case where 2nd process gets the > > early return from memory_failure(). Something is still missing because I get: > > > > [ 481.911298] mce: pte_entry: matched pfn - mark poison & zap pte > > [ 481.917935] MCE: Killing einj_mem_uc:5555 due to hardware memory corruption fault at 7fe64b33b400 > > [ 482.933775] BUG: Bad page cache in process einj_mem_uc pfn:408b6d6 > > [ 482.940777] page:0000000013ea6e96 refcount:3 mapcount:1 mapping:00000000e3a069d9 index:0x0 pfn:0x408b6d6 > > [ 482.951355] memcg:ffff94a809834000 > > [ 482.955153] aops:shmem_aops ino:3c04 > > [ 482.959142] flags: 0x97ffffc0880015(locked|uptodate|lru|swapbacked|hwpoison) > > [ 482.967018] raw: 0097ffffc0880015 ffff94c80e93ec00 ffff94c80e93ec00 ffff94c80a9b25a8 > > [ 482.975666] raw: 0000000000000000 0000000000000000 0000000300000000 ffff94a809834000 > > [ 482.984310] page dumped because: still mapped when deleted > > From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it? Does this walk proper for other memory-failure error cases, can it be applied to if (p->mce_vaddr != (void __user *)-1l) branch in kill_me_maybe()? > > commit e5de44560b33e2d407704243566253a70f858a59 > > Author: Tony Luck <tony.luck@intel.com> > > Date: Tue Mar 2 15:06:33 2021 -0800 > > > > x86/mce: Handle races between machine checks > > > > When multiple CPUs hit the same poison memory there is a race. The > > first CPU into memory_failure() atomically marks the page as poison > > and continues processing to hunt down all the tasks that map this page > > so that the virtual addresses can be marked not-present and SIGBUS > > sent to the task that did the access. > > > > Later CPUs get an early return from memory_failure() and may return > > to user mode and access the poison again. > > > > Add a new argument to memory_failure() so that it can indicate when > > the race has been lost. Fix kill_me_maybe() to scan page tables in > > this case to unmap pages. > > > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > > index 7962355436da..a52c6a772de2 100644 > > --- a/arch/x86/kernel/cpu/mce/core.c > > +++ b/arch/x86/kernel/cpu/mce/core.c > > @@ -28,8 +28,12 @@ > > #include <linux/sysfs.h> > > #include <linux/types.h> > > #include <linux/slab.h> > > +#include <linux/hugetlb.h> > > +#include <linux/swap.h> > > +#include <linux/swapops.h> > > #include <linux/init.h> > > #include <linux/kmod.h> > > +#include <linux/pagewalk.h> > > #include <linux/poll.h> > > #include <linux/nmi.h> > > #include <linux/cpu.h> > > @@ -637,6 +641,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, > > { > > struct mce *mce = (struct mce *)data; > > unsigned long pfn; > > + int already = 0; > > > > if (!mce || !mce_usable_address(mce)) > > return NOTIFY_DONE; > > @@ -646,8 +651,9 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, > > return NOTIFY_DONE; > > > > pfn = mce->addr >> PAGE_SHIFT; > > - if (!memory_failure(pfn, 0)) { > > - set_mce_nospec(pfn, whole_page(mce)); > > + if (!memory_failure(pfn, 0, &already)) { > > + if (!already) > > + set_mce_nospec(pfn, whole_page(mce)); > > mce->kflags |= MCE_HANDLED_UC; > > } > > > > @@ -1248,6 +1254,79 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin > > *m = *final; > > } > > > > +static int pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) > > +{ > > + u64 pfn = (u64)walk->private; > > + struct page *page; > > + pte_t pteval; > > + > > + if (pte_pfn(*pte) == pfn) { > > +pr_info("pte_entry: matched pfn - mark poison & zap pte\n"); > > + page = pfn_to_page(pfn); > > + lock_page(page); > > +SetPageHWPoison(page); > > + pteval = swp_entry_to_pte(make_hwpoison_entry(page)); > > + dec_mm_counter(walk->mm, mm_counter(page)); > > + set_pte_at(current->mm, addr, pte, pteval); > > + unlock_page(page); > > + flush_tlb_page(walk->vma, addr); > > + } > > + > > + return 0; > > +} > > + > > +static int pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk) > > +{ > > + int shift = PMD_SHIFT - PAGE_SHIFT; > > + u64 pfn = (u64)walk->private; > > + struct page *page; > > + pte_t pteval; > > + > > + if (!pmd_large(*pmd)) > > + return 0; > > + > > + if (pmd_pfn(*pmd) >> shift == pfn >> shift) { > > + page = pfn_to_page(pfn); > > + lock_page(page); > > + pteval = swp_entry_to_pte(make_hwpoison_entry(page)); > > + hugetlb_count_sub(compound_nr(page), walk->mm); > > + set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pmd, pteval, vma_mmu_pagesize(walk->vma)); > > + unlock_page(page); > > + flush_tlb_page(walk->vma, addr); > > + } > > + > > + return 0; > > +} > > + > > +static int pud_entry(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk) > > +{ > > + int shift = PUD_SHIFT - PAGE_SHIFT; > > + u64 pfn = (u64)walk->private; > > + struct page *page; > > + pte_t pteval; > > + > > + if (!pud_large(*pud)) > > + return 0; > > + > > + if (pud_pfn(*pud) >> shift == pfn >> shift) { > > + page = pfn_to_page(pfn); > > + lock_page(page); > > + pteval = swp_entry_to_pte(make_hwpoison_entry(page)); > > + hugetlb_count_sub(compound_nr(page), walk->mm); > > + set_huge_swap_pte_at(walk->mm, addr, (pte_t *)pud, pteval, vma_mmu_pagesize(walk->vma)); > > + unlock_page(page); > > + flush_tlb_page(walk->vma, addr); > > + } > > + > > + return 0; > > +} > > + > > +static struct mm_walk_ops walk = { > > + .pte_entry = pte_entry, > > + .pmd_entry = pmd_entry, > > + .pud_entry = pud_entry > > +}; > > + > > static void kill_me_now(struct callback_head *ch) > > { > > force_sig(SIGBUS); > > @@ -1257,15 +1336,22 @@ static void kill_me_maybe(struct callback_head *cb) > > { > > struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); > > int flags = MF_ACTION_REQUIRED; > > + int already = 0; > > > > pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); > > > > if (!p->mce_ripv) > > flags |= MF_MUST_KILL; > > > > - if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && > > + if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags, &already) && > > !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { > > - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); > > + if (already) { > > + mmap_read_lock(current->mm); > > + walk_page_range(current->mm, 0, TASK_SIZE_MAX, &walk, (void *)(p->mce_addr >> PAGE_SHIFT)); > > + mmap_read_unlock(current->mm); > > + } else { > > + set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); > > + } > > sync_core(); > > return; > > } > > @@ -1452,7 +1538,7 @@ noinstr void do_machine_check(struct pt_regs *regs) > > EXPORT_SYMBOL_GPL(do_machine_check); > > > > #ifndef CONFIG_MEMORY_FAILURE > > -int memory_failure(unsigned long pfn, int flags) > > +int memory_failure(unsigned long pfn, int flags, int *already) > > { > > /* mce_severity() should not hand us an ACTION_REQUIRED error */ > > BUG_ON(flags & MF_ACTION_REQUIRED); > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > > index f35298425575..144500983656 100644 > > --- a/drivers/base/memory.c > > +++ b/drivers/base/memory.c > > @@ -480,7 +480,7 @@ static ssize_t hard_offline_page_store(struct device *dev, > > if (kstrtoull(buf, 0, &pfn) < 0) > > return -EINVAL; > > pfn >>= PAGE_SHIFT; > > - ret = memory_failure(pfn, 0); > > + ret = memory_failure(pfn, 0, NULL); > > return ret ? ret : count; > > } > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 77e64e3eac80..beaa6e871cbe 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -3003,7 +3003,7 @@ enum mf_flags { > > MF_MUST_KILL = 1 << 2, > > MF_SOFT_OFFLINE = 1 << 3, > > }; > > -extern int memory_failure(unsigned long pfn, int flags); > > +extern int memory_failure(unsigned long pfn, int flags, int *already); > > extern void memory_failure_queue(unsigned long pfn, int flags); > > extern void memory_failure_queue_kick(int cpu); > > extern int unpoison_memory(unsigned long pfn); > > diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c > > index 1ae1ebc2b9b1..bfd5151dcd3f 100644 > > --- a/mm/hwpoison-inject.c > > +++ b/mm/hwpoison-inject.c > > @@ -48,7 +48,7 @@ static int hwpoison_inject(void *data, u64 val) > > > > inject: > > pr_info("Injecting memory failure at pfn %#lx\n", pfn); > > - return memory_failure(pfn, 0); > > + return memory_failure(pfn, 0, NULL); > > } > > > > static int hwpoison_unpoison(void *data, u64 val) > > diff --git a/mm/madvise.c b/mm/madvise.c > > index df692d2e35d4..09f569fed68d 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -908,7 +908,7 @@ static int madvise_inject_error(int behavior, > > } else { > > pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n", > > pfn, start); > > - ret = memory_failure(pfn, MF_COUNT_INCREASED); > > + ret = memory_failure(pfn, MF_COUNT_INCREASED, NULL); > > } > > > > if (ret) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > index 24210c9bd843..9a8911aa5fc9 100644 > > --- a/mm/memory-failure.c > > +++ b/mm/memory-failure.c > > @@ -1398,7 +1398,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, > > * Must run in process context (e.g. a work queue) with interrupts > > * enabled and no spinlocks hold. > > */ > > -int memory_failure(unsigned long pfn, int flags) > > +int memory_failure(unsigned long pfn, int flags, int *already) > > { > > struct page *p; > > struct page *hpage; > > @@ -1428,6 +1428,8 @@ int memory_failure(unsigned long pfn, int flags) > > if (PageHuge(p)) > > return memory_failure_hugetlb(pfn, flags); > > if (TestSetPageHWPoison(p)) { > > + if (already) > > + *already = 1; > > pr_err("Memory failure: %#lx: already hardware poisoned\n", > > pfn); > > return 0; > > @@ -1634,7 +1636,7 @@ static void memory_failure_work_func(struct work_struct *work) > > if (entry.flags & MF_SOFT_OFFLINE) > > soft_offline_page(entry.pfn, entry.flags); > > else > > - memory_failure(entry.pfn, entry.flags); > > + memory_failure(entry.pfn, entry.flags, NULL); > > } > > } > > >
> From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it?
If the walk wins the race and the pte for the poisoned page is still valid, then yes.
But we could have:
CPU1 CPU2
memory_failure sets poison
bit for struct page
rmap finds page in task
on CPU2 and sets PTE
to not-valid-poison
memory_failure returns
early because struct page
already marked as poison
walk page tables looking
for mapping - don't find it
-Tony
This whole page table walking patch is trying to work around the races caused by multiple calls to memory_failure() for the same page. Maybe better to just avoid the races. The comment right above memory_failure says: * Must run in process context (e.g. a work queue) with interrupts * enabled and no spinlocks hold. So it should be safe to grab and hold a mutex. See patch below. -Tony commit 8dd0dbe7d595e02647e9c2c76c03341a9f6bd7b9 Author: Tony Luck <tony.luck@intel.com> Date: Fri Mar 5 10:40:48 2021 -0800 Use a mutex to avoid memory_failure() races diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 24210c9bd843..c1509f4b565e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, return rc; } +static DEFINE_MUTEX(mf_mutex); + /** * memory_failure - Handle memory failure of a page. * @pfn: Page Number of the corrupted page @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags) return -ENXIO; } + mutex_lock(&mf_mutex); + try_again: - if (PageHuge(p)) - return memory_failure_hugetlb(pfn, flags); + if (PageHuge(p)) { + res = memory_failure_hugetlb(pfn, flags); + goto out2; + } + if (TestSetPageHWPoison(p)) { pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn); + mutex_unlock(&mf_mutex); return 0; } @@ -1463,9 +1471,11 @@ int memory_failure(unsigned long pfn, int flags) res = MF_FAILED; } action_result(pfn, MF_MSG_BUDDY, res); + mutex_unlock(&mf_mutex); return res == MF_RECOVERED ? 0 : -EBUSY; } else { action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); + mutex_unlock(&mf_mutex); return -EBUSY; } } @@ -1473,6 +1483,7 @@ int memory_failure(unsigned long pfn, int flags) if (PageTransHuge(hpage)) { if (try_to_split_thp_page(p, "Memory Failure") < 0) { action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED); + mutex_unlock(&mf_mutex); return -EBUSY; } VM_BUG_ON_PAGE(!page_count(p), p); @@ -1517,6 +1528,7 @@ int memory_failure(unsigned long pfn, int flags) num_poisoned_pages_dec(); unlock_page(p); put_page(p); + mutex_unlock(&mf_mutex); return 0; } if (hwpoison_filter(p)) { @@ -1524,6 +1536,7 @@ int memory_failure(unsigned long pfn, int flags) num_poisoned_pages_dec(); unlock_page(p); put_page(p); + mutex_unlock(&mf_mutex); return 0; } @@ -1559,6 +1572,8 @@ int memory_failure(unsigned long pfn, int flags) res = identify_page_state(pfn, p, page_flags); out: unlock_page(p); +out2: + mutex_unlock(&mf_mutex); return res; } EXPORT_SYMBOL_GPL(memory_failure);
On Fri, Mar 05, 2021 at 02:11:43PM -0800, Luck, Tony wrote: > This whole page table walking patch is trying to work around the > races caused by multiple calls to memory_failure() for the same > page. > > Maybe better to just avoid the races. The comment right above > memory_failure says: > > * Must run in process context (e.g. a work queue) with interrupts > * enabled and no spinlocks hold. > > So it should be safe to grab and hold a mutex. See patch below. The mutex approach looks simpler and safer, so I'm fine with it. > > -Tony > > commit 8dd0dbe7d595e02647e9c2c76c03341a9f6bd7b9 > Author: Tony Luck <tony.luck@intel.com> > Date: Fri Mar 5 10:40:48 2021 -0800 > > Use a mutex to avoid memory_failure() races > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 24210c9bd843..c1509f4b565e 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1381,6 +1381,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, > return rc; > } > > +static DEFINE_MUTEX(mf_mutex); > + > /** > * memory_failure - Handle memory failure of a page. > * @pfn: Page Number of the corrupted page > @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags) > return -ENXIO; > } > > + mutex_lock(&mf_mutex); Is it better to take mutex before memory_failure_dev_pagemap() block? Or we don't have to protect against race for device memory? Thanks, Naoya Horiguchi
>> So it should be safe to grab and hold a mutex. See patch below. > > The mutex approach looks simpler and safer, so I'm fine with it. Thanks. Is that an "Acked-by:"? >> /** >> * memory_failure - Handle memory failure of a page. >> * @pfn: Page Number of the corrupted page >> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags) >> return -ENXIO; >> } >> >> + mutex_lock(&mf_mutex); > > Is it better to take mutex before memory_failure_dev_pagemap() block? > Or we don't have to protect against race for device memory? No races (recovery is only attempted for errors in normal memory). -Tony
On Mon, Mar 08, 2021 at 06:54:02PM +0000, Luck, Tony wrote: > >> So it should be safe to grab and hold a mutex. See patch below. > > > > The mutex approach looks simpler and safer, so I'm fine with it. > > Thanks. Is that an "Acked-by:"? Not yet, I intended to add it after full patch is submitted (with your Signed-off-by and commit log). > > >> /** > >> * memory_failure - Handle memory failure of a page. > >> * @pfn: Page Number of the corrupted page > >> @@ -1424,12 +1426,18 @@ int memory_failure(unsigned long pfn, int flags) > >> return -ENXIO; > >> } > >> > >> + mutex_lock(&mf_mutex); > > > > Is it better to take mutex before memory_failure_dev_pagemap() block? > > Or we don't have to protect against race for device memory? > > No races (recovery is only attempted for errors in normal memory). OK, thanks. - Naoya
On Fri, 5 Mar 2021 15:55:25 +0000 "Luck, Tony" <tony.luck@intel.com> wrote: > > From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it? > > If the walk wins the race and the pte for the poisoned page is still valid, then yes. > > But we could have: > > CPU1 CPU2 > memory_failure sets poison > bit for struct page > > > rmap finds page in task > on CPU2 and sets PTE > to not-valid-poison > > memory_failure returns > early because struct page > already marked as poison > > walk page tables looking > for mapping - don't find it > > -Tony While I don't think there is a race condition, and if you really think the pfn with SIGBUS is not proper, I think following patch maybe one way. I copy your abandon code, and make a little modification, and just now it pass my simple test. And also this is a RFC version, only valid if you think the pfn with SIGBUS is not right. Thanks! From a522ab8856e3a332a2318d57bb19f3c59594d462 Mon Sep 17 00:00:00 2001 From: Aili Yao <yaoaili@kingsoft.com> Date: Wed, 10 Mar 2021 13:59:18 +0800 Subject: [PATCH] x86/mce: fix invalid SIGBUS address walk the current process pte and compare with the pfn; 1. only test for normal page and 2M hugetlb page; 2. 1G hugetlb and transparentHuge is not support currently; 3. May other fails is not recognized, This is a RFC version. --- arch/x86/kernel/cpu/mce/core.c | 83 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 80 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index db4afc5..65d7ef7 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -28,8 +28,12 @@ #include <linux/sysfs.h> #include <linux/types.h> #include <linux/slab.h> +#include <linux/hugetlb.h> +#include <linux/swap.h> +#include <linux/swapops.h> #include <linux/init.h> #include <linux/kmod.h> +#include <linux/pagewalk.h> #include <linux/poll.h> #include <linux/nmi.h> #include <linux/cpu.h> @@ -1235,6 +1239,81 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin /* mce_clear_state will clear *final, save locally for use later */ *m = *final; } +static int mc_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + u64 *buff = (u64 *)walk->private; + u64 pfn = buff[0]; + + if (!pte_present(*pte) && is_hwpoison_entry(pte_to_swp_entry(*pte))) + goto find; + else if (pte_pfn(*pte) == pfn) + goto find; + + return 0; +find: + buff[0] = addr; + buff[1] = PAGE_SHIFT; + return true; +} + +extern bool is_hugetlb_entry_hwpoisoned(pte_t pte); + +static int mc_hugetlb_range(pte_t *ptep, unsigned long hmask, + unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + u64 *buff = (u64 *)walk->private; + u64 pfn = buff[0]; + int shift = PMD_SHIFT; + pte_t pte = huge_ptep_get(ptep); + + if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) + goto find; + + if (pte_pfn(*ptep) == pfn) + goto find; + + return 0; +find: + buff[0] = addr; + buff[1] = shift; + return true; +} + +static struct mm_walk_ops walk = { + .pte_entry = mc_pte_entry, + .hugetlb_entry = mc_hugetlb_range +}; + +void mc_memory_failure_error(struct task_struct *p, unsigned long pfn) +{ + u64 buff[2] = {pfn, 0}; + struct page *page; + int ret = -1; + + page = pfn_to_page(pfn); + if (!page) + goto force_sigbus; + + if (is_zone_device_page(page)) + goto force_sigbus; + + mmap_read_lock(p->mm); + ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff); + mmap_read_unlock(p->mm); + + if (ret && buff[0]) { + pr_err("Memory error may not recovered: %#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", + buff[0], p->comm, p->pid); + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]); + } else { +force_sigbus: + pr_err("Memory error may not recovered, pfn: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", + pfn, p->comm, p->pid); + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)pfn, PAGE_SHIFT); + } + +} static void kill_me_now(struct callback_head *ch) { @@ -1259,9 +1338,7 @@ static void kill_me_maybe(struct callback_head *cb) } if (p->mce_vaddr != (void __user *)-1l) { - pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", - p->mce_addr >> PAGE_SHIFT, p->comm, p->pid); - force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); + mc_memory_failure_error(current, p->mce_addr >> PAGE_SHIFT); } else { pr_err("Memory error not recovered"); kill_me_now(cb);
On Wed, Mar 10, 2021 at 02:10:42PM +0800, Aili Yao wrote: > On Fri, 5 Mar 2021 15:55:25 +0000 > "Luck, Tony" <tony.luck@intel.com> wrote: > > > > From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it? > > > > If the walk wins the race and the pte for the poisoned page is still valid, then yes. > > > > But we could have: > > > > CPU1 CPU2 > > memory_failure sets poison > > bit for struct page > > > > > > rmap finds page in task > > on CPU2 and sets PTE > > to not-valid-poison > > > > memory_failure returns > > early because struct page > > already marked as poison > > > > walk page tables looking > > for mapping - don't find it > > > > -Tony > > While I don't think there is a race condition, and if you really think the pfn with SIGBUS is not > proper, I think following patch maybe one way. > I copy your abandon code, and make a little modification, and just now it pass > my simple test. > > And also this is a RFC version, only valid if you think the pfn with SIGBUS is not right. > > Thanks! > > From a522ab8856e3a332a2318d57bb19f3c59594d462 Mon Sep 17 00:00:00 2001 > From: Aili Yao <yaoaili@kingsoft.com> > Date: Wed, 10 Mar 2021 13:59:18 +0800 > Subject: [PATCH] x86/mce: fix invalid SIGBUS address > > walk the current process pte and compare with the pfn; > 1. only test for normal page and 2M hugetlb page; > 2. 1G hugetlb and transparentHuge is not support currently; > 3. May other fails is not recognized, This is a RFC version. > > --- > arch/x86/kernel/cpu/mce/core.c | 83 ++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 80 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index db4afc5..65d7ef7 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -28,8 +28,12 @@ > #include <linux/sysfs.h> > #include <linux/types.h> > #include <linux/slab.h> > +#include <linux/hugetlb.h> > +#include <linux/swap.h> > +#include <linux/swapops.h> > #include <linux/init.h> > #include <linux/kmod.h> > +#include <linux/pagewalk.h> > #include <linux/poll.h> > #include <linux/nmi.h> > #include <linux/cpu.h> Maybe requiring many dependencies like this implies that you might be better to do below in mm/memory-failure.c instead of in this file. > @@ -1235,6 +1239,81 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin > /* mce_clear_state will clear *final, save locally for use later */ > *m = *final; > } > +static int mc_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) > +{ > + u64 *buff = (u64 *)walk->private; > + u64 pfn = buff[0]; > + > + if (!pte_present(*pte) && is_hwpoison_entry(pte_to_swp_entry(*pte))) > + goto find; > + else if (pte_pfn(*pte) == pfn) > + goto find; > + > + return 0; > +find: > + buff[0] = addr; > + buff[1] = PAGE_SHIFT; > + return true; Returning true means you stop walking when you find the first entry pointing to a given pfn. But there could be multiple such entries, so if MCE SRAR is triggered by memory access to the larger address in hwpoisoned entries, the returned virtual address might be wrong. > +} > + > +extern bool is_hugetlb_entry_hwpoisoned(pte_t pte); > + > +static int mc_hugetlb_range(pte_t *ptep, unsigned long hmask, > + unsigned long addr, unsigned long end, > + struct mm_walk *walk) > +{ > + u64 *buff = (u64 *)walk->private; > + u64 pfn = buff[0]; > + int shift = PMD_SHIFT; > + pte_t pte = huge_ptep_get(ptep); > + > + if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) > + goto find; > + > + if (pte_pfn(*ptep) == pfn) > + goto find; > + > + return 0; > +find: > + buff[0] = addr; > + buff[1] = shift; > + return true; > +} > + > +static struct mm_walk_ops walk = { > + .pte_entry = mc_pte_entry, > + .hugetlb_entry = mc_hugetlb_range > +}; > + > +void mc_memory_failure_error(struct task_struct *p, unsigned long pfn) > +{ > + u64 buff[2] = {pfn, 0}; > + struct page *page; > + int ret = -1; > + > + page = pfn_to_page(pfn); > + if (!page) > + goto force_sigbus; > + > + if (is_zone_device_page(page)) > + goto force_sigbus; > + > + mmap_read_lock(p->mm); > + ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff); > + mmap_read_unlock(p->mm); > + > + if (ret && buff[0]) { > + pr_err("Memory error may not recovered: %#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", > + buff[0], p->comm, p->pid); > + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]); > + } else { > +force_sigbus: > + pr_err("Memory error may not recovered, pfn: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", > + pfn, p->comm, p->pid); > + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)pfn, PAGE_SHIFT); > + } > + > +} > > static void kill_me_now(struct callback_head *ch) > { > @@ -1259,9 +1338,7 @@ static void kill_me_maybe(struct callback_head *cb) > } > > if (p->mce_vaddr != (void __user *)-1l) { > - pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", > - p->mce_addr >> PAGE_SHIFT, p->comm, p->pid); > - force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); > + mc_memory_failure_error(current, p->mce_addr >> PAGE_SHIFT); I guess that p->mce_vaddr stores the virtual address of the error here. If so, sending SIGBUS with the address looks enough as we do now, so why do you walk page table to find the error virtual address? Thanks, Naoya Horiguchi
On Thu, 11 Mar 2021 08:55:30 +0000 HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote: > On Wed, Mar 10, 2021 at 02:10:42PM +0800, Aili Yao wrote: > > On Fri, 5 Mar 2021 15:55:25 +0000 > > "Luck, Tony" <tony.luck@intel.com> wrote: > > > > > > From the walk, it seems we have got the virtual address, can we just send a SIGBUS with it? > > > > > > If the walk wins the race and the pte for the poisoned page is still valid, then yes. > > > > > > But we could have: > > > > > > CPU1 CPU2 > > > memory_failure sets poison > > > bit for struct page > > > > > > > > > rmap finds page in task > > > on CPU2 and sets PTE > > > to not-valid-poison > > > > > > memory_failure returns > > > early because struct page > > > already marked as poison > > > > > > walk page tables looking > > > for mapping - don't find it > > > > > > -Tony > > > > While I don't think there is a race condition, and if you really think the pfn with SIGBUS is not > > proper, I think following patch maybe one way. > > I copy your abandon code, and make a little modification, and just now it pass > > my simple test. > > > > And also this is a RFC version, only valid if you think the pfn with SIGBUS is not right. > > > > Thanks! > > > > From a522ab8856e3a332a2318d57bb19f3c59594d462 Mon Sep 17 00:00:00 2001 > > From: Aili Yao <yaoaili@kingsoft.com> > > Date: Wed, 10 Mar 2021 13:59:18 +0800 > > Subject: [PATCH] x86/mce: fix invalid SIGBUS address > > > > walk the current process pte and compare with the pfn; > > 1. only test for normal page and 2M hugetlb page; > > 2. 1G hugetlb and transparentHuge is not support currently; > > 3. May other fails is not recognized, This is a RFC version. > > > > --- > > arch/x86/kernel/cpu/mce/core.c | 83 ++++++++++++++++++++++++++++++++++++++++-- > > 1 file changed, 80 insertions(+), 3 deletions(-) > > > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > > index db4afc5..65d7ef7 100644 > > --- a/arch/x86/kernel/cpu/mce/core.c > > +++ b/arch/x86/kernel/cpu/mce/core.c > > @@ -28,8 +28,12 @@ > > #include <linux/sysfs.h> > > #include <linux/types.h> > > #include <linux/slab.h> > > +#include <linux/hugetlb.h> > > +#include <linux/swap.h> > > +#include <linux/swapops.h> > > #include <linux/init.h> > > #include <linux/kmod.h> > > +#include <linux/pagewalk.h> > > #include <linux/poll.h> > > #include <linux/nmi.h> > > #include <linux/cpu.h> > > Maybe requiring many dependencies like this implies that you might be better > to do below in mm/memory-failure.c instead of in this file. Yes, agree, I will change this, Thanks! > > @@ -1235,6 +1239,81 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin > > /* mce_clear_state will clear *final, save locally for use later */ > > *m = *final; > > } > > +static int mc_pte_entry(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) > > +{ > > + u64 *buff = (u64 *)walk->private; > > + u64 pfn = buff[0]; > > + > > + if (!pte_present(*pte) && is_hwpoison_entry(pte_to_swp_entry(*pte))) > > + goto find; > > + else if (pte_pfn(*pte) == pfn) > > + goto find; > > + > > + return 0; > > +find: > > + buff[0] = addr; > > + buff[1] = PAGE_SHIFT; > > + return true; > > Returning true means you stop walking when you find the first entry pointing > to a given pfn. But there could be multiple such entries, so if MCE SRAR is > triggered by memory access to the larger address in hwpoisoned entries, the > returned virtual address might be wrong. Yes, We need to consider multiple posion page entries, I will fix this. Thanks for you sugguestion! > > +} > > + > > +extern bool is_hugetlb_entry_hwpoisoned(pte_t pte); > > + > > +static int mc_hugetlb_range(pte_t *ptep, unsigned long hmask, > > + unsigned long addr, unsigned long end, > > + struct mm_walk *walk) > > +{ > > + u64 *buff = (u64 *)walk->private; > > + u64 pfn = buff[0]; > > + int shift = PMD_SHIFT; > > + pte_t pte = huge_ptep_get(ptep); > > + > > + if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) > > + goto find; > > + > > + if (pte_pfn(*ptep) == pfn) > > + goto find; > > + > > + return 0; > > +find: > > + buff[0] = addr; > > + buff[1] = shift; > > + return true; > > +} > > + > > +static struct mm_walk_ops walk = { > > + .pte_entry = mc_pte_entry, > > + .hugetlb_entry = mc_hugetlb_range > > +}; > > + > > +void mc_memory_failure_error(struct task_struct *p, unsigned long pfn) > > +{ > > + u64 buff[2] = {pfn, 0}; > > + struct page *page; > > + int ret = -1; > > + > > + page = pfn_to_page(pfn); > > + if (!page) > > + goto force_sigbus; > > + > > + if (is_zone_device_page(page)) > > + goto force_sigbus; > > + > > + mmap_read_lock(p->mm); > > + ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff); > > + mmap_read_unlock(p->mm); > > + > > + if (ret && buff[0]) { > > + pr_err("Memory error may not recovered: %#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", > > + buff[0], p->comm, p->pid); > > + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]); > > + } else { > > +force_sigbus: > > + pr_err("Memory error may not recovered, pfn: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", > > + pfn, p->comm, p->pid); > > + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)pfn, PAGE_SHIFT); > > + } > > + > > +} > > > > static void kill_me_now(struct callback_head *ch) > > { > > @@ -1259,9 +1338,7 @@ static void kill_me_maybe(struct callback_head *cb) > > } > > > > if (p->mce_vaddr != (void __user *)-1l) { > > - pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", > > - p->mce_addr >> PAGE_SHIFT, p->comm, p->pid); > > - force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); > > + mc_memory_failure_error(current, p->mce_addr >> PAGE_SHIFT); > > I guess that p->mce_vaddr stores the virtual address of the error here. > If so, sending SIGBUS with the address looks enough as we do now, so why > do you walk page table to find the error virtual address? I check the code again, yes, I should have placed mc_memory_failure_error in else branch, but it seems p->mce_vaddr is not correctly initialized and for my test, it has a zero value then code goes into if (p->mce_vaddr != (void __user *)-1l) branch; It seems this is another issue needing to fix; And from the p->mce_vaddr, possibly there is a better way to get vaddr, i will dig more about this.
> I guess that p->mce_vaddr stores the virtual address of the error here. > If so, sending SIGBUS with the address looks enough as we do now, so why > do you walk page table to find the error virtual address? p->mce_vaddr only has the virtual address for the COPYIN case. In that code path we decode the kernel instruction that hit the fault in order to find the virtual address. That's easy because: 1) The kernel RIP is known to be good (can't page fault etc. on kernel address). 2) There are only a half dozen instructions used by the kernel for get_user() or copy_from_user(). When the machine check happens during user execution accessing poison data we only have the physical address (from MCi_ADDR). -Tony
On Thu, 11 Mar 2021 17:05:53 +0000 "Luck, Tony" <tony.luck@intel.com> wrote: > > I guess that p->mce_vaddr stores the virtual address of the error here. > > If so, sending SIGBUS with the address looks enough as we do now, so why > > do you walk page table to find the error virtual address? > > p->mce_vaddr only has the virtual address for the COPYIN case. In that code > path we decode the kernel instruction that hit the fault in order to find the virtual > address. That's easy because: > > 1) The kernel RIP is known to be good (can't page fault etc. on kernel address). > 2) There are only a half dozen instructions used by the kernel for get_user() or > copy_from_user(). > > When the machine check happens during user execution accessing poison data > we only have the physical address (from MCi_ADDR). > > -Tony Sorry to interrupt as I am really confused here: If it's a copyin case, has the page been mapped for the current process? will memory_failure() find it and unmap it? if succeed, then the current will be signaled with correct vaddr and shift? Maybe the mce_vaddr is set correctly, but we may lost the correct page shift? And for copyin case, we don't need to call set_mce_nospec()?
> Sorry to interrupt as I am really confused here: > If it's a copyin case, has the page been mapped for the current process? Yes. The kernel has the current process mapped to the low address range and code like get_user(), put_user() "simply" reaches down to those addresses (maybe not so simply, as the access needs to be within a STAC/CLAC section of code on modern CPUs, and the access instruction needs an entry in EXTABLE so that the page fault handler can fix it if there isn't a page mapped at the user address). > will memory_failure() find it and unmap it? if succeed, then the current will be > signaled with correct vaddr and shift? That's a very good question. I didn't see a SIGBUS when I first wrote this code, hence all the p->mce_vaddr. But now I'm a) not sure why there wasn't a signal b) if we are to fix the problems noted by AndyL, need to make sure that there isn't a SIGBUS > Maybe the mce_vaddr is set correctly, but we may lost the correct page shift? Yes. There is a hard-coded PAGE_SHIFT for this case - which may not match the actual page size. > And for copyin case, we don't need to call set_mce_nospec()? Yes. We should. Thanks for your good questions. -Tony
>> will memory_failure() find it and unmap it? if succeed, then the current will be >> signaled with correct vaddr and shift? > > That's a very good question. I didn't see a SIGBUS when I first wrote this code, > hence all the p->mce_vaddr. But now I'm > a) not sure why there wasn't a signal > b) if we are to fix the problems noted by AndyL, need to make sure that there isn't a SIGBUS Tests on upstream kernel today show that memory_failure() is both unmapping the page and sending a SIGBUS. My biggest issue with the KERNEL_COPYIN recovery path is that we don't have code to mark the page not present while we are still in do_machine_check(). That's resulted in recovery working for simple cases where there is a single get_user() call followed by an error return if that failed. But more complex cases require more machine checks and a touching faith that the kernel will eventually give up trying (spoiler: it sometimes doesn't). Thanks to the decode of the instruction we do have the virtual address. So we just need a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write of a "not-present" value. Maybe a different poison type from the one we get from memory_failure() so that the #PF code can recognize this as a special case and do any other work that we avoided because we were in #MC context. -Tony
On Fri, Mar 12, 2021 at 11:48:31PM +0000, Luck, Tony wrote: > >> will memory_failure() find it and unmap it? if succeed, then the current will be > >> signaled with correct vaddr and shift? > > > > That's a very good question. I didn't see a SIGBUS when I first wrote this code, > > hence all the p->mce_vaddr. But now I'm > > a) not sure why there wasn't a signal We have a recent change around this behavior, which might be an answer for you. See commit 30c9cf492704 ("mm,hwpoison: send SIGBUS to PF_MCE_EARLY processes on action required events"). > > b) if we are to fix the problems noted by AndyL, need to make sure that there isn't a SIGBUS > > Tests on upstream kernel today show that memory_failure() is both unmapping the page > and sending a SIGBUS. Sorry if I misunderstood the exact problem, but if the point is to allow user processes to find poisoned virtual address without SIGBUS, one possible way is to expose hwpoison entry via /proc/pid/pagemap (attached a draft patch below). With this patch, processes easily find hwpoison entries without any new interface. > > My biggest issue with the KERNEL_COPYIN recovery path is that we don't have code > to mark the page not present while we are still in do_machine_check(). It sounds to me that even if we have such code, it just narrows down the race window between multiple MCEs on different CPUs. Or does it completely prevent the race? (Another thread could touch the poisoned page just before the first thread marks the page non-present?) > That's resulted > in recovery working for simple cases where there is a single get_user() call followed by > an error return if that failed. But more complex cases require more machine checks and > a touching faith that the kernel will eventually give up trying (spoiler: it sometimes doesn't). > > Thanks to the decode of the instruction we do have the virtual address. So we just need > a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write > of a "not-present" value. Maybe a different poison type from the one we get from > memory_failure() so that the #PF code can recognize this as a special case and do any > other work that we avoided because we were in #MC context. As you answered in another email, p->mce_vaddr is set only on KERNEL_COPYIN case, then if p->mce_vaddr is available also for generic SRAR MCE (I'm assuming that we are talking about issues on race between generic SRAR MCE not only for KERNEL_COPYIN case), that might be helpful, although it might be hard to implement. And I'm afraid that walking page table could find the wrong virtual address if a process have multiple entries to the single page. We could send multiple SIGBUSs for such case, but maybe that's not an optimal solution. Thanks, Naoya Horiguchi ---- From 147449a97e2ea3420ac3523f13523f5d30a13614 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi <naoya.horiguchi@nec.com> Date: Tue, 16 Mar 2021 14:22:21 +0900 Subject: [PATCH] pagemap: expose hwpoison entry not-signed-off-by-yet: Naoya Horiguchi <naoya.horiguchi@nec.com> --- fs/proc/task_mmu.c | 6 ++++++ include/linux/swapops.h | 12 ++++++++++++ tools/vm/page-types.c | 5 ++++- 3 files changed, 22 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 602e3a52884d..08cea209bae7 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1300,6 +1300,7 @@ struct pagemapread { #define PM_PFRAME_MASK GENMASK_ULL(PM_PFRAME_BITS - 1, 0) #define PM_SOFT_DIRTY BIT_ULL(55) #define PM_MMAP_EXCLUSIVE BIT_ULL(56) +#define PM_HWPOISON BIT_ULL(60) #define PM_FILE BIT_ULL(61) #define PM_SWAP BIT_ULL(62) #define PM_PRESENT BIT_ULL(63) @@ -1385,6 +1386,11 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, if (is_migration_entry(entry)) page = migration_entry_to_page(entry); + if (is_hwpoison_entry(entry)) { + page = hwpoison_entry_to_page(entry); + flags |= PM_HWPOISON; + } + if (is_device_private_entry(entry)) page = device_private_entry_to_page(entry); } diff --git a/include/linux/swapops.h b/include/linux/swapops.h index d9b7c9132c2f..1b9dedbd06ab 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -323,6 +323,13 @@ static inline int is_hwpoison_entry(swp_entry_t entry) return swp_type(entry) == SWP_HWPOISON; } +static inline struct page *hwpoison_entry_to_page(swp_entry_t entry) +{ + struct page *p = pfn_to_page(swp_offset(entry)); + WARN_ON(!PageHWPoison(p)); + return p; +} + static inline void num_poisoned_pages_inc(void) { atomic_long_inc(&num_poisoned_pages); @@ -345,6 +352,11 @@ static inline int is_hwpoison_entry(swp_entry_t swp) return 0; } +static inline struct page *hwpoison_entry_to_page(swp_entry_t entry) +{ + return NULL; +} + static inline void num_poisoned_pages_inc(void) { } diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c index 0517c744b04e..1160d5a14955 100644 --- a/tools/vm/page-types.c +++ b/tools/vm/page-types.c @@ -53,6 +53,7 @@ #define PM_SWAP_OFFSET(x) (((x) & PM_PFRAME_MASK) >> MAX_SWAPFILES_SHIFT) #define PM_SOFT_DIRTY (1ULL << 55) #define PM_MMAP_EXCLUSIVE (1ULL << 56) +#define PM_HWPOISON (1ULL << 60) #define PM_FILE (1ULL << 61) #define PM_SWAP (1ULL << 62) #define PM_PRESENT (1ULL << 63) @@ -311,6 +312,8 @@ static unsigned long pagemap_pfn(uint64_t val) if (val & PM_PRESENT) pfn = PM_PFRAME(val); + else if (val & PM_HWPOISON) + pfn = PM_SWAP_OFFSET(val); else pfn = 0; @@ -742,7 +745,7 @@ static void walk_vma(unsigned long index, unsigned long count) pfn = pagemap_pfn(buf[i]); if (pfn) walk_pfn(index + i, pfn, 1, buf[i]); - if (buf[i] & PM_SWAP) + else if (buf[i] & PM_SWAP) walk_swap(index + i, buf[i]); }
> As you answered in another email, p->mce_vaddr is set only on KERNEL_COPYIN case, > then if p->mce_vaddr is available also for generic SRAR MCE (I'm assuming that we are > talking about issues on race between generic SRAR MCE not only for KERNEL_COPYIN case), > that might be helpful, although it might be hard to implement. > And I'm afraid that walking page table could find the wrong virtual address if a process > have multiple entries to the single page. We could send multiple SIGBUSs for such case, > but maybe that's not an optimal solution. Yes, agree, I can't find one way to address this multiple entries case, to make sure we get the exact correct virtual address. But also I will post a v2 RFC patch for this issue for only discussion purpose! Thanks! > ---- > From 147449a97e2ea3420ac3523f13523f5d30a13614 Mon Sep 17 00:00:00 2001 > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > Date: Tue, 16 Mar 2021 14:22:21 +0900 > Subject: [PATCH] pagemap: expose hwpoison entry > > not-signed-off-by-yet: Naoya Horiguchi <naoya.horiguchi@nec.com> > --- > fs/proc/task_mmu.c | 6 ++++++ > include/linux/swapops.h | 12 ++++++++++++ > tools/vm/page-types.c | 5 ++++- > 3 files changed, 22 insertions(+), 1 deletion(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 602e3a52884d..08cea209bae7 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1300,6 +1300,7 @@ struct pagemapread { > #define PM_PFRAME_MASK GENMASK_ULL(PM_PFRAME_BITS - 1, 0) > #define PM_SOFT_DIRTY BIT_ULL(55) > #define PM_MMAP_EXCLUSIVE BIT_ULL(56) > +#define PM_HWPOISON BIT_ULL(60) > #define PM_FILE BIT_ULL(61) > #define PM_SWAP BIT_ULL(62) > #define PM_PRESENT BIT_ULL(63) > @@ -1385,6 +1386,11 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, > if (is_migration_entry(entry)) > page = migration_entry_to_page(entry); > > + if (is_hwpoison_entry(entry)) { > + page = hwpoison_entry_to_page(entry); > + flags |= PM_HWPOISON; > + } > + > if (is_device_private_entry(entry)) > page = device_private_entry_to_page(entry); > } > diff --git a/include/linux/swapops.h b/include/linux/swapops.h > index d9b7c9132c2f..1b9dedbd06ab 100644 > --- a/include/linux/swapops.h > +++ b/include/linux/swapops.h > @@ -323,6 +323,13 @@ static inline int is_hwpoison_entry(swp_entry_t entry) > return swp_type(entry) == SWP_HWPOISON; > } > > +static inline struct page *hwpoison_entry_to_page(swp_entry_t entry) > +{ > + struct page *p = pfn_to_page(swp_offset(entry)); > + WARN_ON(!PageHWPoison(p)); > + return p; > +} > + > static inline void num_poisoned_pages_inc(void) > { > atomic_long_inc(&num_poisoned_pages); > @@ -345,6 +352,11 @@ static inline int is_hwpoison_entry(swp_entry_t swp) > return 0; > } > > +static inline struct page *hwpoison_entry_to_page(swp_entry_t entry) > +{ > + return NULL; > +} > + > static inline void num_poisoned_pages_inc(void) > { > } > diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c > index 0517c744b04e..1160d5a14955 100644 > --- a/tools/vm/page-types.c > +++ b/tools/vm/page-types.c > @@ -53,6 +53,7 @@ > #define PM_SWAP_OFFSET(x) (((x) & PM_PFRAME_MASK) >> MAX_SWAPFILES_SHIFT) > #define PM_SOFT_DIRTY (1ULL << 55) > #define PM_MMAP_EXCLUSIVE (1ULL << 56) > +#define PM_HWPOISON (1ULL << 60) > #define PM_FILE (1ULL << 61) > #define PM_SWAP (1ULL << 62) > #define PM_PRESENT (1ULL << 63) > @@ -311,6 +312,8 @@ static unsigned long pagemap_pfn(uint64_t val) > > if (val & PM_PRESENT) > pfn = PM_PFRAME(val); > + else if (val & PM_HWPOISON) > + pfn = PM_SWAP_OFFSET(val); > else > pfn = 0; > > @@ -742,7 +745,7 @@ static void walk_vma(unsigned long index, unsigned long count) > pfn = pagemap_pfn(buf[i]); > if (pfn) > walk_pfn(index + i, pfn, 1, buf[i]); > - if (buf[i] & PM_SWAP) > + else if (buf[i] & PM_SWAP) > walk_swap(index + i, buf[i]); > } >
On Fri, Mar 12, 2021 at 11:48:31PM +0000, Luck, Tony wrote: > Thanks to the decode of the instruction we do have the virtual address. So we just need > a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write > of a "not-present" value. Maybe a different poison type from the one we get from > memory_failure() so that the #PF code can recognize this as a special case and do any > other work that we avoided because we were in #MC context. There is no such thing as the safe walk I describe above. In a multi threaded application other threads might munmap(2) bits of the address space which could free some of the p4d->pud->pmd->pte tables. But the pgd table is guaranteed to exist as long as any of the threads in the process exist. Which gave rise to a new approach (partial credit to Dave Hansen who helped me understand why a more extreme patch that I wanted to use wasn't working ... and he pointed out the pgd persistence). N.B. The code below DOES NOT WORK. My test application tried to do a write(2) syscall with some poison in the buffer. Goal is that write should return a short count (with only the bytes from the buffer leading up to the poison written to the file). Currently the process gets a suprise SIGSEGV in some glibc code :-( The code would also need a bunch more work to handle the fact that in a multi-threaded application multiple threads might consume the poison, and some innocent threads not consuming the poison may also end up drawn into the melee in the page fault handler. The big picture. --------------- This approach passes the buck for the recovery from the #MC handler (which has very limited options to do anything useful) to the #PF handler (which is a much friendlier environment where locks and mutexes can be obtained as needed). The mechanism for this buck passing is for the #MC handler to set a reserved bit in the PGD entry for the user address where the machine check happened, flush the TLB for that address, and then return from the #MC handler to the kernel get_user()/copy_from_user() code which will re-execute the instruction to access the poison address, but this time take a #PF because of the reserved bit in the PGD. There's a couple of changes bundled in here to help my debugging: 1) Turn off UCNA calls to memory_failure() ... just avoids races and make tests more determinstic for now 2) Disable "fast string" support ... avoid "REP MOVS" copy routines since I'm testing on an old system that treats poison consumed by REP MOVS as fatal. Posted here for general comments on the approach. On the plus side it avoids taking multiple machine checks in the kernel when it is accessing user data. I think it can also meet the goal set by Andy Lutomirski of avoiding SIGBUS from syscalls that happen to touch user poison. They should see either short counts for calls like write(2) that may partially succeed or -EFAULT if the system call totally failed. -Tony --- diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index f24d7ef8fffa..e533eaf20834 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -23,6 +23,7 @@ #define _PAGE_BIT_SOFTW2 10 /* " */ #define _PAGE_BIT_SOFTW3 11 /* " */ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ +#define _PAGE_BIT_RESERVED1 51 /* Reserved bit */ #define _PAGE_BIT_SOFTW4 58 /* available for programmer */ #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ @@ -56,6 +57,7 @@ #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE) #define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL) #define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST) +#define _PAGE_RESERVED1 (_AT(pteval_t, 1) << _PAGE_BIT_RESERVED1) #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS #define _PAGE_PKEY_BIT0 (_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0) #define _PAGE_PKEY_BIT1 (_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1) diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 7f7200021bd1..269e8ee04c11 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -45,4 +45,24 @@ void __noreturn handle_stack_overflow(const char *message, unsigned long fault_address); #endif +static inline void pgd_set_reserved(pgd_t *pgdp, unsigned long addr) +{ + pgd_t pgd; + + pgdp += pgd_index(addr); + pgd = *pgdp; + pgd.pgd |= _PAGE_RESERVED1; + WRITE_ONCE(*pgdp, pgd); +} + +static inline void pgd_clr_reserved(pgd_t *pgdp, unsigned long addr) +{ + pgd_t pgd; + + pgdp += pgd_index(addr); + pgd = *pgdp; + pgd.pgd &= ~_PAGE_RESERVED1; + WRITE_ONCE(*pgdp, pgd); +} + #endif /* _ASM_X86_TRAPS_H */ diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 0e422a544835..974e1eb5d1aa 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -287,6 +287,7 @@ static void early_init_intel(struct cpuinfo_x86 *c) */ if (c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xd)) { rdmsrl(MSR_IA32_MISC_ENABLE, misc_enable); +misc_enable = 0; if (!(misc_enable & MSR_IA32_MISC_ENABLE_FAST_STRING)) { pr_info("Disabled fast string operations\n"); setup_clear_cpu_cap(X86_FEATURE_REP_GOOD); diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 7962355436da..41bedc961928 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -635,6 +635,7 @@ static struct notifier_block early_nb = { static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, void *data) { +#if 0 struct mce *mce = (struct mce *)data; unsigned long pfn; @@ -650,7 +651,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, set_mce_nospec(pfn, whole_page(mce)); mce->kflags |= MCE_HANDLED_UC; } - +#endif return NOTIFY_OK; } @@ -1443,8 +1444,18 @@ noinstr void do_machine_check(struct pt_regs *regs) mce_panic("Failed kernel mode recovery", &m, msg); } - if (m.kflags & MCE_IN_KERNEL_COPYIN) - queue_task_work(&m, kill_current_task); + /* + * Machine check while copying from user space. Note that we + * do not call fixup_exception(). Instead we ensure a page fault + * when we return to the faulting instruction. + * Let the page fault handler do the rest of the + * recovery. + */ + if (m.kflags & MCE_IN_KERNEL_COPYIN) { + flush_tlb_one_user((unsigned long)current->mce_vaddr); + pgd_set_reserved(current->mm->pgd, (unsigned long)current->mce_vaddr); + current->mce_addr = m.addr; + } } out: mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c index 83df991314c5..da917945150d 100644 --- a/arch/x86/kernel/cpu/mce/severity.c +++ b/arch/x86/kernel/cpu/mce/severity.c @@ -282,7 +282,6 @@ static int error_context(struct mce *m, struct pt_regs *regs) return IN_KERNEL_RECOV; } if (t == EX_HANDLER_UACCESS && regs && is_copy_from_user(regs)) { - m->kflags |= MCE_IN_KERNEL_RECOV; m->kflags |= MCE_IN_KERNEL_COPYIN; return IN_KERNEL_RECOV; } diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index 77b9b2a3b5c8..e0e71ca023ce 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -234,24 +234,11 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string) */ SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail) movl %edx,%ecx - cmp $X86_TRAP_MC,%eax /* check if X86_TRAP_MC */ - je 3f 1: rep movsb 2: mov %ecx,%eax ASM_CLAC ret - /* - * Return zero to pretend that this copy succeeded. This - * is counter-intuitive, but needed to prevent the code - * in lib/iov_iter.c from retrying and running back into - * the poison cache line again. The machine check handler - * will ensure that a SIGBUS is sent to the task. - */ -3: xorl %eax,%eax - ASM_CLAC - ret - _ASM_EXTABLE_CPY(1b, 2b) SYM_CODE_END(.Lcopy_user_handle_tail) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index a73347e2cdfc..49232264e893 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1245,9 +1245,13 @@ void do_user_addr_fault(struct pt_regs *regs, /* * Reserved bits are never expected to be set on * entries in the user portion of the page tables. + * Except when machine check handling of a copy from + * user passed the buck to #PF. */ - if (unlikely(error_code & X86_PF_RSVD)) - pgtable_bad(regs, error_code, address); + if (unlikely(error_code & X86_PF_RSVD)) { + if (!current->mce_vaddr) + pgtable_bad(regs, error_code, address); + } /* * If SMAP is on, check for invalid kernel (supervisor) access to user @@ -1291,6 +1295,15 @@ void do_user_addr_fault(struct pt_regs *regs, local_irq_enable(); } + if (current->mce_vaddr) { + pgd_clr_reserved(current->mm->pgd, + (unsigned long)current->mce_vaddr); + memory_failure(current->mce_addr >> PAGE_SHIFT, 0); + current->mce_vaddr = 0; + error_code &= ~X86_PF_RSVD; + pr_info("#PF: maybe fixed? Try to continue\n"); + } + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (error_code & X86_PF_WRITE)
> > Returning true means you stop walking when you find the first entry pointing > to a given pfn. But there could be multiple such entries, so if MCE SRAR is > triggered by memory access to the larger address in hwpoisoned entries, the > returned virtual address might be wrong. > I can't find the way to fix this, maybe the virtual address is contained in related register, but this is really beyong my knowledge. This is a v2 RFC patch, add support for thp and 1G huge page errors. Thanks Aili Yao From 31b685609610b3b06c8fd98d866913dbfeb7e159 Mon Sep 17 00:00:00 2001 From: Aili Yao <yaoaili@kingsoft.com> Date: Wed, 17 Mar 2021 15:34:15 +0800 Subject: [PATCH] fix invalid SIGBUS address for recovery fail Walk the current process pages and compare with the pfn, then get the user address and related page_shift. For thp pages, we can only split anonoums thp page, so I think there may be no race condition for walking and searching the thp error page for such case; For non anonymous thp, the page flag and pte will not be change. so when code goes into this place, it may be race condition for non-anonoums thp page or from a recovery fail for anonoums thp, the page status will not change, i am not so sure about this; For the case we don't find the related virtual address, Maybe sending one BUS_MCEERR_AR signal with invalid address NULL is a better option, but i am not sure. And this may get the wrong virtual address if process have multiple entry for a same page, I don't find a way to get it correct. Maybe other issues is not recognized. --- arch/x86/kernel/cpu/mce/core.c | 16 ++--- include/linux/mm.h | 1 + mm/memory-failure.c | 145 ++++++++++++++++++++++++++++++++++++++++- 3 files changed, 152 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index db4afc5..1bf21cc 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -28,8 +28,12 @@ #include <linux/sysfs.h> #include <linux/types.h> #include <linux/slab.h> +#include <linux/hugetlb.h> +#include <linux/swap.h> +#include <linux/swapops.h> #include <linux/init.h> #include <linux/kmod.h> +#include <linux/pagewalk.h> #include <linux/poll.h> #include <linux/nmi.h> #include <linux/cpu.h> @@ -1246,7 +1250,7 @@ static void kill_me_maybe(struct callback_head *cb) struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; - pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); + pr_err("Uncorrected hardware memory error in user-access at %llx, %llx", p->mce_addr, p->mce_vaddr); if (!p->mce_ripv) flags |= MF_MUST_KILL; @@ -1258,14 +1262,8 @@ static void kill_me_maybe(struct callback_head *cb) return; } - if (p->mce_vaddr != (void __user *)-1l) { - pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", - p->mce_addr >> PAGE_SHIFT, p->comm, p->pid); - force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); - } else { - pr_err("Memory error not recovered"); - kill_me_now(cb); - } + memory_failure_error(current, p->mce_addr >> PAGE_SHIFT); + } static void queue_task_work(struct mce *m, int kill_current_task) diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8..cff2f02 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3046,6 +3046,7 @@ enum mf_flags { MF_SOFT_OFFLINE = 1 << 3, }; extern int memory_failure(unsigned long pfn, int flags); +extern void memory_failure_error(struct task_struct *p, unsigned long pfn); extern void memory_failure_queue(unsigned long pfn, int flags); extern void memory_failure_queue_kick(int cpu); extern int unpoison_memory(unsigned long pfn); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 06f0061..aaf99a7 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -56,8 +56,10 @@ #include <linux/kfifo.h> #include <linux/ratelimit.h> #include <linux/page-isolation.h> +#include <linux/pagewalk.h> #include "internal.h" #include "ras/ras_event.h" +#include <linux/delay.h> int sysctl_memory_failure_early_kill __read_mostly = 0; @@ -240,7 +242,7 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags) int ret = 0; pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", - pfn, t->comm, t->pid); + tk->addr, t->comm, t->pid); if (flags & MF_ACTION_REQUIRED) { WARN_ON_ONCE(t != current); @@ -1417,6 +1419,7 @@ int memory_failure(unsigned long pfn, int flags) try_again: if (PageHuge(p)) return memory_failure_hugetlb(pfn, flags); + if (TestSetPageHWPoison(p)) { pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn); @@ -1553,6 +1556,146 @@ int memory_failure(unsigned long pfn, int flags) } EXPORT_SYMBOL_GPL(memory_failure); +static int pte_range(pte_t *ptep, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + u64 *buff = (u64 *)walk->private; + u64 pfn = buff[0]; + pte_t pte = *ptep; + + if (!pte_none(pte) && !pte_present(pte)) { + swp_entry_t swp_temp = pte_to_swp_entry(pte); + + printk("%s, %d \n", __FUNCTION__, __LINE__); + if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn) { + printk("%s, %d \n", __FUNCTION__, __LINE__); + goto find; + } + } else if (pte_pfn(pte) == pfn) { + printk("%s, %d \n", __FUNCTION__, __LINE__); + goto find; + } + + return 0; + +find: + buff[0] = addr; + buff[1] = PAGE_SHIFT; + return true; +} + +static int pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + u64 *buff = (u64 *)walk->private; + struct page *page = (struct page *)buff[0]; + u64 pfn = buff[1]; + pmd_t pmd = *pmdp; + + if (likely(!pmd_trans_huge(pmd))) + return 0; + + if (pmd_none(pmd) || !pmd_present(pmd)) + return 0; + + if (pmd_page(pmd) != page) + return 0; + + for (; addr != end; page++, addr += PAGE_SIZE) { + if (page_to_pfn(page) == pfn) { + printk("%s, %d \n", __FUNCTION__, __LINE__); + buff[0] = addr; + buff[1] = PAGE_SHIFT; + return true; + } + } + + return 0; +} + +static int hugetlb_range(pte_t *ptep, unsigned long hmask, + unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + u64 *buff = (u64 *)walk->private; + u64 pfn = buff[0]; + pte_t pte = huge_ptep_get(ptep); + struct page *page = pfn_to_page(pfn); + + if (!huge_pte_none(pte) && !pte_present(pte)) { + swp_entry_t swp_temp = pte_to_swp_entry(pte); + printk("%s, %d \n", __FUNCTION__, __LINE__); + if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn) { + printk("%s, %d \n", __FUNCTION__, __LINE__); + goto find; + } + } + if (pte_pfn(pte) == pfn) { + printk("%s, %d \n", __FUNCTION__, __LINE__); + goto find; + } + return 0; + +find: + buff[0] = addr; + buff[1] = (huge_page_size(page_hstate(page)) > PMD_SIZE) ? PUD_SHIFT : PMD_SHIFT; + return true; +} + +void memory_failure_error(struct task_struct *p, unsigned long pfn) +{ + u64 buff[2] = {0}; + struct page *page; + int ret = -1; + struct mm_walk_ops walk = {0}; + + if (p->mce_vaddr != (void __user *)-1l && p->mce_vaddr != (void __user *)0) { + printk("%s, %d \n", __FUNCTION__, __LINE__); + force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); + return; + } + + page = pfn_to_page(pfn); + if (!page) + goto force_sigbus; + + if (is_zone_device_page(page)) + goto force_sigbus; + + page = compound_head(page); + + if (PageHuge(page)) { + printk("%s, %d \n", __FUNCTION__, __LINE__); + walk.hugetlb_entry = hugetlb_range; + buff[0] = page_to_pfn(page); + } else if (PageTransHuge(page)) { + printk("%s, %d \n", __FUNCTION__, __LINE__); + walk.pmd_entry = pmd_range; + buff[0] = (u64)page; + buff[1] = pfn; + } else { + printk("%s, %d \n", __FUNCTION__, __LINE__); + walk.pte_entry = pte_range; + buff[0] = pfn; + } + msleep(1000); + mmap_read_lock(p->mm); + ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff); + mmap_read_unlock(p->mm); + + pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", + pfn, p->comm, p->pid); + + if (ret) { + printk("%s, %d \n", __FUNCTION__, __LINE__); + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]); + } else { +force_sigbus: + printk("%s, %d \n", __FUNCTION__, __LINE__); + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)0, PAGE_SHIFT); + } +} +EXPORT_SYMBOL_GPL(memory_failure_error); + #define MEMORY_FAILURE_FIFO_ORDER 4 #define MEMORY_FAILURE_FIFO_SIZE (1 << MEMORY_FAILURE_FIFO_ORDER)
> > > > Returning true means you stop walking when you find the first entry pointing > > to a given pfn. But there could be multiple such entries, so if MCE SRAR is > > triggered by memory access to the larger address in hwpoisoned entries, the > > returned virtual address might be wrong. > > > > I can't find the way to fix this, maybe the virtual address is contained in > related register, but this is really beyong my knowledge. > > This is a v2 RFC patch, add support for thp and 1G huge page errors. > Sorry for the debug info and other unclean modifications. Post a clean one. Thanks Aili Yao From 2289276ba943cdcddbf3b5b2cdbcaff78690e2e8 Mon Sep 17 00:00:00 2001 From: Aili Yao <yaoaili@kingsoft.com> Date: Wed, 17 Mar 2021 16:12:41 +0800 Subject: [PATCH] fix invalid SIGBUS address for recovery fail Walk the current process pages and compare with the pfn, then get the user address and related page_shift. For thp pages, we can only split anonoums thp page, so I think there may be no race condition for walking and searching the thp error page for such case; For non anonymous thp, the page flag and pte will not be change. so when code goes into this place, it may be race condition for non-anonoums thp page or from a recovery fail for anonoums thp, the page status will not change, i am not so sure about this; For the case we don't find the related virtual address, Maybe sending one BUS_MCEERR_AR signal with invalid address NULL is a better option, but i am not sure. And this may get the wrong virtual address if process have multiple entry for a same page, I don't find a way to get it correct. Maybe other issues is not recognized. --- arch/x86/kernel/cpu/mce/core.c | 12 +--- include/linux/mm.h | 1 + mm/memory-failure.c | 127 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 131 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index db4afc5..4cb873c 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1246,7 +1246,7 @@ static void kill_me_maybe(struct callback_head *cb) struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; - pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); + pr_err("Uncorrected hardware memory error in user-access at %llx\n", p->mce_addr); if (!p->mce_ripv) flags |= MF_MUST_KILL; @@ -1258,14 +1258,8 @@ static void kill_me_maybe(struct callback_head *cb) return; } - if (p->mce_vaddr != (void __user *)-1l) { - pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", - p->mce_addr >> PAGE_SHIFT, p->comm, p->pid); - force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); - } else { - pr_err("Memory error not recovered"); - kill_me_now(cb); - } + memory_failure_error(current, p->mce_addr >> PAGE_SHIFT); + } static void queue_task_work(struct mce *m, int kill_current_task) diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8..cff2f02 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3046,6 +3046,7 @@ enum mf_flags { MF_SOFT_OFFLINE = 1 << 3, }; extern int memory_failure(unsigned long pfn, int flags); +extern void memory_failure_error(struct task_struct *p, unsigned long pfn); extern void memory_failure_queue(unsigned long pfn, int flags); extern void memory_failure_queue_kick(int cpu); extern int unpoison_memory(unsigned long pfn); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 06f0061..359b42f 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -56,6 +56,7 @@ #include <linux/kfifo.h> #include <linux/ratelimit.h> #include <linux/page-isolation.h> +#include <linux/pagewalk.h> #include "internal.h" #include "ras/ras_event.h" @@ -1553,6 +1554,132 @@ int memory_failure(unsigned long pfn, int flags) } EXPORT_SYMBOL_GPL(memory_failure); +static int pte_range(pte_t *ptep, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + u64 *buff = (u64 *)walk->private; + u64 pfn = buff[0]; + pte_t pte = *ptep; + + if (!pte_none(pte) && !pte_present(pte)) { + swp_entry_t swp_temp = pte_to_swp_entry(pte); + + if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn) + goto find; + } else if (pte_pfn(pte) == pfn) { + goto find; + } + + return 0; + +find: + buff[0] = addr; + buff[1] = PAGE_SHIFT; + return true; +} + +static int pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + u64 *buff = (u64 *)walk->private; + struct page *page = (struct page *)buff[0]; + u64 pfn = buff[1]; + pmd_t pmd = *pmdp; + + if (likely(!pmd_trans_huge(pmd))) + return 0; + + if (pmd_none(pmd) || !pmd_present(pmd)) + return 0; + + if (pmd_page(pmd) != page) + return 0; + + for (; addr != end; page++, addr += PAGE_SIZE) { + if (page_to_pfn(page) == pfn) { + buff[0] = addr; + buff[1] = PAGE_SHIFT; + return true; + } + } + + return 0; +} + +static int hugetlb_range(pte_t *ptep, unsigned long hmask, + unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + u64 *buff = (u64 *)walk->private; + u64 pfn = buff[0]; + pte_t pte = huge_ptep_get(ptep); + struct page *page = pfn_to_page(pfn); + + if (!huge_pte_none(pte) && !pte_present(pte)) { + swp_entry_t swp_temp = pte_to_swp_entry(pte); + + if (is_hwpoison_entry(swp_temp) && swp_offset(swp_temp) == pfn) + goto find; + } + if (pte_pfn(pte) == pfn) + goto find; + + return 0; + +find: + buff[0] = addr; + buff[1] = (huge_page_size(page_hstate(page)) > PMD_SIZE) ? PUD_SHIFT : PMD_SHIFT; + return true; +} + +void memory_failure_error(struct task_struct *p, unsigned long pfn) +{ + u64 buff[2] = {0}; + struct page *page; + int ret = -1; + struct mm_walk_ops walk = {0}; + + if (p->mce_vaddr != (void __user *)-1l && p->mce_vaddr != (void __user *)0) { + force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); + return; + } + + page = pfn_to_page(pfn); + if (!page) + goto force_sigbus; + + if (is_zone_device_page(page)) + goto force_sigbus; + + page = compound_head(page); + + if (PageHuge(page)) { + walk.hugetlb_entry = hugetlb_range; + buff[0] = page_to_pfn(page); + } else if (PageTransHuge(page)) { + walk.pmd_entry = pmd_range; + buff[0] = (u64)page; + buff[1] = pfn; + } else { + walk.pte_entry = pte_range; + buff[0] = pfn; + } + + mmap_read_lock(p->mm); + ret = walk_page_range(p->mm, 0, TASK_SIZE_MAX, &walk, (void *)buff); + mmap_read_unlock(p->mm); + + pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", + pfn, p->comm, p->pid); + + if (ret) { + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)buff[0], buff[1]); + } else { +force_sigbus: + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)0, PAGE_SHIFT); + } +} +EXPORT_SYMBOL_GPL(memory_failure_error); + #define MEMORY_FAILURE_FIFO_ORDER 4 #define MEMORY_FAILURE_FIFO_SIZE (1 << MEMORY_FAILURE_FIFO_ORDER)
On Tue, 16 Mar 2021 17:29:39 -0700 "Luck, Tony" <tony.luck@intel.com> wrote: > On Fri, Mar 12, 2021 at 11:48:31PM +0000, Luck, Tony wrote: > > Thanks to the decode of the instruction we do have the virtual address. So we just need > > a safe walk of pgd->p4d->pud->pmd->pte (truncated if we hit a huge page) with a write > > of a "not-present" value. Maybe a different poison type from the one we get from > > memory_failure() so that the #PF code can recognize this as a special case and do any > > other work that we avoided because we were in #MC context. > > There is no such thing as the safe walk I describe above. In a multi > threaded application other threads might munmap(2) bits of the address > space which could free some of the p4d->pud->pmd->pte tables. > > But the pgd table is guaranteed to exist as long as any of the threads > in the process exist. Which gave rise to a new approach (partial credit > to Dave Hansen who helped me understand why a more extreme patch that > I wanted to use wasn't working ... and he pointed out the pgd persistence). > > N.B. The code below DOES NOT WORK. My test application tried to do a write(2) > syscall with some poison in the buffer. Goal is that write should return a > short count (with only the bytes from the buffer leading up to the poison > written to the file). Currently the process gets a suprise SIGSEGV in > some glibc code :-( > > The code would also need a bunch more work to handle the fact that > in a multi-threaded application multiple threads might consume the > poison, and some innocent threads not consuming the poison may also end > up drawn into the melee in the page fault handler. > > > The big picture. > --------------- > > This approach passes the buck for the recovery from the #MC handler > (which has very limited options to do anything useful) to the #PF > handler (which is a much friendlier environment where locks and mutexes > can be obtained as needed). > > The mechanism for this buck passing is for the #MC handler to set a > reserved bit in the PGD entry for the user address where the machine > check happened, flush the TLB for that address, and then return from > the #MC handler to the kernel get_user()/copy_from_user() code which > will re-execute the instruction to access the poison address, but this > time take a #PF because of the reserved bit in the PGD. > > There's a couple of changes bundled in here to help my debugging: > 1) Turn off UCNA calls to memory_failure() ... just avoids races > and make tests more determinstic for now > 2) Disable "fast string" support ... avoid "REP MOVS" copy routines > since I'm testing on an old system that treats poison consumed > by REP MOVS as fatal. > > Posted here for general comments on the approach. On the plus > side it avoids taking multiple machine checks in the kernel when > it is accessing user data. I think it can also meet the goal set > by Andy Lutomirski of avoiding SIGBUS from syscalls that happen > to touch user poison. They should see either short counts for > calls like write(2) that may partially succeed or -EFAULT if > the system call totally failed. > I have another view here, but maybe wrong, post here for discussion: When the process is signaled SIGBUS with BUS_MCEERR_AR, i think it should be from a read, with the data read, the process can proceed, like process the data, or make decision. When for poison, the data can't be returned, the process don't know what to do, and kill is the first option. while for a copyin case, the read is excuted by kernel, and for poison kernel will refuse to excute current read and further operation. For the process, it seems it have a change to proceed. if just error code is returned, the process may care or not, it may not correctly process the error. It seems the worst case here is the process will touch the poison page again, trigger another MCE and accordingly be killed. It's not that bad? Thanks Aili Yao > > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h > index f24d7ef8fffa..e533eaf20834 100644 > --- a/arch/x86/include/asm/pgtable_types.h > +++ b/arch/x86/include/asm/pgtable_types.h > @@ -23,6 +23,7 @@ > #define _PAGE_BIT_SOFTW2 10 /* " */ > #define _PAGE_BIT_SOFTW3 11 /* " */ > #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ > +#define _PAGE_BIT_RESERVED1 51 /* Reserved bit */ > #define _PAGE_BIT_SOFTW4 58 /* available for programmer */ > #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ > #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ > @@ -56,6 +57,7 @@ > #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE) > #define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL) > #define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST) > +#define _PAGE_RESERVED1 (_AT(pteval_t, 1) << _PAGE_BIT_RESERVED1) > #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS > #define _PAGE_PKEY_BIT0 (_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0) > #define _PAGE_PKEY_BIT1 (_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1) > diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h > index 7f7200021bd1..269e8ee04c11 100644 > --- a/arch/x86/include/asm/traps.h > +++ b/arch/x86/include/asm/traps.h > @@ -45,4 +45,24 @@ void __noreturn handle_stack_overflow(const char *message, > unsigned long fault_address); > #endif > > +static inline void pgd_set_reserved(pgd_t *pgdp, unsigned long addr) > +{ > + pgd_t pgd; > + > + pgdp += pgd_index(addr); > + pgd = *pgdp; > + pgd.pgd |= _PAGE_RESERVED1; > + WRITE_ONCE(*pgdp, pgd); > +} > + > +static inline void pgd_clr_reserved(pgd_t *pgdp, unsigned long addr) > +{ > + pgd_t pgd; > + > + pgdp += pgd_index(addr); > + pgd = *pgdp; > + pgd.pgd &= ~_PAGE_RESERVED1; > + WRITE_ONCE(*pgdp, pgd); > +} > + > #endif /* _ASM_X86_TRAPS_H */ > diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c > index 0e422a544835..974e1eb5d1aa 100644 > --- a/arch/x86/kernel/cpu/intel.c > +++ b/arch/x86/kernel/cpu/intel.c > @@ -287,6 +287,7 @@ static void early_init_intel(struct cpuinfo_x86 *c) > */ > if (c->x86 > 6 || (c->x86 == 6 && c->x86_model >= 0xd)) { > rdmsrl(MSR_IA32_MISC_ENABLE, misc_enable); > +misc_enable = 0; > if (!(misc_enable & MSR_IA32_MISC_ENABLE_FAST_STRING)) { > pr_info("Disabled fast string operations\n"); > setup_clear_cpu_cap(X86_FEATURE_REP_GOOD); > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index 7962355436da..41bedc961928 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -635,6 +635,7 @@ static struct notifier_block early_nb = { > static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, > void *data) > { > +#if 0 > struct mce *mce = (struct mce *)data; > unsigned long pfn; > > @@ -650,7 +651,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val, > set_mce_nospec(pfn, whole_page(mce)); > mce->kflags |= MCE_HANDLED_UC; > } > - > +#endif > return NOTIFY_OK; > } > > @@ -1443,8 +1444,18 @@ noinstr void do_machine_check(struct pt_regs *regs) > mce_panic("Failed kernel mode recovery", &m, msg); > } > > - if (m.kflags & MCE_IN_KERNEL_COPYIN) > - queue_task_work(&m, kill_current_task); > + /* > + * Machine check while copying from user space. Note that we > + * do not call fixup_exception(). Instead we ensure a page fault > + * when we return to the faulting instruction. > + * Let the page fault handler do the rest of the > + * recovery. > + */ > + if (m.kflags & MCE_IN_KERNEL_COPYIN) { > + flush_tlb_one_user((unsigned long)current->mce_vaddr); > + pgd_set_reserved(current->mm->pgd, (unsigned long)current->mce_vaddr); > + current->mce_addr = m.addr; > + } > } > out: > mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); > diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c > index 83df991314c5..da917945150d 100644 > --- a/arch/x86/kernel/cpu/mce/severity.c > +++ b/arch/x86/kernel/cpu/mce/severity.c > @@ -282,7 +282,6 @@ static int error_context(struct mce *m, struct pt_regs *regs) > return IN_KERNEL_RECOV; > } > if (t == EX_HANDLER_UACCESS && regs && is_copy_from_user(regs)) { > - m->kflags |= MCE_IN_KERNEL_RECOV; > m->kflags |= MCE_IN_KERNEL_COPYIN; > return IN_KERNEL_RECOV; > } > diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S > index 77b9b2a3b5c8..e0e71ca023ce 100644 > --- a/arch/x86/lib/copy_user_64.S > +++ b/arch/x86/lib/copy_user_64.S > @@ -234,24 +234,11 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string) > */ > SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail) > movl %edx,%ecx > - cmp $X86_TRAP_MC,%eax /* check if X86_TRAP_MC */ > - je 3f > 1: rep movsb > 2: mov %ecx,%eax > ASM_CLAC > ret > > - /* > - * Return zero to pretend that this copy succeeded. This > - * is counter-intuitive, but needed to prevent the code > - * in lib/iov_iter.c from retrying and running back into > - * the poison cache line again. The machine check handler > - * will ensure that a SIGBUS is sent to the task. > - */ > -3: xorl %eax,%eax > - ASM_CLAC > - ret > - > _ASM_EXTABLE_CPY(1b, 2b) > SYM_CODE_END(.Lcopy_user_handle_tail) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index a73347e2cdfc..49232264e893 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -1245,9 +1245,13 @@ void do_user_addr_fault(struct pt_regs *regs, > /* > * Reserved bits are never expected to be set on > * entries in the user portion of the page tables. > + * Except when machine check handling of a copy from > + * user passed the buck to #PF. > */ > - if (unlikely(error_code & X86_PF_RSVD)) > - pgtable_bad(regs, error_code, address); > + if (unlikely(error_code & X86_PF_RSVD)) { > + if (!current->mce_vaddr) > + pgtable_bad(regs, error_code, address); > + } > > /* > * If SMAP is on, check for invalid kernel (supervisor) access to user > @@ -1291,6 +1295,15 @@ void do_user_addr_fault(struct pt_regs *regs, > local_irq_enable(); > } > > + if (current->mce_vaddr) { > + pgd_clr_reserved(current->mm->pgd, > + (unsigned long)current->mce_vaddr); > + memory_failure(current->mce_addr >> PAGE_SHIFT, 0); > + current->mce_vaddr = 0; > + error_code &= ~X86_PF_RSVD; > + pr_info("#PF: maybe fixed? Try to continue\n"); > + } > + > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); > > if (error_code & X86_PF_WRITE)
On Thu, 25 Feb 2021 12:39:30 +0100 Oscar Salvador <osalvador@suse.de> wrote: > On Thu, Feb 25, 2021 at 11:28:18AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: > > Hi Aili, > > > > I agree that this set_mce_nospec() is not expected to be called for > > "already hwpoisoned" page because in the reported case the error > > page is already contained and no need to resort changing cache mode. > > Out of curiosity, what is the current behavour now? > Say we have an ongoing MCE which has marked the page as HWPoison but > memory_failure did not take any action on the page yet. > And then, we have another MCE, which ends up there. > set_mce_nospec might clear _PAGE_PRESENT bit. > > Does that have any impact on the first MCE? > > > It seems to me that memory_failure() does not return MF_XXX. But yes, > > returning some positive value for the reported case could be a solution. > > No, you are right. I somehow managed to confuse myself. > I see now that MF_XXX return codes are filtered out in page_action. > > > We could use some negative value (error code) to report the reported case, > > then as you mentioned above, some callers need change to handle the > > new case, and the same is true if you use some positive value. > > My preference is -EHWPOISON, but other options are fine if justified well. > > -EHWPOISON seems like a good fit. > Hi Oscar, david: Long away fron this topic, but i noticed today I made a stupid mistake that EHWPOISON is already been declared, so we should better return EHWPOISON for this case. Really sorry for this! As the patch is still under review, I will post a new version for this, if I change this, may I add your review tag here please?
On 31.03.21 12:56, Aili Yao wrote: > On Thu, 25 Feb 2021 12:39:30 +0100 > Oscar Salvador <osalvador@suse.de> wrote: > >> On Thu, Feb 25, 2021 at 11:28:18AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: >>> Hi Aili, >>> >>> I agree that this set_mce_nospec() is not expected to be called for >>> "already hwpoisoned" page because in the reported case the error >>> page is already contained and no need to resort changing cache mode. >> >> Out of curiosity, what is the current behavour now? >> Say we have an ongoing MCE which has marked the page as HWPoison but >> memory_failure did not take any action on the page yet. >> And then, we have another MCE, which ends up there. >> set_mce_nospec might clear _PAGE_PRESENT bit. >> >> Does that have any impact on the first MCE? >> >>> It seems to me that memory_failure() does not return MF_XXX. But yes, >>> returning some positive value for the reported case could be a solution. >> >> No, you are right. I somehow managed to confuse myself. >> I see now that MF_XXX return codes are filtered out in page_action. >> >>> We could use some negative value (error code) to report the reported case, >>> then as you mentioned above, some callers need change to handle the >>> new case, and the same is true if you use some positive value. >>> My preference is -EHWPOISON, but other options are fine if justified well. >> >> -EHWPOISON seems like a good fit. >> > > Hi Oscar, david: > > Long away fron this topic, but i noticed today I made a stupid mistake that EHWPOISON is already > been declared, so we should better return EHWPOISON for this case. > > Really sorry for this! > > As the patch is still under review, I will post a new version for this, if I change this, may I add > your review tag here please? Just resend as v2. We will review and post our RBs there.
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index e133ce1e562b..db4afc5bf15a 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1259,6 +1259,8 @@ static void kill_me_maybe(struct callback_head *cb) } if (p->mce_vaddr != (void __user *)-1l) { + pr_err("Memory error may not recovered: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", + p->mce_addr >> PAGE_SHIFT, p->comm, p->pid); force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); } else { pr_err("Memory error not recovered"); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index e9481632fcd1..06f006174b8c 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1224,7 +1224,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags) if (TestSetPageHWPoison(head)) { pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn); - return 0; + return -EBUSY; } num_poisoned_pages_inc(); @@ -1420,7 +1420,7 @@ int memory_failure(unsigned long pfn, int flags) if (TestSetPageHWPoison(p)) { pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn); - return 0; + return -EBUSY; } orig_head = hpage = compound_head(p);
When the page is already poisoned, another memory_failure() call in the same page now return 0, meaning OK. For nested memory mce handling, this behavior may lead real serious problem, Example: 1.When LCME is enabled, and there are two processes A && B running on different core X && Y separately, which will access one same page, then the page corrupted when process A access it, a MCE will be rasied to core X and the error process is just underway. 2.Then B access the page and trigger another MCE to core Y, it will also do error process, it will see TestSetPageHWPoison be true, and 0 is returned. 3.The kill_me_maybe will check the return: 1244 static void kill_me_maybe(struct callback_head *cb) 1245 { 1254 if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && 1255 !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { 1256 set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); 1257 sync_core(); 1258 return; 1259 } 1267 } 4. The error process for B will end, and may nothing happened if kill-early is not set, We may let the wrong data go into effect. For other cases which care the return value of memory_failure() should check why they want to process a memory error which have already been processed. This behavior seems reasonable. In kill_me_maybe, log the fact about the memory may not recovered, and we will kill the related process. Signed-off-by: Aili Yao <yaoaili@kingsoft.com> --- arch/x86/kernel/cpu/mce/core.c | 2 ++ mm/memory-failure.c | 4 ++-- 2 files changed, 4 insertions(+), 2 deletions(-)