Message ID | 20210521030156.2612074-3-nao.horiguchi@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm,hwpoison: fix sending SIGBUS for Action Required MCE | expand |
On Fri, May 21, 2021 at 12:01:55PM +0900, Naoya Horiguchi wrote: > From: Aili Yao <yaoaili@kingsoft.com> > > When memory_failure() is called with MF_ACTION_REQUIRED on the > page that has already been hwpoisoned, memory_failure() could fail > to send SIGBUS to the affected process, which results in infinite > loop of MCEs. > > Currently memory_failure() returns 0 if it's called for already > hwpoisoned page, then the caller, kill_me_maybe(), could return > without sending SIGBUS to current process. An action required MCE > is raised when the current process accesses to the broken memory, > so no SIGBUS means that the current process continues to run and > access to the error page again soon, so running into MCE loop. > > This issue can arise for example in the following scenarios: > > - Two or more threads access to the poisoned page concurrently. > If local MCE is enabled, MCE handler independently handles the > MCE events. So there's a race among MCE events, and the > second or latter threads fall into the situation in question. > > - If there was a precedent memory error event and memory_failure() > for the event failed to unmap the error page for some reason, > the subsequent memory access to the error page triggers the > MCE loop situation. > > To fix the issue, make memory_failure() return an error code when the > error page has already been hwpoisoned. This allows memory error > handler to control how it sends signals to userspace. And make sure > that any process touching a hwpoisoned page should get a SIGBUS even > in "already hwpoisoned" path of memory_failure() as is done in page > fault path. > > Signed-off-by: Aili Yao <yaoaili@kingsoft.com> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de>
diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c index 0f0b932ccbca..8add7cafad5e 100644 --- v5.13-rc2/mm/memory-failure.c +++ v5.13-rc2_patched/mm/memory-failure.c @@ -1247,7 +1247,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags) if (TestSetPageHWPoison(head)) { pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn); - return 0; + return -EHWPOISON; } num_poisoned_pages_inc(); @@ -1456,6 +1456,7 @@ int memory_failure(unsigned long pfn, int flags) if (TestSetPageHWPoison(p)) { pr_err("Memory failure: %#lx: already hardware poisoned\n", pfn); + res = -EHWPOISON; goto unlock_mutex; }