diff mbox series

[v3,1/2] mm, hwpoison: Try to recover from copy-on write faults

Message ID 20221021200120.175753-2-tony.luck@intel.com (mailing list archive)
State New
Headers show
Series Copy-on-write poison recovery | expand

Commit Message

Tony Luck Oct. 21, 2022, 8:01 p.m. UTC
If the kernel is copying a page as the result of a copy-on-write
fault and runs into an uncorrectable error, Linux will crash because
it does not have recovery code for this case where poison is consumed
by the kernel.

It is easy to set up a test case. Just inject an error into a private
page, fork(2), and have the child process write to the page.

I wrapped that neatly into a test at:

  git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

just enable ACPI error injection and run:

  # ./einj_mem-uc -f copy-on-write

Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
on architectures where that is available (currently x86 and powerpc).
When an error is detected during the page copy, return VM_FAULT_HWPOISON
to caller of wp_page_copy(). This propagates up the call stack. Both x86
and powerpc have code in their fault handler to deal with this code by
sending a SIGBUS to the application.

Note that this patch avoids a system crash and signals the process that
triggered the copy-on-write action. It does not take any action for the
memory error that is still in the shared page. To handle that a call to
memory_failure() is needed. But this cannot be done from wp_page_copy()
because it holds mmap_lock(). Perhaps the architecture fault handlers
can deal with this loose end in a subsequent patch?

On Intel/x86 this loose end will often be handled automatically because
the memory controller provides an additional notification of the h/w
poison in memory, the handler for this will call memory_failure(). This
isn't a 100% solution. If there are multiple errors, not all may be
logged in this way.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Changes in V3:
    Dan Williams
	Rename copy_user_highpage_mc() to copy_mc_user_highpage() for
	consistency with Linus' discussion on names of functions that
	check for machine check.
	Write complete functions for the have/have-not copy_mc_to_kernel
	cases (so grep shows there are two versions)
	Change __wp_page_copy_user() to return 0 for success, negative for fail
	[I picked -EAGAIN for both non-EHWPOISON cases]

Changes in V2:
   Naoya Horiguchi:
	1) Use -EHWPOISON error code instead of minus one.
	2) Poison path needs also to deal with old_page
   Tony Luck:
	Rewrote commit message
	Added some powerpc folks to Cc: list
---
 include/linux/highmem.h | 24 ++++++++++++++++++++++++
 mm/memory.c             | 30 ++++++++++++++++++++----------
 2 files changed, 44 insertions(+), 10 deletions(-)

Comments

HORIGUCHI NAOYA(堀口 直也) Oct. 25, 2022, 5:46 a.m. UTC | #1
On Fri, Oct 21, 2022 at 01:01:19PM -0700, Tony Luck wrote:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
> 
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
> 
> I wrapped that neatly into a test at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> 
> just enable ACPI error injection and run:
> 
>   # ./einj_mem-uc -f copy-on-write
> 
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.
> 
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed. But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
> 
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>

Thank you for the update. Looks good to me.

Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Miaohe Lin Oct. 28, 2022, 2:11 a.m. UTC | #2
On 2022/10/22 4:01, Tony Luck wrote:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
> 
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
> 
> I wrapped that neatly into a test at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> 
> just enable ACPI error injection and run:
> 
>   # ./einj_mem-uc -f copy-on-write
> 
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.
> 
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed. But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
> 
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>

Thanks for your work, Tony.

> 
> ---
> Changes in V3:
>     Dan Williams
> 	Rename copy_user_highpage_mc() to copy_mc_user_highpage() for
> 	consistency with Linus' discussion on names of functions that
> 	check for machine check.
> 	Write complete functions for the have/have-not copy_mc_to_kernel
> 	cases (so grep shows there are two versions)
> 	Change __wp_page_copy_user() to return 0 for success, negative for fail
> 	[I picked -EAGAIN for both non-EHWPOISON cases]
> 
> Changes in V2:
>    Naoya Horiguchi:
> 	1) Use -EHWPOISON error code instead of minus one.
> 	2) Poison path needs also to deal with old_page
>    Tony Luck:
> 	Rewrote commit message
> 	Added some powerpc folks to Cc: list
> ---
>  include/linux/highmem.h | 24 ++++++++++++++++++++++++
>  mm/memory.c             | 30 ++++++++++++++++++++----------
>  2 files changed, 44 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index e9912da5441b..a32c64681f03 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -319,6 +319,30 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
>  
>  #endif
>  
> +#ifdef copy_mc_to_kernel
> +static inline int copy_mc_user_highpage(struct page *to, struct page *from,
> +					unsigned long vaddr, struct vm_area_struct *vma)
> +{
> +	unsigned long ret;
> +	char *vfrom, *vto;
> +
> +	vfrom = kmap_local_page(from);
> +	vto = kmap_local_page(to);
> +	ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);

In copy_user_highpage(), kmsan_unpoison_memory(page_address(to), PAGE_SIZE) is done after the copy when
__HAVE_ARCH_COPY_USER_HIGHPAGE isn't defined. Do we need to do something similar here? But I'm not familiar
with kmsan, so I can easy be wrong.

Anyway, this patch looks good to me. Thanks.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks,
Miaohe Lin
Tony Luck Oct. 28, 2022, 4:09 p.m. UTC | #3
>> +	vfrom = kmap_local_page(from);
>> +	vto = kmap_local_page(to);
>> +	ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
>
> In copy_user_highpage(), kmsan_unpoison_memory(page_address(to), PAGE_SIZE) is done after the copy when
> __HAVE_ARCH_COPY_USER_HIGHPAGE isn't defined. Do we need to do something similar here? But I'm not familiar
> with kmsan, so I can easy be wrong.

It looks like that kmsan_unpoison_memory() call was added recently, after I copied
copy_user_highpage() to create copy_mc_user_highpage(). I'm not familiar with
kmsan either. Adding Alexander to this thread since they added that code.

> Anyway, this patch looks good to me. Thanks.
>
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks for the review.

-Tony
Alexander Potapenko Nov. 2, 2022, 2:27 p.m. UTC | #4
On Fri, Oct 28, 2022 at 6:14 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> >> +    vfrom = kmap_local_page(from);
> >> +    vto = kmap_local_page(to);
> >> +    ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> >
> > In copy_user_highpage(), kmsan_unpoison_memory(page_address(to), PAGE_SIZE) is done after the copy when
> > __HAVE_ARCH_COPY_USER_HIGHPAGE isn't defined. Do we need to do something similar here? But I'm not familiar
> > with kmsan, so I can easy be wrong.
>
> It looks like that kmsan_unpoison_memory() call was added recently, after I copied
> copy_user_highpage() to create copy_mc_user_highpage(). I'm not familiar with
> kmsan either. Adding Alexander to this thread since they added that code.
>

Given that copy_mc_user_highpage() replaces one of the calls to
copy_user_highpage(), it sure makes sense to call
kmsan_unpoison_memory() here.

KMSAN tracks the status (initialized/uninitialized) of the kernel
memory. Newly allocated memory is marked uninitialized, copying memory
preserves its status, and writing constants to that memory makes it
initialized.
Userspace memory does not have its status tracked by KMSAN, so when
values are copied from the userspace, KMSAN does nothing with their
status.
That's why every (successful) copy_from_user event should be followed
by kmsan_unpoison_memory(), which marks the corresponding kernel
buffer initialized - otherwise the status of that buffer may get
stale.

> > Anyway, this patch looks good to me. Thanks.
> >
> > Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
>
> Thanks for the review.
>
> -Tony
Alexander Potapenko Nov. 2, 2022, 2:30 p.m. UTC | #5
On Wed, Nov 2, 2022 at 3:27 PM Alexander Potapenko <glider@google.com> wrote:
>
> On Fri, Oct 28, 2022 at 6:14 PM Luck, Tony <tony.luck@intel.com> wrote:
> >
> > >> +    vfrom = kmap_local_page(from);
> > >> +    vto = kmap_local_page(to);
> > >> +    ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> > >
> > > In copy_user_highpage(), kmsan_unpoison_memory(page_address(to), PAGE_SIZE) is done after the copy when
> > > __HAVE_ARCH_COPY_USER_HIGHPAGE isn't defined. Do we need to do something similar here? But I'm not familiar
> > > with kmsan, so I can easy be wrong.
> >
> > It looks like that kmsan_unpoison_memory() call was added recently, after I copied
> > copy_user_highpage() to create copy_mc_user_highpage(). I'm not familiar with
> > kmsan either. Adding Alexander to this thread since they added that code.
> >
>
> Given that copy_mc_user_highpage() replaces one of the calls to
> copy_user_highpage(), it sure makes sense to call
> kmsan_unpoison_memory() here.
>
> KMSAN tracks the status (initialized/uninitialized) of the kernel
> memory. Newly allocated memory is marked uninitialized, copying memory
> preserves its status, and writing constants to that memory makes it
> initialized.
> Userspace memory does not have its status tracked by KMSAN, so when
> values are copied from the userspace, KMSAN does nothing with their
> status.
> That's why every (successful) copy_from_user event should be followed
> by kmsan_unpoison_memory(), which marks the corresponding kernel
> buffer initialized - otherwise the status of that buffer may get
> stale.
>
> > > Anyway, this patch looks good to me. Thanks.
> > >
> > > Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Alexander Potapenko <glider@google.com>

> >
> > Thanks for the review.
> >
> > -Tony
>
>
>
> --
> Alexander Potapenko
> Software Engineer
>
> Google Germany GmbH
> Erika-Mann-Straße, 33
> 80636 München
>
> Geschäftsführer: Paul Manicle, Liana Sebastian
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg
diff mbox series

Patch

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index e9912da5441b..a32c64681f03 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -319,6 +319,30 @@  static inline void copy_user_highpage(struct page *to, struct page *from,
 
 #endif
 
+#ifdef copy_mc_to_kernel
+static inline int copy_mc_user_highpage(struct page *to, struct page *from,
+					unsigned long vaddr, struct vm_area_struct *vma)
+{
+	unsigned long ret;
+	char *vfrom, *vto;
+
+	vfrom = kmap_local_page(from);
+	vto = kmap_local_page(to);
+	ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
+	kunmap_local(vto);
+	kunmap_local(vfrom);
+
+	return ret;
+}
+#else
+static inline int copy_mc_user_highpage(struct page *to, struct page *from,
+					unsigned long vaddr, struct vm_area_struct *vma)
+{
+	copy_user_highpage(to, from, vaddr, vma);
+	return 0;
+}
+#endif
+
 #ifndef __HAVE_ARCH_COPY_HIGHPAGE
 
 static inline void copy_highpage(struct page *to, struct page *from)
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..b6056eef2f72 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2848,10 +2848,16 @@  static inline int pte_unmap_same(struct vm_fault *vmf)
 	return same;
 }
 
-static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
-				       struct vm_fault *vmf)
+/*
+ * Return:
+ *	0:		copied succeeded
+ *	-EHWPOISON:	copy failed due to hwpoison in source page
+ *	-EAGAIN:	copied failed (some other reason)
+ */
+static inline int __wp_page_copy_user(struct page *dst, struct page *src,
+				      struct vm_fault *vmf)
 {
-	bool ret;
+	int ret;
 	void *kaddr;
 	void __user *uaddr;
 	bool locked = false;
@@ -2860,8 +2866,9 @@  static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
 	unsigned long addr = vmf->address;
 
 	if (likely(src)) {
-		copy_user_highpage(dst, src, addr, vma);
-		return true;
+		if (copy_mc_user_highpage(dst, src, addr, vma))
+			return -EHWPOISON;
+		return 0;
 	}
 
 	/*
@@ -2888,7 +2895,7 @@  static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
 			 * and update local tlb only
 			 */
 			update_mmu_tlb(vma, addr, vmf->pte);
-			ret = false;
+			ret = -EAGAIN;
 			goto pte_unlock;
 		}
 
@@ -2913,7 +2920,7 @@  static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
 		if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
 			/* The PTE changed under us, update local tlb */
 			update_mmu_tlb(vma, addr, vmf->pte);
-			ret = false;
+			ret = -EAGAIN;
 			goto pte_unlock;
 		}
 
@@ -2932,7 +2939,7 @@  static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
 		}
 	}
 
-	ret = true;
+	ret = 0;
 
 pte_unlock:
 	if (locked)
@@ -3104,6 +3111,7 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	pte_t entry;
 	int page_copied = 0;
 	struct mmu_notifier_range range;
+	int ret;
 
 	delayacct_wpcopy_start();
 
@@ -3121,19 +3129,21 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		if (!new_page)
 			goto oom;
 
-		if (!__wp_page_copy_user(new_page, old_page, vmf)) {
+		ret = __wp_page_copy_user(new_page, old_page, vmf);
+		if (ret) {
 			/*
 			 * COW failed, if the fault was solved by other,
 			 * it's fine. If not, userspace would re-fault on
 			 * the same address and we will handle the fault
 			 * from the second attempt.
+			 * The -EHWPOISON case will not be retried.
 			 */
 			put_page(new_page);
 			if (old_page)
 				put_page(old_page);
 
 			delayacct_wpcopy_end();
-			return 0;
+			return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
 		}
 		kmsan_copy_page_meta(new_page, old_page);
 	}