diff mbox series

[v1] mm, hwpoison: enable error handling on shmem thp

Message ID 20210209062128.453814-1-nao.horiguchi@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v1] mm, hwpoison: enable error handling on shmem thp | expand

Commit Message

Naoya Horiguchi Feb. 9, 2021, 6:21 a.m. UTC
From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Currently hwpoison code checks PageAnon() for thp and refuses to handle
errors on non-anonymous thps (just for historical reason).  We now
support non-anonymou thp like shmem one, so this patch suggests to enable
to handle shmem thps. Fortunately, we already have can_split_huge_page()
to check if a give thp is splittable, so this patch relies on it.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 34 +++++++++-------------------------
 1 file changed, 9 insertions(+), 25 deletions(-)

Comments

Andrew Morton Feb. 9, 2021, 7:46 p.m. UTC | #1
On Tue,  9 Feb 2021 15:21:28 +0900 Naoya Horiguchi <nao.horiguchi@gmail.com> wrote:

> Currently hwpoison code checks PageAnon() for thp and refuses to handle
> errors on non-anonymous thps (just for historical reason).  We now
> support non-anonymou thp like shmem one, so this patch suggests to enable
> to handle shmem thps. Fortunately, we already have can_split_huge_page()
> to check if a give thp is splittable, so this patch relies on it.

Thanks.  We're at -rc7 so I think I'll park this one for 5.12-rc1,
unless it is more urgent than I believe it to be?
Naoya Horiguchi Feb. 9, 2021, 11:51 p.m. UTC | #2
On Tue, Feb 09, 2021 at 11:46:40AM -0800, Andrew Morton wrote:
> On Tue,  9 Feb 2021 15:21:28 +0900 Naoya Horiguchi <nao.horiguchi@gmail.com> wrote:
> 
> > Currently hwpoison code checks PageAnon() for thp and refuses to handle
> > errors on non-anonymous thps (just for historical reason).  We now
> > support non-anonymou thp like shmem one, so this patch suggests to enable
> > to handle shmem thps. Fortunately, we already have can_split_huge_page()
> > to check if a give thp is splittable, so this patch relies on it.
> 
> Thanks.  We're at -rc7 so I think I'll park this one for 5.12-rc1,
> unless it is more urgent than I believe it to be?

This patch is not so urgent, so I'm fine to make this targeted at 5.12-rc1.

Thanks,
Naoya Horiguchi
Oscar Salvador Feb. 11, 2021, 8:06 a.m. UTC | #3
On Tue, Feb 09, 2021 at 03:21:28PM +0900, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Currently hwpoison code checks PageAnon() for thp and refuses to handle
> errors on non-anonymous thps (just for historical reason).  We now
> support non-anonymou thp like shmem one, so this patch suggests to enable
> to handle shmem thps. Fortunately, we already have can_split_huge_page()
> to check if a give thp is splittable, so this patch relies on it.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Looks good to me, thanks Naoya:

Reviewed-by: Oscar Salvador <osalvador@suse.de>
Hugh Dickins March 11, 2021, 7:22 a.m. UTC | #4
On Tue, 9 Feb 2021, Naoya Horiguchi wrote:

> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Currently hwpoison code checks PageAnon() for thp and refuses to handle
> errors on non-anonymous thps (just for historical reason).  We now
> support non-anonymou thp like shmem one, so this patch suggests to enable
> to handle shmem thps. Fortunately, we already have can_split_huge_page()
> to check if a give thp is splittable, so this patch relies on it.

Fortunately? I don't understand. Why call can_split_huge_page()
at all, instead of simply trying split_huge_page() directly?
And could it do better than -EBUSY when split_huge_page() fails?

> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Thanks for trying to add shmem+file THP support, but I think this
does not work as intended - Andrew, if Naoya agrees, please drop from
mmotm for now, the fixup needed will be more than a line or two.

I'm not much into memory-failure myself, but Jue discovered that the
SIGBUS never arrives: because split_huge_page() on a shmem or file
THP unmaps all its pmds and ptes, and (unlike with anon) leaves them
unmapped - in normal circumstances, to be faulted back on demand.
So the page_mapped() check in hwpoison_user_mappings() fails,
and the intended SIGBUS is not delivered.

(Or, is it acceptable that the SIGBUS is not delivered to those who
have the huge page mapped: should it get delivered later, to anyone
who faults back in the bad 4k?)

We believe the tokill list has to be set up earlier, before
split_huge_page() is called, then passed in to hwpoison_user_mappings().

Sorry, we don't have a proper patch for that right now, but I expect
you can see what needs to be done.  But something we found on the way,
we do have a patch for: add_to_kill() uses page_address_in_vma(), but
that has not been used on file THP tails before - fix appended at the
end below, so as not to waste your time on that bit.

> ---
>  mm/memory-failure.c | 34 +++++++++-------------------------
>  1 file changed, 9 insertions(+), 25 deletions(-)
> 
> diff --git v5.11-rc6/mm/memory-failure.c v5.11-rc6_patched/mm/memory-failure.c
> index e9481632fcd1..8c1575de0507 100644
> --- v5.11-rc6/mm/memory-failure.c
> +++ v5.11-rc6_patched/mm/memory-failure.c
> @@ -956,20 +956,6 @@ static int __get_hwpoison_page(struct page *page)
>  {
>  	struct page *head = compound_head(page);
>  
> -	if (!PageHuge(head) && PageTransHuge(head)) {
> -		/*
> -		 * Non anonymous thp exists only in allocation/free time. We
> -		 * can't handle such a case correctly, so let's give it up.
> -		 * This should be better than triggering BUG_ON when kernel
> -		 * tries to touch the "partially handled" page.
> -		 */
> -		if (!PageAnon(head)) {
> -			pr_err("Memory failure: %#lx: non anonymous thp\n",
> -				page_to_pfn(page));
> -			return 0;
> -		}
> -	}
> -
>  	if (get_page_unless_zero(head)) {
>  		if (head == compound_head(page))
>  			return 1;
> @@ -1197,21 +1183,19 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>  
>  static int try_to_split_thp_page(struct page *page, const char *msg)
>  {
> -	lock_page(page);
> -	if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> -		unsigned long pfn = page_to_pfn(page);
> +	struct page *head;
>  
> +	lock_page(page);
> +	head = compound_head(page);
> +	if (PageTransHuge(head) && can_split_huge_page(head, NULL) &&
> +	    !split_huge_page(page)) {
>  		unlock_page(page);
> -		if (!PageAnon(page))
> -			pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
> -		else
> -			pr_info("%s: %#lx: thp split failed\n", msg, pfn);
> -		put_page(page);
> -		return -EBUSY;
> +		return 0;
>  	}
> +	pr_info("%s: %#lx: thp split failed\n", msg, page_to_pfn(page));
>  	unlock_page(page);
> -
> -	return 0;
> +	put_page(page);
> +	return -EBUSY;
>  }
>  
>  static int memory_failure_hugetlb(unsigned long pfn, int flags)
> -- 
> 2.25.1

[PATCH] mm: fix page_address_in_vma() on file THP tails
From: Jue Wang <juew@google.com>

Anon THP tails were already supported, but memory-failure now needs to use
page_address_in_vma() on file THP tails, which its page->mapping check did
not permit: fix it.

Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/rmap.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- 5.12-rc2/mm/rmap.c	2021-02-28 16:58:57.950450151 -0800
+++ linux/mm/rmap.c	2021-03-10 20:29:21.591475177 -0800
@@ -717,11 +717,11 @@ unsigned long page_address_in_vma(struct
 		if (!vma->anon_vma || !page__anon_vma ||
 		    vma->anon_vma->root != page__anon_vma->root)
 			return -EFAULT;
-	} else if (page->mapping) {
-		if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping)
-			return -EFAULT;
-	} else
+	} else if (!vma->vm_file) {
+		return -EFAULT;
+	} else if (vma->vm_file->f_mapping != compound_head(page)->mapping) {
 		return -EFAULT;
+	}
 	address = __vma_address(page, vma);
 	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
 		return -EFAULT;
Matthew Wilcox March 11, 2021, 2:45 p.m. UTC | #5
On Wed, Mar 10, 2021 at 11:22:18PM -0800, Hugh Dickins wrote:
> But something we found on the way,
> we do have a patch for: add_to_kill() uses page_address_in_vma(), but
> that has not been used on file THP tails before - fix appended at the
> end below, so as not to waste your time on that bit.

> [PATCH] mm: fix page_address_in_vma() on file THP tails
> From: Jue Wang <juew@google.com>
> 
> Anon THP tails were already supported, but memory-failure now needs to use
> page_address_in_vma() on file THP tails, which its page->mapping check did
> not permit: fix it.
> 
> Signed-off-by: Jue Wang <juew@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> 
>  mm/rmap.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> --- 5.12-rc2/mm/rmap.c	2021-02-28 16:58:57.950450151 -0800
> +++ linux/mm/rmap.c	2021-03-10 20:29:21.591475177 -0800
> @@ -717,11 +717,11 @@ unsigned long page_address_in_vma(struct
>  		if (!vma->anon_vma || !page__anon_vma ||
>  		    vma->anon_vma->root != page__anon_vma->root)
>  			return -EFAULT;
> -	} else if (page->mapping) {
> -		if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping)
> -			return -EFAULT;
> -	} else
> +	} else if (!vma->vm_file) {
> +		return -EFAULT;
> +	} else if (vma->vm_file->f_mapping != compound_head(page)->mapping) {
>  		return -EFAULT;
> +	}

This is a common bug I'm seeing; people just assume they can dereference
page->mapping.  Below is how I would convert this patch into folios, but
it doesn't remove the problem that people just use page->mapping.

Do we want to set ourselves the goal of making this bug impossible?  That
is, eventually, do something like this ...

struct folio {
	union {
		struct page page;
		struct {
			unsigned long flags;
			struct list_head lru;
			struct address_space *mapping;
			pgoff_t index;
			unsigned long private;
		};
	};
};

... and remove the names 'mapping' & 'index' from struct page.
If so, now would be a good time to prepare for that, so we write
folio->mapping instead of folio->page.mapping.

diff --git a/mm/rmap.c b/mm/rmap.c
index 3142ea1dd071..fcad8c6a417a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -707,9 +707,11 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
  */
 unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 {
+	struct folio *folio = page_folio(page);
 	unsigned long address;
-	if (PageAnon(page)) {
-		struct anon_vma *page__anon_vma = page_anon_vma(page);
+
+	if (FolioAnon(folio)) {
+		struct anon_vma *page__anon_vma = folio_anon_vma(folio);
 		/*
 		 * Note: swapoff's unuse_vma() is more efficient with this
 		 * check, and needs it to match anon_vma when KSM is active.
@@ -717,9 +719,10 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 		if (!vma->anon_vma || !page__anon_vma ||
 		    vma->anon_vma->root != page__anon_vma->root)
 			return -EFAULT;
-	} else if (page->mapping) {
-		if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping)
-			return -EFAULT;
+	} else if (!vma->vm_file) {
+		return -EFAULT;
+	} else if (vma->vm_file->f_mapping != folio->page.mapping) {
+		return -EFAULT;
 	} else
 		return -EFAULT;
 	address = __vma_address(page, vma);
diff --git a/mm/util.c b/mm/util.c
index 9ab72cfa4aa1..91ac4fc90a07 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -635,11 +635,11 @@ void kvfree_sensitive(const void *addr, size_t len)
 }
 EXPORT_SYMBOL(kvfree_sensitive);
 
-static inline void *__page_rmapping(struct page *page)
+static inline void *folio_rmapping(struct folio *folio)
 {
 	unsigned long mapping;
 
-	mapping = (unsigned long)page->mapping;
+	mapping = (unsigned long)folio->page.mapping;
 	mapping &= ~PAGE_MAPPING_FLAGS;
 
 	return (void *)mapping;
@@ -648,8 +648,7 @@ static inline void *__page_rmapping(struct page *page)
 /* Neutral page->mapping pointer to address_space or anon_vma or other */
 void *page_rmapping(struct page *page)
 {
-	page = compound_head(page);
-	return __page_rmapping(page);
+	return folio_rmapping(page_folio(page));
 }
 
 /*
@@ -675,15 +674,12 @@ bool page_mapped(struct page *page)
 }
 EXPORT_SYMBOL(page_mapped);
 
-struct anon_vma *page_anon_vma(struct page *page)
+struct anon_vma *folio_anon_vma(struct folio *folio)
 {
-	unsigned long mapping;
-
-	page = compound_head(page);
-	mapping = (unsigned long)page->mapping;
+	unsigned long mapping = (unsigned long)folio->page.mapping;
 	if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
 		return NULL;
-	return __page_rmapping(page);
+	return folio_rmapping(folio);
 }
 
 struct address_space *folio_mapping(struct folio *folio)
HORIGUCHI NAOYA(堀口 直也) March 11, 2021, 3:14 p.m. UTC | #6
On Wed, Mar 10, 2021 at 11:22:18PM -0800, Hugh Dickins wrote:
> On Tue, 9 Feb 2021, Naoya Horiguchi wrote:
> 
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > Currently hwpoison code checks PageAnon() for thp and refuses to handle
> > errors on non-anonymous thps (just for historical reason).  We now
> > support non-anonymou thp like shmem one, so this patch suggests to enable
> > to handle shmem thps. Fortunately, we already have can_split_huge_page()
> > to check if a give thp is splittable, so this patch relies on it.
> 
> Fortunately? I don't understand. Why call can_split_huge_page()
> at all, instead of simply trying split_huge_page() directly?

The background of this change was that I've experienced triggering
VM_BUG_ON_PAGE() at the beginning of split_huge_page_to_list() in the older
kernel (I forgot specific condition of the BUG_ON).  I thought that that was
caused by race between thp allocation and memory_failure.  So I wanted to
have some rigid way to confirm that a given thp is splittable.  Then I found
can_split_huge_page(), which sounds suitable to me because I expected the API
to be maintained by thp subsystem.

But I rethink that split_huge_page_to_list() seems to have different set of
VM_BUG_ON_PAGE()s now, and anyway split_huge_page_to_list() calls
can_split_huge_page() internally, so I might have wrongly read the code.

> And could it do better than -EBUSY when split_huge_page() fails?

Yes it could.

> 
> > 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Thanks for trying to add shmem+file THP support, but I think this
> does not work as intended - Andrew, if Naoya agrees, please drop from
> mmotm for now, the fixup needed will be more than a line or two.

I agree to drop it. I need research more to address the following comments.

> 
> I'm not much into memory-failure myself, but Jue discovered that the
> SIGBUS never arrives: because split_huge_page() on a shmem or file
> THP unmaps all its pmds and ptes, and (unlike with anon) leaves them
> unmapped - in normal circumstances, to be faulted back on demand.
> So the page_mapped() check in hwpoison_user_mappings() fails,
> and the intended SIGBUS is not delivered.

Thanks for the information.  The split behaves quite differently between
for anon thp and for shmem thp.  I saw some unexpected behavior in my
testing, maybe that's due to the difference.

> 
> (Or, is it acceptable that the SIGBUS is not delivered to those who
> have the huge page mapped: should it get delivered later, to anyone
> who faults back in the bad 4k?)

Later access should report error in page fault, so the worst scenario
of consuming corrupted data does not happen, but precautionary signal
does not work so it's not acceptable.

> 
> We believe the tokill list has to be set up earlier, before
> split_huge_page() is called, then passed in to hwpoison_user_mappings().
> 
> Sorry, we don't have a proper patch for that right now, but I expect
> you can see what needs to be done.  But something we found on the way,
> we do have a patch for: add_to_kill() uses page_address_in_vma(), but
> that has not been used on file THP tails before - fix appended at the
> end below, so as not to waste your time on that bit.
> 

Thank you very much, I'll work on top of it.

Thanks,
Naoya Horiguchi

...
> 
> [PATCH] mm: fix page_address_in_vma() on file THP tails
> From: Jue Wang <juew@google.com>
> 
> Anon THP tails were already supported, but memory-failure now needs to use
> page_address_in_vma() on file THP tails, which its page->mapping check did
> not permit: fix it.
> 
> Signed-off-by: Jue Wang <juew@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> 
>  mm/rmap.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> --- 5.12-rc2/mm/rmap.c	2021-02-28 16:58:57.950450151 -0800
> +++ linux/mm/rmap.c	2021-03-10 20:29:21.591475177 -0800
> @@ -717,11 +717,11 @@ unsigned long page_address_in_vma(struct
>  		if (!vma->anon_vma || !page__anon_vma ||
>  		    vma->anon_vma->root != page__anon_vma->root)
>  			return -EFAULT;
> -	} else if (page->mapping) {
> -		if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping)
> -			return -EFAULT;
> -	} else
> +	} else if (!vma->vm_file) {
> +		return -EFAULT;
> +	} else if (vma->vm_file->f_mapping != compound_head(page)->mapping) {
>  		return -EFAULT;
> +	}
>  	address = __vma_address(page, vma);
>  	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
>  		return -EFAULT;
>
Jue Wang March 11, 2021, 7:32 p.m. UTC | #7
Huge thanks to Hugh for his expertise in shmem thps and forwarding this message!

Hi Naoya,

This is the first time I reply to an email in a Linux upstream thread,
apologies for any technical issues due to my email client. And my
apologies for the state of the whole patch not quite shareable with
upstream due to some kernel differences.

Some fyi comment inline.

Thanks,
-Jue





On Thu, Mar 11, 2021 at 7:14 AM HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Wed, Mar 10, 2021 at 11:22:18PM -0800, Hugh Dickins wrote:
> > On Tue, 9 Feb 2021, Naoya Horiguchi wrote:
> >
> > > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > >
> > > Currently hwpoison code checks PageAnon() for thp and refuses to handle
> > > errors on non-anonymous thps (just for historical reason).  We now
> > > support non-anonymou thp like shmem one, so this patch suggests to enable
> > > to handle shmem thps. Fortunately, we already have can_split_huge_page()
> > > to check if a give thp is splittable, so this patch relies on it.
> >
> > Fortunately? I don't understand. Why call can_split_huge_page()
> > at all, instead of simply trying split_huge_page() directly?
>
> The background of this change was that I've experienced triggering
> VM_BUG_ON_PAGE() at the beginning of split_huge_page_to_list() in the older
> kernel (I forgot specific condition of the BUG_ON).  I thought that that was
> caused by race between thp allocation and memory_failure.  So I wanted to
> have some rigid way to confirm that a given thp is splittable.  Then I found
> can_split_huge_page(), which sounds suitable to me because I expected the API
> to be maintained by thp subsystem.
>
> But I rethink that split_huge_page_to_list() seems to have different set of
> VM_BUG_ON_PAGE()s now, and anyway split_huge_page_to_list() calls
> can_split_huge_page() internally, so I might have wrongly read the code.
>
> > And could it do better than -EBUSY when split_huge_page() fails?
>
> Yes it could.
>
> >
> > >
> > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> >
> > Thanks for trying to add shmem+file THP support, but I think this
> > does not work as intended - Andrew, if Naoya agrees, please drop from
> > mmotm for now, the fixup needed will be more than a line or two.
>
> I agree to drop it. I need research more to address the following comments.
>
> >
> > I'm not much into memory-failure myself, but Jue discovered that the
> > SIGBUS never arrives: because split_huge_page() on a shmem or file
> > THP unmaps all its pmds and ptes, and (unlike with anon) leaves them
> > unmapped - in normal circumstances, to be faulted back on demand.
> > So the page_mapped() check in hwpoison_user_mappings() fails,
> > and the intended SIGBUS is not delivered.
>
> Thanks for the information.  The split behaves quite differently between
> for anon thp and for shmem thp.  I saw some unexpected behavior in my
> testing, maybe that's due to the difference.
>
> >
> > (Or, is it acceptable that the SIGBUS is not delivered to those who
> > have the huge page mapped: should it get delivered later, to anyone
> > who faults back in the bad 4k?)
>
> Later access should report error in page fault, so the worst scenario
> of consuming corrupted data does not happen, but precautionary signal
> does not work so it's not acceptable.
In our experiment with SHMEM THPs, later accesses resulted in a zero
page allocated instead of a SIGBUS with BUS_MCEERR_AR reported by the
page fault handler. That part might be an opportunity to prevent some
silent data corruption just in case.
>
> >
> > We believe the tokill list has to be set up earlier, before
> > split_huge_page() is called, then passed in to hwpoison_user_mappings().
> >
> > Sorry, we don't have a proper patch for that right now, but I expect
> > you can see what needs to be done.  But something we found on the way,
> > we do have a patch for: add_to_kill() uses page_address_in_vma(), but
> > that has not been used on file THP tails before - fix appended at the
> > end below, so as not to waste your time on that bit.
> >
>
> Thank you very much, I'll work on top of it.
>
> Thanks,
> Naoya Horiguchi
>
> ...
> >
> > [PATCH] mm: fix page_address_in_vma() on file THP tails
> > From: Jue Wang <juew@google.com>
> >
> > Anon THP tails were already supported, but memory-failure now needs to use
> > page_address_in_vma() on file THP tails, which its page->mapping check did
> > not permit: fix it.
> >
> > Signed-off-by: Jue Wang <juew@google.com>
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >
> >  mm/rmap.c |    8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > --- 5.12-rc2/mm/rmap.c        2021-02-28 16:58:57.950450151 -0800
> > +++ linux/mm/rmap.c   2021-03-10 20:29:21.591475177 -0800
> > @@ -717,11 +717,11 @@ unsigned long page_address_in_vma(struct
> >               if (!vma->anon_vma || !page__anon_vma ||
> >                   vma->anon_vma->root != page__anon_vma->root)
> >                       return -EFAULT;
> > -     } else if (page->mapping) {
> > -             if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping)
> > -                     return -EFAULT;
> > -     } else
> > +     } else if (!vma->vm_file) {
> > +             return -EFAULT;
> > +     } else if (vma->vm_file->f_mapping != compound_head(page)->mapping) {
> >               return -EFAULT;
> > +     }
> >       address = __vma_address(page, vma);
> >       if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> >               return -EFAULT;
> >
Hugh Dickins March 11, 2021, 10 p.m. UTC | #8
On Thu, 11 Mar 2021, Jue Wang wrote:
> On Thu, Mar 11, 2021 at 7:14 AM HORIGUCHI NAOYA(堀口 直也)
> <naoya.horiguchi@nec.com> wrote:
> > On Wed, Mar 10, 2021 at 11:22:18PM -0800, Hugh Dickins wrote:
> > >
> > > I'm not much into memory-failure myself, but Jue discovered that the
> > > SIGBUS never arrives: because split_huge_page() on a shmem or file
> > > THP unmaps all its pmds and ptes, and (unlike with anon) leaves them
> > > unmapped - in normal circumstances, to be faulted back on demand.
> > > So the page_mapped() check in hwpoison_user_mappings() fails,
> > > and the intended SIGBUS is not delivered.
> >
> > Thanks for the information.  The split behaves quite differently between
> > for anon thp and for shmem thp.  I saw some unexpected behavior in my
> > testing, maybe that's due to the difference.
> >
> > >
> > > (Or, is it acceptable that the SIGBUS is not delivered to those who
> > > have the huge page mapped: should it get delivered later, to anyone
> > > who faults back in the bad 4k?)
> >
> > Later access should report error in page fault, so the worst scenario
> > of consuming corrupted data does not happen, but precautionary signal
> > does not work so it's not acceptable.

On the other hand, if split_huge_page() does succeed, then there is an
argument that it would be better not to SIGBUS all mappers of parts of
the THP, but wait to select only those re-accessing the one bad 4k.

> In our experiment with SHMEM THPs, later accesses resulted in a zero
> page allocated instead of a SIGBUS with BUS_MCEERR_AR reported by the
> page fault handler. That part might be an opportunity to prevent some
> silent data corruption just in case.

Thanks for filling in more detail, Jue: I understand better now.

Maybe mm/shmem.c is wrong to be using generic_error_remove_page(),
the function which punches a hole on memory-failure.

That works well for filesystems backed by storage (at least when the
page had not been modified), because it does not (I think) actually
punch a hole in the stored object; and the next touch at that offset of
the file will allocate a new cache page to be filled from good storage.

But in the case of shmem (if we ignore the less likely swap cache case)
there is no storage to read back good data from, so the next touch just
fills a new cache page with zeroes (as you report above).

I don't know enough of the philosophy of memory-failure to say, but
I can see there's an argument for leaving the bad page in cache, to
give SIGBUS or EFAULT or EIO (whether by observation of PageHWPoison,
or by another MCE) to whoever accesses it - until the file or that
part of it is deleted (then that page never returned to use again).

Hugh
Matthew Wilcox March 18, 2021, 6:33 p.m. UTC | #9
On Thu, Mar 11, 2021 at 02:00:40PM -0800, Hugh Dickins wrote:
> On Thu, 11 Mar 2021, Jue Wang wrote:
> > In our experiment with SHMEM THPs, later accesses resulted in a zero
> > page allocated instead of a SIGBUS with BUS_MCEERR_AR reported by the
> > page fault handler. That part might be an opportunity to prevent some
> > silent data corruption just in case.
> 
> Thanks for filling in more detail, Jue: I understand better now.
> 
> Maybe mm/shmem.c is wrong to be using generic_error_remove_page(),
> the function which punches a hole on memory-failure.
> 
> That works well for filesystems backed by storage (at least when the
> page had not been modified), because it does not (I think) actually
> punch a hole in the stored object; and the next touch at that offset of
> the file will allocate a new cache page to be filled from good storage.
> 
> But in the case of shmem (if we ignore the less likely swap cache case)
> there is no storage to read back good data from, so the next touch just
> fills a new cache page with zeroes (as you report above).
> 
> I don't know enough of the philosophy of memory-failure to say, but
> I can see there's an argument for leaving the bad page in cache, to
> give SIGBUS or EFAULT or EIO (whether by observation of PageHWPoison,
> or by another MCE) to whoever accesses it - until the file or that
> part of it is deleted (then that page never returned to use again).

I think you have it right here.  If the page hasn't been modified since
being read in from swap, it's perfectly fine to discard it (because the
data can be read back from swap if accessed again).  But if it has been
modified, the user has lost data and must be made aware of this.

It seems to me that other filesystems also get this wrong.  If the page
is dirty (eg for ext2/ext4/xfs), we'll silently throw away the user's
data and reload it from storage.  That's spelled "silent data corruption".

So I think this is part of the solution here:

+++ b/mm/truncate.c
@@ -236,6 +236,8 @@ int generic_error_remove_page(struct address_space *mapping, struct page *page)
         */
        if (!S_ISREG(mapping->host->i_mode))
                return -EIO;
+       if (PageDirty(page))
+               return -EBUSY;
        return truncate_inode_page(mapping, page);
 }
 EXPORT_SYMBOL(generic_error_remove_page);

Things I don't know ...

 - Does shmem mark pages dirty in the right circumstances for this to work?
 - How does memory-failure cope with an -EBUSY return from
   ->error_remove_page() today?  It seems to treat all errors the same
   and return MF_FAILED, but I don't know whether that's going to do
   the right thing.
 - Do we correctly decline to evict a HWPoison page from the page cache
   until the page is truncated (hole punched, whatever).
diff mbox series

Patch

diff --git v5.11-rc6/mm/memory-failure.c v5.11-rc6_patched/mm/memory-failure.c
index e9481632fcd1..8c1575de0507 100644
--- v5.11-rc6/mm/memory-failure.c
+++ v5.11-rc6_patched/mm/memory-failure.c
@@ -956,20 +956,6 @@  static int __get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
 
-	if (!PageHuge(head) && PageTransHuge(head)) {
-		/*
-		 * Non anonymous thp exists only in allocation/free time. We
-		 * can't handle such a case correctly, so let's give it up.
-		 * This should be better than triggering BUG_ON when kernel
-		 * tries to touch the "partially handled" page.
-		 */
-		if (!PageAnon(head)) {
-			pr_err("Memory failure: %#lx: non anonymous thp\n",
-				page_to_pfn(page));
-			return 0;
-		}
-	}
-
 	if (get_page_unless_zero(head)) {
 		if (head == compound_head(page))
 			return 1;
@@ -1197,21 +1183,19 @@  static int identify_page_state(unsigned long pfn, struct page *p,
 
 static int try_to_split_thp_page(struct page *page, const char *msg)
 {
-	lock_page(page);
-	if (!PageAnon(page) || unlikely(split_huge_page(page))) {
-		unsigned long pfn = page_to_pfn(page);
+	struct page *head;
 
+	lock_page(page);
+	head = compound_head(page);
+	if (PageTransHuge(head) && can_split_huge_page(head, NULL) &&
+	    !split_huge_page(page)) {
 		unlock_page(page);
-		if (!PageAnon(page))
-			pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
-		else
-			pr_info("%s: %#lx: thp split failed\n", msg, pfn);
-		put_page(page);
-		return -EBUSY;
+		return 0;
 	}
+	pr_info("%s: %#lx: thp split failed\n", msg, page_to_pfn(page));
 	unlock_page(page);
-
-	return 0;
+	put_page(page);
+	return -EBUSY;
 }
 
 static int memory_failure_hugetlb(unsigned long pfn, int flags)