diff mbox series

mm: clean up hwpoison page cache page in fault path

Message ID 20220211170557.7964a301@imladris.surriel.com (mailing list archive)
State New
Headers show
Series mm: clean up hwpoison page cache page in fault path | expand

Commit Message

Rik van Riel Feb. 11, 2022, 10:05 p.m. UTC
Sometimes the page offlining code can leave behind a hwpoisoned clean
page cache page. This can lead to programs being killed over and over
and over again as they fault in the hwpoisoned page, get killed, and
then get re-spawned by whatever wanted to run them.
    
This is particularly embarrassing when the page was offlined due to
having too many corrected memory errors. Now we are killing tasks
due to them trying to access memory that probably isn't even corrupted.
    
This problem can be avoided by invalidating the page from the page
fault handler, which already has a branch for dealing with these
kinds of pages. With this patch we simply pretend the page fault
was successful if the page was invalidated, return to userspace,
incur another page fault, read in the file from disk (to a new
memory page), and then everything works again.
    
Signed-off-by: Rik van Riel <riel@surriel.com>

Comments

Miaohe Lin Feb. 12, 2022, 3:10 a.m. UTC | #1
On 2022/2/12 6:05, Rik van Riel wrote:
> Sometimes the page offlining code can leave behind a hwpoisoned clean
> page cache page. This can lead to programs being killed over and over

Yep, __soft_offline_page tries to invalidate_inode_page in a lightway.

> and over again as they fault in the hwpoisoned page, get killed, and
> then get re-spawned by whatever wanted to run them.
>     
> This is particularly embarrassing when the page was offlined due to
> having too many corrected memory errors. Now we are killing tasks
> due to them trying to access memory that probably isn't even corrupted.
>     
> This problem can be avoided by invalidating the page from the page
> fault handler, which already has a branch for dealing with these
> kinds of pages. With this patch we simply pretend the page fault
> was successful if the page was invalidated, return to userspace,
> incur another page fault, read in the file from disk (to a new
> memory page), and then everything works again.
>     
> Signed-off-by: Rik van Riel <riel@surriel.com>

Good catch! This looks good to me. Thanks.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

> 
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..2300358e268c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3871,11 +3871,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>  		return ret;
>  
>  	if (unlikely(PageHWPoison(vmf->page))) {
> -		if (ret & VM_FAULT_LOCKED)
> +		int poisonret = VM_FAULT_HWPOISON;
> +		if (ret & VM_FAULT_LOCKED) {
> +			/* Retry if a clean page was removed from the cache. */
> +			if (invalidate_inode_page(vmf->page))
> +				poisonret = 0;
>  			unlock_page(vmf->page);
> +		}
>  		put_page(vmf->page);
>  		vmf->page = NULL;
> -		return VM_FAULT_HWPOISON;
> +		return poisonret;
>  	}
>  
>  	if (unlikely(!(ret & VM_FAULT_LOCKED)))
> 
>
kernel test robot Feb. 12, 2022, 6:17 p.m. UTC | #2
Hi Rik,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on hnaz-mm/master]

url:    https://github.com/0day-ci/linux/commits/Rik-van-Riel/mm-clean-up-hwpoison-page-cache-page-in-fault-path/20220212-060643
base:   https://github.com/hnaz/linux-mm master
config: sparc-randconfig-s031-20220211 (https://download.01.org/0day-ci/archive/20220212/202202122306.S9ByO64R-lkp@intel.com/config)
compiler: sparc-linux-gcc (GCC) 11.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/0day-ci/linux/commit/abd960cdbc9487dbf0a0dc3b2395825a38f8fa44
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Rik-van-Riel/mm-clean-up-hwpoison-page-cache-page-in-fault-path/20220212-060643
        git checkout abd960cdbc9487dbf0a0dc3b2395825a38f8fa44
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=sparc SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> mm/memory.c:3913:33: sparse: sparse: incorrect type in initializer (different base types) @@     expected int poisonret @@     got restricted vm_fault_t @@
   mm/memory.c:3913:33: sparse:     expected int poisonret
   mm/memory.c:3913:33: sparse:     got restricted vm_fault_t
>> mm/memory.c:3922:24: sparse: sparse: incorrect type in return expression (different base types) @@     expected restricted vm_fault_t @@     got int [assigned] poisonret @@
   mm/memory.c:3922:24: sparse:     expected restricted vm_fault_t
   mm/memory.c:3922:24: sparse:     got int [assigned] poisonret
   mm/memory.c:1024:17: sparse: sparse: context imbalance in 'copy_pte_range' - different lock contexts for basic block
   mm/memory.c:1740:16: sparse: sparse: context imbalance in '__get_locked_pte' - different lock contexts for basic block
   mm/memory.c:1788:9: sparse: sparse: context imbalance in 'insert_page' - different lock contexts for basic block
   mm/memory.c:2290:17: sparse: sparse: context imbalance in 'remap_pte_range' - different lock contexts for basic block
   mm/memory.c:2546:17: sparse: sparse: context imbalance in 'apply_to_pte_range' - unexpected unlock
   mm/memory.c:2834:17: sparse: sparse: context imbalance in 'wp_page_copy' - unexpected unlock
   mm/memory.c:3173:17: sparse: sparse: context imbalance in 'wp_pfn_shared' - unexpected unlock
   mm/memory.c:3236:19: sparse: sparse: context imbalance in 'do_wp_page' - different lock contexts for basic block
   mm/memory.c:4939:5: sparse: sparse: context imbalance in 'follow_invalidate_pte' - different lock contexts for basic block
   mm/memory.c: note: in included file (through arch/sparc/include/asm/pgtable.h, include/linux/pgtable.h, include/linux/mm.h):
   arch/sparc/include/asm/pgtable_32.h:275:29: sparse: sparse: context imbalance in 'follow_pfn' - unexpected unlock

vim +3913 mm/memory.c

  3875	
  3876	/*
  3877	 * The mmap_lock must have been held on entry, and may have been
  3878	 * released depending on flags and vma->vm_ops->fault() return value.
  3879	 * See filemap_fault() and __lock_page_retry().
  3880	 */
  3881	static vm_fault_t __do_fault(struct vm_fault *vmf)
  3882	{
  3883		struct vm_area_struct *vma = vmf->vma;
  3884		vm_fault_t ret;
  3885	
  3886		/*
  3887		 * Preallocate pte before we take page_lock because this might lead to
  3888		 * deadlocks for memcg reclaim which waits for pages under writeback:
  3889		 *				lock_page(A)
  3890		 *				SetPageWriteback(A)
  3891		 *				unlock_page(A)
  3892		 * lock_page(B)
  3893		 *				lock_page(B)
  3894		 * pte_alloc_one
  3895		 *   shrink_page_list
  3896		 *     wait_on_page_writeback(A)
  3897		 *				SetPageWriteback(B)
  3898		 *				unlock_page(B)
  3899		 *				# flush A, B to clear the writeback
  3900		 */
  3901		if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
  3902			vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
  3903			if (!vmf->prealloc_pte)
  3904				return VM_FAULT_OOM;
  3905		}
  3906	
  3907		ret = vma->vm_ops->fault(vmf);
  3908		if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
  3909				    VM_FAULT_DONE_COW)))
  3910			return ret;
  3911	
  3912		if (unlikely(PageHWPoison(vmf->page))) {
> 3913			int poisonret = VM_FAULT_HWPOISON;
  3914			if (ret & VM_FAULT_LOCKED) {
  3915				/* Retry if a clean page was removed from the cache. */
  3916				if (invalidate_inode_page(vmf->page))
  3917					poisonret = 0;
  3918				unlock_page(vmf->page);
  3919			}
  3920			put_page(vmf->page);
  3921			vmf->page = NULL;
> 3922			return poisonret;
  3923		}
  3924	
  3925		if (unlikely(!(ret & VM_FAULT_LOCKED)))
  3926			lock_page(vmf->page);
  3927		else
  3928			VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
  3929	
  3930		return ret;
  3931	}
  3932	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
John Hubbard Feb. 13, 2022, 8:56 a.m. UTC | #3
On Fri, 11 Feb 2022, Rik van Riel wrote:
> Sometimes the page offlining code can leave behind a hwpoisoned clean
> page cache page. This can lead to programs being killed over and over
> and over again as they fault in the hwpoisoned page, get killed, and
> then get re-spawned by whatever wanted to run them.

Hi Rik,

This looks good. Some minor suggestions below:

>     
> This is particularly embarrassing when the page was offlined due to
> having too many corrected memory errors. Now we are killing tasks
> due to them trying to access memory that probably isn't even corrupted.

I'd recommend deleting that paragraph entirely. It's a separate
question, and it is not necessarily an accurate assessment of that
question either: the engineers who set the thresholds for "too many
corrected errors" may not--in fact, probably *will not*--agree with your
feeling that the memory is still working and reliable!

>     
> This problem can be avoided by invalidating the page from the page
> fault handler, which already has a branch for dealing with these
> kinds of pages. With this patch we simply pretend the page fault
> was successful if the page was invalidated, return to userspace,
> incur another page fault, read in the file from disk (to a new
> memory page), and then everything works again.

Very nice write-up here, it's rare that I get to read such a short, yet
clear explanation. Thank you for that. :)

>     
> Signed-off-by: Rik van Riel <riel@surriel.com>
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..2300358e268c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3871,11 +3871,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>  		return ret;
>  
>  	if (unlikely(PageHWPoison(vmf->page))) {
> -		if (ret & VM_FAULT_LOCKED)
> +		int poisonret = VM_FAULT_HWPOISON;
> +		if (ret & VM_FAULT_LOCKED) {

How about changing those two lines to these three (note the newline
after the declaration):

		vm_fault_t poison_ret = VM_FAULT_HWPOISON;

		if (ret & VM_FAULT_LOCKED) {

...which should fix the krobot complaint, and the underscore is just a
very tiny readability improvement while we're there.

> +			/* Retry if a clean page was removed from the cache. */

This is more cryptic than necessary, and in fact you've already provided
a write-up, so how about something like this, instead of the above:

		/*
		 * Avoid refaulting on the same poisoned page
		 * forever: attempt to invalidate the page. If that
		 * succeeds, then pretend the page fault was successful,
		 * return to userspace, incur another page fault, read
		 * in the file from disk (to a new page),and then
		 * everything works again.
		 */

> +			if (invalidate_inode_page(vmf->page))
> +				poisonret = 0;
>  			unlock_page(vmf->page);
> +		}
>  		put_page(vmf->page);
>  		vmf->page = NULL;
> -		return VM_FAULT_HWPOISON;
> +		return poisonret;
>  	}
>  
>  	if (unlikely(!(ret & VM_FAULT_LOCKED)))
> 
> 
> -- 
> All rights reversed.

Anyway, those are all documentation nits, except for the vm_fault_t,
which is a declaration nit :) , so I'm comfortable saying:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
Rik van Riel Feb. 13, 2022, 7:21 p.m. UTC | #4
On Sun, 2022-02-13 at 00:56 -0800, John Hubbard wrote:
> On Fri, 11 Feb 2022, Rik van Riel wrote:
> 
> >     
> > This is particularly embarrassing when the page was offlined due to
> > having too many corrected memory errors. Now we are killing tasks
> > due to them trying to access memory that probably isn't even
> > corrupted.
> 
> I'd recommend deleting that paragraph entirely. It's a separate
> question, and it is not necessarily an accurate assessment of that
> question either: the engineers who set the thresholds for "too many
> corrected errors" may not--in fact, probably *will not*--agree with
> your
> feeling that the memory is still working and reliable!

Fair enough. We try to offline pages before we get to
a point where the error correction might no longer be
able to correct the error correctly, but I am pretty
sure I have seen a few odd kernel crashes following a
stream of corrected errors that strongly suggested
corruption had in fact happened.

I'll take that paragraph out if anybody else asks
for further changes for v3 of the patch.
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..2300358e268c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3871,11 +3871,16 @@  static vm_fault_t __do_fault(struct vm_fault *vmf)
 		return ret;
 
 	if (unlikely(PageHWPoison(vmf->page))) {
-		if (ret & VM_FAULT_LOCKED)
+		int poisonret = VM_FAULT_HWPOISON;
+		if (ret & VM_FAULT_LOCKED) {
+			/* Retry if a clean page was removed from the cache. */
+			if (invalidate_inode_page(vmf->page))
+				poisonret = 0;
 			unlock_page(vmf->page);
+		}
 		put_page(vmf->page);
 		vmf->page = NULL;
-		return VM_FAULT_HWPOISON;
+		return poisonret;
 	}
 
 	if (unlikely(!(ret & VM_FAULT_LOCKED)))