Message ID | 20220211170557.7964a301@imladris.surriel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: clean up hwpoison page cache page in fault path | expand |
On 2022/2/12 6:05, Rik van Riel wrote: > Sometimes the page offlining code can leave behind a hwpoisoned clean > page cache page. This can lead to programs being killed over and over Yep, __soft_offline_page tries to invalidate_inode_page in a lightway. > and over again as they fault in the hwpoisoned page, get killed, and > then get re-spawned by whatever wanted to run them. > > This is particularly embarrassing when the page was offlined due to > having too many corrected memory errors. Now we are killing tasks > due to them trying to access memory that probably isn't even corrupted. > > This problem can be avoided by invalidating the page from the page > fault handler, which already has a branch for dealing with these > kinds of pages. With this patch we simply pretend the page fault > was successful if the page was invalidated, return to userspace, > incur another page fault, read in the file from disk (to a new > memory page), and then everything works again. > > Signed-off-by: Rik van Riel <riel@surriel.com> Good catch! This looks good to me. Thanks. Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> > > diff --git a/mm/memory.c b/mm/memory.c > index c125c4969913..2300358e268c 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3871,11 +3871,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) > return ret; > > if (unlikely(PageHWPoison(vmf->page))) { > - if (ret & VM_FAULT_LOCKED) > + int poisonret = VM_FAULT_HWPOISON; > + if (ret & VM_FAULT_LOCKED) { > + /* Retry if a clean page was removed from the cache. */ > + if (invalidate_inode_page(vmf->page)) > + poisonret = 0; > unlock_page(vmf->page); > + } > put_page(vmf->page); > vmf->page = NULL; > - return VM_FAULT_HWPOISON; > + return poisonret; > } > > if (unlikely(!(ret & VM_FAULT_LOCKED))) > >
Hi Rik, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on hnaz-mm/master] url: https://github.com/0day-ci/linux/commits/Rik-van-Riel/mm-clean-up-hwpoison-page-cache-page-in-fault-path/20220212-060643 base: https://github.com/hnaz/linux-mm master config: sparc-randconfig-s031-20220211 (https://download.01.org/0day-ci/archive/20220212/202202122306.S9ByO64R-lkp@intel.com/config) compiler: sparc-linux-gcc (GCC) 11.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # apt-get install sparse # sparse version: v0.6.4-dirty # https://github.com/0day-ci/linux/commit/abd960cdbc9487dbf0a0dc3b2395825a38f8fa44 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Rik-van-Riel/mm-clean-up-hwpoison-page-cache-page-in-fault-path/20220212-060643 git checkout abd960cdbc9487dbf0a0dc3b2395825a38f8fa44 # save the config file to linux build tree mkdir build_dir COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=sparc SHELL=/bin/bash If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> sparse warnings: (new ones prefixed by >>) >> mm/memory.c:3913:33: sparse: sparse: incorrect type in initializer (different base types) @@ expected int poisonret @@ got restricted vm_fault_t @@ mm/memory.c:3913:33: sparse: expected int poisonret mm/memory.c:3913:33: sparse: got restricted vm_fault_t >> mm/memory.c:3922:24: sparse: sparse: incorrect type in return expression (different base types) @@ expected restricted vm_fault_t @@ got int [assigned] poisonret @@ mm/memory.c:3922:24: sparse: expected restricted vm_fault_t mm/memory.c:3922:24: sparse: got int [assigned] poisonret mm/memory.c:1024:17: sparse: sparse: context imbalance in 'copy_pte_range' - different lock contexts for basic block mm/memory.c:1740:16: sparse: sparse: context imbalance in '__get_locked_pte' - different lock contexts for basic block mm/memory.c:1788:9: sparse: sparse: context imbalance in 'insert_page' - different lock contexts for basic block mm/memory.c:2290:17: sparse: sparse: context imbalance in 'remap_pte_range' - different lock contexts for basic block mm/memory.c:2546:17: sparse: sparse: context imbalance in 'apply_to_pte_range' - unexpected unlock mm/memory.c:2834:17: sparse: sparse: context imbalance in 'wp_page_copy' - unexpected unlock mm/memory.c:3173:17: sparse: sparse: context imbalance in 'wp_pfn_shared' - unexpected unlock mm/memory.c:3236:19: sparse: sparse: context imbalance in 'do_wp_page' - different lock contexts for basic block mm/memory.c:4939:5: sparse: sparse: context imbalance in 'follow_invalidate_pte' - different lock contexts for basic block mm/memory.c: note: in included file (through arch/sparc/include/asm/pgtable.h, include/linux/pgtable.h, include/linux/mm.h): arch/sparc/include/asm/pgtable_32.h:275:29: sparse: sparse: context imbalance in 'follow_pfn' - unexpected unlock vim +3913 mm/memory.c 3875 3876 /* 3877 * The mmap_lock must have been held on entry, and may have been 3878 * released depending on flags and vma->vm_ops->fault() return value. 3879 * See filemap_fault() and __lock_page_retry(). 3880 */ 3881 static vm_fault_t __do_fault(struct vm_fault *vmf) 3882 { 3883 struct vm_area_struct *vma = vmf->vma; 3884 vm_fault_t ret; 3885 3886 /* 3887 * Preallocate pte before we take page_lock because this might lead to 3888 * deadlocks for memcg reclaim which waits for pages under writeback: 3889 * lock_page(A) 3890 * SetPageWriteback(A) 3891 * unlock_page(A) 3892 * lock_page(B) 3893 * lock_page(B) 3894 * pte_alloc_one 3895 * shrink_page_list 3896 * wait_on_page_writeback(A) 3897 * SetPageWriteback(B) 3898 * unlock_page(B) 3899 * # flush A, B to clear the writeback 3900 */ 3901 if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { 3902 vmf->prealloc_pte = pte_alloc_one(vma->vm_mm); 3903 if (!vmf->prealloc_pte) 3904 return VM_FAULT_OOM; 3905 } 3906 3907 ret = vma->vm_ops->fault(vmf); 3908 if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | 3909 VM_FAULT_DONE_COW))) 3910 return ret; 3911 3912 if (unlikely(PageHWPoison(vmf->page))) { > 3913 int poisonret = VM_FAULT_HWPOISON; 3914 if (ret & VM_FAULT_LOCKED) { 3915 /* Retry if a clean page was removed from the cache. */ 3916 if (invalidate_inode_page(vmf->page)) 3917 poisonret = 0; 3918 unlock_page(vmf->page); 3919 } 3920 put_page(vmf->page); 3921 vmf->page = NULL; > 3922 return poisonret; 3923 } 3924 3925 if (unlikely(!(ret & VM_FAULT_LOCKED))) 3926 lock_page(vmf->page); 3927 else 3928 VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page); 3929 3930 return ret; 3931 } 3932 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On Fri, 11 Feb 2022, Rik van Riel wrote: > Sometimes the page offlining code can leave behind a hwpoisoned clean > page cache page. This can lead to programs being killed over and over > and over again as they fault in the hwpoisoned page, get killed, and > then get re-spawned by whatever wanted to run them. Hi Rik, This looks good. Some minor suggestions below: > > This is particularly embarrassing when the page was offlined due to > having too many corrected memory errors. Now we are killing tasks > due to them trying to access memory that probably isn't even corrupted. I'd recommend deleting that paragraph entirely. It's a separate question, and it is not necessarily an accurate assessment of that question either: the engineers who set the thresholds for "too many corrected errors" may not--in fact, probably *will not*--agree with your feeling that the memory is still working and reliable! > > This problem can be avoided by invalidating the page from the page > fault handler, which already has a branch for dealing with these > kinds of pages. With this patch we simply pretend the page fault > was successful if the page was invalidated, return to userspace, > incur another page fault, read in the file from disk (to a new > memory page), and then everything works again. Very nice write-up here, it's rare that I get to read such a short, yet clear explanation. Thank you for that. :) > > Signed-off-by: Rik van Riel <riel@surriel.com> > > diff --git a/mm/memory.c b/mm/memory.c > index c125c4969913..2300358e268c 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3871,11 +3871,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) > return ret; > > if (unlikely(PageHWPoison(vmf->page))) { > - if (ret & VM_FAULT_LOCKED) > + int poisonret = VM_FAULT_HWPOISON; > + if (ret & VM_FAULT_LOCKED) { How about changing those two lines to these three (note the newline after the declaration): vm_fault_t poison_ret = VM_FAULT_HWPOISON; if (ret & VM_FAULT_LOCKED) { ...which should fix the krobot complaint, and the underscore is just a very tiny readability improvement while we're there. > + /* Retry if a clean page was removed from the cache. */ This is more cryptic than necessary, and in fact you've already provided a write-up, so how about something like this, instead of the above: /* * Avoid refaulting on the same poisoned page * forever: attempt to invalidate the page. If that * succeeds, then pretend the page fault was successful, * return to userspace, incur another page fault, read * in the file from disk (to a new page),and then * everything works again. */ > + if (invalidate_inode_page(vmf->page)) > + poisonret = 0; > unlock_page(vmf->page); > + } > put_page(vmf->page); > vmf->page = NULL; > - return VM_FAULT_HWPOISON; > + return poisonret; > } > > if (unlikely(!(ret & VM_FAULT_LOCKED))) > > > -- > All rights reversed. Anyway, those are all documentation nits, except for the vm_fault_t, which is a declaration nit :) , so I'm comfortable saying: Reviewed-by: John Hubbard <jhubbard@nvidia.com> thanks,
On Sun, 2022-02-13 at 00:56 -0800, John Hubbard wrote: > On Fri, 11 Feb 2022, Rik van Riel wrote: > > > > > This is particularly embarrassing when the page was offlined due to > > having too many corrected memory errors. Now we are killing tasks > > due to them trying to access memory that probably isn't even > > corrupted. > > I'd recommend deleting that paragraph entirely. It's a separate > question, and it is not necessarily an accurate assessment of that > question either: the engineers who set the thresholds for "too many > corrected errors" may not--in fact, probably *will not*--agree with > your > feeling that the memory is still working and reliable! Fair enough. We try to offline pages before we get to a point where the error correction might no longer be able to correct the error correctly, but I am pretty sure I have seen a few odd kernel crashes following a stream of corrected errors that strongly suggested corruption had in fact happened. I'll take that paragraph out if anybody else asks for further changes for v3 of the patch.
diff --git a/mm/memory.c b/mm/memory.c index c125c4969913..2300358e268c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3871,11 +3871,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return ret; if (unlikely(PageHWPoison(vmf->page))) { - if (ret & VM_FAULT_LOCKED) + int poisonret = VM_FAULT_HWPOISON; + if (ret & VM_FAULT_LOCKED) { + /* Retry if a clean page was removed from the cache. */ + if (invalidate_inode_page(vmf->page)) + poisonret = 0; unlock_page(vmf->page); + } put_page(vmf->page); vmf->page = NULL; - return VM_FAULT_HWPOISON; + return poisonret; } if (unlikely(!(ret & VM_FAULT_LOCKED)))
Sometimes the page offlining code can leave behind a hwpoisoned clean page cache page. This can lead to programs being killed over and over and over again as they fault in the hwpoisoned page, get killed, and then get re-spawned by whatever wanted to run them. This is particularly embarrassing when the page was offlined due to having too many corrected memory errors. Now we are killing tasks due to them trying to access memory that probably isn't even corrupted. This problem can be avoided by invalidating the page from the page fault handler, which already has a branch for dealing with these kinds of pages. With this patch we simply pretend the page fault was successful if the page was invalidated, return to userspace, incur another page fault, read in the file from disk (to a new memory page), and then everything works again. Signed-off-by: Rik van Riel <riel@surriel.com>