mbox series

[0/6] hwpoison, shmem, hugetlb: fix data loss issue 5.10.y

Message ID 20221123195408.135161-1-mike.kravetz@oracle.com (mailing list archive)
Headers show
Series hwpoison, shmem, hugetlb: fix data loss issue 5.10.y | expand

Message

Mike Kravetz Nov. 23, 2022, 7:54 p.m. UTC
This is a request for adding the following patches to stable 5.10.y.

Poisoned shmem and hugetlb pages are removed from the pagecache.
Subsequent access to the offset in the file results in a NEW zero
filled page.  Application code does not get notified of the data
loss, and the only 'clue' is a message in the system log.  Data
loss has been experienced by real users.

This was addressed upstream.  Most commits were marked for backports,
but some were not.  This was discussed here [1] and here [2].

Patches apply cleanly to v5.4.224 and pass tests checking for this
specific data loss issue.  LTP mm tests show no regressions.

All patches except 4 "mm: hwpoison: handle non-anonymous THP correctly"
required a small bit of change to apply correctly: mostly for context.

linux-mm Cc'ed as it would be great to get at least an ACK from others
familiar with this issue.

[1] https://lore.kernel.org/linux-mm/Y2UTUNBHVY5U9si2@monkey/
[2] https://lore.kernel.org/stable/20221114131403.GA3807058@u2004/

James Houghton (1):
  hugetlbfs: don't delete error page from pagecache

Yang Shi (5):
  mm: hwpoison: remove the unnecessary THP check
  mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
  mm: hwpoison: refactor refcount check handling
  mm: hwpoison: handle non-anonymous THP correctly
  mm: shmem: don't truncate page if memory failure happens

 fs/hugetlbfs/inode.c       |  13 ++--
 include/linux/page-flags.h |  23 ++++++
 mm/huge_memory.c           |   2 +
 mm/hugetlb.c               |   4 +
 mm/memory-failure.c        | 153 ++++++++++++++++++++++++-------------
 mm/memory.c                |   9 +++
 mm/page_alloc.c            |   4 +-
 mm/shmem.c                 |  51 +++++++++++--
 8 files changed, 191 insertions(+), 68 deletions(-)

Comments

Shuai Xue Feb. 20, 2023, 11:38 a.m. UTC | #1
On 2022/11/24 AM3:54, Mike Kravetz wrote:
> This is a request for adding the following patches to stable 5.10.y.
> 
> Poisoned shmem and hugetlb pages are removed from the pagecache.
> Subsequent access to the offset in the file results in a NEW zero
> filled page.  Application code does not get notified of the data
> loss, and the only 'clue' is a message in the system log.  Data
> loss has been experienced by real users.
> 
> This was addressed upstream.  Most commits were marked for backports,
> but some were not.  This was discussed here [1] and here [2].
> 
> Patches apply cleanly to v5.4.224 and pass tests checking for this
> specific data loss issue.  LTP mm tests show no regressions.
> 
> All patches except 4 "mm: hwpoison: handle non-anonymous THP correctly"
> required a small bit of change to apply correctly: mostly for context.
> 
> linux-mm Cc'ed as it would be great to get at least an ACK from others
> familiar with this issue.
> 
> [1] https://lore.kernel.org/linux-mm/Y2UTUNBHVY5U9si2@monkey/
> [2] https://lore.kernel.org/stable/20221114131403.GA3807058@u2004/
> 
> James Houghton (1):
>   hugetlbfs: don't delete error page from pagecache
> 
> Yang Shi (5):
>   mm: hwpoison: remove the unnecessary THP check
>   mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
>   mm: hwpoison: refactor refcount check handling
>   mm: hwpoison: handle non-anonymous THP correctly
>   mm: shmem: don't truncate page if memory failure happens
> 
>  fs/hugetlbfs/inode.c       |  13 ++--
>  include/linux/page-flags.h |  23 ++++++
>  mm/huge_memory.c           |   2 +
>  mm/hugetlb.c               |   4 +
>  mm/memory-failure.c        | 153 ++++++++++++++++++++++++-------------
>  mm/memory.c                |   9 +++
>  mm/page_alloc.c            |   4 +-
>  mm/shmem.c                 |  51 +++++++++++--
>  8 files changed, 191 insertions(+), 68 deletions(-)
> 

Hi, folks

Thank you for your effort. Data loss will break the data consistency of
end users and it is critical to notify users.

I tried to apply this patch set to 5.10.168 stable release[1] and run
mm_regression[3] test cases following steps[4] provided by Naoya. All
four cases passed.

	#./run.sh project summary -p
	Project Name: debug
	PASS mm/hwpoison/shmem_link/link-hard.auto3
	PASS mm/hwpoison/shmem_link/link-sym.auto3
	PASS mm/hwpoison/shmem_rw/thp-always.auto3
	PASS mm/hwpoison/shmem_rw/thp-never.auto3
	Progress: 4 / 4 (100%)

Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>

Cheers,
Shuai

[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tag/?h=v5.10.168
[2] https://github.com/nhoriguchi/mm_regression
[3] https://lore.kernel.org/stable/20221116235842.GA62826@u2004/