[v5] mm/gup: check page hwposion status for coredump.

Message ID	20210322193318.377c9ce9@alex-virtual-machine (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=1AXc=IU=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E9C7361984 Date: Mon, 22 Mar 2021 19:33:18 +0800 From: Aili Yao <yaoaili@kingsoft.com> To: Matthew Wilcox <willy@infradead.org>, David Hildenbrand <david@redhat.com>, <akpm@linux-foundation.org>, <naoya.horiguchi@nec.com> CC: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, <yangfeng1@kingsoft.com>, <sunhao2@kingsoft.com>, Oscar Salvador <osalvador@suse.de>, Mike Kravetz <mike.kravetz@oracle.com>, <yaoaili@kingsoft.com> Subject: [PATCH v5] mm/gup: check page hwposion status for coredump. Message-ID: <20210322193318.377c9ce9@alex-virtual-machine> In-Reply-To: <20210320003516.GC3420@casper.infradead.org> References: <20210317163714.328a038d@alex-virtual-machine> <20a0d078-f49d-54d6-9f04-f6b41dd51e5f@redhat.com> <20210318044600.GJ3420@casper.infradead.org> <20210318133412.12078eb7@alex-virtual-machine> <20210319104437.6f30e80d@alex-virtual-machine> <20210320003516.GC3420@casper.infradead.org> Organization: kingsoft MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Received-SPF: none (kingsoft.com>: No applicable sender policy available) receiver=imf26; identity=mailfrom; envelope-from="<yaoaili@kingsoft.com>"; helo=mail.kingsoft.com; client-ip=114.255.44.145 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v5] mm/gup: check page hwposion status for coredump. \| expand [v5] mm/gup: check page hwposion status for coredump.

yaoaili [么爱利] March 22, 2021, 11:33 a.m. UTC

When we do coredump for user process signal, this may be one SIGBUS signal
with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
resulted from ECC memory fail like SRAR or SRAO, we expect the memory
recovery work is finished correctly, then the get_dump_page() will not
return the error page as its process pte is set invalid by
memory_failure().

But memory_failure() may fail, and the process's related pte may not be
correctly set invalid, for current code, we will return the poison page,
get it dumped, and then lead to system panic as its in kernel code.

So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.

There maybe other scenario that is also better to check hwposion status
and not to panic, so make a wrapper for this check, Thanks to David's
suggestion(<david@redhat.com>).

Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/gup.c      |  4 ++++
 mm/internal.h | 20 ++++++++++++++++++++
 2 files changed, 24 insertions(+)

David Hildenbrand March 26, 2021, 2:09 p.m. UTC | #1

On 22.03.21 12:33, Aili Yao wrote:
> When we do coredump for user process signal, this may be one SIGBUS signal
> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> recovery work is finished correctly, then the get_dump_page() will not
> return the error page as its process pte is set invalid by
> memory_failure().
> 
> But memory_failure() may fail, and the process's related pte may not be
> correctly set invalid, for current code, we will return the poison page,
> get it dumped, and then lead to system panic as its in kernel code.
> 
> So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
> 
> There maybe other scenario that is also better to check hwposion status
> and not to panic, so make a wrapper for this check, Thanks to David's
> suggestion(<david@redhat.com>).
> 
> Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Aili Yao <yaoaili@kingsoft.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>   mm/gup.c      |  4 ++++
>   mm/internal.h | 20 ++++++++++++++++++++
>   2 files changed, 24 insertions(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index e4c224c..6f7e1aa 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
>   				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
>   	if (locked)
>   		mmap_read_unlock(mm);

Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked() 
when stumbling over a hwpoisoned page?

See __get_user_pages_locked()->__get_user_pages()->faultin_page():

handle_mm_fault()->vm_fault_to_errno(), which translates 
VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)

?

> +
> +	if (ret == 1 && is_page_hwpoison(page))
> +		return NULL;
> +
>   	return (ret == 1) ? page : NULL;
>   }
>   #endif /* CONFIG_ELF_CORE */
> diff --git a/mm/internal.h b/mm/internal.h
> index 25d2b2439..b751cef 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -97,6 +97,26 @@ static inline void set_page_refcounted(struct page *page)
>   	set_page_count(page, 1);
>   }
>   
> +/*
> + * When kernel touch the user page, the user page may be have been marked
> + * poison but still mapped in user space, if without this page, the kernel
> + * can guarantee the data integrity and operation success, the kernel is
> + * better to check the posion status and avoid touching it, be good not to
> + * panic, coredump for process fatal signal is a sample case matching this
> + * scenario. Or if kernel can't guarantee the data integrity, it's better
> + * not to call this function, let kernel touch the poison page and get to
> + * panic.
> + */
> +static inline bool is_page_hwpoison(struct page *page)
> +{
> +	if (PageHWPoison(page))
> +		return true;
> +	else if (PageHuge(page) && PageHWPoison(compound_head(page)))
> +		return true;
> +
> +	return false;
> +}
> +
>   extern unsigned long highest_memmap_pfn;
>   
>   /*
>

David Hildenbrand March 26, 2021, 2:22 p.m. UTC | #2

On 26.03.21 15:09, David Hildenbrand wrote:
> On 22.03.21 12:33, Aili Yao wrote:
>> When we do coredump for user process signal, this may be one SIGBUS signal
>> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
>> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
>> recovery work is finished correctly, then the get_dump_page() will not
>> return the error page as its process pte is set invalid by
>> memory_failure().
>>
>> But memory_failure() may fail, and the process's related pte may not be
>> correctly set invalid, for current code, we will return the poison page,
>> get it dumped, and then lead to system panic as its in kernel code.
>>
>> So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
>>
>> There maybe other scenario that is also better to check hwposion status
>> and not to panic, so make a wrapper for this check, Thanks to David's
>> suggestion(<david@redhat.com>).
>>
>> Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
>> Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Aili Yao <yaoaili@kingsoft.com>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>    mm/gup.c      |  4 ++++
>>    mm/internal.h | 20 ++++++++++++++++++++
>>    2 files changed, 24 insertions(+)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index e4c224c..6f7e1aa 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
>>    				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
>>    	if (locked)
>>    		mmap_read_unlock(mm);
> 
> Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> when stumbling over a hwpoisoned page?
> 
> See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> 
> handle_mm_fault()->vm_fault_to_errno(), which translates
> VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> 
> ?

Or doesn't that happen as you describe "But memory_failure() may fail, 
and the process's related pte may not be correctly set invalid" -- but 
why does that happen?

On a similar thought, should get_user_pages() never return a page that 
has HWPoison set? E.g., check also for existing PTEs if the page is 
hwpoisoned?

@Naoya, Oscar

HORIGUCHI NAOYA(堀口直也) March 31, 2021, 1:52 a.m. UTC | #3

On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:
> On 26.03.21 15:09, David Hildenbrand wrote:
> > On 22.03.21 12:33, Aili Yao wrote:
> > > When we do coredump for user process signal, this may be one SIGBUS signal
> > > with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> > > resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> > > recovery work is finished correctly, then the get_dump_page() will not
> > > return the error page as its process pte is set invalid by
> > > memory_failure().
> > > 
> > > But memory_failure() may fail, and the process's related pte may not be
> > > correctly set invalid, for current code, we will return the poison page,
> > > get it dumped, and then lead to system panic as its in kernel code.
> > > 
> > > So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
> > > 
> > > There maybe other scenario that is also better to check hwposion status
> > > and not to panic, so make a wrapper for this check, Thanks to David's
> > > suggestion(<david@redhat.com>).
> > > 
> > > Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> > > Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: Matthew Wilcox <willy@infradead.org>
> > > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > Cc: Oscar Salvador <osalvador@suse.de>
> > > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > > Cc: Aili Yao <yaoaili@kingsoft.com>
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > ---
> > >    mm/gup.c      |  4 ++++
> > >    mm/internal.h | 20 ++++++++++++++++++++
> > >    2 files changed, 24 insertions(+)
> > > 
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index e4c224c..6f7e1aa 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> > >    				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> > >    	if (locked)
> > >    		mmap_read_unlock(mm);
> > 
> > Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> > when stumbling over a hwpoisoned page?
> > 
> > See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> > 
> > handle_mm_fault()->vm_fault_to_errno(), which translates
> > VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> > 
> > ?

We could get -EFAULT, but sometimes not (depends on how memory_failure() fails).

If we failed to unmap, the page table is not converted to hwpoison entry,
so __get_user_pages_locked() get the hwpoisoned page.

If we successfully unmapped but failed in truncate_error_page() for example,
the processes mapping the page would get -EFAULT as expected.  But even in
this case, other processes could reach the error page via page cache and
__get_user_pages_locked() for them could return the hwpoisoned page.

> 
> Or doesn't that happen as you describe "But memory_failure() may fail, and
> the process's related pte may not be correctly set invalid" -- but why does
> that happen?

Simply because memory_failure() doesn't handle some page types like ksm page
and zero page. Or maybe shmem thp also belongs to this class.

> 
> On a similar thought, should get_user_pages() never return a page that has
> HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned?

Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
follow_huge_pmd() would work well.

Thanks,
Naoya Horiguchi

yaoaili [么爱利] March 31, 2021, 2:43 a.m. UTC | #4

On Wed, 31 Mar 2021 01:52:59 +0000
HORIGUCHI NAOYA(堀口　直也) <naoya.horiguchi@nec.com> wrote:

> On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:
> > On 26.03.21 15:09, David Hildenbrand wrote:  
> > > On 22.03.21 12:33, Aili Yao wrote:  
> > > > When we do coredump for user process signal, this may be one SIGBUS signal
> > > > with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> > > > resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> > > > recovery work is finished correctly, then the get_dump_page() will not
> > > > return the error page as its process pte is set invalid by
> > > > memory_failure().
> > > > 
> > > > But memory_failure() may fail, and the process's related pte may not be
> > > > correctly set invalid, for current code, we will return the poison page,
> > > > get it dumped, and then lead to system panic as its in kernel code.
> > > > 
> > > > So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
> > > > 
> > > > There maybe other scenario that is also better to check hwposion status
> > > > and not to panic, so make a wrapper for this check, Thanks to David's
> > > > suggestion(<david@redhat.com>).
> > > > 
> > > > Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> > > > Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> > > > Cc: David Hildenbrand <david@redhat.com>
> > > > Cc: Matthew Wilcox <willy@infradead.org>
> > > > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > > > Cc: Aili Yao <yaoaili@kingsoft.com>
> > > > Cc: stable@vger.kernel.org
> > > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > > ---
> > > >    mm/gup.c      |  4 ++++
> > > >    mm/internal.h | 20 ++++++++++++++++++++
> > > >    2 files changed, 24 insertions(+)
> > > > 
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index e4c224c..6f7e1aa 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> > > >    				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> > > >    	if (locked)
> > > >    		mmap_read_unlock(mm);  
> > > 
> > > Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> > > when stumbling over a hwpoisoned page?
> > > 
> > > See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> > > 
> > > handle_mm_fault()->vm_fault_to_errno(), which translates
> > > VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> > > 
> > > ?  
> 
> We could get -EFAULT, but sometimes not (depends on how memory_failure() fails).
> 
> If we failed to unmap, the page table is not converted to hwpoison entry,
> so __get_user_pages_locked() get the hwpoisoned page.
> 
> If we successfully unmapped but failed in truncate_error_page() for example,
> the processes mapping the page would get -EFAULT as expected.  But even in
> this case, other processes could reach the error page via page cache and
> __get_user_pages_locked() for them could return the hwpoisoned page.
> 
> > 
> > Or doesn't that happen as you describe "But memory_failure() may fail, and
> > the process's related pte may not be correctly set invalid" -- but why does
> > that happen?  
> 
> Simply because memory_failure() doesn't handle some page types like ksm page
> and zero page. Or maybe shmem thp also belongs to this class.
> 
> > 
> > On a similar thought, should get_user_pages() never return a page that has
> > HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned?  
> 
> Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
> follow_huge_pmd() would work well.

I think we should take more care to broadcast the hwpoison check to other cases,
SIGBUS coredump is such a case that it is supposed to not touch the poison page, 
and if we return NULL for this, the coredump process will get a successful finish.

Other cases may also meet the requirements like coredump, but we need to identify it,
that's the poison check wrapper's purpose. If not, we may break the integrity of the
related action, which may be no better than panic.

HORIGUCHI NAOYA(堀口直也) March 31, 2021, 4:32 a.m. UTC | #5

On Wed, Mar 31, 2021 at 10:43:36AM +0800, Aili Yao wrote:
> On Wed, 31 Mar 2021 01:52:59 +0000 HORIGUCHI NAOYA(堀口　直也) <naoya.horiguchi@nec.com> wrote:
> > On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:
> > > On 26.03.21 15:09, David Hildenbrand wrote:  
> > > > On 22.03.21 12:33, Aili Yao wrote:  
> > > > > When we do coredump for user process signal, this may be one SIGBUS signal
> > > > > with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> > > > > resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> > > > > recovery work is finished correctly, then the get_dump_page() will not
> > > > > return the error page as its process pte is set invalid by
> > > > > memory_failure().
> > > > > 
> > > > > But memory_failure() may fail, and the process's related pte may not be
> > > > > correctly set invalid, for current code, we will return the poison page,
> > > > > get it dumped, and then lead to system panic as its in kernel code.
> > > > > 
> > > > > So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
> > > > > 
> > > > > There maybe other scenario that is also better to check hwposion status
> > > > > and not to panic, so make a wrapper for this check, Thanks to David's
> > > > > suggestion(<david@redhat.com>).
> > > > > 
> > > > > Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> > > > > Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> > > > > Cc: David Hildenbrand <david@redhat.com>
> > > > > Cc: Matthew Wilcox <willy@infradead.org>
> > > > > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > > > > Cc: Aili Yao <yaoaili@kingsoft.com>
> > > > > Cc: stable@vger.kernel.org
> > > > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > > > ---
> > > > >    mm/gup.c      |  4 ++++
> > > > >    mm/internal.h | 20 ++++++++++++++++++++
> > > > >    2 files changed, 24 insertions(+)
> > > > > 
> > > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > > index e4c224c..6f7e1aa 100644
> > > > > --- a/mm/gup.c
> > > > > +++ b/mm/gup.c
> > > > > @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> > > > >    				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> > > > >    	if (locked)
> > > > >    		mmap_read_unlock(mm);  
> > > > 
> > > > Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> > > > when stumbling over a hwpoisoned page?
> > > > 
> > > > See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> > > > 
> > > > handle_mm_fault()->vm_fault_to_errno(), which translates
> > > > VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> > > > 
> > > > ?  
> > 
> > We could get -EFAULT, but sometimes not (depends on how memory_failure() fails).
> > 
> > If we failed to unmap, the page table is not converted to hwpoison entry,
> > so __get_user_pages_locked() get the hwpoisoned page.
> > 
> > If we successfully unmapped but failed in truncate_error_page() for example,
> > the processes mapping the page would get -EFAULT as expected.  But even in
> > this case, other processes could reach the error page via page cache and
> > __get_user_pages_locked() for them could return the hwpoisoned page.
> > 
> > > 
> > > Or doesn't that happen as you describe "But memory_failure() may fail, and
> > > the process's related pte may not be correctly set invalid" -- but why does
> > > that happen?  
> > 
> > Simply because memory_failure() doesn't handle some page types like ksm page
> > and zero page. Or maybe shmem thp also belongs to this class.
> > 
> > > 
> > > On a similar thought, should get_user_pages() never return a page that has
> > > HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned?  
> > 
> > Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
> > follow_huge_pmd() would work well.
> 
> I think we should take more care to broadcast the hwpoison check to other cases,
> SIGBUS coredump is such a case that it is supposed to not touch the poison page, 
> and if we return NULL for this, the coredump process will get a successful finish.
> 
> Other cases may also meet the requirements like coredump, but we need to identify it,
> that's the poison check wrapper's purpose. If not, we may break the integrity of the
> related action, which may be no better than panic.

If you worry about regression and would like to make this new behavior conditional,
we could use FOLL_HWPOISON to specify that the caller is hwpoison-aware so that
any !FOLL_HWPOISON caller ignores the hwpoison check and works as it does now.
This approach looks to me helpful because it would encourage developers touching
gup code to pay attention to FOLL_HWPOISON code.

Thanks,
Naoya Horiguchi

Matthew Wilcox March 31, 2021, 6:07 a.m. UTC | #6

On Wed, Mar 31, 2021 at 01:52:59AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> If we successfully unmapped but failed in truncate_error_page() for example,
> the processes mapping the page would get -EFAULT as expected.  But even in
> this case, other processes could reach the error page via page cache and
> __get_user_pages_locked() for them could return the hwpoisoned page.

How would that happen?  We check PageHWPoison before inserting a page
into the page tables.  See, eg, filemap_map_pages() and __do_fault().

David Hildenbrand March 31, 2021, 6:44 a.m. UTC | #7

On 31.03.21 06:32, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Mar 31, 2021 at 10:43:36AM +0800, Aili Yao wrote:
>> On Wed, 31 Mar 2021 01:52:59 +0000 HORIGUCHI NAOYA(堀口　直也) <naoya.horiguchi@nec.com> wrote:
>>> On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:
>>>> On 26.03.21 15:09, David Hildenbrand wrote:
>>>>> On 22.03.21 12:33, Aili Yao wrote:
>>>>>> When we do coredump for user process signal, this may be one SIGBUS signal
>>>>>> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
>>>>>> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
>>>>>> recovery work is finished correctly, then the get_dump_page() will not
>>>>>> return the error page as its process pte is set invalid by
>>>>>> memory_failure().
>>>>>>
>>>>>> But memory_failure() may fail, and the process's related pte may not be
>>>>>> correctly set invalid, for current code, we will return the poison page,
>>>>>> get it dumped, and then lead to system panic as its in kernel code.
>>>>>>
>>>>>> So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
>>>>>>
>>>>>> There maybe other scenario that is also better to check hwposion status
>>>>>> and not to panic, so make a wrapper for this check, Thanks to David's
>>>>>> suggestion(<david@redhat.com>).
>>>>>>
>>>>>> Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
>>>>>> Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>>>> Cc: Oscar Salvador <osalvador@suse.de>
>>>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>> Cc: Aili Yao <yaoaili@kingsoft.com>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>>>>>> ---
>>>>>>     mm/gup.c      |  4 ++++
>>>>>>     mm/internal.h | 20 ++++++++++++++++++++
>>>>>>     2 files changed, 24 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>>> index e4c224c..6f7e1aa 100644
>>>>>> --- a/mm/gup.c
>>>>>> +++ b/mm/gup.c
>>>>>> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
>>>>>>     				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
>>>>>>     	if (locked)
>>>>>>     		mmap_read_unlock(mm);
>>>>>
>>>>> Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
>>>>> when stumbling over a hwpoisoned page?
>>>>>
>>>>> See __get_user_pages_locked()->__get_user_pages()->faultin_page():
>>>>>
>>>>> handle_mm_fault()->vm_fault_to_errno(), which translates
>>>>> VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
>>>>>
>>>>> ?
>>>
>>> We could get -EFAULT, but sometimes not (depends on how memory_failure() fails).
>>>
>>> If we failed to unmap, the page table is not converted to hwpoison entry,
>>> so __get_user_pages_locked() get the hwpoisoned page.
>>>
>>> If we successfully unmapped but failed in truncate_error_page() for example,
>>> the processes mapping the page would get -EFAULT as expected.  But even in
>>> this case, other processes could reach the error page via page cache and
>>> __get_user_pages_locked() for them could return the hwpoisoned page.
>>>
>>>>
>>>> Or doesn't that happen as you describe "But memory_failure() may fail, and
>>>> the process's related pte may not be correctly set invalid" -- but why does
>>>> that happen?
>>>
>>> Simply because memory_failure() doesn't handle some page types like ksm page
>>> and zero page. Or maybe shmem thp also belongs to this class.

Thanks for that info!

>>>
>>>>
>>>> On a similar thought, should get_user_pages() never return a page that has
>>>> HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned?
>>>
>>> Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
>>> follow_huge_pmd() would work well.
>>
>> I think we should take more care to broadcast the hwpoison check to other cases,
>> SIGBUS coredump is such a case that it is supposed to not touch the poison page,
>> and if we return NULL for this, the coredump process will get a successful finish.
>>
>> Other cases may also meet the requirements like coredump, but we need to identify it,
>> that's the poison check wrapper's purpose. If not, we may break the integrity of the
>> related action, which may be no better than panic.
> 
> If you worry about regression and would like to make this new behavior conditional,
> we could use FOLL_HWPOISON to specify that the caller is hwpoison-aware so that
> any !FOLL_HWPOISON caller ignores the hwpoison check and works as it does now.
> This approach looks to me helpful because it would encourage developers touching
> gup code to pay attention to FOLL_HWPOISON code.

FOLL_HWPOISON might be the right start, indeed.

HORIGUCHI NAOYA(堀口直也) March 31, 2021, 6:53 a.m. UTC | #8

On Wed, Mar 31, 2021 at 07:07:39AM +0100, Matthew Wilcox wrote:
> On Wed, Mar 31, 2021 at 01:52:59AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > If we successfully unmapped but failed in truncate_error_page() for example,
> > the processes mapping the page would get -EFAULT as expected.  But even in
> > this case, other processes could reach the error page via page cache and
> > __get_user_pages_locked() for them could return the hwpoisoned page.
> 
> How would that happen?  We check PageHWPoison before inserting a page
> into the page tables.  See, eg, filemap_map_pages() and __do_fault().

Ah, you're right, that never happens. I misread the code.
Thanks for correcting me.

David Hildenbrand March 31, 2021, 7:05 a.m. UTC | #9

On 31.03.21 08:53, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Mar 31, 2021 at 07:07:39AM +0100, Matthew Wilcox wrote:
>> On Wed, Mar 31, 2021 at 01:52:59AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
>>> If we successfully unmapped but failed in truncate_error_page() for example,
>>> the processes mapping the page would get -EFAULT as expected.  But even in
>>> this case, other processes could reach the error page via page cache and
>>> __get_user_pages_locked() for them could return the hwpoisoned page.
>>
>> How would that happen?  We check PageHWPoison before inserting a page
>> into the page tables.  See, eg, filemap_map_pages() and __do_fault().
> 
> Ah, you're right, that never happens. I misread the code.
> Thanks for correcting me.
> 

I'm wondering if there is a small race window, if we poison a page while 
inserting it.

yaoaili [么爱利] March 31, 2021, 7:07 a.m. UTC | #10

On Wed, 31 Mar 2021 08:44:53 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 31.03.21 06:32, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Mar 31, 2021 at 10:43:36AM +0800, Aili Yao wrote:  
> >> On Wed, 31 Mar 2021 01:52:59 +0000 HORIGUCHI NAOYA(堀口　直也) <naoya.horiguchi@nec.com> wrote:  
> >>> On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:  
> >>>> On 26.03.21 15:09, David Hildenbrand wrote:  
> >>>>> On 22.03.21 12:33, Aili Yao wrote:  
> >>>>>> When we do coredump for user process signal, this may be one SIGBUS signal
> >>>>>> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> >>>>>> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> >>>>>> recovery work is finished correctly, then the get_dump_page() will not
> >>>>>> return the error page as its process pte is set invalid by
> >>>>>> memory_failure().
> >>>>>>
> >>>>>> But memory_failure() may fail, and the process's related pte may not be
> >>>>>> correctly set invalid, for current code, we will return the poison page,
> >>>>>> get it dumped, and then lead to system panic as its in kernel code.
> >>>>>>
> >>>>>> So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
> >>>>>>
> >>>>>> There maybe other scenario that is also better to check hwposion status
> >>>>>> and not to panic, so make a wrapper for this check, Thanks to David's
> >>>>>> suggestion(<david@redhat.com>).
> >>>>>>
> >>>>>> Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> >>>>>> Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> >>>>>> Cc: David Hildenbrand <david@redhat.com>
> >>>>>> Cc: Matthew Wilcox <willy@infradead.org>
> >>>>>> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> >>>>>> Cc: Oscar Salvador <osalvador@suse.de>
> >>>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> >>>>>> Cc: Aili Yao <yaoaili@kingsoft.com>
> >>>>>> Cc: stable@vger.kernel.org
> >>>>>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >>>>>> ---
> >>>>>>     mm/gup.c      |  4 ++++
> >>>>>>     mm/internal.h | 20 ++++++++++++++++++++
> >>>>>>     2 files changed, 24 insertions(+)
> >>>>>>
> >>>>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>>>> index e4c224c..6f7e1aa 100644
> >>>>>> --- a/mm/gup.c
> >>>>>> +++ b/mm/gup.c
> >>>>>> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> >>>>>>     				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> >>>>>>     	if (locked)
> >>>>>>     		mmap_read_unlock(mm);  
> >>>>>
> >>>>> Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> >>>>> when stumbling over a hwpoisoned page?
> >>>>>
> >>>>> See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> >>>>>
> >>>>> handle_mm_fault()->vm_fault_to_errno(), which translates
> >>>>> VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> >>>>>
> >>>>> ?  
> >>>
> >>> We could get -EFAULT, but sometimes not (depends on how memory_failure() fails).
> >>>
> >>> If we failed to unmap, the page table is not converted to hwpoison entry,
> >>> so __get_user_pages_locked() get the hwpoisoned page.
> >>>
> >>> If we successfully unmapped but failed in truncate_error_page() for example,
> >>> the processes mapping the page would get -EFAULT as expected.  But even in
> >>> this case, other processes could reach the error page via page cache and
> >>> __get_user_pages_locked() for them could return the hwpoisoned page.
> >>>  
> >>>>
> >>>> Or doesn't that happen as you describe "But memory_failure() may fail, and
> >>>> the process's related pte may not be correctly set invalid" -- but why does
> >>>> that happen?  
> >>>
> >>> Simply because memory_failure() doesn't handle some page types like ksm page
> >>> and zero page. Or maybe shmem thp also belongs to this class.  
> 
> Thanks for that info!
> 
> >>>  
> >>>>
> >>>> On a similar thought, should get_user_pages() never return a page that has
> >>>> HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned?  
> >>>
> >>> Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
> >>> follow_huge_pmd() would work well.  
> >>
> >> I think we should take more care to broadcast the hwpoison check to other cases,
> >> SIGBUS coredump is such a case that it is supposed to not touch the poison page,
> >> and if we return NULL for this, the coredump process will get a successful finish.
> >>
> >> Other cases may also meet the requirements like coredump, but we need to identify it,
> >> that's the poison check wrapper's purpose. If not, we may break the integrity of the
> >> related action, which may be no better than panic.  
> > 
> > If you worry about regression and would like to make this new behavior conditional,
> > we could use FOLL_HWPOISON to specify that the caller is hwpoison-aware so that
> > any !FOLL_HWPOISON caller ignores the hwpoison check and works as it does now.
> > This approach looks to me helpful because it would encourage developers touching
> > gup code to pay attention to FOLL_HWPOISON code.  
> 
> FOLL_HWPOISON might be the right start, indeed.
> 

Got this, Thanks!
I will dig more!

yaoaili [么爱利] April 1, 2021, 2:31 a.m. UTC | #11

On Wed, 31 Mar 2021 08:44:53 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 31.03.21 06:32, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Mar 31, 2021 at 10:43:36AM +0800, Aili Yao wrote:  
> >> On Wed, 31 Mar 2021 01:52:59 +0000 HORIGUCHI NAOYA(堀口　直也) <naoya.horiguchi@nec.com> wrote:  
> >>> On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:  
> >>>> On 26.03.21 15:09, David Hildenbrand wrote:  
> >>>>> On 22.03.21 12:33, Aili Yao wrote:  
> >>>>>> When we do coredump for user process signal, this may be one SIGBUS signal
> >>>>>> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> >>>>>> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> >>>>>> recovery work is finished correctly, then the get_dump_page() will not
> >>>>>> return the error page as its process pte is set invalid by
> >>>>>> memory_failure().
> >>>>>>
> >>>>>> But memory_failure() may fail, and the process's related pte may not be
> >>>>>> correctly set invalid, for current code, we will return the poison page,
> >>>>>> get it dumped, and then lead to system panic as its in kernel code.
> >>>>>>
> >>>>>> So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
> >>>>>>
> >>>>>> There maybe other scenario that is also better to check hwposion status
> >>>>>> and not to panic, so make a wrapper for this check, Thanks to David's
> >>>>>> suggestion(<david@redhat.com>).
> >>>>>>
> >>>>>> Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> >>>>>> Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
> >>>>>> Cc: David Hildenbrand <david@redhat.com>
> >>>>>> Cc: Matthew Wilcox <willy@infradead.org>
> >>>>>> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> >>>>>> Cc: Oscar Salvador <osalvador@suse.de>
> >>>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> >>>>>> Cc: Aili Yao <yaoaili@kingsoft.com>
> >>>>>> Cc: stable@vger.kernel.org
> >>>>>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >>>>>> ---
> >>>>>>     mm/gup.c      |  4 ++++
> >>>>>>     mm/internal.h | 20 ++++++++++++++++++++
> >>>>>>     2 files changed, 24 insertions(+)
> >>>>>>
> >>>>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>>>> index e4c224c..6f7e1aa 100644
> >>>>>> --- a/mm/gup.c
> >>>>>> +++ b/mm/gup.c
> >>>>>> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> >>>>>>     				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> >>>>>>     	if (locked)
> >>>>>>     		mmap_read_unlock(mm);  
> >>>>>
> >>>>> Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> >>>>> when stumbling over a hwpoisoned page?
> >>>>>
> >>>>> See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> >>>>>
> >>>>> handle_mm_fault()->vm_fault_to_errno(), which translates
> >>>>> VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> >>>>>
> >>>>> ?  
> >>>
> >>> We could get -EFAULT, but sometimes not (depends on how memory_failure() fails).
> >>>
> >>> If we failed to unmap, the page table is not converted to hwpoison entry,
> >>> so __get_user_pages_locked() get the hwpoisoned page.
> >>>
> >>> If we successfully unmapped but failed in truncate_error_page() for example,
> >>> the processes mapping the page would get -EFAULT as expected.  But even in
> >>> this case, other processes could reach the error page via page cache and
> >>> __get_user_pages_locked() for them could return the hwpoisoned page.
> >>>  
> >>>>
> >>>> Or doesn't that happen as you describe "But memory_failure() may fail, and
> >>>> the process's related pte may not be correctly set invalid" -- but why does
> >>>> that happen?  
> >>>
> >>> Simply because memory_failure() doesn't handle some page types like ksm page
> >>> and zero page. Or maybe shmem thp also belongs to this class.  
> 
> Thanks for that info!
> 
> >>>  
> >>>>
> >>>> On a similar thought, should get_user_pages() never return a page that has
> >>>> HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned?  
> >>>
> >>> Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
> >>> follow_huge_pmd() would work well.  
> >>
> >> I think we should take more care to broadcast the hwpoison check to other cases,
> >> SIGBUS coredump is such a case that it is supposed to not touch the poison page,
> >> and if we return NULL for this, the coredump process will get a successful finish.
> >>
> >> Other cases may also meet the requirements like coredump, but we need to identify it,
> >> that's the poison check wrapper's purpose. If not, we may break the integrity of the
> >> related action, which may be no better than panic.  

I think I have wrong logic here, before this patch, the code has already returned error for
pages which the user pte has been set invalid because of hwpoison. And this patch is adding another
missing scenario for the same purpose. Without this patch, the code may still fail in gup.c for
hwpoison case, I think that's OK as it's already there. Then the same rule will apply to this missing
case, I think I am wrong, David,Naoya, you are right!

Thanks!

> > If you worry about regression and would like to make this new behavior conditional,
> > we could use FOLL_HWPOISON to specify that the caller is hwpoison-aware so that
> > any !FOLL_HWPOISON caller ignores the hwpoison check and works as it does now.
> > This approach looks to me helpful because it would encourage developers touching
> > gup code to pay attention to FOLL_HWPOISON code.  
> 
> FOLL_HWPOISON might be the right start, indeed.
> 
I think we may still need this flag to return different error code for this case.
I will change the patch accordingly!

[v5] mm/gup: check page hwposion status for coredump.

Commit Message

Comments

Patch