[v1,1/3] mm/hwpoison: find subpage in hugetlb HWPOISON list

Message ID	20230517160948.811355-2-jiaqiyan@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Wed, 17 May 2023 16:09:46 +0000 In-Reply-To: <20230517160948.811355-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20230517160948.811355-1-jiaqiyan@google.com> Message-ID: <20230517160948.811355-2-jiaqiyan@google.com> Subject: [PATCH v1 1/3] mm/hwpoison: find subpage in hugetlb HWPOISON list From: Jiaqi Yan <jiaqiyan@google.com> To: mike.kravetz@oracle.com, songmuchun@bytedance.com, naoya.horiguchi@nec.com, shy828301@gmail.com, linmiaohe@huawei.com Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, duenwen@google.com, axelrasmussen@google.com, jthoughton@google.com, Jiaqi Yan <jiaqiyan@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Improve hugetlbfs read on HWPOISON hugepages \| expand [v1,0/3] Improve hugetlbfs read on HWPOISON hugepages [v1,1/3] mm/hwpoison: find subpage in hugetlb HWPOISON list [v1,2/3] hugetlbfs: improve read HWPOISON hugepage [v1,3/3] selftests/mm: add tests for HWPOISON hugetlbfs read

Jiaqi Yan May 17, 2023, 4:09 p.m. UTC

Adds the functionality to search a subpage's corresponding raw_hwp_page
in hugetlb page's HWPOISON list. This functionality can also tell if a
subpage is a raw HWPOISON page.

Exports this functionality to be immediately used in the read operation
for hugetlbfs.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 include/linux/mm.h  | 23 +++++++++++++++++++++++
 mm/memory-failure.c | 26 ++++++++++++++++----------
 2 files changed, 39 insertions(+), 10 deletions(-)

Mike Kravetz May 17, 2023, 11:53 p.m. UTC | #1

On 05/17/23 16:09, Jiaqi Yan wrote:
> Adds the functionality to search a subpage's corresponding raw_hwp_page
> in hugetlb page's HWPOISON list. This functionality can also tell if a
> subpage is a raw HWPOISON page.
> 
> Exports this functionality to be immediately used in the read operation
> for hugetlbfs.
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  include/linux/mm.h  | 23 +++++++++++++++++++++++
>  mm/memory-failure.c | 26 ++++++++++++++++----------
>  2 files changed, 39 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ce77080c79..f191a4119719 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h

Any reason why you decided to add the following to linux/mm.h instead of
linux/hugetlb.h?  Since it is hugetlb specific I would have thought
hugetlb.h was more appropriate.

> @@ -3683,6 +3683,29 @@ enum mf_action_page_type {
>   */
>  extern const struct attribute_group memory_failure_attr_group;
>  
> +#ifdef CONFIG_HUGETLB_PAGE
> +/*
> + * Struct raw_hwp_page represents information about "raw error page",
> + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> + */
> +struct raw_hwp_page {
> +	struct llist_node node;
> +	struct page *page;
> +};
> +
> +static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> +{
> +	return (struct llist_head *)&folio->_hugetlb_hwpoison;
> +}
> +
> +/*
> + * Given @subpage, a raw page in a hugepage, find its location in @folio's
> + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list.
> + */
> +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> +				       struct page *subpage);
> +#endif
> +
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>  extern void clear_huge_page(struct page *page,
>  			    unsigned long addr_hint,
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5b663eca1f29..c49e6c2d1f07 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
>  #endif /* CONFIG_FS_DAX */
>  
>  #ifdef CONFIG_HUGETLB_PAGE
> -/*
> - * Struct raw_hwp_page represents information about "raw error page",
> - * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> - */
> -struct raw_hwp_page {
> -	struct llist_node node;
> -	struct page *page;
> -};
>  
> -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> +				       struct page *subpage)
>  {
> -	return (struct llist_head *)&folio->_hugetlb_hwpoison;
> +	struct llist_node *t, *tnode;
> +	struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> +	struct raw_hwp_page *hwp_page = NULL;
> +	struct raw_hwp_page *p;
> +
> +	llist_for_each_safe(tnode, t, raw_hwp_head->first) {

IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a
raw_hwp_list.  This is indicated by the hugetlb page specific flag
RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable().

Looks like this routine does not consider that case.  Seems like it should
always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable()
is true?

Jiaqi Yan May 19, 2023, 8:54 p.m. UTC | #2

On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 05/17/23 16:09, Jiaqi Yan wrote:
> > Adds the functionality to search a subpage's corresponding raw_hwp_page
> > in hugetlb page's HWPOISON list. This functionality can also tell if a
> > subpage is a raw HWPOISON page.
> >
> > Exports this functionality to be immediately used in the read operation
> > for hugetlbfs.
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > ---
> >  include/linux/mm.h  | 23 +++++++++++++++++++++++
> >  mm/memory-failure.c | 26 ++++++++++++++++----------
> >  2 files changed, 39 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 27ce77080c79..f191a4119719 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
>
> Any reason why you decided to add the following to linux/mm.h instead of
> linux/hugetlb.h?  Since it is hugetlb specific I would have thought
> hugetlb.h was more appropriate.
>
> > @@ -3683,6 +3683,29 @@ enum mf_action_page_type {
> >   */
> >  extern const struct attribute_group memory_failure_attr_group;
> >
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +/*
> > + * Struct raw_hwp_page represents information about "raw error page",
> > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > + */
> > +struct raw_hwp_page {
> > +     struct llist_node node;
> > +     struct page *page;
> > +};
> > +
> > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > +{
> > +     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > +}
> > +
> > +/*
> > + * Given @subpage, a raw page in a hugepage, find its location in @folio's
> > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list.
> > + */
> > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > +                                    struct page *subpage);
> > +#endif
> > +
> >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >  extern void clear_huge_page(struct page *page,
> >                           unsigned long addr_hint,
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 5b663eca1f29..c49e6c2d1f07 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> >  #endif /* CONFIG_FS_DAX */
> >
> >  #ifdef CONFIG_HUGETLB_PAGE
> > -/*
> > - * Struct raw_hwp_page represents information about "raw error page",
> > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > - */
> > -struct raw_hwp_page {
> > -     struct llist_node node;
> > -     struct page *page;
> > -};
> >
> > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > +                                    struct page *subpage)
> >  {
> > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > +     struct llist_node *t, *tnode;
> > +     struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > +     struct raw_hwp_page *hwp_page = NULL;
> > +     struct raw_hwp_page *p;
> > +
> > +     llist_for_each_safe(tnode, t, raw_hwp_head->first) {
>
> IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a
> raw_hwp_list.  This is indicated by the hugetlb page specific flag
> RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable().
>
> Looks like this routine does not consider that case.  Seems like it should
> always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable()
> is true?

Thanks for catching this. I wonder should this routine consider
RawHwpUnreliable or should the caller do.

find_raw_hwp_page now returns raw_hwp_page* in the llist entry to
caller (valid one at the moment), but once RawHwpUnreliable is set,
all the raw_hwp_page in the llist will be kfree(), and the returned
value becomes dangling pointer to caller (if the caller holds that
caller long enough). Maybe returning a bool would be safer to the
caller? If the routine returns bool, then checking RawHwpUnreliable
can definitely be within the routine.

Another option is, this routine simply doesn one thing: find a
raw_hwp_page in raw_hwp_list for a subpage. But the caller needs to 1)
test RawHwpUnreliable before calls into the routine, and 2) test
RawHwpUnreliable before access returned raw_hwp_page*. I think 2nd
option will be error-prone and the 1st option is a better one.

Maybe I am over-thinking. What do you think?

> --
> Mike Kravetz
>
> > +             p = container_of(tnode, struct raw_hwp_page, node);
> > +             if (subpage == p->page) {
> > +                     hwp_page = p;
> > +                     break;
> > +             }
> > +     }
> > +
> > +     return hwp_page;
> >  }
> >
> >  static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
> > --
> > 2.40.1.606.ga4b1b128d6-goog
> >

Mike Kravetz May 19, 2023, 10:42 p.m. UTC | #3

On 05/19/23 13:54, Jiaqi Yan wrote:
> On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 05/17/23 16:09, Jiaqi Yan wrote:
> > > Adds the functionality to search a subpage's corresponding raw_hwp_page
> > > in hugetlb page's HWPOISON list. This functionality can also tell if a
> > > subpage is a raw HWPOISON page.
> > >
> > > Exports this functionality to be immediately used in the read operation
> > > for hugetlbfs.
> > >
> > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > ---
> > >  include/linux/mm.h  | 23 +++++++++++++++++++++++
> > >  mm/memory-failure.c | 26 ++++++++++++++++----------
> > >  2 files changed, 39 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 27ce77080c79..f191a4119719 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> >
> > Any reason why you decided to add the following to linux/mm.h instead of
> > linux/hugetlb.h?  Since it is hugetlb specific I would have thought
> > hugetlb.h was more appropriate.
> >
> > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type {
> > >   */
> > >  extern const struct attribute_group memory_failure_attr_group;
> > >
> > > +#ifdef CONFIG_HUGETLB_PAGE
> > > +/*
> > > + * Struct raw_hwp_page represents information about "raw error page",
> > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > + */
> > > +struct raw_hwp_page {
> > > +     struct llist_node node;
> > > +     struct page *page;
> > > +};
> > > +
> > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > +{
> > > +     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > +}
> > > +
> > > +/*
> > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's
> > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list.
> > > + */
> > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > +                                    struct page *subpage);
> > > +#endif
> > > +
> > >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > >  extern void clear_huge_page(struct page *page,
> > >                           unsigned long addr_hint,
> > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > index 5b663eca1f29..c49e6c2d1f07 100644
> > > --- a/mm/memory-failure.c
> > > +++ b/mm/memory-failure.c
> > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > >  #endif /* CONFIG_FS_DAX */
> > >
> > >  #ifdef CONFIG_HUGETLB_PAGE
> > > -/*
> > > - * Struct raw_hwp_page represents information about "raw error page",
> > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > - */
> > > -struct raw_hwp_page {
> > > -     struct llist_node node;
> > > -     struct page *page;
> > > -};
> > >
> > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > +                                    struct page *subpage)
> > >  {
> > > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > +     struct llist_node *t, *tnode;
> > > +     struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > > +     struct raw_hwp_page *hwp_page = NULL;
> > > +     struct raw_hwp_page *p;
> > > +
> > > +     llist_for_each_safe(tnode, t, raw_hwp_head->first) {
> >
> > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a
> > raw_hwp_list.  This is indicated by the hugetlb page specific flag
> > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable().
> >
> > Looks like this routine does not consider that case.  Seems like it should
> > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable()
> > is true?
> 
> Thanks for catching this. I wonder should this routine consider
> RawHwpUnreliable or should the caller do.
> 
> find_raw_hwp_page now returns raw_hwp_page* in the llist entry to
> caller (valid one at the moment), but once RawHwpUnreliable is set,
> all the raw_hwp_page in the llist will be kfree(), and the returned
> value becomes dangling pointer to caller (if the caller holds that
> caller long enough). Maybe returning a bool would be safer to the
> caller? If the routine returns bool, then checking RawHwpUnreliable
> can definitely be within the routine.

I think the check for RawHwpUnreliable should be within this routine.
Looking closer at the code, I do not see any way to synchronize this.
It looks like manipulation in the memory-failure code would be
synchronized via the mf_mutex.  However, I do not see how traversal and
freeing of the raw_hwp_list  called from __update_and_free_hugetlb_folio
is synchronized against memory-failure code modifying the list.

Naoya, can you provide some thoughts?

> 
> Another option is, this routine simply doesn one thing: find a
> raw_hwp_page in raw_hwp_list for a subpage. But the caller needs to 1)
> test RawHwpUnreliable before calls into the routine, and 2) test
> RawHwpUnreliable before access returned raw_hwp_page*. I think 2nd
> option will be error-prone and the 1st option is a better one.
> 
> Maybe I am over-thinking. What do you think?

I think racing code accessing the raw_hwp_list is very unlikely.
However, it is possible and should be considered.

HORIGUCHI NAOYA(堀口直也) May 22, 2023, 4:50 a.m. UTC | #4

On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote:
> On 05/19/23 13:54, Jiaqi Yan wrote:
> > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > On 05/17/23 16:09, Jiaqi Yan wrote:
> > > > Adds the functionality to search a subpage's corresponding raw_hwp_page
> > > > in hugetlb page's HWPOISON list. This functionality can also tell if a
> > > > subpage is a raw HWPOISON page.
> > > >
> > > > Exports this functionality to be immediately used in the read operation
> > > > for hugetlbfs.
> > > >
> > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > ---
> > > >  include/linux/mm.h  | 23 +++++++++++++++++++++++
> > > >  mm/memory-failure.c | 26 ++++++++++++++++----------
> > > >  2 files changed, 39 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 27ce77080c79..f191a4119719 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > >
> > > Any reason why you decided to add the following to linux/mm.h instead of
> > > linux/hugetlb.h?  Since it is hugetlb specific I would have thought
> > > hugetlb.h was more appropriate.
> > >
> > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type {
> > > >   */
> > > >  extern const struct attribute_group memory_failure_attr_group;
> > > >
> > > > +#ifdef CONFIG_HUGETLB_PAGE
> > > > +/*
> > > > + * Struct raw_hwp_page represents information about "raw error page",
> > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > + */
> > > > +struct raw_hwp_page {
> > > > +     struct llist_node node;
> > > > +     struct page *page;
> > > > +};
> > > > +
> > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > +{
> > > > +     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's
> > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list.
> > > > + */
> > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > +                                    struct page *subpage);
> > > > +#endif
> > > > +
> > > >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > > >  extern void clear_huge_page(struct page *page,
> > > >                           unsigned long addr_hint,
> > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > index 5b663eca1f29..c49e6c2d1f07 100644
> > > > --- a/mm/memory-failure.c
> > > > +++ b/mm/memory-failure.c
> > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > > >  #endif /* CONFIG_FS_DAX */
> > > >
> > > >  #ifdef CONFIG_HUGETLB_PAGE
> > > > -/*
> > > > - * Struct raw_hwp_page represents information about "raw error page",
> > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > - */
> > > > -struct raw_hwp_page {
> > > > -     struct llist_node node;
> > > > -     struct page *page;
> > > > -};
> > > >
> > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > +                                    struct page *subpage)
> > > >  {
> > > > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > +     struct llist_node *t, *tnode;
> > > > +     struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > > > +     struct raw_hwp_page *hwp_page = NULL;
> > > > +     struct raw_hwp_page *p;
> > > > +
> > > > +     llist_for_each_safe(tnode, t, raw_hwp_head->first) {
> > >
> > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a
> > > raw_hwp_list.  This is indicated by the hugetlb page specific flag
> > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable().
> > >
> > > Looks like this routine does not consider that case.  Seems like it should
> > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable()
> > > is true?
> > 
> > Thanks for catching this. I wonder should this routine consider
> > RawHwpUnreliable or should the caller do.
> > 
> > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to
> > caller (valid one at the moment), but once RawHwpUnreliable is set,
> > all the raw_hwp_page in the llist will be kfree(), and the returned
> > value becomes dangling pointer to caller (if the caller holds that
> > caller long enough). Maybe returning a bool would be safer to the
> > caller? If the routine returns bool, then checking RawHwpUnreliable
> > can definitely be within the routine.
> 
> I think the check for RawHwpUnreliable should be within this routine.
> Looking closer at the code, I do not see any way to synchronize this.
> It looks like manipulation in the memory-failure code would be
> synchronized via the mf_mutex.  However, I do not see how traversal and
> freeing of the raw_hwp_list  called from __update_and_free_hugetlb_folio
> is synchronized against memory-failure code modifying the list.
> 
> Naoya, can you provide some thoughts?

Thanks for elaborating the issue.  I think that making find_raw_hwp_page() and
folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution.
try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(),
already calls it within mf_mutex, so some wrapper might be needed to implement
calling path from __update_and_free_hugetlb_folio() to take mf_mutex.

It might be a concern that mf_mutex is a big lock to cover overall hwpoison
subsystem, but I think that the impact is not so big if the changed code paths
take mf_mutex only after checking hwpoisoned hugepage.  Maybe using folio_lock
to synchronize accesses to the raw_hwp_list could be possible, but currently
__get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without
folio_lock, so this approach needs update on locking rule and it sounds more
error-prone to me.

Thanks,
Naoya Horiguchi

> 
> > 
> > Another option is, this routine simply doesn one thing: find a
> > raw_hwp_page in raw_hwp_list for a subpage. But the caller needs to 1)
> > test RawHwpUnreliable before calls into the routine, and 2) test
> > RawHwpUnreliable before access returned raw_hwp_page*. I think 2nd
> > option will be error-prone and the 1st option is a better one.
> > 
> > Maybe I am over-thinking. What do you think?
> 
> I think racing code accessing the raw_hwp_list is very unlikely.
> However, it is possible and should be considered.
> -- 
> Mike Kravetz
> 
> > 
> > > --
> > > Mike Kravetz
> > >
> > > > +             p = container_of(tnode, struct raw_hwp_page, node);
> > > > +             if (subpage == p->page) {
> > > > +                     hwp_page = p;
> > > > +                     break;
> > > > +             }
> > > > +     }
> > > > +
> > > > +     return hwp_page;
> > > >  }
> > > >
> > > >  static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
> > > > --
> > > > 2.40.1.606.ga4b1b128d6-goog
> > > >

Jiaqi Yan May 22, 2023, 6:22 p.m. UTC | #5

On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口　直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote:
> > On 05/19/23 13:54, Jiaqi Yan wrote:
> > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > >
> > > > On 05/17/23 16:09, Jiaqi Yan wrote:
> > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page
> > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a
> > > > > subpage is a raw HWPOISON page.
> > > > >
> > > > > Exports this functionality to be immediately used in the read operation
> > > > > for hugetlbfs.
> > > > >
> > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > > ---
> > > > >  include/linux/mm.h  | 23 +++++++++++++++++++++++
> > > > >  mm/memory-failure.c | 26 ++++++++++++++++----------
> > > > >  2 files changed, 39 insertions(+), 10 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index 27ce77080c79..f191a4119719 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > >
> > > > Any reason why you decided to add the following to linux/mm.h instead of
> > > > linux/hugetlb.h?  Since it is hugetlb specific I would have thought
> > > > hugetlb.h was more appropriate.
> > > >
> > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type {
> > > > >   */
> > > > >  extern const struct attribute_group memory_failure_attr_group;
> > > > >
> > > > > +#ifdef CONFIG_HUGETLB_PAGE
> > > > > +/*
> > > > > + * Struct raw_hwp_page represents information about "raw error page",
> > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > + */
> > > > > +struct raw_hwp_page {
> > > > > +     struct llist_node node;
> > > > > +     struct page *page;
> > > > > +};
> > > > > +
> > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > +{
> > > > > +     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's
> > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list.
> > > > > + */
> > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > +                                    struct page *subpage);
> > > > > +#endif
> > > > > +
> > > > >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > > > >  extern void clear_huge_page(struct page *page,
> > > > >                           unsigned long addr_hint,
> > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > > index 5b663eca1f29..c49e6c2d1f07 100644
> > > > > --- a/mm/memory-failure.c
> > > > > +++ b/mm/memory-failure.c
> > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > > > >  #endif /* CONFIG_FS_DAX */
> > > > >
> > > > >  #ifdef CONFIG_HUGETLB_PAGE
> > > > > -/*
> > > > > - * Struct raw_hwp_page represents information about "raw error page",
> > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > - */
> > > > > -struct raw_hwp_page {
> > > > > -     struct llist_node node;
> > > > > -     struct page *page;
> > > > > -};
> > > > >
> > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > +                                    struct page *subpage)
> > > > >  {
> > > > > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > +     struct llist_node *t, *tnode;
> > > > > +     struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > > > > +     struct raw_hwp_page *hwp_page = NULL;
> > > > > +     struct raw_hwp_page *p;
> > > > > +
> > > > > +     llist_for_each_safe(tnode, t, raw_hwp_head->first) {
> > > >
> > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a
> > > > raw_hwp_list.  This is indicated by the hugetlb page specific flag
> > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable().
> > > >
> > > > Looks like this routine does not consider that case.  Seems like it should
> > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable()
> > > > is true?
> > >
> > > Thanks for catching this. I wonder should this routine consider
> > > RawHwpUnreliable or should the caller do.
> > >
> > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to
> > > caller (valid one at the moment), but once RawHwpUnreliable is set,
> > > all the raw_hwp_page in the llist will be kfree(), and the returned
> > > value becomes dangling pointer to caller (if the caller holds that
> > > caller long enough). Maybe returning a bool would be safer to the
> > > caller? If the routine returns bool, then checking RawHwpUnreliable
> > > can definitely be within the routine.
> >
> > I think the check for RawHwpUnreliable should be within this routine.
> > Looking closer at the code, I do not see any way to synchronize this.
> > It looks like manipulation in the memory-failure code would be
> > synchronized via the mf_mutex.  However, I do not see how traversal and
> > freeing of the raw_hwp_list  called from __update_and_free_hugetlb_folio
> > is synchronized against memory-failure code modifying the list.
> >
> > Naoya, can you provide some thoughts?
>
> Thanks for elaborating the issue.  I think that making find_raw_hwp_page() and
> folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution.
> try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(),
> already calls it within mf_mutex, so some wrapper might be needed to implement
> calling path from __update_and_free_hugetlb_folio() to take mf_mutex.
>
> It might be a concern that mf_mutex is a big lock to cover overall hwpoison
> subsystem, but I think that the impact is not so big if the changed code paths
> take mf_mutex only after checking hwpoisoned hugepage.  Maybe using folio_lock
> to synchronize accesses to the raw_hwp_list could be possible, but currently
> __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without
> folio_lock, so this approach needs update on locking rule and it sounds more
> error-prone to me.

Thanks Naoya, since memory_failure is the main user of raw_hwp_list, I
agree mf_mutex could help to sync its two del_all operations (one from
try_memory_failure_hugetlb and one from
__update_and_free_hugetlb_folio).

I still want to ask a perhaps stupid question, somewhat related to how
to implement find_raw_hwp_page() correctly. It seems
llist_for_each_safe should only be used to traverse after list entries
already *deleted* via llist_del_all. But the llist_for_each_safe calls
in memory_failure today is used *directly* on the raw_hwp_list. This
is quite different from other users of llist_for_each_safe (for
example, kernel/bpf/memalloc.c). Why is it correct? I guess mostly
because they are sync-ed under mf_mutex (except the missing coverage
on __update_and_free_hugetlb_folio)?

>
> Thanks,
> Naoya Horiguchi
>
> >
> > >
> > > Another option is, this routine simply doesn one thing: find a
> > > raw_hwp_page in raw_hwp_list for a subpage. But the caller needs to 1)
> > > test RawHwpUnreliable before calls into the routine, and 2) test
> > > RawHwpUnreliable before access returned raw_hwp_page*. I think 2nd
> > > option will be error-prone and the 1st option is a better one.
> > >
> > > Maybe I am over-thinking. What do you think?
> >
> > I think racing code accessing the raw_hwp_list is very unlikely.
> > However, it is possible and should be considered.
> > --
> > Mike Kravetz
> >
> > >
> > > > --
> > > > Mike Kravetz
> > > >
> > > > > +             p = container_of(tnode, struct raw_hwp_page, node);
> > > > > +             if (subpage == p->page) {
> > > > > +                     hwp_page = p;
> > > > > +                     break;
> > > > > +             }
> > > > > +     }
> > > > > +
> > > > > +     return hwp_page;
> > > > >  }
> > > > >
> > > > >  static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
> > > > > --
> > > > > 2.40.1.606.ga4b1b128d6-goog
> > > > >

HORIGUCHI NAOYA(堀口直也) May 23, 2023, 2:43 a.m. UTC | #6

On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote:
> On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口　直也)
> <naoya.horiguchi@nec.com> wrote:
> >
> > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote:
> > > On 05/19/23 13:54, Jiaqi Yan wrote:
> > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > >
> > > > > On 05/17/23 16:09, Jiaqi Yan wrote:
> > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page
> > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a
> > > > > > subpage is a raw HWPOISON page.
> > > > > >
> > > > > > Exports this functionality to be immediately used in the read operation
> > > > > > for hugetlbfs.
> > > > > >
> > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > > > ---
> > > > > >  include/linux/mm.h  | 23 +++++++++++++++++++++++
> > > > > >  mm/memory-failure.c | 26 ++++++++++++++++----------
> > > > > >  2 files changed, 39 insertions(+), 10 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > index 27ce77080c79..f191a4119719 100644
> > > > > > --- a/include/linux/mm.h
> > > > > > +++ b/include/linux/mm.h
> > > > >
> > > > > Any reason why you decided to add the following to linux/mm.h instead of
> > > > > linux/hugetlb.h?  Since it is hugetlb specific I would have thought
> > > > > hugetlb.h was more appropriate.
> > > > >
> > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type {
> > > > > >   */
> > > > > >  extern const struct attribute_group memory_failure_attr_group;
> > > > > >
> > > > > > +#ifdef CONFIG_HUGETLB_PAGE
> > > > > > +/*
> > > > > > + * Struct raw_hwp_page represents information about "raw error page",
> > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > + */
> > > > > > +struct raw_hwp_page {
> > > > > > +     struct llist_node node;
> > > > > > +     struct page *page;
> > > > > > +};
> > > > > > +
> > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > +{
> > > > > > +     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's
> > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list.
> > > > > > + */
> > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > +                                    struct page *subpage);
> > > > > > +#endif
> > > > > > +
> > > > > >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > > > > >  extern void clear_huge_page(struct page *page,
> > > > > >                           unsigned long addr_hint,
> > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > > > index 5b663eca1f29..c49e6c2d1f07 100644
> > > > > > --- a/mm/memory-failure.c
> > > > > > +++ b/mm/memory-failure.c
> > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > > > > >  #endif /* CONFIG_FS_DAX */
> > > > > >
> > > > > >  #ifdef CONFIG_HUGETLB_PAGE
> > > > > > -/*
> > > > > > - * Struct raw_hwp_page represents information about "raw error page",
> > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > - */
> > > > > > -struct raw_hwp_page {
> > > > > > -     struct llist_node node;
> > > > > > -     struct page *page;
> > > > > > -};
> > > > > >
> > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > +                                    struct page *subpage)
> > > > > >  {
> > > > > > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > +     struct llist_node *t, *tnode;
> > > > > > +     struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > > > > > +     struct raw_hwp_page *hwp_page = NULL;
> > > > > > +     struct raw_hwp_page *p;
> > > > > > +
> > > > > > +     llist_for_each_safe(tnode, t, raw_hwp_head->first) {
> > > > >
> > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a
> > > > > raw_hwp_list.  This is indicated by the hugetlb page specific flag
> > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable().
> > > > >
> > > > > Looks like this routine does not consider that case.  Seems like it should
> > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable()
> > > > > is true?
> > > >
> > > > Thanks for catching this. I wonder should this routine consider
> > > > RawHwpUnreliable or should the caller do.
> > > >
> > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to
> > > > caller (valid one at the moment), but once RawHwpUnreliable is set,
> > > > all the raw_hwp_page in the llist will be kfree(), and the returned
> > > > value becomes dangling pointer to caller (if the caller holds that
> > > > caller long enough). Maybe returning a bool would be safer to the
> > > > caller? If the routine returns bool, then checking RawHwpUnreliable
> > > > can definitely be within the routine.
> > >
> > > I think the check for RawHwpUnreliable should be within this routine.
> > > Looking closer at the code, I do not see any way to synchronize this.
> > > It looks like manipulation in the memory-failure code would be
> > > synchronized via the mf_mutex.  However, I do not see how traversal and
> > > freeing of the raw_hwp_list  called from __update_and_free_hugetlb_folio
> > > is synchronized against memory-failure code modifying the list.
> > >
> > > Naoya, can you provide some thoughts?
> >
> > Thanks for elaborating the issue.  I think that making find_raw_hwp_page() and
> > folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution.
> > try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(),
> > already calls it within mf_mutex, so some wrapper might be needed to implement
> > calling path from __update_and_free_hugetlb_folio() to take mf_mutex.
> >
> > It might be a concern that mf_mutex is a big lock to cover overall hwpoison
> > subsystem, but I think that the impact is not so big if the changed code paths
> > take mf_mutex only after checking hwpoisoned hugepage.  Maybe using folio_lock
> > to synchronize accesses to the raw_hwp_list could be possible, but currently
> > __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without
> > folio_lock, so this approach needs update on locking rule and it sounds more
> > error-prone to me.
> 
> Thanks Naoya, since memory_failure is the main user of raw_hwp_list, I
> agree mf_mutex could help to sync its two del_all operations (one from
> try_memory_failure_hugetlb and one from
> __update_and_free_hugetlb_folio).
> 
> I still want to ask a perhaps stupid question, somewhat related to how
> to implement find_raw_hwp_page() correctly. It seems
> llist_for_each_safe should only be used to traverse after list entries
> already *deleted* via llist_del_all. But the llist_for_each_safe calls
> in memory_failure today is used *directly* on the raw_hwp_list. This
> is quite different from other users of llist_for_each_safe (for
> example, kernel/bpf/memalloc.c).

Oh, I don't noticed that when writing the original code. (I just chose
llist_for_each_safe because I just wanted struct llist_node as a singly
linked list.)

>  Why is it correct? I guess mostly
> because they are sync-ed under mf_mutex (except the missing coverage
> on __update_and_free_hugetlb_folio)?

Yes, and there seems no good reason to use the macro llist_for_each_safe
here.  I think it's OK to switch to simpler one like list_for_each which
should be supposed to be called directly.  To do this, struct raw_hwp_page
need to have @node (typed struct list_head instead of struct llist_node).

Thanks,
Naoya Horiguchi

Jiaqi Yan May 26, 2023, 12:28 a.m. UTC | #7

On Mon, May 22, 2023 at 7:43 PM HORIGUCHI NAOYA(堀口　直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote:
> > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口　直也)
> > <naoya.horiguchi@nec.com> wrote:
> > >
> > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote:
> > > > On 05/19/23 13:54, Jiaqi Yan wrote:
> > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > > >
> > > > > > On 05/17/23 16:09, Jiaqi Yan wrote:
> > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page
> > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a
> > > > > > > subpage is a raw HWPOISON page.
> > > > > > >
> > > > > > > Exports this functionality to be immediately used in the read operation
> > > > > > > for hugetlbfs.
> > > > > > >
> > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > > > > ---
> > > > > > >  include/linux/mm.h  | 23 +++++++++++++++++++++++
> > > > > > >  mm/memory-failure.c | 26 ++++++++++++++++----------
> > > > > > >  2 files changed, 39 insertions(+), 10 deletions(-)
> > > > > > >
> > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > > index 27ce77080c79..f191a4119719 100644
> > > > > > > --- a/include/linux/mm.h
> > > > > > > +++ b/include/linux/mm.h
> > > > > >
> > > > > > Any reason why you decided to add the following to linux/mm.h instead of
> > > > > > linux/hugetlb.h?  Since it is hugetlb specific I would have thought
> > > > > > hugetlb.h was more appropriate.
> > > > > >
> > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type {
> > > > > > >   */
> > > > > > >  extern const struct attribute_group memory_failure_attr_group;
> > > > > > >
> > > > > > > +#ifdef CONFIG_HUGETLB_PAGE
> > > > > > > +/*
> > > > > > > + * Struct raw_hwp_page represents information about "raw error page",
> > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > > + */
> > > > > > > +struct raw_hwp_page {
> > > > > > > +     struct llist_node node;
> > > > > > > +     struct page *page;
> > > > > > > +};
> > > > > > > +
> > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > > +{
> > > > > > > +     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's
> > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list.
> > > > > > > + */
> > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > > +                                    struct page *subpage);
> > > > > > > +#endif
> > > > > > > +
> > > > > > >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > > > > > >  extern void clear_huge_page(struct page *page,
> > > > > > >                           unsigned long addr_hint,
> > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644
> > > > > > > --- a/mm/memory-failure.c
> > > > > > > +++ b/mm/memory-failure.c
> > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > > > > > >  #endif /* CONFIG_FS_DAX */
> > > > > > >
> > > > > > >  #ifdef CONFIG_HUGETLB_PAGE
> > > > > > > -/*
> > > > > > > - * Struct raw_hwp_page represents information about "raw error page",
> > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > > - */
> > > > > > > -struct raw_hwp_page {
> > > > > > > -     struct llist_node node;
> > > > > > > -     struct page *page;
> > > > > > > -};
> > > > > > >
> > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > > +                                    struct page *subpage)
> > > > > > >  {
> > > > > > > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > > +     struct llist_node *t, *tnode;
> > > > > > > +     struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > > > > > > +     struct raw_hwp_page *hwp_page = NULL;
> > > > > > > +     struct raw_hwp_page *p;
> > > > > > > +
> > > > > > > +     llist_for_each_safe(tnode, t, raw_hwp_head->first) {
> > > > > >
> > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a
> > > > > > raw_hwp_list.  This is indicated by the hugetlb page specific flag
> > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable().
> > > > > >
> > > > > > Looks like this routine does not consider that case.  Seems like it should
> > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable()
> > > > > > is true?
> > > > >
> > > > > Thanks for catching this. I wonder should this routine consider
> > > > > RawHwpUnreliable or should the caller do.
> > > > >
> > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to
> > > > > caller (valid one at the moment), but once RawHwpUnreliable is set,
> > > > > all the raw_hwp_page in the llist will be kfree(), and the returned
> > > > > value becomes dangling pointer to caller (if the caller holds that
> > > > > caller long enough). Maybe returning a bool would be safer to the
> > > > > caller? If the routine returns bool, then checking RawHwpUnreliable
> > > > > can definitely be within the routine.
> > > >
> > > > I think the check for RawHwpUnreliable should be within this routine.
> > > > Looking closer at the code, I do not see any way to synchronize this.
> > > > It looks like manipulation in the memory-failure code would be
> > > > synchronized via the mf_mutex.  However, I do not see how traversal and
> > > > freeing of the raw_hwp_list  called from __update_and_free_hugetlb_folio
> > > > is synchronized against memory-failure code modifying the list.
> > > >
> > > > Naoya, can you provide some thoughts?
> > >
> > > Thanks for elaborating the issue.  I think that making find_raw_hwp_page() and
> > > folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution.
> > > try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(),
> > > already calls it within mf_mutex, so some wrapper might be needed to implement
> > > calling path from __update_and_free_hugetlb_folio() to take mf_mutex.
> > >
> > > It might be a concern that mf_mutex is a big lock to cover overall hwpoison
> > > subsystem, but I think that the impact is not so big if the changed code paths
> > > take mf_mutex only after checking hwpoisoned hugepage.  Maybe using folio_lock
> > > to synchronize accesses to the raw_hwp_list could be possible, but currently
> > > __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without
> > > folio_lock, so this approach needs update on locking rule and it sounds more
> > > error-prone to me.
> >
> > Thanks Naoya, since memory_failure is the main user of raw_hwp_list, I
> > agree mf_mutex could help to sync its two del_all operations (one from
> > try_memory_failure_hugetlb and one from
> > __update_and_free_hugetlb_folio).
> >
> > I still want to ask a perhaps stupid question, somewhat related to how
> > to implement find_raw_hwp_page() correctly. It seems
> > llist_for_each_safe should only be used to traverse after list entries
> > already *deleted* via llist_del_all. But the llist_for_each_safe calls
> > in memory_failure today is used *directly* on the raw_hwp_list. This
> > is quite different from other users of llist_for_each_safe (for
> > example, kernel/bpf/memalloc.c).
>
> Oh, I don't noticed that when writing the original code. (I just chose
> llist_for_each_safe because I just wanted struct llist_node as a singly
> linked list.)

And maybe because you can avoid doing INIT_LIST_HEAD (which seems
doable in folio_set_hugetlb_hwpoison if hugepage is hwpoison-ed for
the first time)?

>
> >  Why is it correct? I guess mostly
> > because they are sync-ed under mf_mutex (except the missing coverage
> > on __update_and_free_hugetlb_folio)?
>
> Yes, and there seems no good reason to use the macro llist_for_each_safe
> here.  I think it's OK to switch to simpler one like list_for_each which
> should be supposed to be called directly.  To do this, struct raw_hwp_page
> need to have @node (typed struct list_head instead of struct llist_node).

I will start to work on a separate patch to switch to list_head, and
make sure operations from __update_and_free_hugetlb_folio and
memory_failure are serialized (hopefully without intro new locks and
just use mf_mutex).

>
> Thanks,
> Naoya Horiguchi

Jiaqi Yan June 10, 2023, 5:48 a.m. UTC | #8

On Thu, May 25, 2023 at 5:28 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> On Mon, May 22, 2023 at 7:43 PM HORIGUCHI NAOYA(堀口　直也)
> <naoya.horiguchi@nec.com> wrote:
> >
> > On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote:
> > > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口　直也)
> > > <naoya.horiguchi@nec.com> wrote:
> > > >
> > > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote:
> > > > > On 05/19/23 13:54, Jiaqi Yan wrote:
> > > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > > > >
> > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote:
> > > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page
> > > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a
> > > > > > > > subpage is a raw HWPOISON page.
> > > > > > > >
> > > > > > > > Exports this functionality to be immediately used in the read operation
> > > > > > > > for hugetlbfs.
> > > > > > > >
> > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > > > > > ---
> > > > > > > >  include/linux/mm.h  | 23 +++++++++++++++++++++++
> > > > > > > >  mm/memory-failure.c | 26 ++++++++++++++++----------
> > > > > > > >  2 files changed, 39 insertions(+), 10 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > > > index 27ce77080c79..f191a4119719 100644
> > > > > > > > --- a/include/linux/mm.h
> > > > > > > > +++ b/include/linux/mm.h
> > > > > > >
> > > > > > > Any reason why you decided to add the following to linux/mm.h instead of
> > > > > > > linux/hugetlb.h?  Since it is hugetlb specific I would have thought
> > > > > > > hugetlb.h was more appropriate.
> > > > > > >
> > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type {
> > > > > > > >   */
> > > > > > > >  extern const struct attribute_group memory_failure_attr_group;
> > > > > > > >
> > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE
> > > > > > > > +/*
> > > > > > > > + * Struct raw_hwp_page represents information about "raw error page",
> > > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > > > + */
> > > > > > > > +struct raw_hwp_page {
> > > > > > > > +     struct llist_node node;
> > > > > > > > +     struct page *page;
> > > > > > > > +};
> > > > > > > > +
> > > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > > > +{
> > > > > > > > +     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's
> > > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list.
> > > > > > > > + */
> > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > > > +                                    struct page *subpage);
> > > > > > > > +#endif
> > > > > > > > +
> > > > > > > >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > > > > > > >  extern void clear_huge_page(struct page *page,
> > > > > > > >                           unsigned long addr_hint,
> > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644
> > > > > > > > --- a/mm/memory-failure.c
> > > > > > > > +++ b/mm/memory-failure.c
> > > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > > > > > > >  #endif /* CONFIG_FS_DAX */
> > > > > > > >
> > > > > > > >  #ifdef CONFIG_HUGETLB_PAGE
> > > > > > > > -/*
> > > > > > > > - * Struct raw_hwp_page represents information about "raw error page",
> > > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > > > - */
> > > > > > > > -struct raw_hwp_page {
> > > > > > > > -     struct llist_node node;
> > > > > > > > -     struct page *page;
> > > > > > > > -};
> > > > > > > >
> > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > > > +                                    struct page *subpage)
> > > > > > > >  {
> > > > > > > > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > > > +     struct llist_node *t, *tnode;
> > > > > > > > +     struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > > > > > > > +     struct raw_hwp_page *hwp_page = NULL;
> > > > > > > > +     struct raw_hwp_page *p;
> > > > > > > > +
> > > > > > > > +     llist_for_each_safe(tnode, t, raw_hwp_head->first) {
> > > > > > >
> > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a
> > > > > > > raw_hwp_list.  This is indicated by the hugetlb page specific flag
> > > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable().
> > > > > > >
> > > > > > > Looks like this routine does not consider that case.  Seems like it should
> > > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable()
> > > > > > > is true?
> > > > > >
> > > > > > Thanks for catching this. I wonder should this routine consider
> > > > > > RawHwpUnreliable or should the caller do.
> > > > > >
> > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to
> > > > > > caller (valid one at the moment), but once RawHwpUnreliable is set,
> > > > > > all the raw_hwp_page in the llist will be kfree(), and the returned
> > > > > > value becomes dangling pointer to caller (if the caller holds that
> > > > > > caller long enough). Maybe returning a bool would be safer to the
> > > > > > caller? If the routine returns bool, then checking RawHwpUnreliable
> > > > > > can definitely be within the routine.
> > > > >
> > > > > I think the check for RawHwpUnreliable should be within this routine.
> > > > > Looking closer at the code, I do not see any way to synchronize this.
> > > > > It looks like manipulation in the memory-failure code would be
> > > > > synchronized via the mf_mutex.  However, I do not see how traversal and
> > > > > freeing of the raw_hwp_list  called from __update_and_free_hugetlb_folio
> > > > > is synchronized against memory-failure code modifying the list.
> > > > >
> > > > > Naoya, can you provide some thoughts?
> > > >
> > > > Thanks for elaborating the issue.  I think that making find_raw_hwp_page() and
> > > > folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution.
> > > > try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(),
> > > > already calls it within mf_mutex, so some wrapper might be needed to implement
> > > > calling path from __update_and_free_hugetlb_folio() to take mf_mutex.
> > > >
> > > > It might be a concern that mf_mutex is a big lock to cover overall hwpoison
> > > > subsystem, but I think that the impact is not so big if the changed code paths
> > > > take mf_mutex only after checking hwpoisoned hugepage.  Maybe using folio_lock
> > > > to synchronize accesses to the raw_hwp_list could be possible, but currently
> > > > __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without
> > > > folio_lock, so this approach needs update on locking rule and it sounds more
> > > > error-prone to me.
> > >
> > > Thanks Naoya, since memory_failure is the main user of raw_hwp_list, I
> > > agree mf_mutex could help to sync its two del_all operations (one from
> > > try_memory_failure_hugetlb and one from
> > > __update_and_free_hugetlb_folio).
> > >
> > > I still want to ask a perhaps stupid question, somewhat related to how
> > > to implement find_raw_hwp_page() correctly. It seems
> > > llist_for_each_safe should only be used to traverse after list entries
> > > already *deleted* via llist_del_all. But the llist_for_each_safe calls
> > > in memory_failure today is used *directly* on the raw_hwp_list. This
> > > is quite different from other users of llist_for_each_safe (for
> > > example, kernel/bpf/memalloc.c).
> >
> > Oh, I don't noticed that when writing the original code. (I just chose
> > llist_for_each_safe because I just wanted struct llist_node as a singly
> > linked list.)
>
> And maybe because you can avoid doing INIT_LIST_HEAD (which seems
> doable in folio_set_hugetlb_hwpoison if hugepage is hwpoison-ed for
> the first time)?
>
> >
> > >  Why is it correct? I guess mostly
> > > because they are sync-ed under mf_mutex (except the missing coverage
> > > on __update_and_free_hugetlb_folio)?
> >
> > Yes, and there seems no good reason to use the macro llist_for_each_safe
> > here.  I think it's OK to switch to simpler one like list_for_each which
> > should be supposed to be called directly.  To do this, struct raw_hwp_page
> > need to have @node (typed struct list_head instead of struct llist_node).

Hi Naoya, a maybe-stupid question on list vs llist: _hugetlb_hwpoison
in folio is a void*. struct list_head is composed of two pointers to
list_node (prev and next) so folio just can't hold a list_head in the
_hugetlb_hwpoison field, right? llist_head on the other hand only
contains one pointer to llist_node first. I wonder if this is one of
the reason you picked llist instead of list in the first place.

The reason I ask is while I was testing my refactor draft, I
constantly see the refcount of the 3rd subpage in the folio got
corrupted. I am not sure about the exact reason but it feels to me
related to the above reason.

>
> I will start to work on a separate patch to switch to list_head, and
> make sure operations from __update_and_free_hugetlb_folio and
> memory_failure are serialized (hopefully without intro new locks and
> just use mf_mutex).
>
> >
> > Thanks,
> > Naoya Horiguchi

Naoya Horiguchi June 12, 2023, 4:19 a.m. UTC | #9

On Fri, Jun 09, 2023 at 10:48:47PM -0700, Jiaqi Yan wrote:
> On Thu, May 25, 2023 at 5:28 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
> >
> > On Mon, May 22, 2023 at 7:43 PM HORIGUCHI NAOYA(堀口　直也)
> > <naoya.horiguchi@nec.com> wrote:
> > >
> > > On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote:
> > > > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口　直也)
> > > > <naoya.horiguchi@nec.com> wrote:
> > > > >
> > > > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote:
> > > > > > On 05/19/23 13:54, Jiaqi Yan wrote:
> > > > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > > > > >
> > > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote:
> > > > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page
> > > > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a
> > > > > > > > > subpage is a raw HWPOISON page.
> > > > > > > > >
> > > > > > > > > Exports this functionality to be immediately used in the read operation
> > > > > > > > > for hugetlbfs.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > > > > > > ---
> > > > > > > > >  include/linux/mm.h  | 23 +++++++++++++++++++++++
> > > > > > > > >  mm/memory-failure.c | 26 ++++++++++++++++----------
> > > > > > > > >  2 files changed, 39 insertions(+), 10 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > > > > index 27ce77080c79..f191a4119719 100644
> > > > > > > > > --- a/include/linux/mm.h
> > > > > > > > > +++ b/include/linux/mm.h
> > > > > > > >
> > > > > > > > Any reason why you decided to add the following to linux/mm.h instead of
> > > > > > > > linux/hugetlb.h?  Since it is hugetlb specific I would have thought
> > > > > > > > hugetlb.h was more appropriate.
> > > > > > > >
> > > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type {
> > > > > > > > >   */
> > > > > > > > >  extern const struct attribute_group memory_failure_attr_group;
> > > > > > > > >
> > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE
> > > > > > > > > +/*
> > > > > > > > > + * Struct raw_hwp_page represents information about "raw error page",
> > > > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > > > > + */
> > > > > > > > > +struct raw_hwp_page {
> > > > > > > > > +     struct llist_node node;
> > > > > > > > > +     struct page *page;
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > > > > +{
> > > > > > > > > +     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's
> > > > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list.
> > > > > > > > > + */
> > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > > > > +                                    struct page *subpage);
> > > > > > > > > +#endif
> > > > > > > > > +
> > > > > > > > >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > > > > > > > >  extern void clear_huge_page(struct page *page,
> > > > > > > > >                           unsigned long addr_hint,
> > > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644
> > > > > > > > > --- a/mm/memory-failure.c
> > > > > > > > > +++ b/mm/memory-failure.c
> > > > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > > > > > > > >  #endif /* CONFIG_FS_DAX */
> > > > > > > > >
> > > > > > > > >  #ifdef CONFIG_HUGETLB_PAGE
> > > > > > > > > -/*
> > > > > > > > > - * Struct raw_hwp_page represents information about "raw error page",
> > > > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > > > > - */
> > > > > > > > > -struct raw_hwp_page {
> > > > > > > > > -     struct llist_node node;
> > > > > > > > > -     struct page *page;
> > > > > > > > > -};
> > > > > > > > >
> > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > > > > +                                    struct page *subpage)
> > > > > > > > >  {
> > > > > > > > > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > > > > +     struct llist_node *t, *tnode;
> > > > > > > > > +     struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > > > > > > > > +     struct raw_hwp_page *hwp_page = NULL;
> > > > > > > > > +     struct raw_hwp_page *p;
> > > > > > > > > +
> > > > > > > > > +     llist_for_each_safe(tnode, t, raw_hwp_head->first) {
> > > > > > > >
> > > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a
> > > > > > > > raw_hwp_list.  This is indicated by the hugetlb page specific flag
> > > > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable().
> > > > > > > >
> > > > > > > > Looks like this routine does not consider that case.  Seems like it should
> > > > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable()
> > > > > > > > is true?
> > > > > > >
> > > > > > > Thanks for catching this. I wonder should this routine consider
> > > > > > > RawHwpUnreliable or should the caller do.
> > > > > > >
> > > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to
> > > > > > > caller (valid one at the moment), but once RawHwpUnreliable is set,
> > > > > > > all the raw_hwp_page in the llist will be kfree(), and the returned
> > > > > > > value becomes dangling pointer to caller (if the caller holds that
> > > > > > > caller long enough). Maybe returning a bool would be safer to the
> > > > > > > caller? If the routine returns bool, then checking RawHwpUnreliable
> > > > > > > can definitely be within the routine.
> > > > > >
> > > > > > I think the check for RawHwpUnreliable should be within this routine.
> > > > > > Looking closer at the code, I do not see any way to synchronize this.
> > > > > > It looks like manipulation in the memory-failure code would be
> > > > > > synchronized via the mf_mutex.  However, I do not see how traversal and
> > > > > > freeing of the raw_hwp_list  called from __update_and_free_hugetlb_folio
> > > > > > is synchronized against memory-failure code modifying the list.
> > > > > >
> > > > > > Naoya, can you provide some thoughts?
> > > > >
> > > > > Thanks for elaborating the issue.  I think that making find_raw_hwp_page() and
> > > > > folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution.
> > > > > try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(),
> > > > > already calls it within mf_mutex, so some wrapper might be needed to implement
> > > > > calling path from __update_and_free_hugetlb_folio() to take mf_mutex.
> > > > >
> > > > > It might be a concern that mf_mutex is a big lock to cover overall hwpoison
> > > > > subsystem, but I think that the impact is not so big if the changed code paths
> > > > > take mf_mutex only after checking hwpoisoned hugepage.  Maybe using folio_lock
> > > > > to synchronize accesses to the raw_hwp_list could be possible, but currently
> > > > > __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without
> > > > > folio_lock, so this approach needs update on locking rule and it sounds more
> > > > > error-prone to me.
> > > >
> > > > Thanks Naoya, since memory_failure is the main user of raw_hwp_list, I
> > > > agree mf_mutex could help to sync its two del_all operations (one from
> > > > try_memory_failure_hugetlb and one from
> > > > __update_and_free_hugetlb_folio).
> > > >
> > > > I still want to ask a perhaps stupid question, somewhat related to how
> > > > to implement find_raw_hwp_page() correctly. It seems
> > > > llist_for_each_safe should only be used to traverse after list entries
> > > > already *deleted* via llist_del_all. But the llist_for_each_safe calls
> > > > in memory_failure today is used *directly* on the raw_hwp_list. This
> > > > is quite different from other users of llist_for_each_safe (for
> > > > example, kernel/bpf/memalloc.c).
> > >
> > > Oh, I don't noticed that when writing the original code. (I just chose
> > > llist_for_each_safe because I just wanted struct llist_node as a singly
> > > linked list.)
> >
> > And maybe because you can avoid doing INIT_LIST_HEAD (which seems
> > doable in folio_set_hugetlb_hwpoison if hugepage is hwpoison-ed for
> > the first time)?
> >
> > >
> > > >  Why is it correct? I guess mostly
> > > > because they are sync-ed under mf_mutex (except the missing coverage
> > > > on __update_and_free_hugetlb_folio)?
> > >
> > > Yes, and there seems no good reason to use the macro llist_for_each_safe
> > > here.  I think it's OK to switch to simpler one like list_for_each which
> > > should be supposed to be called directly.  To do this, struct raw_hwp_page
> > > need to have @node (typed struct list_head instead of struct llist_node).
> 
> Hi Naoya, a maybe-stupid question on list vs llist: _hugetlb_hwpoison
> in folio is a void*. struct list_head is composed of two pointers to
> list_node (prev and next) so folio just can't hold a list_head in the
> _hugetlb_hwpoison field, right? llist_head on the other hand only
> contains one pointer to llist_node first. I wonder if this is one of
> the reason you picked llist instead of list in the first place.

Yes, that's one reason to use llist_head, and another (minor) reason is that
we don't need doubly-linked list here.

Thanks,
Naoya Horiguchi

> 
> The reason I ask is while I was testing my refactor draft, I
> constantly see the refcount of the 3rd subpage in the folio got
> corrupted. I am not sure about the exact reason but it feels to me
> related to the above reason.
> 
> >
> > I will start to work on a separate patch to switch to list_head, and
> > make sure operations from __update_and_free_hugetlb_folio and
> > memory_failure are serialized (hopefully without intro new locks and
> > just use mf_mutex).
> >
> > >
> > > Thanks,
> > > Naoya Horiguchi
> 
>

Jiaqi Yan June 16, 2023, 9:19 p.m. UTC | #10

On Sun, Jun 11, 2023 at 9:19 PM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Fri, Jun 09, 2023 at 10:48:47PM -0700, Jiaqi Yan wrote:
> > On Thu, May 25, 2023 at 5:28 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
> > >
> > > On Mon, May 22, 2023 at 7:43 PM HORIGUCHI NAOYA(堀口　直也)
> > > <naoya.horiguchi@nec.com> wrote:
> > > >
> > > > On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote:
> > > > > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > >
> > > > > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote:
> > > > > > > On 05/19/23 13:54, Jiaqi Yan wrote:
> > > > > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > > > > > >
> > > > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote:
> > > > > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page
> > > > > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a
> > > > > > > > > > subpage is a raw HWPOISON page.
> > > > > > > > > >
> > > > > > > > > > Exports this functionality to be immediately used in the read operation
> > > > > > > > > > for hugetlbfs.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > > > > > > > ---
> > > > > > > > > >  include/linux/mm.h  | 23 +++++++++++++++++++++++
> > > > > > > > > >  mm/memory-failure.c | 26 ++++++++++++++++----------
> > > > > > > > > >  2 files changed, 39 insertions(+), 10 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > > > > > index 27ce77080c79..f191a4119719 100644
> > > > > > > > > > --- a/include/linux/mm.h
> > > > > > > > > > +++ b/include/linux/mm.h
> > > > > > > > >
> > > > > > > > > Any reason why you decided to add the following to linux/mm.h instead of
> > > > > > > > > linux/hugetlb.h?  Since it is hugetlb specific I would have thought
> > > > > > > > > hugetlb.h was more appropriate.
> > > > > > > > >
> > > > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type {
> > > > > > > > > >   */
> > > > > > > > > >  extern const struct attribute_group memory_failure_attr_group;
> > > > > > > > > >
> > > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE
> > > > > > > > > > +/*
> > > > > > > > > > + * Struct raw_hwp_page represents information about "raw error page",
> > > > > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > > > > > + */
> > > > > > > > > > +struct raw_hwp_page {
> > > > > > > > > > +     struct llist_node node;
> > > > > > > > > > +     struct page *page;
> > > > > > > > > > +};
> > > > > > > > > > +
> > > > > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > > > > > +{
> > > > > > > > > > +     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +/*
> > > > > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's
> > > > > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list.
> > > > > > > > > > + */
> > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > > > > > +                                    struct page *subpage);
> > > > > > > > > > +#endif
> > > > > > > > > > +
> > > > > > > > > >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > > > > > > > > >  extern void clear_huge_page(struct page *page,
> > > > > > > > > >                           unsigned long addr_hint,
> > > > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644
> > > > > > > > > > --- a/mm/memory-failure.c
> > > > > > > > > > +++ b/mm/memory-failure.c
> > > > > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > > > > > > > > >  #endif /* CONFIG_FS_DAX */
> > > > > > > > > >
> > > > > > > > > >  #ifdef CONFIG_HUGETLB_PAGE
> > > > > > > > > > -/*
> > > > > > > > > > - * Struct raw_hwp_page represents information about "raw error page",
> > > > > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > > > > > - */
> > > > > > > > > > -struct raw_hwp_page {
> > > > > > > > > > -     struct llist_node node;
> > > > > > > > > > -     struct page *page;
> > > > > > > > > > -};
> > > > > > > > > >
> > > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > > > > > +                                    struct page *subpage)
> > > > > > > > > >  {
> > > > > > > > > > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > > > > > +     struct llist_node *t, *tnode;
> > > > > > > > > > +     struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > > > > > > > > > +     struct raw_hwp_page *hwp_page = NULL;
> > > > > > > > > > +     struct raw_hwp_page *p;
> > > > > > > > > > +
> > > > > > > > > > +     llist_for_each_safe(tnode, t, raw_hwp_head->first) {
> > > > > > > > >
> > > > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a
> > > > > > > > > raw_hwp_list.  This is indicated by the hugetlb page specific flag
> > > > > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable().
> > > > > > > > >
> > > > > > > > > Looks like this routine does not consider that case.  Seems like it should
> > > > > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable()
> > > > > > > > > is true?
> > > > > > > >
> > > > > > > > Thanks for catching this. I wonder should this routine consider
> > > > > > > > RawHwpUnreliable or should the caller do.
> > > > > > > >
> > > > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to
> > > > > > > > caller (valid one at the moment), but once RawHwpUnreliable is set,
> > > > > > > > all the raw_hwp_page in the llist will be kfree(), and the returned
> > > > > > > > value becomes dangling pointer to caller (if the caller holds that
> > > > > > > > caller long enough). Maybe returning a bool would be safer to the
> > > > > > > > caller? If the routine returns bool, then checking RawHwpUnreliable
> > > > > > > > can definitely be within the routine.
> > > > > > >
> > > > > > > I think the check for RawHwpUnreliable should be within this routine.
> > > > > > > Looking closer at the code, I do not see any way to synchronize this.
> > > > > > > It looks like manipulation in the memory-failure code would be
> > > > > > > synchronized via the mf_mutex.  However, I do not see how traversal and
> > > > > > > freeing of the raw_hwp_list  called from __update_and_free_hugetlb_folio
> > > > > > > is synchronized against memory-failure code modifying the list.
> > > > > > >
> > > > > > > Naoya, can you provide some thoughts?

Hi Mike,

Now looking again this, I think concurrent adding and deleting are
fine with each other and with themselves, because raw_hwp_list is
lock-less llist.

As for synchronizing traversal with adding and deleting, I wonder is
it a good idea to make __update_and_free_hugetlb_folio hold
hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse +
delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already
takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is
missing the lock.

> > > > > >
> > > > > > Thanks for elaborating the issue.  I think that making find_raw_hwp_page() and
> > > > > > folio_clear_hugetlb_hwpoison() do their works within mf_mutex can be one solution.
> > > > > > try_memory_failure_hugetlb(), one of the callers of folio_clear_hugetlb_hwpoison(),
> > > > > > already calls it within mf_mutex, so some wrapper might be needed to implement
> > > > > > calling path from __update_and_free_hugetlb_folio() to take mf_mutex.
> > > > > >
> > > > > > It might be a concern that mf_mutex is a big lock to cover overall hwpoison
> > > > > > subsystem, but I think that the impact is not so big if the changed code paths
> > > > > > take mf_mutex only after checking hwpoisoned hugepage.  Maybe using folio_lock
> > > > > > to synchronize accesses to the raw_hwp_list could be possible, but currently
> > > > > > __get_huge_page_for_hwpoison() calls folio_set_hugetlb_hwpoison() without
> > > > > > folio_lock, so this approach needs update on locking rule and it sounds more
> > > > > > error-prone to me.
> > > > >
> > > > > Thanks Naoya, since memory_failure is the main user of raw_hwp_list, I
> > > > > agree mf_mutex could help to sync its two del_all operations (one from
> > > > > try_memory_failure_hugetlb and one from
> > > > > __update_and_free_hugetlb_folio).
> > > > >
> > > > > I still want to ask a perhaps stupid question, somewhat related to how
> > > > > to implement find_raw_hwp_page() correctly. It seems
> > > > > llist_for_each_safe should only be used to traverse after list entries
> > > > > already *deleted* via llist_del_all. But the llist_for_each_safe calls
> > > > > in memory_failure today is used *directly* on the raw_hwp_list. This
> > > > > is quite different from other users of llist_for_each_safe (for
> > > > > example, kernel/bpf/memalloc.c).
> > > >
> > > > Oh, I don't noticed that when writing the original code. (I just chose
> > > > llist_for_each_safe because I just wanted struct llist_node as a singly
> > > > linked list.)
> > >
> > > And maybe because you can avoid doing INIT_LIST_HEAD (which seems
> > > doable in folio_set_hugetlb_hwpoison if hugepage is hwpoison-ed for
> > > the first time)?
> > >
> > > >
> > > > >  Why is it correct? I guess mostly
> > > > > because they are sync-ed under mf_mutex (except the missing coverage
> > > > > on __update_and_free_hugetlb_folio)?
> > > >
> > > > Yes, and there seems no good reason to use the macro llist_for_each_safe
> > > > here.  I think it's OK to switch to simpler one like list_for_each which
> > > > should be supposed to be called directly.  To do this, struct raw_hwp_page
> > > > need to have @node (typed struct list_head instead of struct llist_node).
> >
> > Hi Naoya, a maybe-stupid question on list vs llist: _hugetlb_hwpoison
> > in folio is a void*. struct list_head is composed of two pointers to
> > list_node (prev and next) so folio just can't hold a list_head in the
> > _hugetlb_hwpoison field, right? llist_head on the other hand only
> > contains one pointer to llist_node first. I wonder if this is one of
> > the reason you picked llist instead of list in the first place.
>
> Yes, that's one reason to use llist_head, and another (minor) reason is that
> we don't need doubly-linked list here.

Hi Naoya,

Even with hugetlb_lock, I think we should still fix
__folio_free_raw_hwp: call llist_del_all first, then traverse the list
and free raw_hwp_page entries. Then folio_clear_hugetlb_hwpoison from
both memory_failure and hugetlb will be safe given llist_del_all on
llist is safe with itself.

In my v2, I tried both (1.taking hugetlb in
__update_and_free_hugetlb_folio, 2. call llist_del_all first in
__folio_free_raw_hwp). I also changed find_raw_hwp_page to
is_raw_hwp_subpage (only returns bool, and takes hugetlb_lock for
traversing raw_hwp_list). So far I didn't run into any problems with
my selftest.

>
> Thanks,
> Naoya Horiguchi
>
> >
> > The reason I ask is while I was testing my refactor draft, I
> > constantly see the refcount of the 3rd subpage in the folio got
> > corrupted. I am not sure about the exact reason but it feels to me
> > related to the above reason.
> >
> > >
> > > I will start to work on a separate patch to switch to list_head, and
> > > make sure operations from __update_and_free_hugetlb_folio and
> > > memory_failure are serialized (hopefully without intro new locks and
> > > just use mf_mutex).
> > >
> > > >
> > > > Thanks,
> > > > Naoya Horiguchi
> >
> >

Mike Kravetz June 16, 2023, 11:34 p.m. UTC | #11

On 06/16/23 14:19, Jiaqi Yan wrote:
> On Sun, Jun 11, 2023 at 9:19 PM Naoya Horiguchi
> <naoya.horiguchi@linux.dev> wrote:
> >
> > On Fri, Jun 09, 2023 at 10:48:47PM -0700, Jiaqi Yan wrote:
> > > On Thu, May 25, 2023 at 5:28 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
> > > >
> > > > On Mon, May 22, 2023 at 7:43 PM HORIGUCHI NAOYA(堀口　直也)
> > > > <naoya.horiguchi@nec.com> wrote:
> > > > >
> > > > > On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote:
> > > > > > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > > >
> > > > > > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote:
> > > > > > > > On 05/19/23 13:54, Jiaqi Yan wrote:
> > > > > > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote:
> > > > > > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page
> > > > > > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a
> > > > > > > > > > > subpage is a raw HWPOISON page.
> > > > > > > > > > >
> > > > > > > > > > > Exports this functionality to be immediately used in the read operation
> > > > > > > > > > > for hugetlbfs.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  include/linux/mm.h  | 23 +++++++++++++++++++++++
> > > > > > > > > > >  mm/memory-failure.c | 26 ++++++++++++++++----------
> > > > > > > > > > >  2 files changed, 39 insertions(+), 10 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > > > > > > index 27ce77080c79..f191a4119719 100644
> > > > > > > > > > > --- a/include/linux/mm.h
> > > > > > > > > > > +++ b/include/linux/mm.h
> > > > > > > > > >
> > > > > > > > > > Any reason why you decided to add the following to linux/mm.h instead of
> > > > > > > > > > linux/hugetlb.h?  Since it is hugetlb specific I would have thought
> > > > > > > > > > hugetlb.h was more appropriate.
> > > > > > > > > >
> > > > > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type {
> > > > > > > > > > >   */
> > > > > > > > > > >  extern const struct attribute_group memory_failure_attr_group;
> > > > > > > > > > >
> > > > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE
> > > > > > > > > > > +/*
> > > > > > > > > > > + * Struct raw_hwp_page represents information about "raw error page",
> > > > > > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > > > > > > + */
> > > > > > > > > > > +struct raw_hwp_page {
> > > > > > > > > > > +     struct llist_node node;
> > > > > > > > > > > +     struct page *page;
> > > > > > > > > > > +};
> > > > > > > > > > > +
> > > > > > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > > > > > > +{
> > > > > > > > > > > +     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +/*
> > > > > > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's
> > > > > > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list.
> > > > > > > > > > > + */
> > > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > > > > > > +                                    struct page *subpage);
> > > > > > > > > > > +#endif
> > > > > > > > > > > +
> > > > > > > > > > >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > > > > > > > > > >  extern void clear_huge_page(struct page *page,
> > > > > > > > > > >                           unsigned long addr_hint,
> > > > > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > > > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644
> > > > > > > > > > > --- a/mm/memory-failure.c
> > > > > > > > > > > +++ b/mm/memory-failure.c
> > > > > > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > > > > > > > > > >  #endif /* CONFIG_FS_DAX */
> > > > > > > > > > >
> > > > > > > > > > >  #ifdef CONFIG_HUGETLB_PAGE
> > > > > > > > > > > -/*
> > > > > > > > > > > - * Struct raw_hwp_page represents information about "raw error page",
> > > > > > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > > > > > > - */
> > > > > > > > > > > -struct raw_hwp_page {
> > > > > > > > > > > -     struct llist_node node;
> > > > > > > > > > > -     struct page *page;
> > > > > > > > > > > -};
> > > > > > > > > > >
> > > > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > > > > > > +                                    struct page *subpage)
> > > > > > > > > > >  {
> > > > > > > > > > > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > > > > > > +     struct llist_node *t, *tnode;
> > > > > > > > > > > +     struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > > > > > > > > > > +     struct raw_hwp_page *hwp_page = NULL;
> > > > > > > > > > > +     struct raw_hwp_page *p;
> > > > > > > > > > > +
> > > > > > > > > > > +     llist_for_each_safe(tnode, t, raw_hwp_head->first) {
> > > > > > > > > >
> > > > > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a
> > > > > > > > > > raw_hwp_list.  This is indicated by the hugetlb page specific flag
> > > > > > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable().
> > > > > > > > > >
> > > > > > > > > > Looks like this routine does not consider that case.  Seems like it should
> > > > > > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable()
> > > > > > > > > > is true?
> > > > > > > > >
> > > > > > > > > Thanks for catching this. I wonder should this routine consider
> > > > > > > > > RawHwpUnreliable or should the caller do.
> > > > > > > > >
> > > > > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to
> > > > > > > > > caller (valid one at the moment), but once RawHwpUnreliable is set,
> > > > > > > > > all the raw_hwp_page in the llist will be kfree(), and the returned
> > > > > > > > > value becomes dangling pointer to caller (if the caller holds that
> > > > > > > > > caller long enough). Maybe returning a bool would be safer to the
> > > > > > > > > caller? If the routine returns bool, then checking RawHwpUnreliable
> > > > > > > > > can definitely be within the routine.
> > > > > > > >
> > > > > > > > I think the check for RawHwpUnreliable should be within this routine.
> > > > > > > > Looking closer at the code, I do not see any way to synchronize this.
> > > > > > > > It looks like manipulation in the memory-failure code would be
> > > > > > > > synchronized via the mf_mutex.  However, I do not see how traversal and
> > > > > > > > freeing of the raw_hwp_list  called from __update_and_free_hugetlb_folio
> > > > > > > > is synchronized against memory-failure code modifying the list.
> > > > > > > >
> > > > > > > > Naoya, can you provide some thoughts?
> 
> Hi Mike,
> 
> Now looking again this, I think concurrent adding and deleting are
> fine with each other and with themselves, because raw_hwp_list is
> lock-less llist.

Correct.

> As for synchronizing traversal with adding and deleting, I wonder is
> it a good idea to make __update_and_free_hugetlb_folio hold
> hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse +
> delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already
> takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is
> missing the lock.

I do not think the lock is needed.  However, while looking more closely
at this I think I discovered another issue.
This is VERY subtle.
Perhaps Naoya can help verify if my reasoning below is correct.

In __update_and_free_hugetlb_folio we are not operating on a hugetlb page.
Why is this?
Before calling update_and_free_hugetlb_folio we call remove_hugetlb_folio.
The purpose of remove_hugetlb_folio is to remove the huge page from the
list AND compound page destructor indicating this is a hugetlb page is changed.
This is all done while holding the hugetlb lock.  So, the test for
folio_test_hugetlb(folio) is false.

We have technically a compound non-hugetlb page with a non-null raw_hwp_list.

Important note: at this time we have not reallocated vmemmap pages if
hugetlb page was vmemmap optimized.  That is done later in
__update_and_free_hugetlb_folio.

The 'good news' is that after this point get_huge_page_for_hwpoison will
not recognize this as a hugetlb page, so nothing will be added to the
list.  There is no need to worry about entries being added to the list
during traversal.

The 'bad news' is that if we get a memory error at this time we will
treat it as a memory error on a regular compound page.  So,
TestSetPageHWPoison(p) in memory_failure() may try to write a read only
struct page. :(

Jiaqi Yan June 17, 2023, 2:18 a.m. UTC | #12

On Fri, Jun 16, 2023 at 4:35 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 06/16/23 14:19, Jiaqi Yan wrote:
> > On Sun, Jun 11, 2023 at 9:19 PM Naoya Horiguchi
> > <naoya.horiguchi@linux.dev> wrote:
> > >
> > > On Fri, Jun 09, 2023 at 10:48:47PM -0700, Jiaqi Yan wrote:
> > > > On Thu, May 25, 2023 at 5:28 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
> > > > >
> > > > > On Mon, May 22, 2023 at 7:43 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > >
> > > > > > On Mon, May 22, 2023 at 11:22:49AM -0700, Jiaqi Yan wrote:
> > > > > > > On Sun, May 21, 2023 at 9:50 PM HORIGUCHI NAOYA(堀口　直也)
> > > > > > > <naoya.horiguchi@nec.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, May 19, 2023 at 03:42:14PM -0700, Mike Kravetz wrote:
> > > > > > > > > On 05/19/23 13:54, Jiaqi Yan wrote:
> > > > > > > > > > On Wed, May 17, 2023 at 4:53 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On 05/17/23 16:09, Jiaqi Yan wrote:
> > > > > > > > > > > > Adds the functionality to search a subpage's corresponding raw_hwp_page
> > > > > > > > > > > > in hugetlb page's HWPOISON list. This functionality can also tell if a
> > > > > > > > > > > > subpage is a raw HWPOISON page.
> > > > > > > > > > > >
> > > > > > > > > > > > Exports this functionality to be immediately used in the read operation
> > > > > > > > > > > > for hugetlbfs.
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  include/linux/mm.h  | 23 +++++++++++++++++++++++
> > > > > > > > > > > >  mm/memory-failure.c | 26 ++++++++++++++++----------
> > > > > > > > > > > >  2 files changed, 39 insertions(+), 10 deletions(-)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > > > > > > > index 27ce77080c79..f191a4119719 100644
> > > > > > > > > > > > --- a/include/linux/mm.h
> > > > > > > > > > > > +++ b/include/linux/mm.h
> > > > > > > > > > >
> > > > > > > > > > > Any reason why you decided to add the following to linux/mm.h instead of
> > > > > > > > > > > linux/hugetlb.h?  Since it is hugetlb specific I would have thought
> > > > > > > > > > > hugetlb.h was more appropriate.
> > > > > > > > > > >
> > > > > > > > > > > > @@ -3683,6 +3683,29 @@ enum mf_action_page_type {
> > > > > > > > > > > >   */
> > > > > > > > > > > >  extern const struct attribute_group memory_failure_attr_group;
> > > > > > > > > > > >
> > > > > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE
> > > > > > > > > > > > +/*
> > > > > > > > > > > > + * Struct raw_hwp_page represents information about "raw error page",
> > > > > > > > > > > > + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +struct raw_hwp_page {
> > > > > > > > > > > > +     struct llist_node node;
> > > > > > > > > > > > +     struct page *page;
> > > > > > > > > > > > +};
> > > > > > > > > > > > +
> > > > > > > > > > > > +static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > > > > > > > +}
> > > > > > > > > > > > +
> > > > > > > > > > > > +/*
> > > > > > > > > > > > + * Given @subpage, a raw page in a hugepage, find its location in @folio's
> > > > > > > > > > > > + * _hugetlb_hwpoison list. Return NULL if @subpage is not in the list.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > > > > > > > +                                    struct page *subpage);
> > > > > > > > > > > > +#endif
> > > > > > > > > > > > +
> > > > > > > > > > > >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> > > > > > > > > > > >  extern void clear_huge_page(struct page *page,
> > > > > > > > > > > >                           unsigned long addr_hint,
> > > > > > > > > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > > > > > > > > > index 5b663eca1f29..c49e6c2d1f07 100644
> > > > > > > > > > > > --- a/mm/memory-failure.c
> > > > > > > > > > > > +++ b/mm/memory-failure.c
> > > > > > > > > > > > @@ -1818,18 +1818,24 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> > > > > > > > > > > >  #endif /* CONFIG_FS_DAX */
> > > > > > > > > > > >
> > > > > > > > > > > >  #ifdef CONFIG_HUGETLB_PAGE
> > > > > > > > > > > > -/*
> > > > > > > > > > > > - * Struct raw_hwp_page represents information about "raw error page",
> > > > > > > > > > > > - * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
> > > > > > > > > > > > - */
> > > > > > > > > > > > -struct raw_hwp_page {
> > > > > > > > > > > > -     struct llist_node node;
> > > > > > > > > > > > -     struct page *page;
> > > > > > > > > > > > -};
> > > > > > > > > > > >
> > > > > > > > > > > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > > > > > > > > > > +struct raw_hwp_page *find_raw_hwp_page(struct folio *folio,
> > > > > > > > > > > > +                                    struct page *subpage)
> > > > > > > > > > > >  {
> > > > > > > > > > > > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > > > > > > > > > > +     struct llist_node *t, *tnode;
> > > > > > > > > > > > +     struct llist_head *raw_hwp_head = raw_hwp_list_head(folio);
> > > > > > > > > > > > +     struct raw_hwp_page *hwp_page = NULL;
> > > > > > > > > > > > +     struct raw_hwp_page *p;
> > > > > > > > > > > > +
> > > > > > > > > > > > +     llist_for_each_safe(tnode, t, raw_hwp_head->first) {
> > > > > > > > > > >
> > > > > > > > > > > IIUC, in rare error cases a hugetlb page can be poisoned WITHOUT a
> > > > > > > > > > > raw_hwp_list.  This is indicated by the hugetlb page specific flag
> > > > > > > > > > > RawHwpUnreliable or folio_test_hugetlb_raw_hwp_unreliable().
> > > > > > > > > > >
> > > > > > > > > > > Looks like this routine does not consider that case.  Seems like it should
> > > > > > > > > > > always return the passed subpage if folio_test_hugetlb_raw_hwp_unreliable()
> > > > > > > > > > > is true?
> > > > > > > > > >
> > > > > > > > > > Thanks for catching this. I wonder should this routine consider
> > > > > > > > > > RawHwpUnreliable or should the caller do.
> > > > > > > > > >
> > > > > > > > > > find_raw_hwp_page now returns raw_hwp_page* in the llist entry to
> > > > > > > > > > caller (valid one at the moment), but once RawHwpUnreliable is set,
> > > > > > > > > > all the raw_hwp_page in the llist will be kfree(), and the returned
> > > > > > > > > > value becomes dangling pointer to caller (if the caller holds that
> > > > > > > > > > caller long enough). Maybe returning a bool would be safer to the
> > > > > > > > > > caller? If the routine returns bool, then checking RawHwpUnreliable
> > > > > > > > > > can definitely be within the routine.
> > > > > > > > >
> > > > > > > > > I think the check for RawHwpUnreliable should be within this routine.
> > > > > > > > > Looking closer at the code, I do not see any way to synchronize this.
> > > > > > > > > It looks like manipulation in the memory-failure code would be
> > > > > > > > > synchronized via the mf_mutex.  However, I do not see how traversal and
> > > > > > > > > freeing of the raw_hwp_list  called from __update_and_free_hugetlb_folio
> > > > > > > > > is synchronized against memory-failure code modifying the list.
> > > > > > > > >
> > > > > > > > > Naoya, can you provide some thoughts?
> >
> > Hi Mike,
> >
> > Now looking again this, I think concurrent adding and deleting are
> > fine with each other and with themselves, because raw_hwp_list is
> > lock-less llist.
>
> Correct.
>
> > As for synchronizing traversal with adding and deleting, I wonder is
> > it a good idea to make __update_and_free_hugetlb_folio hold
> > hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse +
> > delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already
> > takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is
> > missing the lock.
>
> I do not think the lock is needed.  However, while looking more closely
> at this I think I discovered another issue.
> This is VERY subtle.
> Perhaps Naoya can help verify if my reasoning below is correct.
>
> In __update_and_free_hugetlb_folio we are not operating on a hugetlb page.
> Why is this?
> Before calling update_and_free_hugetlb_folio we call remove_hugetlb_folio.
> The purpose of remove_hugetlb_folio is to remove the huge page from the
> list AND compound page destructor indicating this is a hugetlb page is changed.
> This is all done while holding the hugetlb lock.  So, the test for
> folio_test_hugetlb(folio) is false.
>
> We have technically a compound non-hugetlb page with a non-null raw_hwp_list.
>
> Important note: at this time we have not reallocated vmemmap pages if
> hugetlb page was vmemmap optimized.  That is done later in
> __update_and_free_hugetlb_folio.


>
> The 'good news' is that after this point get_huge_page_for_hwpoison will
> not recognize this as a hugetlb page, so nothing will be added to the
> list.  There is no need to worry about entries being added to the list
> during traversal.
>
> The 'bad news' is that if we get a memory error at this time we will
> treat it as a memory error on a regular compound page.  So,
> TestSetPageHWPoison(p) in memory_failure() may try to write a read only
> struct page. :(

At least I think this is an issue.

Would it help if dissolve_free_huge_page doesn't unlock hugetlb_lock
until update_and_free_hugetlb_folio is done, or basically until
dissolve_free_huge_page is done?

TestSetPageHWPoison in memory_failure is called after
try_memory_failure_hugetlb, and folio_test_hugetlb is tested within
__get_huge_page_for_hwpoison, which is wrapped by the hugetlb_lock. So
by the time dissolve_free_huge_page returns, subpages already go
through hugetlb_vmemmap_restore and __destroy_compound_gigantic_folio
and become non-compound raw pages (folios). Now
folio_test_hugetlb(p)=false will be correct for memory_failure, and it
can recover p as a dissolved non-hugetlb page.


> --
> Mike Kravetz

Mike Kravetz June 17, 2023, 10:59 p.m. UTC | #13

On 06/16/23 19:18, Jiaqi Yan wrote:
> On Fri, Jun 16, 2023 at 4:35 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > On 06/16/23 14:19, Jiaqi Yan wrote:
> > >
> > > Now looking again this, I think concurrent adding and deleting are
> > > fine with each other and with themselves, because raw_hwp_list is
> > > lock-less llist.
> >
> > Correct.
> >
> > > As for synchronizing traversal with adding and deleting, I wonder is
> > > it a good idea to make __update_and_free_hugetlb_folio hold
> > > hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse +
> > > delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already
> > > takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is
> > > missing the lock.
> >
> > I do not think the lock is needed.  However, while looking more closely
> > at this I think I discovered another issue.
> > This is VERY subtle.
> > Perhaps Naoya can help verify if my reasoning below is correct.
> >
> > In __update_and_free_hugetlb_folio we are not operating on a hugetlb page.
> > Why is this?
> > Before calling update_and_free_hugetlb_folio we call remove_hugetlb_folio.
> > The purpose of remove_hugetlb_folio is to remove the huge page from the
> > list AND compound page destructor indicating this is a hugetlb page is changed.
> > This is all done while holding the hugetlb lock.  So, the test for
> > folio_test_hugetlb(folio) is false.
> >
> > We have technically a compound non-hugetlb page with a non-null raw_hwp_list.
> >
> > Important note: at this time we have not reallocated vmemmap pages if
> > hugetlb page was vmemmap optimized.  That is done later in
> > __update_and_free_hugetlb_folio.
> 
> 
> >
> > The 'good news' is that after this point get_huge_page_for_hwpoison will
> > not recognize this as a hugetlb page, so nothing will be added to the
> > list.  There is no need to worry about entries being added to the list
> > during traversal.
> >
> > The 'bad news' is that if we get a memory error at this time we will
> > treat it as a memory error on a regular compound page.  So,
> > TestSetPageHWPoison(p) in memory_failure() may try to write a read only
> > struct page. :(
> 
> At least I think this is an issue.
> 
> Would it help if dissolve_free_huge_page doesn't unlock hugetlb_lock
> until update_and_free_hugetlb_folio is done, or basically until
> dissolve_free_huge_page is done?

Unfortunately, update_and_free_hugetlb_folio is designed to be called
without locks held.  This is because we can not hold any locks while
allocating vmemmap pages.

I'll try to think of some way to restructure the code.  IIUC, this is a
potential general issue, not just isolated to memory error handling.

Naoya Horiguchi June 19, 2023, 8:23 a.m. UTC | #14

On Sat, Jun 17, 2023 at 03:59:27PM -0700, Mike Kravetz wrote:
> On 06/16/23 19:18, Jiaqi Yan wrote:
> > On Fri, Jun 16, 2023 at 4:35 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > On 06/16/23 14:19, Jiaqi Yan wrote:
> > > >
> > > > Now looking again this, I think concurrent adding and deleting are
> > > > fine with each other and with themselves, because raw_hwp_list is
> > > > lock-less llist.
> > >
> > > Correct.
> > >
> > > > As for synchronizing traversal with adding and deleting, I wonder is
> > > > it a good idea to make __update_and_free_hugetlb_folio hold
> > > > hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse +
> > > > delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already
> > > > takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is
> > > > missing the lock.
> > >
> > > I do not think the lock is needed.  However, while looking more closely
> > > at this I think I discovered another issue.
> > > This is VERY subtle.
> > > Perhaps Naoya can help verify if my reasoning below is correct.
> > >
> > > In __update_and_free_hugetlb_folio we are not operating on a hugetlb page.
> > > Why is this?
> > > Before calling update_and_free_hugetlb_folio we call remove_hugetlb_folio.
> > > The purpose of remove_hugetlb_folio is to remove the huge page from the
> > > list AND compound page destructor indicating this is a hugetlb page is changed.
> > > This is all done while holding the hugetlb lock.  So, the test for
> > > folio_test_hugetlb(folio) is false.
> > >
> > > We have technically a compound non-hugetlb page with a non-null raw_hwp_list.
> > >
> > > Important note: at this time we have not reallocated vmemmap pages if
> > > hugetlb page was vmemmap optimized.  That is done later in
> > > __update_and_free_hugetlb_folio.
> > 
> > 
> > >
> > > The 'good news' is that after this point get_huge_page_for_hwpoison will
> > > not recognize this as a hugetlb page, so nothing will be added to the
> > > list.  There is no need to worry about entries being added to the list
> > > during traversal.
> > >
> > > The 'bad news' is that if we get a memory error at this time we will
> > > treat it as a memory error on a regular compound page.  So,
> > > TestSetPageHWPoison(p) in memory_failure() may try to write a read only
> > > struct page. :(
> > 
> > At least I think this is an issue.
> > 
> > Would it help if dissolve_free_huge_page doesn't unlock hugetlb_lock
> > until update_and_free_hugetlb_folio is done, or basically until
> > dissolve_free_huge_page is done?
> 
> Unfortunately, update_and_free_hugetlb_folio is designed to be called
> without locks held.  This is because we can not hold any locks while
> allocating vmemmap pages.
> 
> I'll try to think of some way to restructure the code.  IIUC, this is a
> potential general issue, not just isolated to memory error handling.

Considering this issue as one specific to memory error handling, checking
HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to
detect the race.  Then, an idea like the below diff (not tested) can make
try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait
for complete the allocation of vmemmap pages.

@@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
        int ret = 2;    /* fallback to normal page handling */
        bool count_increased = false;

-       if (!folio_test_hugetlb(folio))
+       if (!folio_test_hugetlb(folio)) {
+               if (folio_test_hugetlb_vmemmap_optimized(folio))
+                       ret = -EBUSY;
                goto out;
+       }

        if (flags & MF_COUNT_INCREASED) {
                ret = 1;


Thanks,
Naoya Horiguchi

> -- 
> Mike Kravetz
> 
> > 
> > TestSetPageHWPoison in memory_failure is called after
> > try_memory_failure_hugetlb, and folio_test_hugetlb is tested within
> > __get_huge_page_for_hwpoison, which is wrapped by the hugetlb_lock. So
> > by the time dissolve_free_huge_page returns, subpages already go
> > through hugetlb_vmemmap_restore and __destroy_compound_gigantic_folio
> > and become non-compound raw pages (folios). Now
> > folio_test_hugetlb(p)=false will be correct for memory_failure, and it
> > can recover p as a dissolved non-hugetlb page.

Mike Kravetz June 20, 2023, 6:05 p.m. UTC | #15

On 06/19/23 17:23, Naoya Horiguchi wrote:
> On Sat, Jun 17, 2023 at 03:59:27PM -0700, Mike Kravetz wrote:
> > On 06/16/23 19:18, Jiaqi Yan wrote:
> > > On Fri, Jun 16, 2023 at 4:35 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > On 06/16/23 14:19, Jiaqi Yan wrote:
> > > > >
> > > > > Now looking again this, I think concurrent adding and deleting are
> > > > > fine with each other and with themselves, because raw_hwp_list is
> > > > > lock-less llist.
> > > >
> > > > Correct.
> > > >
> > > > > As for synchronizing traversal with adding and deleting, I wonder is
> > > > > it a good idea to make __update_and_free_hugetlb_folio hold
> > > > > hugetlb_lock before it folio_clear_hugetlb_hwpoison(which traverse +
> > > > > delete raw_hwp_list)? In hugetlb, get_huge_page_for_hwpoison already
> > > > > takes hugetlb_lock; it seems to me __update_and_free_hugetlb_folio is
> > > > > missing the lock.
> > > >
> > > > I do not think the lock is needed.  However, while looking more closely
> > > > at this I think I discovered another issue.
> > > > This is VERY subtle.
> > > > Perhaps Naoya can help verify if my reasoning below is correct.
> > > >
> > > > In __update_and_free_hugetlb_folio we are not operating on a hugetlb page.
> > > > Why is this?
> > > > Before calling update_and_free_hugetlb_folio we call remove_hugetlb_folio.
> > > > The purpose of remove_hugetlb_folio is to remove the huge page from the
> > > > list AND compound page destructor indicating this is a hugetlb page is changed.
> > > > This is all done while holding the hugetlb lock.  So, the test for
> > > > folio_test_hugetlb(folio) is false.
> > > >
> > > > We have technically a compound non-hugetlb page with a non-null raw_hwp_list.
> > > >
> > > > Important note: at this time we have not reallocated vmemmap pages if
> > > > hugetlb page was vmemmap optimized.  That is done later in
> > > > __update_and_free_hugetlb_folio.
> > > 
> > > 
> > > >
> > > > The 'good news' is that after this point get_huge_page_for_hwpoison will
> > > > not recognize this as a hugetlb page, so nothing will be added to the
> > > > list.  There is no need to worry about entries being added to the list
> > > > during traversal.
> > > >
> > > > The 'bad news' is that if we get a memory error at this time we will
> > > > treat it as a memory error on a regular compound page.  So,
> > > > TestSetPageHWPoison(p) in memory_failure() may try to write a read only
> > > > struct page. :(
> > > 
> > > At least I think this is an issue.
> > > 
> > > Would it help if dissolve_free_huge_page doesn't unlock hugetlb_lock
> > > until update_and_free_hugetlb_folio is done, or basically until
> > > dissolve_free_huge_page is done?
> > 
> > Unfortunately, update_and_free_hugetlb_folio is designed to be called
> > without locks held.  This is because we can not hold any locks while
> > allocating vmemmap pages.
> > 
> > I'll try to think of some way to restructure the code.  IIUC, this is a
> > potential general issue, not just isolated to memory error handling.
> 
> Considering this issue as one specific to memory error handling, checking
> HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to
> detect the race.  Then, an idea like the below diff (not tested) can make
> try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait
> for complete the allocation of vmemmap pages.
> 
> @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
>         int ret = 2;    /* fallback to normal page handling */
>         bool count_increased = false;
> 
> -       if (!folio_test_hugetlb(folio))
> +       if (!folio_test_hugetlb(folio)) {
> +               if (folio_test_hugetlb_vmemmap_optimized(folio))
> +                       ret = -EBUSY;

The hugetlb specific page flags (HPG_vmemmap_optimized here) reside in
the folio->private field.

In the case where the folio is a non-hugetlb folio, the folio->private field
could be any arbitrary value.  As such, the test for vmemmap_optimized may
return a false positive.  We could end up retrying for an arbitrarily
long time.

I am looking at how to restructure the code which removes and frees
hugetlb pages so that folio_test_hugetlb() would remain true until
vmemmap pages are allocated.  The easiest way to do this would introduce
another hugetlb lock/unlock cycle in the page freeing path.  This would
undo some of the speedups in the series:
https://lore.kernel.org/all/20210409205254.242291-4-mike.kravetz@oracle.com/T/#m34321fbcbdf8bb35dfe083b05d445e90ecc1efab

Mike Kravetz June 20, 2023, 10:39 p.m. UTC | #16

On 06/20/23 11:05, Mike Kravetz wrote:
> On 06/19/23 17:23, Naoya Horiguchi wrote:
> > 
> > Considering this issue as one specific to memory error handling, checking
> > HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to
> > detect the race.  Then, an idea like the below diff (not tested) can make
> > try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait
> > for complete the allocation of vmemmap pages.
> > 
> > @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> >         int ret = 2;    /* fallback to normal page handling */
> >         bool count_increased = false;
> > 
> > -       if (!folio_test_hugetlb(folio))
> > +       if (!folio_test_hugetlb(folio)) {
> > +               if (folio_test_hugetlb_vmemmap_optimized(folio))
> > +                       ret = -EBUSY;
> 
> The hugetlb specific page flags (HPG_vmemmap_optimized here) reside in
> the folio->private field.
> 
> In the case where the folio is a non-hugetlb folio, the folio->private field
> could be any arbitrary value.  As such, the test for vmemmap_optimized may
> return a false positive.  We could end up retrying for an arbitrarily
> long time.
> 
> I am looking at how to restructure the code which removes and frees
> hugetlb pages so that folio_test_hugetlb() would remain true until
> vmemmap pages are allocated.  The easiest way to do this would introduce
> another hugetlb lock/unlock cycle in the page freeing path.  This would
> undo some of the speedups in the series:
> https://lore.kernel.org/all/20210409205254.242291-4-mike.kravetz@oracle.com/T/#m34321fbcbdf8bb35dfe083b05d445e90ecc1efab
> 

Perhaps something like this?  Minimal testing.

From e709fb4da0b6249973f9bf0540c9da0e4c585fe2 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 20 Jun 2023 14:48:39 -0700
Subject: [PATCH] hugetlb: Do not clear hugetlb dtor until allocating vmemmap

Freeing a hugetlb page and releasing base pages back to the underlying
allocator such as buddy or cma is performed in two steps:
- remove_hugetlb_folio() is called to remove the folio from hugetlb
  lists, get a ref on the page and remove hugetlb destructor.  This
  all must be done under the hugetlb lock.  After this call, the page
  can be treated as a normal compound page or a collection of base
  size pages.
- update_and_free_hugetlb_folio() is called to allocate vmemmap if
  needed and the free routine of the underlying allocator is called
  on the resulting page.  We can not hold the hugetlb lock here.

One issue with this scheme is that a memory error could occur between
these two steps.  In this case, the memory error handling code treats
the old hugetlb page as a normal compound page or collection of base
pages.  It will then try to SetPageHWPoison(page) on the page with an
error.  If the page with error is a tail page without vmemmap, a write
error will occur when trying to set the flag.

Address this issue by modifying remove_hugetlb_folio() and
update_and_free_hugetlb_folio() such that the hugetlb destructor is not
cleared until after allocating vmemmap.  Since clearing the destructor
required holding the hugetlb lock, the clearing is done in
remove_hugetlb_folio() if the vmemmap is present.  This saves a
lock/unlock cycle.  Otherwise, destructor is cleared in
update_and_free_hugetlb_folio() after allocating vmemmap.

Note that this will leave hugetlb pages in a state where they are marked
free (by hugetlb specific page flag) and have a ref count.  This is not
a normal state.  The only code that would notice is the memory error
code, and it is set up to retry in such a case.

A subsequent patch will create a routine to do bulk processing of
vmemmap allocation.  This will eliminate a lock/unlock cycle for each
hugetlb page in the case where we are freeing a bunch of pages.

Fixes: ???
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 51 insertions(+), 24 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d76574425da3..f7f64470aee0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio,
 						unsigned int order) { }
 #endif
 
+static inline void __clear_hugetlb_destructor(struct hstate *h,
+						struct folio *folio)
+{
+	lockdep_assert_held(&hugetlb_lock);
+
+	/*
+	 * Very subtle
+	 *
+	 * For non-gigantic pages set the destructor to the normal compound
+	 * page dtor.  This is needed in case someone takes an additional
+	 * temporary ref to the page, and freeing is delayed until they drop
+	 * their reference.
+	 *
+	 * For gigantic pages set the destructor to the null dtor.  This
+	 * destructor will never be called.  Before freeing the gigantic
+	 * page destroy_compound_gigantic_folio will turn the folio into a
+	 * simple group of pages.  After this the destructor does not
+	 * apply.
+	 *
+	 */
+	if (hstate_is_gigantic(h))
+		folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
+	else
+		folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
+}
+
 /*
- * Remove hugetlb folio from lists, and update dtor so that the folio appears
- * as just a compound page.
+ * Remove hugetlb folio from lists.
+ * If vmemmap exists for the folio, update dtor so that the folio appears
+ * as just a compound page.  Otherwise, wait until after allocating vmemmap
+ * to update dtor.
  *
  * A reference is held on the folio, except in the case of demote.
  *
@@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
 	}
 
 	/*
-	 * Very subtle
-	 *
-	 * For non-gigantic pages set the destructor to the normal compound
-	 * page dtor.  This is needed in case someone takes an additional
-	 * temporary ref to the page, and freeing is delayed until they drop
-	 * their reference.
-	 *
-	 * For gigantic pages set the destructor to the null dtor.  This
-	 * destructor will never be called.  Before freeing the gigantic
-	 * page destroy_compound_gigantic_folio will turn the folio into a
-	 * simple group of pages.  After this the destructor does not
-	 * apply.
-	 *
-	 * This handles the case where more than one ref is held when and
-	 * after update_and_free_hugetlb_folio is called.
-	 *
-	 * In the case of demote we do not ref count the page as it will soon
-	 * be turned into a page of smaller size.
+	 * We can only clear the hugetlb destructor after allocating vmemmap
+	 * pages.  Otherwise, someone (memory error handling) may try to write
+	 * to tail struct pages.
+	 */
+	if (!folio_test_hugetlb_vmemmap_optimized(folio))
+		__clear_hugetlb_destructor(h, folio);
+
+	 /*
+	  * In the case of demote we do not ref count the page as it will soon
+	  * be turned into a page of smaller size.
 	 */
 	if (!demote)
 		folio_ref_unfreeze(folio, 1);
-	if (hstate_is_gigantic(h))
-		folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
-	else
-		folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
 
 	h->nr_huge_pages--;
 	h->nr_huge_pages_node[nid]--;
@@ -1705,6 +1721,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 {
 	int i;
 	struct page *subpage;
+	bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio);
 
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
@@ -1735,6 +1752,16 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 	if (unlikely(folio_test_hwpoison(folio)))
 		folio_clear_hugetlb_hwpoison(folio);
 
+	/*
+	 * If vmemmap pages were allocated above, then we need to clear the
+	 * hugetlb destructor under the hugetlb lock.
+	 */
+	if (clear_dtor) {
+		spin_lock_irq(&hugetlb_lock);
+		__clear_hugetlb_destructor(h, folio);
+		spin_unlock_irq(&hugetlb_lock);
+	}
+
 	for (i = 0; i < pages_per_huge_page(h); i++) {
 		subpage = folio_page(folio, i);
 		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |

Jiaqi Yan June 23, 2023, 12:45 a.m. UTC | #17

On Tue, Jun 20, 2023 at 3:39 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 06/20/23 11:05, Mike Kravetz wrote:
> > On 06/19/23 17:23, Naoya Horiguchi wrote:
> > >
> > > Considering this issue as one specific to memory error handling, checking
> > > HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to
> > > detect the race.  Then, an idea like the below diff (not tested) can make
> > > try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait
> > > for complete the allocation of vmemmap pages.
> > >
> > > @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> > >         int ret = 2;    /* fallback to normal page handling */
> > >         bool count_increased = false;
> > >
> > > -       if (!folio_test_hugetlb(folio))
> > > +       if (!folio_test_hugetlb(folio)) {
> > > +               if (folio_test_hugetlb_vmemmap_optimized(folio))
> > > +                       ret = -EBUSY;
> >
> > The hugetlb specific page flags (HPG_vmemmap_optimized here) reside in
> > the folio->private field.
> >
> > In the case where the folio is a non-hugetlb folio, the folio->private field
> > could be any arbitrary value.  As such, the test for vmemmap_optimized may
> > return a false positive.  We could end up retrying for an arbitrarily
> > long time.
> >
> > I am looking at how to restructure the code which removes and frees
> > hugetlb pages so that folio_test_hugetlb() would remain true until
> > vmemmap pages are allocated.  The easiest way to do this would introduce
> > another hugetlb lock/unlock cycle in the page freeing path.  This would
> > undo some of the speedups in the series:
> > https://lore.kernel.org/all/20210409205254.242291-4-mike.kravetz@oracle.com/T/#m34321fbcbdf8bb35dfe083b05d445e90ecc1efab
> >
>
> Perhaps something like this?  Minimal testing.

Thanks for putting up a fix, Mike!

>
> From e709fb4da0b6249973f9bf0540c9da0e4c585fe2 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Tue, 20 Jun 2023 14:48:39 -0700
> Subject: [PATCH] hugetlb: Do not clear hugetlb dtor until allocating vmemmap
>
> Freeing a hugetlb page and releasing base pages back to the underlying
> allocator such as buddy or cma is performed in two steps:
> - remove_hugetlb_folio() is called to remove the folio from hugetlb
>   lists, get a ref on the page and remove hugetlb destructor.  This
>   all must be done under the hugetlb lock.  After this call, the page
>   can be treated as a normal compound page or a collection of base
>   size pages.
> - update_and_free_hugetlb_folio() is called to allocate vmemmap if
>   needed and the free routine of the underlying allocator is called
>   on the resulting page.  We can not hold the hugetlb lock here.
>
> One issue with this scheme is that a memory error could occur between
> these two steps.  In this case, the memory error handling code treats
> the old hugetlb page as a normal compound page or collection of base
> pages.  It will then try to SetPageHWPoison(page) on the page with an
> error.  If the page with error is a tail page without vmemmap, a write
> error will occur when trying to set the flag.
>
> Address this issue by modifying remove_hugetlb_folio() and
> update_and_free_hugetlb_folio() such that the hugetlb destructor is not
> cleared until after allocating vmemmap.  Since clearing the destructor
> required holding the hugetlb lock, the clearing is done in
> remove_hugetlb_folio() if the vmemmap is present.  This saves a
> lock/unlock cycle.  Otherwise, destructor is cleared in
> update_and_free_hugetlb_folio() after allocating vmemmap.
>
> Note that this will leave hugetlb pages in a state where they are marked
> free (by hugetlb specific page flag) and have a ref count.  This is not
> a normal state.  The only code that would notice is the memory error
> code, and it is set up to retry in such a case.
>
> A subsequent patch will create a routine to do bulk processing of
> vmemmap allocation.  This will eliminate a lock/unlock cycle for each
> hugetlb page in the case where we are freeing a bunch of pages.
>
> Fixes: ???
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 51 insertions(+), 24 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d76574425da3..f7f64470aee0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio,
>                                                 unsigned int order) { }
>  #endif
>
> +static inline void __clear_hugetlb_destructor(struct hstate *h,
> +                                               struct folio *folio)
> +{
> +       lockdep_assert_held(&hugetlb_lock);
> +
> +       /*
> +        * Very subtle
> +        *
> +        * For non-gigantic pages set the destructor to the normal compound
> +        * page dtor.  This is needed in case someone takes an additional
> +        * temporary ref to the page, and freeing is delayed until they drop
> +        * their reference.
> +        *
> +        * For gigantic pages set the destructor to the null dtor.  This
> +        * destructor will never be called.  Before freeing the gigantic
> +        * page destroy_compound_gigantic_folio will turn the folio into a
> +        * simple group of pages.  After this the destructor does not
> +        * apply.
> +        *
> +        */
> +       if (hstate_is_gigantic(h))
> +               folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> +       else
> +               folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
> +}
> +
>  /*
> - * Remove hugetlb folio from lists, and update dtor so that the folio appears
> - * as just a compound page.
> + * Remove hugetlb folio from lists.
> + * If vmemmap exists for the folio, update dtor so that the folio appears
> + * as just a compound page.  Otherwise, wait until after allocating vmemmap
> + * to update dtor.
>   *
>   * A reference is held on the folio, except in the case of demote.
>   *
> @@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
>         }
>
>         /*
> -        * Very subtle
> -        *
> -        * For non-gigantic pages set the destructor to the normal compound
> -        * page dtor.  This is needed in case someone takes an additional
> -        * temporary ref to the page, and freeing is delayed until they drop
> -        * their reference.
> -        *
> -        * For gigantic pages set the destructor to the null dtor.  This
> -        * destructor will never be called.  Before freeing the gigantic
> -        * page destroy_compound_gigantic_folio will turn the folio into a
> -        * simple group of pages.  After this the destructor does not
> -        * apply.
> -        *
> -        * This handles the case where more than one ref is held when and
> -        * after update_and_free_hugetlb_folio is called.
> -        *
> -        * In the case of demote we do not ref count the page as it will soon
> -        * be turned into a page of smaller size.
> +        * We can only clear the hugetlb destructor after allocating vmemmap
> +        * pages.  Otherwise, someone (memory error handling) may try to write
> +        * to tail struct pages.
> +        */
> +       if (!folio_test_hugetlb_vmemmap_optimized(folio))
> +               __clear_hugetlb_destructor(h, folio);
> +
> +        /*
> +         * In the case of demote we do not ref count the page as it will soon
> +         * be turned into a page of smaller size.
>          */
>         if (!demote)
>                 folio_ref_unfreeze(folio, 1);
> -       if (hstate_is_gigantic(h))
> -               folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> -       else
> -               folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
>
>         h->nr_huge_pages--;
>         h->nr_huge_pages_node[nid]--;
> @@ -1705,6 +1721,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>  {
>         int i;
>         struct page *subpage;
> +       bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio);

Can this test on vmemmap_optimized still tell us if we should
__clear_hugetlb_destructor? From my reading:
1. If a hugetlb folio is still vmemmap optimized in
__remove_hugetlb_folio, __remove_hugetlb_folio won't
__clear_hugetlb_destructor.
2. Then hugetlb_vmemmap_restore in dissolve_free_huge_page will clear
HPG_vmemmap_optimized if it succeeds.
3. Now when dissolve_free_huge_page gets into
__update_and_free_hugetlb_folio, we will see clear_dtor to be false
and __clear_hugetlb_destructor won't be called.

Or maybe I misunderstood, and what you really want to do is never
__clear_hugetlb_destructor so that folio_test_hugetlb is always true?

>
>         if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>                 return;
> @@ -1735,6 +1752,16 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>         if (unlikely(folio_test_hwpoison(folio)))
>                 folio_clear_hugetlb_hwpoison(folio);
>
> +       /*
> +        * If vmemmap pages were allocated above, then we need to clear the
> +        * hugetlb destructor under the hugetlb lock.
> +        */
> +       if (clear_dtor) {
> +               spin_lock_irq(&hugetlb_lock);
> +               __clear_hugetlb_destructor(h, folio);
> +               spin_unlock_irq(&hugetlb_lock);
> +       }
> +
>         for (i = 0; i < pages_per_huge_page(h); i++) {
>                 subpage = folio_page(folio, i);
>                 subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
> --
> 2.41.0
>

Mike Kravetz June 23, 2023, 4:19 a.m. UTC | #18

On 06/22/23 17:45, Jiaqi Yan wrote:
> On Tue, Jun 20, 2023 at 3:39 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 06/20/23 11:05, Mike Kravetz wrote:
> > > On 06/19/23 17:23, Naoya Horiguchi wrote:
> > > >
> > > > Considering this issue as one specific to memory error handling, checking
> > > > HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to
> > > > detect the race.  Then, an idea like the below diff (not tested) can make
> > > > try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait
> > > > for complete the allocation of vmemmap pages.
> > > >
> > > > @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> > > >         int ret = 2;    /* fallback to normal page handling */
> > > >         bool count_increased = false;
> > > >
> > > > -       if (!folio_test_hugetlb(folio))
> > > > +       if (!folio_test_hugetlb(folio)) {
> > > > +               if (folio_test_hugetlb_vmemmap_optimized(folio))
> > > > +                       ret = -EBUSY;
> > >
> > > The hugetlb specific page flags (HPG_vmemmap_optimized here) reside in
> > > the folio->private field.
> > >
> > > In the case where the folio is a non-hugetlb folio, the folio->private field
> > > could be any arbitrary value.  As such, the test for vmemmap_optimized may
> > > return a false positive.  We could end up retrying for an arbitrarily
> > > long time.
> > >
> > > I am looking at how to restructure the code which removes and frees
> > > hugetlb pages so that folio_test_hugetlb() would remain true until
> > > vmemmap pages are allocated.  The easiest way to do this would introduce
> > > another hugetlb lock/unlock cycle in the page freeing path.  This would
> > > undo some of the speedups in the series:
> > > https://lore.kernel.org/all/20210409205254.242291-4-mike.kravetz@oracle.com/T/#m34321fbcbdf8bb35dfe083b05d445e90ecc1efab
> > >
> >
> > Perhaps something like this?  Minimal testing.
> 
> Thanks for putting up a fix, Mike!
> 
> >
> > From e709fb4da0b6249973f9bf0540c9da0e4c585fe2 Mon Sep 17 00:00:00 2001
> > From: Mike Kravetz <mike.kravetz@oracle.com>
> > Date: Tue, 20 Jun 2023 14:48:39 -0700
> > Subject: [PATCH] hugetlb: Do not clear hugetlb dtor until allocating vmemmap
> >
> > Freeing a hugetlb page and releasing base pages back to the underlying
> > allocator such as buddy or cma is performed in two steps:
> > - remove_hugetlb_folio() is called to remove the folio from hugetlb
> >   lists, get a ref on the page and remove hugetlb destructor.  This
> >   all must be done under the hugetlb lock.  After this call, the page
> >   can be treated as a normal compound page or a collection of base
> >   size pages.
> > - update_and_free_hugetlb_folio() is called to allocate vmemmap if
> >   needed and the free routine of the underlying allocator is called
> >   on the resulting page.  We can not hold the hugetlb lock here.
> >
> > One issue with this scheme is that a memory error could occur between
> > these two steps.  In this case, the memory error handling code treats
> > the old hugetlb page as a normal compound page or collection of base
> > pages.  It will then try to SetPageHWPoison(page) on the page with an
> > error.  If the page with error is a tail page without vmemmap, a write
> > error will occur when trying to set the flag.
> >
> > Address this issue by modifying remove_hugetlb_folio() and
> > update_and_free_hugetlb_folio() such that the hugetlb destructor is not
> > cleared until after allocating vmemmap.  Since clearing the destructor
> > required holding the hugetlb lock, the clearing is done in
> > remove_hugetlb_folio() if the vmemmap is present.  This saves a
> > lock/unlock cycle.  Otherwise, destructor is cleared in
> > update_and_free_hugetlb_folio() after allocating vmemmap.
> >
> > Note that this will leave hugetlb pages in a state where they are marked
> > free (by hugetlb specific page flag) and have a ref count.  This is not
> > a normal state.  The only code that would notice is the memory error
> > code, and it is set up to retry in such a case.
> >
> > A subsequent patch will create a routine to do bulk processing of
> > vmemmap allocation.  This will eliminate a lock/unlock cycle for each
> > hugetlb page in the case where we are freeing a bunch of pages.
> >
> > Fixes: ???
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >  mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 51 insertions(+), 24 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index d76574425da3..f7f64470aee0 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio,
> >                                                 unsigned int order) { }
> >  #endif
> >
> > +static inline void __clear_hugetlb_destructor(struct hstate *h,
> > +                                               struct folio *folio)
> > +{
> > +       lockdep_assert_held(&hugetlb_lock);
> > +
> > +       /*
> > +        * Very subtle
> > +        *
> > +        * For non-gigantic pages set the destructor to the normal compound
> > +        * page dtor.  This is needed in case someone takes an additional
> > +        * temporary ref to the page, and freeing is delayed until they drop
> > +        * their reference.
> > +        *
> > +        * For gigantic pages set the destructor to the null dtor.  This
> > +        * destructor will never be called.  Before freeing the gigantic
> > +        * page destroy_compound_gigantic_folio will turn the folio into a
> > +        * simple group of pages.  After this the destructor does not
> > +        * apply.
> > +        *
> > +        */
> > +       if (hstate_is_gigantic(h))
> > +               folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> > +       else
> > +               folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
> > +}
> > +
> >  /*
> > - * Remove hugetlb folio from lists, and update dtor so that the folio appears
> > - * as just a compound page.
> > + * Remove hugetlb folio from lists.
> > + * If vmemmap exists for the folio, update dtor so that the folio appears
> > + * as just a compound page.  Otherwise, wait until after allocating vmemmap
> > + * to update dtor.
> >   *
> >   * A reference is held on the folio, except in the case of demote.
> >   *
> > @@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
> >         }
> >
> >         /*
> > -        * Very subtle
> > -        *
> > -        * For non-gigantic pages set the destructor to the normal compound
> > -        * page dtor.  This is needed in case someone takes an additional
> > -        * temporary ref to the page, and freeing is delayed until they drop
> > -        * their reference.
> > -        *
> > -        * For gigantic pages set the destructor to the null dtor.  This
> > -        * destructor will never be called.  Before freeing the gigantic
> > -        * page destroy_compound_gigantic_folio will turn the folio into a
> > -        * simple group of pages.  After this the destructor does not
> > -        * apply.
> > -        *
> > -        * This handles the case where more than one ref is held when and
> > -        * after update_and_free_hugetlb_folio is called.
> > -        *
> > -        * In the case of demote we do not ref count the page as it will soon
> > -        * be turned into a page of smaller size.
> > +        * We can only clear the hugetlb destructor after allocating vmemmap
> > +        * pages.  Otherwise, someone (memory error handling) may try to write
> > +        * to tail struct pages.
> > +        */
> > +       if (!folio_test_hugetlb_vmemmap_optimized(folio))
> > +               __clear_hugetlb_destructor(h, folio);
> > +
> > +        /*
> > +         * In the case of demote we do not ref count the page as it will soon
> > +         * be turned into a page of smaller size.
> >          */
> >         if (!demote)
> >                 folio_ref_unfreeze(folio, 1);
> > -       if (hstate_is_gigantic(h))
> > -               folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> > -       else
> > -               folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
> >
> >         h->nr_huge_pages--;
> >         h->nr_huge_pages_node[nid]--;
> > @@ -1705,6 +1721,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> >  {
> >         int i;
> >         struct page *subpage;
> > +       bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio);
> 
> Can this test on vmemmap_optimized still tell us if we should
> __clear_hugetlb_destructor? From my reading:
> 1. If a hugetlb folio is still vmemmap optimized in
> __remove_hugetlb_folio, __remove_hugetlb_folio won't
> __clear_hugetlb_destructor.
> 2. Then hugetlb_vmemmap_restore in dissolve_free_huge_page will clear
> HPG_vmemmap_optimized if it succeeds.
> 3. Now when dissolve_free_huge_page gets into
> __update_and_free_hugetlb_folio, we will see clear_dtor to be false
> and __clear_hugetlb_destructor won't be called.

Good catch!  That is indeed a problem with this patch.

> 
> Or maybe I misunderstood, and what you really want to do is never
> __clear_hugetlb_destructor so that folio_test_hugetlb is always true?

No, that was a bug with this patch.

We could ALWAYS wait until __update_and_free_hugetlb_folio to clear the
hugetlb destructor.  However, we have to take hugetlb lock to clear it.
If the page does not have vmemmap optimized, the we can clear the
destructor earlier in __remove_hugetlb_folio and avoid the lock/unlock
cycle.  In the past, we have had complaints about the time required to
allocate and free a large quantity of hugetlb pages.  Most of that time
is spent in the low level allocators.  However, I do not want to add
something like an extra lock/unlock cycle unless absolutely necessary.

I'll try to think of a cleaner and more fool proof way to address this.

IIUC, this is an existing issue.  Your patch series does not depend
this being fixed.

Jiaqi Yan June 23, 2023, 4:40 p.m. UTC | #19

On Thu, Jun 22, 2023 at 9:19 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 06/22/23 17:45, Jiaqi Yan wrote:
> > On Tue, Jun 20, 2023 at 3:39 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > On 06/20/23 11:05, Mike Kravetz wrote:
> > > > On 06/19/23 17:23, Naoya Horiguchi wrote:
> > > > >
> > > > > Considering this issue as one specific to memory error handling, checking
> > > > > HPG_vmemmap_optimized in __get_huge_page_for_hwpoison() might be helpful to
> > > > > detect the race.  Then, an idea like the below diff (not tested) can make
> > > > > try_memory_failure_hugetlb() retry (with retaking hugetlb_lock) to wait
> > > > > for complete the allocation of vmemmap pages.
> > > > >
> > > > > @@ -1938,8 +1938,11 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> > > > >         int ret = 2;    /* fallback to normal page handling */
> > > > >         bool count_increased = false;
> > > > >
> > > > > -       if (!folio_test_hugetlb(folio))
> > > > > +       if (!folio_test_hugetlb(folio)) {
> > > > > +               if (folio_test_hugetlb_vmemmap_optimized(folio))
> > > > > +                       ret = -EBUSY;
> > > >
> > > > The hugetlb specific page flags (HPG_vmemmap_optimized here) reside in
> > > > the folio->private field.
> > > >
> > > > In the case where the folio is a non-hugetlb folio, the folio->private field
> > > > could be any arbitrary value.  As such, the test for vmemmap_optimized may
> > > > return a false positive.  We could end up retrying for an arbitrarily
> > > > long time.
> > > >
> > > > I am looking at how to restructure the code which removes and frees
> > > > hugetlb pages so that folio_test_hugetlb() would remain true until
> > > > vmemmap pages are allocated.  The easiest way to do this would introduce
> > > > another hugetlb lock/unlock cycle in the page freeing path.  This would
> > > > undo some of the speedups in the series:
> > > > https://lore.kernel.org/all/20210409205254.242291-4-mike.kravetz@oracle.com/T/#m34321fbcbdf8bb35dfe083b05d445e90ecc1efab
> > > >
> > >
> > > Perhaps something like this?  Minimal testing.
> >
> > Thanks for putting up a fix, Mike!
> >
> > >
> > > From e709fb4da0b6249973f9bf0540c9da0e4c585fe2 Mon Sep 17 00:00:00 2001
> > > From: Mike Kravetz <mike.kravetz@oracle.com>
> > > Date: Tue, 20 Jun 2023 14:48:39 -0700
> > > Subject: [PATCH] hugetlb: Do not clear hugetlb dtor until allocating vmemmap
> > >
> > > Freeing a hugetlb page and releasing base pages back to the underlying
> > > allocator such as buddy or cma is performed in two steps:
> > > - remove_hugetlb_folio() is called to remove the folio from hugetlb
> > >   lists, get a ref on the page and remove hugetlb destructor.  This
> > >   all must be done under the hugetlb lock.  After this call, the page
> > >   can be treated as a normal compound page or a collection of base
> > >   size pages.
> > > - update_and_free_hugetlb_folio() is called to allocate vmemmap if
> > >   needed and the free routine of the underlying allocator is called
> > >   on the resulting page.  We can not hold the hugetlb lock here.
> > >
> > > One issue with this scheme is that a memory error could occur between
> > > these two steps.  In this case, the memory error handling code treats
> > > the old hugetlb page as a normal compound page or collection of base
> > > pages.  It will then try to SetPageHWPoison(page) on the page with an
> > > error.  If the page with error is a tail page without vmemmap, a write
> > > error will occur when trying to set the flag.
> > >
> > > Address this issue by modifying remove_hugetlb_folio() and
> > > update_and_free_hugetlb_folio() such that the hugetlb destructor is not
> > > cleared until after allocating vmemmap.  Since clearing the destructor
> > > required holding the hugetlb lock, the clearing is done in
> > > remove_hugetlb_folio() if the vmemmap is present.  This saves a
> > > lock/unlock cycle.  Otherwise, destructor is cleared in
> > > update_and_free_hugetlb_folio() after allocating vmemmap.
> > >
> > > Note that this will leave hugetlb pages in a state where they are marked
> > > free (by hugetlb specific page flag) and have a ref count.  This is not
> > > a normal state.  The only code that would notice is the memory error
> > > code, and it is set up to retry in such a case.
> > >
> > > A subsequent patch will create a routine to do bulk processing of
> > > vmemmap allocation.  This will eliminate a lock/unlock cycle for each
> > > hugetlb page in the case where we are freeing a bunch of pages.
> > >
> > > Fixes: ???
> > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > > ---
> > >  mm/hugetlb.c | 75 +++++++++++++++++++++++++++++++++++-----------------
> > >  1 file changed, 51 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index d76574425da3..f7f64470aee0 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1579,9 +1579,37 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio,
> > >                                                 unsigned int order) { }
> > >  #endif
> > >
> > > +static inline void __clear_hugetlb_destructor(struct hstate *h,
> > > +                                               struct folio *folio)
> > > +{
> > > +       lockdep_assert_held(&hugetlb_lock);
> > > +
> > > +       /*
> > > +        * Very subtle
> > > +        *
> > > +        * For non-gigantic pages set the destructor to the normal compound
> > > +        * page dtor.  This is needed in case someone takes an additional
> > > +        * temporary ref to the page, and freeing is delayed until they drop
> > > +        * their reference.
> > > +        *
> > > +        * For gigantic pages set the destructor to the null dtor.  This
> > > +        * destructor will never be called.  Before freeing the gigantic
> > > +        * page destroy_compound_gigantic_folio will turn the folio into a
> > > +        * simple group of pages.  After this the destructor does not
> > > +        * apply.
> > > +        *
> > > +        */
> > > +       if (hstate_is_gigantic(h))
> > > +               folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> > > +       else
> > > +               folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
> > > +}
> > > +
> > >  /*
> > > - * Remove hugetlb folio from lists, and update dtor so that the folio appears
> > > - * as just a compound page.
> > > + * Remove hugetlb folio from lists.
> > > + * If vmemmap exists for the folio, update dtor so that the folio appears
> > > + * as just a compound page.  Otherwise, wait until after allocating vmemmap
> > > + * to update dtor.
> > >   *
> > >   * A reference is held on the folio, except in the case of demote.
> > >   *
> > > @@ -1612,31 +1640,19 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio,
> > >         }
> > >
> > >         /*
> > > -        * Very subtle
> > > -        *
> > > -        * For non-gigantic pages set the destructor to the normal compound
> > > -        * page dtor.  This is needed in case someone takes an additional
> > > -        * temporary ref to the page, and freeing is delayed until they drop
> > > -        * their reference.
> > > -        *
> > > -        * For gigantic pages set the destructor to the null dtor.  This
> > > -        * destructor will never be called.  Before freeing the gigantic
> > > -        * page destroy_compound_gigantic_folio will turn the folio into a
> > > -        * simple group of pages.  After this the destructor does not
> > > -        * apply.
> > > -        *
> > > -        * This handles the case where more than one ref is held when and
> > > -        * after update_and_free_hugetlb_folio is called.
> > > -        *
> > > -        * In the case of demote we do not ref count the page as it will soon
> > > -        * be turned into a page of smaller size.
> > > +        * We can only clear the hugetlb destructor after allocating vmemmap
> > > +        * pages.  Otherwise, someone (memory error handling) may try to write
> > > +        * to tail struct pages.
> > > +        */
> > > +       if (!folio_test_hugetlb_vmemmap_optimized(folio))
> > > +               __clear_hugetlb_destructor(h, folio);
> > > +
> > > +        /*
> > > +         * In the case of demote we do not ref count the page as it will soon
> > > +         * be turned into a page of smaller size.
> > >          */
> > >         if (!demote)
> > >                 folio_ref_unfreeze(folio, 1);
> > > -       if (hstate_is_gigantic(h))
> > > -               folio_set_compound_dtor(folio, NULL_COMPOUND_DTOR);
> > > -       else
> > > -               folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
> > >
> > >         h->nr_huge_pages--;
> > >         h->nr_huge_pages_node[nid]--;
> > > @@ -1705,6 +1721,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> > >  {
> > >         int i;
> > >         struct page *subpage;
> > > +       bool clear_dtor = folio_test_hugetlb_vmemmap_optimized(folio);
> >
> > Can this test on vmemmap_optimized still tell us if we should
> > __clear_hugetlb_destructor? From my reading:
> > 1. If a hugetlb folio is still vmemmap optimized in
> > __remove_hugetlb_folio, __remove_hugetlb_folio won't
> > __clear_hugetlb_destructor.
> > 2. Then hugetlb_vmemmap_restore in dissolve_free_huge_page will clear
> > HPG_vmemmap_optimized if it succeeds.
> > 3. Now when dissolve_free_huge_page gets into
> > __update_and_free_hugetlb_folio, we will see clear_dtor to be false
> > and __clear_hugetlb_destructor won't be called.
>
> Good catch!  That is indeed a problem with this patch.

Glad that I could help.

>
> >
> > Or maybe I misunderstood, and what you really want to do is never
> > __clear_hugetlb_destructor so that folio_test_hugetlb is always true?
>
> No, that was a bug with this patch.
>
> We could ALWAYS wait until __update_and_free_hugetlb_folio to clear the
> hugetlb destructor.  However, we have to take hugetlb lock to clear it.
> If the page does not have vmemmap optimized, the we can clear the
> destructor earlier in __remove_hugetlb_folio and avoid the lock/unlock
> cycle.  In the past, we have had complaints about the time required to
> allocate and free a large quantity of hugetlb pages.  Most of that time
> is spent in the low level allocators.  However, I do not want to add
> something like an extra lock/unlock cycle unless absolutely necessary.
>
> I'll try to think of a cleaner and more fool proof way to address this.
>
> IIUC, this is an existing issue.  Your patch series does not depend
> this being fixed.

Thanks Mike, I was about to send out V2 today.

> --
> Mike Kravetz
>
> >
> > >
> > >         if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> > >                 return;
> > > @@ -1735,6 +1752,16 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> > >         if (unlikely(folio_test_hwpoison(folio)))
> > >                 folio_clear_hugetlb_hwpoison(folio);
> > >
> > > +       /*
> > > +        * If vmemmap pages were allocated above, then we need to clear the
> > > +        * hugetlb destructor under the hugetlb lock.
> > > +        */
> > > +       if (clear_dtor) {
> > > +               spin_lock_irq(&hugetlb_lock);
> > > +               __clear_hugetlb_destructor(h, folio);
> > > +               spin_unlock_irq(&hugetlb_lock);
> > > +       }
> > > +
> > >         for (i = 0; i < pages_per_huge_page(h); i++) {
> > >                 subpage = folio_page(folio, i);
> > >                 subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
> > > --
> > > 2.41.0
> > >

[v1,1/3] mm/hwpoison: find subpage in hugetlb HWPOISON list

Commit Message

Comments

Patch