diff mbox series

[RFC] mm/page_alloc: fix counting of free pages after take off from buddy

Message ID 20210421020426.5914-1-dinghui@sangfor.com.cn (mailing list archive)
State New, archived
Headers show
Series [RFC] mm/page_alloc: fix counting of free pages after take off from buddy | expand

Commit Message

Ding Hui April 21, 2021, 2:04 a.m. UTC
Recently we found there is a lot MemFree left in /proc/meminfo after
do a lot of pages soft offline.

I think it's incorrect since NR_FREE_PAGES should not contain HWPoison pages.
After take_page_off_buddy, the page is no longer belong to buddy
allocator, and will not be used any more, but we maybe missed accounting
NR_FREE_PAGES in this situation.

Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
---
 mm/page_alloc.c | 1 +
 1 file changed, 1 insertion(+)

Comments

David Hildenbrand April 28, 2021, 2:54 p.m. UTC | #1
On 21.04.21 04:04, Ding Hui wrote:
> Recently we found there is a lot MemFree left in /proc/meminfo after
> do a lot of pages soft offline.
> 
> I think it's incorrect since NR_FREE_PAGES should not contain HWPoison pages.
> After take_page_off_buddy, the page is no longer belong to buddy
> allocator, and will not be used any more, but we maybe missed accounting
> NR_FREE_PAGES in this situation.
> 
> Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
> ---
>   mm/page_alloc.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cfc72873961d..8d65b62784d8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8947,6 +8947,7 @@ bool take_page_off_buddy(struct page *page)
>   			del_page_from_free_list(page_head, zone, page_order);
>   			break_down_buddy_pages(zone, page_head, page, 0,
>   						page_order, migratetype);
> +			__mod_zone_page_state(zone, NR_FREE_PAGES, -1);
>   			ret = true;
>   			break;
>   		}
> 

Should this use __mod_zone_freepage_state() instead?
Ding Hui April 30, 2021, 9:43 a.m. UTC | #2
On 2021/4/28 22:54, David Hildenbrand wrote:
> On 21.04.21 04:04, Ding Hui wrote:
>> Recently we found there is a lot MemFree left in /proc/meminfo after
>> do a lot of pages soft offline.
>>
>> I think it's incorrect since NR_FREE_PAGES should not contain HWPoison 
>> pages.
>> After take_page_off_buddy, the page is no longer belong to buddy
>> allocator, and will not be used any more, but we maybe missed accounting
>> NR_FREE_PAGES in this situation.
>>
>> Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
>> ---
>>   mm/page_alloc.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index cfc72873961d..8d65b62784d8 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -8947,6 +8947,7 @@ bool take_page_off_buddy(struct page *page)
>>               del_page_from_free_list(page_head, zone, page_order);
>>               break_down_buddy_pages(zone, page_head, page, 0,
>>                           page_order, migratetype);
>> +            __mod_zone_page_state(zone, NR_FREE_PAGES, -1);
>>               ret = true;
>>               break;
>>           }
>>
> 
> Should this use __mod_zone_freepage_state() instead?
> 

Yes, you're right.
I'll use it in v2.
HORIGUCHI NAOYA(堀口 直也) May 6, 2021, 2:49 a.m. UTC | #3
On Wed, Apr 28, 2021 at 04:54:59PM +0200, David Hildenbrand wrote:
> On 21.04.21 04:04, Ding Hui wrote:
> > Recently we found there is a lot MemFree left in /proc/meminfo after
> > do a lot of pages soft offline.
> > 
> > I think it's incorrect since NR_FREE_PAGES should not contain HWPoison pages.
> > After take_page_off_buddy, the page is no longer belong to buddy
> > allocator, and will not be used any more, but we maybe missed accounting
> > NR_FREE_PAGES in this situation.
> > 
> > Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
> > ---
> >   mm/page_alloc.c | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index cfc72873961d..8d65b62784d8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -8947,6 +8947,7 @@ bool take_page_off_buddy(struct page *page)
> >   			del_page_from_free_list(page_head, zone, page_order);
> >   			break_down_buddy_pages(zone, page_head, page, 0,
> >   						page_order, migratetype);
> > +			__mod_zone_page_state(zone, NR_FREE_PAGES, -1);
> >   			ret = true;
> >   			break;
> >   		}
> > 
> 
> Should this use __mod_zone_freepage_state() instead?

Yes, __mod_zone_freepage_state() looks better to me.

And I think that maybe an additional __mod_zone_freepage_state() in
unpoison_memory() is necessary to cancel the decrement.  I thought of the
following, but it doesn't build because get_pfnblock_migratetype() is
available only in mm/page_alloc.c, so you might want to add a small exported
routine in mm/page_alloc.c and let it called from unpoison_memory().

  @@ -1899,8 +1899,12 @@ int unpoison_memory(unsigned long pfn)
          }
  
          if (!get_hwpoison_page(p, flags, 0)) {
  -               if (TestClearPageHWPoison(p))
  +               if (TestClearPageHWPoison(p)) {
  +                       int migratetype = get_pfnblock_migratetype(p, pfn);
  +
                          num_poisoned_pages_dec();
  +                       __mod_zone_freepage_state(page_zone(p), 1, migratetype);
  +               }
                  unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
                                   pfn, &unpoison_rs);
                  return 0;

Thanks,
Naoya Horiguchi
Ding Hui May 6, 2021, 4:01 a.m. UTC | #4
On 2021/5/6 10:49, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Apr 28, 2021 at 04:54:59PM +0200, David Hildenbrand wrote:
>> On 21.04.21 04:04, Ding Hui wrote:
>>> Recently we found there is a lot MemFree left in /proc/meminfo after
>>> do a lot of pages soft offline.
>>>
>>> I think it's incorrect since NR_FREE_PAGES should not contain HWPoison pages.
>>> After take_page_off_buddy, the page is no longer belong to buddy
>>> allocator, and will not be used any more, but we maybe missed accounting
>>> NR_FREE_PAGES in this situation.
>>>
>>> Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
>>> ---
>>>    mm/page_alloc.c | 1 +
>>>    1 file changed, 1 insertion(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index cfc72873961d..8d65b62784d8 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -8947,6 +8947,7 @@ bool take_page_off_buddy(struct page *page)
>>>    			del_page_from_free_list(page_head, zone, page_order);
>>>    			break_down_buddy_pages(zone, page_head, page, 0,
>>>    						page_order, migratetype);
>>> +			__mod_zone_page_state(zone, NR_FREE_PAGES, -1);
>>>    			ret = true;
>>>    			break;
>>>    		}
>>>
>>
>> Should this use __mod_zone_freepage_state() instead?
> 
> Yes, __mod_zone_freepage_state() looks better to me.
> 
> And I think that maybe an additional __mod_zone_freepage_state() in
> unpoison_memory() is necessary to cancel the decrement.  I thought of the
> following, but it doesn't build because get_pfnblock_migratetype() is
> available only in mm/page_alloc.c, so you might want to add a small exported
> routine in mm/page_alloc.c and let it called from unpoison_memory().
> 
>    @@ -1899,8 +1899,12 @@ int unpoison_memory(unsigned long pfn)
>            }
>    
>            if (!get_hwpoison_page(p, flags, 0)) {
>    -               if (TestClearPageHWPoison(p))
>    +               if (TestClearPageHWPoison(p)) {
>    +                       int migratetype = get_pfnblock_migratetype(p, pfn);
>    +
>                            num_poisoned_pages_dec();
>    +                       __mod_zone_freepage_state(page_zone(p), 1, migratetype);
>    +               }
>                    unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
>                                     pfn, &unpoison_rs);
>                    return 0;
> 

I think there is another problem:
In normal case, we keep the last refcount of the hwpoison page, so 
get_hwpoison_page should return 1. The NR_FREE_PAGES will be adjusted 
when call put_page.
At race condition, we maybe leak the page because we does not put it 
back to buddy in unpoison_memory, however the HWPoison flag, 
num_poisoned_pages, NR_FREE_PAGES is adjusted correctly.

CPU0                        CPU1

soft_offline_page
   soft_offline_free_page
     page_handle_poison
       take_page_off_buddy
       SetPageHWPoison
                             unpoison_memory
                               if (!get_hwpoison_page(p))
                                 TestClearPageHWPoison
                                   num_poisoned_pages_dec
                                 __mod_zone_freepage_state
                                 return 0
                                 /* miss put the page back to buddy */
       page_ref_inc
       num_poisoned_pages_inc

How about do nothing and return -EBUSY (so the caller can retry) if 
unpoison a zero refcount page , or return 0 like 230ac719c500 
("mm/hwpoison: don't try to unpoison containment-failed pages") does ?

   @@ -1736,11 +1736,9 @@ int unpoison_memory(unsigned long pfn)
     }

     if (!get_hwpoison_page(p, flags, 0)) {
   -       if (TestClearPageHWPoison(p))
   -           num_poisoned_pages_dec();
   -       unpoison_pr_info("Unpoison: Software-unpoisoned free page 
%#lx\n",
   +       unpoison_pr_info("Unpoison: Software-unpoisoned zero refcount 
page %#lx\n",
   				 pfn, &unpoison_rs);
   -       return 0;
   +       return -EBUSY;
     }

     lock_page(page);
HORIGUCHI NAOYA(堀口 直也) May 6, 2021, 7:30 a.m. UTC | #5
On Thu, May 06, 2021 at 12:01:34PM +0800, Ding Hui wrote:
> On 2021/5/6 10:49, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Apr 28, 2021 at 04:54:59PM +0200, David Hildenbrand wrote:
> > > On 21.04.21 04:04, Ding Hui wrote:
> > > > Recently we found there is a lot MemFree left in /proc/meminfo after
> > > > do a lot of pages soft offline.
> > > > 
> > > > I think it's incorrect since NR_FREE_PAGES should not contain HWPoison pages.
> > > > After take_page_off_buddy, the page is no longer belong to buddy
> > > > allocator, and will not be used any more, but we maybe missed accounting
> > > > NR_FREE_PAGES in this situation.
> > > > 
> > > > Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
> > > > ---
> > > >    mm/page_alloc.c | 1 +
> > > >    1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index cfc72873961d..8d65b62784d8 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -8947,6 +8947,7 @@ bool take_page_off_buddy(struct page *page)
> > > >    			del_page_from_free_list(page_head, zone, page_order);
> > > >    			break_down_buddy_pages(zone, page_head, page, 0,
> > > >    						page_order, migratetype);
> > > > +			__mod_zone_page_state(zone, NR_FREE_PAGES, -1);
> > > >    			ret = true;
> > > >    			break;
> > > >    		}
> > > > 
> > > 
> > > Should this use __mod_zone_freepage_state() instead?
> > 
> > Yes, __mod_zone_freepage_state() looks better to me.
> > 
> > And I think that maybe an additional __mod_zone_freepage_state() in
> > unpoison_memory() is necessary to cancel the decrement.  I thought of the
> > following, but it doesn't build because get_pfnblock_migratetype() is
> > available only in mm/page_alloc.c, so you might want to add a small exported
> > routine in mm/page_alloc.c and let it called from unpoison_memory().
> > 
> >    @@ -1899,8 +1899,12 @@ int unpoison_memory(unsigned long pfn)
> >            }
> >            if (!get_hwpoison_page(p, flags, 0)) {
> >    -               if (TestClearPageHWPoison(p))
> >    +               if (TestClearPageHWPoison(p)) {
> >    +                       int migratetype = get_pfnblock_migratetype(p, pfn);
> >    +
> >                            num_poisoned_pages_dec();
> >    +                       __mod_zone_freepage_state(page_zone(p), 1, migratetype);
> >    +               }
> >                    unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
> >                                     pfn, &unpoison_rs);
> >                    return 0;
> > 
> 
> I think there is another problem:
> In normal case, we keep the last refcount of the hwpoison page, so
> get_hwpoison_page should return 1. The NR_FREE_PAGES will be adjusted when
> call put_page.

I think that take_page_off_buddy() should not be called for this case
(the error page have remaining refcount).  So it seems that no need to
update NR_FREE_PAGES ?

> At race condition, we maybe leak the page because we does not put it back to
> buddy in unpoison_memory, however the HWPoison flag, num_poisoned_pages,
> NR_FREE_PAGES is adjusted correctly.
> 
> CPU0                        CPU1
> 
> soft_offline_page
>   soft_offline_free_page
>     page_handle_poison
>       take_page_off_buddy
>       SetPageHWPoison
>                             unpoison_memory
>                               if (!get_hwpoison_page(p))
>                                 TestClearPageHWPoison
>                                   num_poisoned_pages_dec
>                                 __mod_zone_freepage_state
>                                 return 0
>                                 /* miss put the page back to buddy */
>       page_ref_inc
>       num_poisoned_pages_inc

Thanks for checking this, unpoison_memory() is racy.  Recently we are suggesting
to introduce mf_mutex by [1].  Although this patch is not merged to mainline yet,
but it could be used to prevent the above race too.

[1] https://lore.kernel.org/linux-mm/20210427062953.2080293-2-nao.horiguchi@gmail.com/

> 
> How about do nothing and return -EBUSY (so the caller can retry) if unpoison
> a zero refcount page , or return 0 like 230ac719c500 ("mm/hwpoison: don't
> try to unpoison containment-failed pages") does ?
> 
>   @@ -1736,11 +1736,9 @@ int unpoison_memory(unsigned long pfn)
>     }
> 
>     if (!get_hwpoison_page(p, flags, 0)) {
>   -       if (TestClearPageHWPoison(p))
>   -           num_poisoned_pages_dec();
>   -       unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
>   +       unpoison_pr_info("Unpoison: Software-unpoisoned zero refcount page
> %#lx\n",
>   				 pfn, &unpoison_rs);
>   -       return 0;
>   +       return -EBUSY;

Currently unpoison_memory() does not work as reverse operation of take_page_off_buddy()
(it's simply broken), so implementing it at one time would be better.
I'll take time to fix unpoison_memory().

Thanks,
Naoya Horiguchi
Ding Hui May 7, 2021, 1:46 a.m. UTC | #6
On 2021/5/6 15:30, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, May 06, 2021 at 12:01:34PM +0800, Ding Hui wrote:
>> On 2021/5/6 10:49, HORIGUCHI NAOYA(堀口 直也) wrote:
>>> On Wed, Apr 28, 2021 at 04:54:59PM +0200, David Hildenbrand wrote:
>>>> On 21.04.21 04:04, Ding Hui wrote:
>>>>> Recently we found there is a lot MemFree left in /proc/meminfo after
>>>>> do a lot of pages soft offline.
>>>>>
>>>>> I think it's incorrect since NR_FREE_PAGES should not contain HWPoison pages.
>>>>> After take_page_off_buddy, the page is no longer belong to buddy
>>>>> allocator, and will not be used any more, but we maybe missed accounting
>>>>> NR_FREE_PAGES in this situation.
>>>>>
>>>>> Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
>>>>> ---
>>>>>     mm/page_alloc.c | 1 +
>>>>>     1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index cfc72873961d..8d65b62784d8 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -8947,6 +8947,7 @@ bool take_page_off_buddy(struct page *page)
>>>>>     			del_page_from_free_list(page_head, zone, page_order);
>>>>>     			break_down_buddy_pages(zone, page_head, page, 0,
>>>>>     						page_order, migratetype);
>>>>> +			__mod_zone_page_state(zone, NR_FREE_PAGES, -1);
>>>>>     			ret = true;
>>>>>     			break;
>>>>>     		}
>>>>>
>>>>
>>>> Should this use __mod_zone_freepage_state() instead?
>>>
>>> Yes, __mod_zone_freepage_state() looks better to me.
>>>
>>> And I think that maybe an additional __mod_zone_freepage_state() in
>>> unpoison_memory() is necessary to cancel the decrement.  I thought of the
>>> following, but it doesn't build because get_pfnblock_migratetype() is
>>> available only in mm/page_alloc.c, so you might want to add a small exported
>>> routine in mm/page_alloc.c and let it called from unpoison_memory().
>>>
>>>     @@ -1899,8 +1899,12 @@ int unpoison_memory(unsigned long pfn)
>>>             }
>>>             if (!get_hwpoison_page(p, flags, 0)) {
>>>     -               if (TestClearPageHWPoison(p))
>>>     +               if (TestClearPageHWPoison(p)) {
>>>     +                       int migratetype = get_pfnblock_migratetype(p, pfn);
>>>     +
>>>                             num_poisoned_pages_dec();
>>>     +                       __mod_zone_freepage_state(page_zone(p), 1, migratetype);
>>>     +               }
>>>                     unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
>>>                                      pfn, &unpoison_rs);
>>>                     return 0;
>>>
>>
>> I think there is another problem:
>> In normal case, we keep the last refcount of the hwpoison page, so
>> get_hwpoison_page should return 1. The NR_FREE_PAGES will be adjusted when
>> call put_page.
> 
> I think that take_page_off_buddy() should not be called for this case
> (the error page have remaining refcount).  So it seems that no need to
> update NR_FREE_PAGES ?
> 

Yes, take_page_off_buddy() only used for free pages, but we will call 
page_ref_inc() after that, on the other hand for in used pages, we 
increased the refcount by get_any_page(), so in both cases, the 
hwpoisoned pages have refcount great than zero.

I think there is no need to update NR_FREE_PAGES explicitly in 
unpoison_memory(), the put_page() will help us to update NR_FREE_PAGES 
and put the page back to buddy.

>> At race condition, we maybe leak the page because we does not put it back to
>> buddy in unpoison_memory, however the HWPoison flag, num_poisoned_pages,
>> NR_FREE_PAGES is adjusted correctly.
>>
>> CPU0                        CPU1
>>
>> soft_offline_page
>>    soft_offline_free_page
>>      page_handle_poison
>>        take_page_off_buddy
>>        SetPageHWPoison
>>                              unpoison_memory
>>                                if (!get_hwpoison_page(p))
>>                                  TestClearPageHWPoison
>>                                    num_poisoned_pages_dec
>>                                  __mod_zone_freepage_state
>>                                  return 0
>>                                  /* miss put the page back to buddy */
>>        page_ref_inc
>>        num_poisoned_pages_inc
> 
> Thanks for checking this, unpoison_memory() is racy.  Recently we are suggesting
> to introduce mf_mutex by [1].  Although this patch is not merged to mainline yet,
> but it could be used to prevent the above race too.
> 
> [1] https://lore.kernel.org/linux-mm/20210427062953.2080293-2-nao.horiguchi@gmail.com/
> 

I'll look forward to it, thanks.

>>
>> How about do nothing and return -EBUSY (so the caller can retry) if unpoison
>> a zero refcount page , or return 0 like 230ac719c500 ("mm/hwpoison: don't
>> try to unpoison containment-failed pages") does ?
>>
>>    @@ -1736,11 +1736,9 @@ int unpoison_memory(unsigned long pfn)
>>      }
>>
>>      if (!get_hwpoison_page(p, flags, 0)) {
>>    -       if (TestClearPageHWPoison(p))
>>    -           num_poisoned_pages_dec();
>>    -       unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
>>    +       unpoison_pr_info("Unpoison: Software-unpoisoned zero refcount page
>> %#lx\n",
>>    				 pfn, &unpoison_rs);
>>    -       return 0;
>>    +       return -EBUSY;
> 
> Currently unpoison_memory() does not work as reverse operation of take_page_off_buddy()
> (it's simply broken), so implementing it at one time would be better.
> I'll take time to fix unpoison_memory().
> 

Thanks for your work.
Actually, I'm not sure about the exactly meaning of "broken", it seems 
that the basic function of unpoison_memory() is ok if not considered the 
race conditions.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cfc72873961d..8d65b62784d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8947,6 +8947,7 @@  bool take_page_off_buddy(struct page *page)
 			del_page_from_free_list(page_head, zone, page_order);
 			break_down_buddy_pages(zone, page_head, page, 0,
 						page_order, migratetype);
+			__mod_zone_page_state(zone, NR_FREE_PAGES, -1);
 			ret = true;
 			break;
 		}