diff mbox series

[rfc,3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed

Message ID alpine.DEB.2.21.1909041253390.94813@chino.kir.corp.google.com (mailing list archive)
State New, archived
Headers show
Series revert immediate fallback to remote hugepages | expand

Commit Message

David Rientjes Sept. 4, 2019, 7:54 p.m. UTC
Memory compaction has a couple significant drawbacks as the allocation
order increases, specifically:

 - isolate_freepages() is responsible for finding free pages to use as
   migration targets and is implemented as a linear scan of memory
   starting at the end of a zone,

 - failing order-0 watermark checks in memory compaction does not account
   for how far below the watermarks the zone actually is: to enable
   migration, there must be *some* free memory available.  Per the above,
   watermarks are not always suffficient if isolate_freepages() cannot
   find the free memory but it could require hundreds of MBs of reclaim to
   even reach this threshold (read: potentially very expensive reclaim with
   no indication compaction can be successful), and

 - if compaction at this order has failed recently so that it does not even
   run as a result of deferred compaction, looping through reclaim can often
   be pointless.

For hugepage allocations, these are quite substantial drawbacks because
these are very high order allocations (order-9 on x86) and falling back to
doing reclaim can potentially be *very* expensive without any indication
that compaction would even be successful.

Reclaim itself is unlikely to free entire pageblocks and certainly no
reliance should be put on it to do so in isolation (recall lumpy reclaim).
This means we should avoid reclaim and simply fail hugepage allocation if
compaction is deferred.

It is also not helpful to thrash a zone by doing excessive reclaim if
compaction may not be able to access that memory.  If order-0 watermarks
fail and the allocation order is sufficiently large, it is likely better
to fail the allocation rather than thrashing the zone.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Comments

Michal Hocko Sept. 5, 2019, 9 a.m. UTC | #1
[Ccing Mike for checking on the hugetlb side of this change]

On Wed 04-09-19 12:54:22, David Rientjes wrote:
> Memory compaction has a couple significant drawbacks as the allocation
> order increases, specifically:
> 
>  - isolate_freepages() is responsible for finding free pages to use as
>    migration targets and is implemented as a linear scan of memory
>    starting at the end of a zone,
> 
>  - failing order-0 watermark checks in memory compaction does not account
>    for how far below the watermarks the zone actually is: to enable
>    migration, there must be *some* free memory available.  Per the above,
>    watermarks are not always suffficient if isolate_freepages() cannot
>    find the free memory but it could require hundreds of MBs of reclaim to
>    even reach this threshold (read: potentially very expensive reclaim with
>    no indication compaction can be successful), and
> 
>  - if compaction at this order has failed recently so that it does not even
>    run as a result of deferred compaction, looping through reclaim can often
>    be pointless.
> 
> For hugepage allocations, these are quite substantial drawbacks because
> these are very high order allocations (order-9 on x86) and falling back to
> doing reclaim can potentially be *very* expensive without any indication
> that compaction would even be successful.
> 
> Reclaim itself is unlikely to free entire pageblocks and certainly no
> reliance should be put on it to do so in isolation (recall lumpy reclaim).
> This means we should avoid reclaim and simply fail hugepage allocation if
> compaction is deferred.
> 
> It is also not helpful to thrash a zone by doing excessive reclaim if
> compaction may not be able to access that memory.  If order-0 watermarks
> fail and the allocation order is sufficiently large, it is likely better
> to fail the allocation rather than thrashing the zone.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/page_alloc.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4458,6 +4458,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		if (page)
>  			goto got_pg;
>  
> +		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
> +			/*
> +			 * If allocating entire pageblock(s) and compaction
> +			 * failed because all zones are below low watermarks
> +			 * or is prohibited because it recently failed at this
> +			 * order, fail immediately.
> +			 *
> +			 * Reclaim is
> +			 *  - potentially very expensive because zones are far
> +			 *    below their low watermarks or this is part of very
> +			 *    bursty high order allocations,
> +			 *  - not guaranteed to help because isolate_freepages()
> +			 *    may not iterate over freed pages as part of its
> +			 *    linear scan, and
> +			 *  - unlikely to make entire pageblocks free on its
> +			 *    own.
> +			 */
> +			if (compact_result == COMPACT_SKIPPED ||
> +			    compact_result == COMPACT_DEFERRED)
> +				goto nopage;
> +		}
> +
>  		/*
>  		 * Checks for costly allocations with __GFP_NORETRY, which
>  		 * includes THP page fault allocations
Vlastimil Babka Sept. 5, 2019, 11:22 a.m. UTC | #2
On 9/5/19 11:00 AM, Michal Hocko wrote:
> [Ccing Mike for checking on the hugetlb side of this change]
> 
> On Wed 04-09-19 12:54:22, David Rientjes wrote:
>> Memory compaction has a couple significant drawbacks as the allocation
>> order increases, specifically:
>>
>>  - isolate_freepages() is responsible for finding free pages to use as
>>    migration targets and is implemented as a linear scan of memory
>>    starting at the end of a zone,

Note that's no longer entirely true, see fast_isolate_freepages().

>>  - failing order-0 watermark checks in memory compaction does not account
>>    for how far below the watermarks the zone actually is: to enable
>>    migration, there must be *some* free memory available.  Per the above,
>>    watermarks are not always suffficient if isolate_freepages() cannot
>>    find the free memory but it could require hundreds of MBs of reclaim to
>>    even reach this threshold (read: potentially very expensive reclaim with
>>    no indication compaction can be successful), and

I doubt it's hundreds of MBs for a 2MB hugepage.

>>  - if compaction at this order has failed recently so that it does not even
>>    run as a result of deferred compaction, looping through reclaim can often
>>    be pointless.

Agreed.

>> For hugepage allocations, these are quite substantial drawbacks because
>> these are very high order allocations (order-9 on x86) and falling back to
>> doing reclaim can potentially be *very* expensive without any indication
>> that compaction would even be successful.

You seem to lump together hugetlbfs and THP here, by saying "hugepage",
but these are very different things - hugetlbfs reservations are
expected to be potentially expensive.

>> Reclaim itself is unlikely to free entire pageblocks and certainly no
>> reliance should be put on it to do so in isolation (recall lumpy reclaim).
>> This means we should avoid reclaim and simply fail hugepage allocation if
>> compaction is deferred.

It is however possible that reclaim frees enough to make even a
previously deferred compaction succeed.

>> It is also not helpful to thrash a zone by doing excessive reclaim if
>> compaction may not be able to access that memory.  If order-0 watermarks
>> fail and the allocation order is sufficiently large, it is likely better
>> to fail the allocation rather than thrashing the zone.
>>
>> Signed-off-by: David Rientjes <rientjes@google.com>
>> ---
>>  mm/page_alloc.c | 22 ++++++++++++++++++++++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4458,6 +4458,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  		if (page)
>>  			goto got_pg;
>>  
>> +		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
>> +			/*
>> +			 * If allocating entire pageblock(s) and compaction
>> +			 * failed because all zones are below low watermarks
>> +			 * or is prohibited because it recently failed at this
>> +			 * order, fail immediately.
>> +			 *
>> +			 * Reclaim is
>> +			 *  - potentially very expensive because zones are far
>> +			 *    below their low watermarks or this is part of very
>> +			 *    bursty high order allocations,
>> +			 *  - not guaranteed to help because isolate_freepages()
>> +			 *    may not iterate over freed pages as part of its
>> +			 *    linear scan, and
>> +			 *  - unlikely to make entire pageblocks free on its
>> +			 *    own.
>> +			 */
>> +			if (compact_result == COMPACT_SKIPPED ||
>> +			    compact_result == COMPACT_DEFERRED)
>> +				goto nopage;

As I said, I expect this will make hugetlbfs reservations fail
prematurely - Mike can probably confirm or disprove that.
I think it also addresses consequences, not the primary problem, IMHO.
I believe the primary problem is that we reclaim something even if
there's enough memory for compaction. This won't change with your patch,
as compact_result won't be SKIPPED in that case. Then we continue
through to __alloc_pages_direct_reclaim(), shrink_zones() which will
call compaction_ready(), which will only return true and skip reclaim of
the zone, if there's high_watermark (!!!) + compact_gap() pages. But as
long as one zone isn't compaction_ready(), we enter shrink_node(), which
will reclaim something and call should_continue_reclaim() where we might
finally notice that compaction_suitable() returns CONTINUE, and abort
reclaim.

Thus I think the right solution might be to really avoid reclaim for
zones where compaction is not skipped, while your patch avoids reclaim
when compaction is skipped. The per-node reclaim vs per-zone compaction
might complicate those decisions a lot, though.

>> +		}
>> +
>>  		/*
>>  		 * Checks for costly allocations with __GFP_NORETRY, which
>>  		 * includes THP page fault allocations
>
Mike Kravetz Sept. 5, 2019, 8:53 p.m. UTC | #3
On 9/5/19 4:22 AM, Vlastimil Babka wrote:
> On 9/5/19 11:00 AM, Michal Hocko wrote:
>> [Ccing Mike for checking on the hugetlb side of this change]
>> On Wed 04-09-19 12:54:22, David Rientjes wrote:
>>> @@ -4458,6 +4458,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>>  		if (page)
>>>  			goto got_pg;
>>>  
>>> +		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
>>> +			/*
>>> +			 * If allocating entire pageblock(s) and compaction
>>> +			 * failed because all zones are below low watermarks
>>> +			 * or is prohibited because it recently failed at this
>>> +			 * order, fail immediately.
>>> +			 *
>>> +			 * Reclaim is
>>> +			 *  - potentially very expensive because zones are far
>>> +			 *    below their low watermarks or this is part of very
>>> +			 *    bursty high order allocations,
>>> +			 *  - not guaranteed to help because isolate_freepages()
>>> +			 *    may not iterate over freed pages as part of its
>>> +			 *    linear scan, and
>>> +			 *  - unlikely to make entire pageblocks free on its
>>> +			 *    own.
>>> +			 */
>>> +			if (compact_result == COMPACT_SKIPPED ||
>>> +			    compact_result == COMPACT_DEFERRED)
>>> +				goto nopage;
> 
> As I said, I expect this will make hugetlbfs reservations fail
> prematurely - Mike can probably confirm or disprove that.

I don't have a specific test for this.  It is somewhat common for people
to want to allocate "as many hugetlb pages as possible".  Therefore, they
will try to allocate more pages than reasonable for their environment and
take what they can get.  I 'tested' by simply creating some background
activity and then seeing how many hugetlb pages could be allocated.  Of
course, many tries over time in a loop.

This patch did not cause premature allocation failures in my limited testing.
The number of pages which could be allocated with and without patch were
pretty much the same.

Do note that I tested on top of Andrew's tree which contains this series:
http://lkml.kernel.org/r/20190806014744.15446-1-mike.kravetz@oracle.com
Patch 3 in that series causes allocations to fail sooner in the case of
COMPACT_DEFERRED:
http://lkml.kernel.org/r/20190806014744.15446-4-mike.kravetz@oracle.com

hugetlb allocations have the __GFP_RETRY_MAYFAIL flag set.  They are willing
to retry and wait and callers are aware of this.  Even though my limited
testing did not show regressions caused by this patch, I would prefer if the
quick exit did not apply to __GFP_RETRY_MAYFAIL requests.
David Rientjes Sept. 6, 2019, 8:16 p.m. UTC | #4
On Thu, 5 Sep 2019, Mike Kravetz wrote:

> I don't have a specific test for this.  It is somewhat common for people
> to want to allocate "as many hugetlb pages as possible".  Therefore, they
> will try to allocate more pages than reasonable for their environment and
> take what they can get.  I 'tested' by simply creating some background
> activity and then seeing how many hugetlb pages could be allocated.  Of
> course, many tries over time in a loop.
> 
> This patch did not cause premature allocation failures in my limited testing.
> The number of pages which could be allocated with and without patch were
> pretty much the same.
> 
> Do note that I tested on top of Andrew's tree which contains this series:
> http://lkml.kernel.org/r/20190806014744.15446-1-mike.kravetz@oracle.com
> Patch 3 in that series causes allocations to fail sooner in the case of
> COMPACT_DEFERRED:
> http://lkml.kernel.org/r/20190806014744.15446-4-mike.kravetz@oracle.com
> 
> hugetlb allocations have the __GFP_RETRY_MAYFAIL flag set.  They are willing
> to retry and wait and callers are aware of this.  Even though my limited
> testing did not show regressions caused by this patch, I would prefer if the
> quick exit did not apply to __GFP_RETRY_MAYFAIL requests.

Good!  I think that is the ideal way of handling it: we can specify the 
preference to actually loop and retry (but still eventually fail) for 
hugetlb allocations specifically for this patch by testing for 
__GFP_RETRY_MAYFAIL.

I can add that to the formal proposal of patches 3 and 4 in this series 
assuming we get 5.3 settled by applying the reverts in patches 1 and 2 so 
that we don't cause various versions of Linux to have different default 
and madvise allocation policies wrt NUMA.
David Rientjes Sept. 6, 2019, 8:49 p.m. UTC | #5
On Thu, 5 Sep 2019, Vlastimil Babka wrote:

> >>  - failing order-0 watermark checks in memory compaction does not account
> >>    for how far below the watermarks the zone actually is: to enable
> >>    migration, there must be *some* free memory available.  Per the above,
> >>    watermarks are not always suffficient if isolate_freepages() cannot
> >>    find the free memory but it could require hundreds of MBs of reclaim to
> >>    even reach this threshold (read: potentially very expensive reclaim with
> >>    no indication compaction can be successful), and
> 
> I doubt it's hundreds of MBs for a 2MB hugepage.
> 

I'm not sure how you presume to know, we certainly have incidents where 
compaction is skipped because free pages are are 100MB+ under low 
watermarks.

> >> For hugepage allocations, these are quite substantial drawbacks because
> >> these are very high order allocations (order-9 on x86) and falling back to
> >> doing reclaim can potentially be *very* expensive without any indication
> >> that compaction would even be successful.
> 
> You seem to lump together hugetlbfs and THP here, by saying "hugepage",
> but these are very different things - hugetlbfs reservations are
> expected to be potentially expensive.
> 

Mike Kravetz followed up and I can make a simple change to this fix to 
only run the new logic if the allocation is not using __GFP_RETRY_MAYFAIL 
which would exclude hugetlb allocations and include transparent hugepage 
allocations.

> >> Reclaim itself is unlikely to free entire pageblocks and certainly no
> >> reliance should be put on it to do so in isolation (recall lumpy reclaim).
> >> This means we should avoid reclaim and simply fail hugepage allocation if
> >> compaction is deferred.
> 
> It is however possible that reclaim frees enough to make even a
> previously deferred compaction succeed.
> 

This is another way that the return value that we get from memory 
compaction can be improved since right now we only check 
compaction_deferred() at the priorities we care about.  This discussion 
has revealed several areas where we can get more reliable and actionable 
return values from memory compaction to implement a sane default policy in 
the page allocator that works for everybody.

> >> It is also not helpful to thrash a zone by doing excessive reclaim if
> >> compaction may not be able to access that memory.  If order-0 watermarks
> >> fail and the allocation order is sufficiently large, it is likely better
> >> to fail the allocation rather than thrashing the zone.
> >>
> >> Signed-off-by: David Rientjes <rientjes@google.com>
> >> ---
> >>  mm/page_alloc.c | 22 ++++++++++++++++++++++
> >>  1 file changed, 22 insertions(+)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -4458,6 +4458,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >>  		if (page)
> >>  			goto got_pg;
> >>  
> >> +		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
> >> +			/*
> >> +			 * If allocating entire pageblock(s) and compaction
> >> +			 * failed because all zones are below low watermarks
> >> +			 * or is prohibited because it recently failed at this
> >> +			 * order, fail immediately.
> >> +			 *
> >> +			 * Reclaim is
> >> +			 *  - potentially very expensive because zones are far
> >> +			 *    below their low watermarks or this is part of very
> >> +			 *    bursty high order allocations,
> >> +			 *  - not guaranteed to help because isolate_freepages()
> >> +			 *    may not iterate over freed pages as part of its
> >> +			 *    linear scan, and
> >> +			 *  - unlikely to make entire pageblocks free on its
> >> +			 *    own.
> >> +			 */
> >> +			if (compact_result == COMPACT_SKIPPED ||
> >> +			    compact_result == COMPACT_DEFERRED)
> >> +				goto nopage;
> 
> As I said, I expect this will make hugetlbfs reservations fail
> prematurely - Mike can probably confirm or disprove that.
> I think it also addresses consequences, not the primary problem, IMHO.
> I believe the primary problem is that we reclaim something even if
> there's enough memory for compaction. This won't change with your patch,
> as compact_result won't be SKIPPED in that case. 

I'm relying only on Andrea's one line feedback saying that this would 
address the swap storms that he is reporting, more details on why it 
doesn't, if it doesn't, would definitely be helpful.

> Then we continue
> through to __alloc_pages_direct_reclaim(), shrink_zones() which will
> call compaction_ready(), which will only return true and skip reclaim of
> the zone, if there's high_watermark (!!!) + compact_gap() pages.

Interesting find, that heuristic certainly doesn't appear consistent.  
Another thing to add to the list for how the memory compaction, direct 
reclaim, and page allocator feedback loop can be improved to provide sane 
default behavior for everybody.

If you'd like to send a patch to address this issue specifically, that 
would be very helpful!  I'm hoping Andrea can test it.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4458,6 +4458,28 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		if (page)
 			goto got_pg;
 
+		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
+			/*
+			 * If allocating entire pageblock(s) and compaction
+			 * failed because all zones are below low watermarks
+			 * or is prohibited because it recently failed at this
+			 * order, fail immediately.
+			 *
+			 * Reclaim is
+			 *  - potentially very expensive because zones are far
+			 *    below their low watermarks or this is part of very
+			 *    bursty high order allocations,
+			 *  - not guaranteed to help because isolate_freepages()
+			 *    may not iterate over freed pages as part of its
+			 *    linear scan, and
+			 *  - unlikely to make entire pageblocks free on its
+			 *    own.
+			 */
+			if (compact_result == COMPACT_SKIPPED ||
+			    compact_result == COMPACT_DEFERRED)
+				goto nopage;
+		}
+
 		/*
 		 * Checks for costly allocations with __GFP_NORETRY, which
 		 * includes THP page fault allocations