mbox series

[RFC,0/3] Make deferred split shrinker memcg aware

Message ID 1559047464-59838-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
Headers show
Series Make deferred split shrinker memcg aware | expand

Message

Yang Shi May 28, 2019, 12:44 p.m. UTC
I got some reports from our internal application team about memcg OOM.
Even though the application has been killed by oom killer, there are
still a lot THPs reside, page reclaim doesn't reclaim them at all.

Some investigation shows they are on deferred split queue, memcg direct
reclaim can't shrink them since THP deferred split shrinker is not memcg
aware, this may cause premature OOM in memcg.  The issue can be
reproduced easily by the below test:

$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp ./transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred split
queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Convert deferred split shrinker memcg aware by introducing per memcg deferred
split queue.  The THP should be on either per node or per memcg deferred
split queue if it belongs to a memcg.  When the page is immigrated to the
other memcg, it will be immigrated to the target memcg's deferred split queue
too.

And, move deleting THP from deferred split queue in page free before memcg
uncharge so that the page's memcg information is available.

Reuse the second tail page's deferred_list for per memcg list since the same
THP can't be on multiple deferred split queues at the same time.

Remove THP specific destructor since it is not used anymore with memcg aware
THP shrinker (Please see the commit log of patch 2/3 for the details).

Make deferred split shrinker not depend on memcg kmem since it is not slab.
It doesn't make sense to not shrink THP even though memcg kmem is disabled.

With the above change the test demonstrated above doesn't trigger OOM anymore
even though with cgroup.memory=nokmem.


Yang Shi (3):
      mm: thp: make deferred split shrinker memcg aware
      mm: thp: remove THP destructor
      mm: shrinker: make shrinker not depend on memcg kmem

 include/linux/huge_mm.h    |  24 +++++++++
 include/linux/memcontrol.h |   6 +++
 include/linux/mm.h         |   3 --
 include/linux/mm_types.h   |   7 ++-
 include/linux/shrinker.h   |   3 +-
 mm/huge_memory.c           | 181 ++++++++++++++++++++++++++++++++++++++++++++++++-------------------
 mm/memcontrol.c            |  20 ++++++++
 mm/page_alloc.c            |   3 --
 mm/swap.c                  |   4 ++
 mm/vmscan.c                |  27 +++-------
 10 files changed, 198 insertions(+), 80 deletions(-)

Comments

David Rientjes May 29, 2019, 1:22 a.m. UTC | #1
On Tue, 28 May 2019, Yang Shi wrote:

> 
> I got some reports from our internal application team about memcg OOM.
> Even though the application has been killed by oom killer, there are
> still a lot THPs reside, page reclaim doesn't reclaim them at all.
> 
> Some investigation shows they are on deferred split queue, memcg direct
> reclaim can't shrink them since THP deferred split shrinker is not memcg
> aware, this may cause premature OOM in memcg.  The issue can be
> reproduced easily by the below test:
> 

Right, we've also encountered this.  I talked to Kirill about it a week or 
so ago where the suggestion was to split all compound pages on the 
deferred split queues under the presence of even memory pressure.

That breaks cgroup isolation and perhaps unfairly penalizes workloads that 
are running attached to other memcg hierarchies that are not under 
pressure because their compound pages are now split as a side effect.  
There is a benefit to keeping these compound pages around while not under 
memory pressure if all pages are subsequently mapped again.

> $ cgcreate -g memory:thp
> $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
> $ cgexec -g memory:thp ./transhuge-stress 4000
> 
> transhuge-stress comes from kernel selftest.
> 
> It is easy to hit OOM, but there are still a lot THP on the deferred split
> queue, memcg direct reclaim can't touch them since the deferred split
> shrinker is not memcg aware.
> 

Yes, we have seen this on at least 4.15 as well.

> Convert deferred split shrinker memcg aware by introducing per memcg deferred
> split queue.  The THP should be on either per node or per memcg deferred
> split queue if it belongs to a memcg.  When the page is immigrated to the
> other memcg, it will be immigrated to the target memcg's deferred split queue
> too.
> 
> And, move deleting THP from deferred split queue in page free before memcg
> uncharge so that the page's memcg information is available.
> 
> Reuse the second tail page's deferred_list for per memcg list since the same
> THP can't be on multiple deferred split queues at the same time.
> 
> Remove THP specific destructor since it is not used anymore with memcg aware
> THP shrinker (Please see the commit log of patch 2/3 for the details).
> 
> Make deferred split shrinker not depend on memcg kmem since it is not slab.
> It doesn't make sense to not shrink THP even though memcg kmem is disabled.
> 
> With the above change the test demonstrated above doesn't trigger OOM anymore
> even though with cgroup.memory=nokmem.
> 

I'm curious if your internal applications team is also asking for 
statistics on how much memory can be freed if the deferred split queues 
can be shrunk?  We have applications that monitor their own memory usage 
through memcg stats or usage and proactively try to reduce that usage when 
it is growing too large.  The deferred split queues have significantly 
increased both memcg usage and rss when they've upgraded kernels.

How are your applications monitoring how much memory from deferred split 
queues can be freed on memory pressure?  Any thoughts on providing it as a 
memcg stat?

Thanks!
Yang Shi May 29, 2019, 2:34 a.m. UTC | #2
On 5/29/19 9:22 AM, David Rientjes wrote:
> On Tue, 28 May 2019, Yang Shi wrote:
>
>> I got some reports from our internal application team about memcg OOM.
>> Even though the application has been killed by oom killer, there are
>> still a lot THPs reside, page reclaim doesn't reclaim them at all.
>>
>> Some investigation shows they are on deferred split queue, memcg direct
>> reclaim can't shrink them since THP deferred split shrinker is not memcg
>> aware, this may cause premature OOM in memcg.  The issue can be
>> reproduced easily by the below test:
>>
> Right, we've also encountered this.  I talked to Kirill about it a week or
> so ago where the suggestion was to split all compound pages on the
> deferred split queues under the presence of even memory pressure.
>
> That breaks cgroup isolation and perhaps unfairly penalizes workloads that
> are running attached to other memcg hierarchies that are not under
> pressure because their compound pages are now split as a side effect.
> There is a benefit to keeping these compound pages around while not under
> memory pressure if all pages are subsequently mapped again.

Yes, I do agree. I tried other approaches too, it sounds making deferred 
split queue per memcg is the optimal one.

>
>> $ cgcreate -g memory:thp
>> $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
>> $ cgexec -g memory:thp ./transhuge-stress 4000
>>
>> transhuge-stress comes from kernel selftest.
>>
>> It is easy to hit OOM, but there are still a lot THP on the deferred split
>> queue, memcg direct reclaim can't touch them since the deferred split
>> shrinker is not memcg aware.
>>
> Yes, we have seen this on at least 4.15 as well.
>
>> Convert deferred split shrinker memcg aware by introducing per memcg deferred
>> split queue.  The THP should be on either per node or per memcg deferred
>> split queue if it belongs to a memcg.  When the page is immigrated to the
>> other memcg, it will be immigrated to the target memcg's deferred split queue
>> too.
>>
>> And, move deleting THP from deferred split queue in page free before memcg
>> uncharge so that the page's memcg information is available.
>>
>> Reuse the second tail page's deferred_list for per memcg list since the same
>> THP can't be on multiple deferred split queues at the same time.
>>
>> Remove THP specific destructor since it is not used anymore with memcg aware
>> THP shrinker (Please see the commit log of patch 2/3 for the details).
>>
>> Make deferred split shrinker not depend on memcg kmem since it is not slab.
>> It doesn't make sense to not shrink THP even though memcg kmem is disabled.
>>
>> With the above change the test demonstrated above doesn't trigger OOM anymore
>> even though with cgroup.memory=nokmem.
>>
> I'm curious if your internal applications team is also asking for
> statistics on how much memory can be freed if the deferred split queues
> can be shrunk?  We have applications that monitor their own memory usage

No, but this reminds me. The THPs on deferred split queue should be 
accounted into available memory too.

> through memcg stats or usage and proactively try to reduce that usage when
> it is growing too large.  The deferred split queues have significantly
> increased both memcg usage and rss when they've upgraded kernels.
>
> How are your applications monitoring how much memory from deferred split
> queues can be freed on memory pressure?  Any thoughts on providing it as a
> memcg stat?

I don't think they have such monitor. I saw rss_huge is abormal in memcg 
stat even after the application is killed by oom, so I realized the 
deferred split queue may play a role here.

The memcg stat doesn't have counters for available memory as global 
vmstat. It may be better to have such statistics, or extending 
reclaimable "slab" to shrinkable/reclaimable "memory".

>
> Thanks!
David Rientjes May 29, 2019, 9:07 p.m. UTC | #3
On Wed, 29 May 2019, Yang Shi wrote:

> > Right, we've also encountered this.  I talked to Kirill about it a week or
> > so ago where the suggestion was to split all compound pages on the
> > deferred split queues under the presence of even memory pressure.
> > 
> > That breaks cgroup isolation and perhaps unfairly penalizes workloads that
> > are running attached to other memcg hierarchies that are not under
> > pressure because their compound pages are now split as a side effect.
> > There is a benefit to keeping these compound pages around while not under
> > memory pressure if all pages are subsequently mapped again.
> 
> Yes, I do agree. I tried other approaches too, it sounds making deferred split
> queue per memcg is the optimal one.
> 

The approach we went with were to track the actual counts of compound 
pages on the deferred split queue for each pgdat for each memcg and then 
invoke the shrinker for memcg reclaim and iterate those not charged to the 
hierarchy under reclaim.  That's suboptimal and was a stop gap measure 
under time pressure: it's refreshing to see the optimal method being 
pursued, thanks!

> > I'm curious if your internal applications team is also asking for
> > statistics on how much memory can be freed if the deferred split queues
> > can be shrunk?  We have applications that monitor their own memory usage
> 
> No, but this reminds me. The THPs on deferred split queue should be accounted
> into available memory too.
> 

Right, and we have also seen this for users of MADV_FREE that have both an 
increased rss and memcg usage that don't realize that the memory is freed 
under pressure.  I'm thinking that we need some kind of MemAvailable for 
memcg hierarchies to be the authoritative source of what can be reclaimed 
under pressure.

> > through memcg stats or usage and proactively try to reduce that usage when
> > it is growing too large.  The deferred split queues have significantly
> > increased both memcg usage and rss when they've upgraded kernels.
> > 
> > How are your applications monitoring how much memory from deferred split
> > queues can be freed on memory pressure?  Any thoughts on providing it as a
> > memcg stat?
> 
> I don't think they have such monitor. I saw rss_huge is abormal in memcg stat
> even after the application is killed by oom, so I realized the deferred split
> queue may play a role here.
> 

Exactly the same in my case :)  We were likely looking at the exact same 
issue at the same time.

> The memcg stat doesn't have counters for available memory as global vmstat. It
> may be better to have such statistics, or extending reclaimable "slab" to
> shrinkable/reclaimable "memory".
> 

Have you considered following how NR_ANON_MAPPED is tracked for each pgdat 
and using that as an indicator of when the modify a memcg stat to track 
the amount of memory on a compound page?  I think this would be necessary 
for userspace to know what their true memory usage is.
Yang Shi May 30, 2019, 3:22 a.m. UTC | #4
On 5/30/19 5:07 AM, David Rientjes wrote:
> On Wed, 29 May 2019, Yang Shi wrote:
>
>>> Right, we've also encountered this.  I talked to Kirill about it a week or
>>> so ago where the suggestion was to split all compound pages on the
>>> deferred split queues under the presence of even memory pressure.
>>>
>>> That breaks cgroup isolation and perhaps unfairly penalizes workloads that
>>> are running attached to other memcg hierarchies that are not under
>>> pressure because their compound pages are now split as a side effect.
>>> There is a benefit to keeping these compound pages around while not under
>>> memory pressure if all pages are subsequently mapped again.
>> Yes, I do agree. I tried other approaches too, it sounds making deferred split
>> queue per memcg is the optimal one.
>>
> The approach we went with were to track the actual counts of compound
> pages on the deferred split queue for each pgdat for each memcg and then
> invoke the shrinker for memcg reclaim and iterate those not charged to the
> hierarchy under reclaim.  That's suboptimal and was a stop gap measure
> under time pressure: it's refreshing to see the optimal method being
> pursued, thanks!

We did the exactly same thing for a temporary hotfix.

>
>>> I'm curious if your internal applications team is also asking for
>>> statistics on how much memory can be freed if the deferred split queues
>>> can be shrunk?  We have applications that monitor their own memory usage
>> No, but this reminds me. The THPs on deferred split queue should be accounted
>> into available memory too.
>>
> Right, and we have also seen this for users of MADV_FREE that have both an
> increased rss and memcg usage that don't realize that the memory is freed
> under pressure.  I'm thinking that we need some kind of MemAvailable for
> memcg hierarchies to be the authoritative source of what can be reclaimed
> under pressure.

It sounds useful. We also need know the available memory in memcg scope 
in our containers.

>
>>> through memcg stats or usage and proactively try to reduce that usage when
>>> it is growing too large.  The deferred split queues have significantly
>>> increased both memcg usage and rss when they've upgraded kernels.
>>>
>>> How are your applications monitoring how much memory from deferred split
>>> queues can be freed on memory pressure?  Any thoughts on providing it as a
>>> memcg stat?
>> I don't think they have such monitor. I saw rss_huge is abormal in memcg stat
>> even after the application is killed by oom, so I realized the deferred split
>> queue may play a role here.
>>
> Exactly the same in my case :)  We were likely looking at the exact same
> issue at the same time.

Yes, it seems so. :-)

>> The memcg stat doesn't have counters for available memory as global vmstat. It
>> may be better to have such statistics, or extending reclaimable "slab" to
>> shrinkable/reclaimable "memory".
>>
> Have you considered following how NR_ANON_MAPPED is tracked for each pgdat
> and using that as an indicator of when the modify a memcg stat to track
> the amount of memory on a compound page?  I think this would be necessary
> for userspace to know what their true memory usage is.

No, I haven't. Do you mean minus MADV_FREE and deferred split THP from 
NR_ANON_MAPPED? It looks they have been decreased from NR_ANON_MAPPED 
when removing rmap.