diff mbox series

mm/memcg: Free percpu stats memory of dying memcg's

Message ID 20220421145845.1044652-1-longman@redhat.com (mailing list archive)
State New
Headers show
Series mm/memcg: Free percpu stats memory of dying memcg's | expand

Commit Message

Waiman Long April 21, 2022, 2:58 p.m. UTC
For systems with large number of CPUs, the majority of the memory
consumed by the mem_cgroup structure is actually the percpu stats
memory. When a large number of memory cgroups are continuously created
and destroyed (like in a container host), it is possible that more
and more mem_cgroup structures remained in the dying state holding up
increasing amount of percpu memory.

We can't free up the memory of the dying mem_cgroup structure due to
active references in some other places. However, the percpu stats memory
allocated to that mem_cgroup is a different story.

This patch adds a new percpu_stats_disabled variable to keep track of
the state of the percpu stats memory. If the variable is set, percpu
stats update will be disabled for that particular memcg. All the stats
update will be forward to its parent instead. Reading of the its percpu
stats will return 0.

The flushing and freeing of the percpu stats memory is a multi-step
process. The percpu_stats_disabled variable is set when the memcg is
being set to offline state. After a grace period with the help of RCU,
the percpu stats data are flushed and then freed.

This will greatly reduce the amount of memory held up by dying memory
cgroups.

By running a simple management tool for container 2000 times per test
run, below are the results of increases of percpu memory (as reported
in /proc/meminfo) and nr_dying_descendants in root's cgroup.stat.

  Unpatched kernel:

  Run    Percpu Memory Increase    nr_dying_descendants Increase
  ---    ----------------------    -----------------------------
   1           76032 kB                       209
   2           74112 kB                       208
   3           74112 kB                       207

  Patched kernel:

  Run    Percpu Memory Increase    nr_dying_descendants Increase
  ---    ----------------------    -----------------------------
   1            4224 kB                       207
   2            1536 kB                       209
   3               0 kB                       208

The x86-64 test system has 2 sockets and 96 threads. For the compiled
kernel, memcg_vmstats_percpu and lruvec_stats_percpu have a size of
2424 and 656 bytes respectively.  The mem_cgroup structure is just a
bit over 4k.  That translates to a percpu stats data size of 350.25
kB per memcg. That means 72852 kB of percpu data for 208 memcg's. This
matches pretty well with the unpatched kernel figures listed above.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/memcontrol.h |  13 ++++
 mm/memcontrol.c            | 122 +++++++++++++++++++++++++++++++------
 2 files changed, 115 insertions(+), 20 deletions(-)

Comments

Roman Gushchin April 21, 2022, 4:33 p.m. UTC | #1
On Thu, Apr 21, 2022 at 10:58:45AM -0400, Waiman Long wrote:
> For systems with large number of CPUs, the majority of the memory
> consumed by the mem_cgroup structure is actually the percpu stats
> memory. When a large number of memory cgroups are continuously created
> and destroyed (like in a container host), it is possible that more
> and more mem_cgroup structures remained in the dying state holding up
> increasing amount of percpu memory.
> 
> We can't free up the memory of the dying mem_cgroup structure due to
> active references in some other places. However, the percpu stats memory
> allocated to that mem_cgroup is a different story.
> 
> This patch adds a new percpu_stats_disabled variable to keep track of
> the state of the percpu stats memory. If the variable is set, percpu
> stats update will be disabled for that particular memcg. All the stats
> update will be forward to its parent instead. Reading of the its percpu
> stats will return 0.
> 
> The flushing and freeing of the percpu stats memory is a multi-step
> process. The percpu_stats_disabled variable is set when the memcg is
> being set to offline state. After a grace period with the help of RCU,
> the percpu stats data are flushed and then freed.
> 
> This will greatly reduce the amount of memory held up by dying memory
> cgroups.
> 
> By running a simple management tool for container 2000 times per test
> run, below are the results of increases of percpu memory (as reported
> in /proc/meminfo) and nr_dying_descendants in root's cgroup.stat.

Hi Waiman!

I've been proposing the same idea some time ago:
https://lore.kernel.org/all/20190312223404.28665-7-guro@fb.com/T/ .

However I dropped it with the thinking that with many other fixes
preventing the accumulation of the dying cgroups it's not worth the added
complexity and a potential cpu overhead.

I think it ultimately comes to the number of dying cgroups. If it's low,
memory savings are not worth the cpu overhead. If it's high, they are.
I hope long-term to drive it down significantly (with lru-pages reparenting
being the first major milestone), but it might take a while.

I don't have a strong opinion either way, just want to dump my thoughts
on this.

Thanks!
Waiman Long April 21, 2022, 5:28 p.m. UTC | #2
On 4/21/22 12:33, Roman Gushchin wrote:
> On Thu, Apr 21, 2022 at 10:58:45AM -0400, Waiman Long wrote:
>> For systems with large number of CPUs, the majority of the memory
>> consumed by the mem_cgroup structure is actually the percpu stats
>> memory. When a large number of memory cgroups are continuously created
>> and destroyed (like in a container host), it is possible that more
>> and more mem_cgroup structures remained in the dying state holding up
>> increasing amount of percpu memory.
>>
>> We can't free up the memory of the dying mem_cgroup structure due to
>> active references in some other places. However, the percpu stats memory
>> allocated to that mem_cgroup is a different story.
>>
>> This patch adds a new percpu_stats_disabled variable to keep track of
>> the state of the percpu stats memory. If the variable is set, percpu
>> stats update will be disabled for that particular memcg. All the stats
>> update will be forward to its parent instead. Reading of the its percpu
>> stats will return 0.
>>
>> The flushing and freeing of the percpu stats memory is a multi-step
>> process. The percpu_stats_disabled variable is set when the memcg is
>> being set to offline state. After a grace period with the help of RCU,
>> the percpu stats data are flushed and then freed.
>>
>> This will greatly reduce the amount of memory held up by dying memory
>> cgroups.
>>
>> By running a simple management tool for container 2000 times per test
>> run, below are the results of increases of percpu memory (as reported
>> in /proc/meminfo) and nr_dying_descendants in root's cgroup.stat.
> Hi Waiman!
>
> I've been proposing the same idea some time ago:
> https://lore.kernel.org/all/20190312223404.28665-7-guro@fb.com/T/ .
>
> However I dropped it with the thinking that with many other fixes
> preventing the accumulation of the dying cgroups it's not worth the added
> complexity and a potential cpu overhead.
>
> I think it ultimately comes to the number of dying cgroups. If it's low,
> memory savings are not worth the cpu overhead. If it's high, they are.
> I hope long-term to drive it down significantly (with lru-pages reparenting
> being the first major milestone), but it might take a while.
>
> I don't have a strong opinion either way, just want to dump my thoughts
> on this.

I have quite a number of customer cases complaining about increasing 
percpu memory usages. The number of dying memcg's can go to tens of 
thousands. From my own investigation, I believe that those dying memcg's 
are not freed because they are pinned down by references in the page 
structure. I am aware that we support the use of objcg in the page 
structure which will allow easy reparenting, but most pages don't do 
that and it is not easy to do this conversion and it may take quite a 
while to do that.

In term of overhead, it is mostly one more memory read from the 
mem_cgroup structure in the update path. I don't expect there will be 
that many updates when the memcg is in an offline state as updates will 
be slower in this case. Freeing the dying memcg will take a bit longer 
though, but its impact on the overall system performance should still be 
negligible.

I am also thinking about using a static_key for turning it on only for 
systems with more than, say, 20 cpus as the percpu memory overhead 
increases linearly with the number of possible cpus.

Any other suggestions and improvements are welcome.

Cheers,
Longman
Roman Gushchin April 21, 2022, 5:59 p.m. UTC | #3
On Thu, Apr 21, 2022 at 01:28:20PM -0400, Waiman Long wrote:
> 
> On 4/21/22 12:33, Roman Gushchin wrote:
> > On Thu, Apr 21, 2022 at 10:58:45AM -0400, Waiman Long wrote:
> > > For systems with large number of CPUs, the majority of the memory
> > > consumed by the mem_cgroup structure is actually the percpu stats
> > > memory. When a large number of memory cgroups are continuously created
> > > and destroyed (like in a container host), it is possible that more
> > > and more mem_cgroup structures remained in the dying state holding up
> > > increasing amount of percpu memory.
> > > 
> > > We can't free up the memory of the dying mem_cgroup structure due to
> > > active references in some other places. However, the percpu stats memory
> > > allocated to that mem_cgroup is a different story.
> > > 
> > > This patch adds a new percpu_stats_disabled variable to keep track of
> > > the state of the percpu stats memory. If the variable is set, percpu
> > > stats update will be disabled for that particular memcg. All the stats
> > > update will be forward to its parent instead. Reading of the its percpu
> > > stats will return 0.
> > > 
> > > The flushing and freeing of the percpu stats memory is a multi-step
> > > process. The percpu_stats_disabled variable is set when the memcg is
> > > being set to offline state. After a grace period with the help of RCU,
> > > the percpu stats data are flushed and then freed.
> > > 
> > > This will greatly reduce the amount of memory held up by dying memory
> > > cgroups.
> > > 
> > > By running a simple management tool for container 2000 times per test
> > > run, below are the results of increases of percpu memory (as reported
> > > in /proc/meminfo) and nr_dying_descendants in root's cgroup.stat.
> > Hi Waiman!
> > 
> > I've been proposing the same idea some time ago:
> > https://lore.kernel.org/all/20190312223404.28665-7-guro@fb.com/T/ .
> > 
> > However I dropped it with the thinking that with many other fixes
> > preventing the accumulation of the dying cgroups it's not worth the added
> > complexity and a potential cpu overhead.
> > 
> > I think it ultimately comes to the number of dying cgroups. If it's low,
> > memory savings are not worth the cpu overhead. If it's high, they are.
> > I hope long-term to drive it down significantly (with lru-pages reparenting
> > being the first major milestone), but it might take a while.
> > 
> > I don't have a strong opinion either way, just want to dump my thoughts
> > on this.
> 
> I have quite a number of customer cases complaining about increasing percpu
> memory usages. The number of dying memcg's can go to tens of thousands. From
> my own investigation, I believe that those dying memcg's are not freed
> because they are pinned down by references in the page structure. I am aware
> that we support the use of objcg in the page structure which will allow easy
> reparenting, but most pages don't do that and it is not easy to do this
> conversion and it may take quite a while to do that.

The big question is whether there is a memory pressure on those systems.
If yes, and the number of dying cgroups is growing, it's worth investigating.
It might be due to the sharing of pagecache pages and this will be ultimately
fixed with implementing of the pagecache reparenting. But it also might be due
to other bugs, which are fixable, so it would be great to understand.

So if there is a memory pressure and dying cgroups are still accumulating,
we need to investigate and fix it.

If there is (almost) no memory pressure, it's a proactive reclaim question.
There are several discussions and projects going on in this area.

Releasing percpu memory is more a workaround of the problem rather than fix.
In the end, if we're accumulating dying cgroups, we're still leaking memory,
just at a smaller pace (which is good, of course).

Thanks!
Waiman Long April 21, 2022, 6:46 p.m. UTC | #4
On 4/21/22 13:59, Roman Gushchin wrote:
> On Thu, Apr 21, 2022 at 01:28:20PM -0400, Waiman Long wrote:
>> On 4/21/22 12:33, Roman Gushchin wrote:
>>> On Thu, Apr 21, 2022 at 10:58:45AM -0400, Waiman Long wrote:
>>>> For systems with large number of CPUs, the majority of the memory
>>>> consumed by the mem_cgroup structure is actually the percpu stats
>>>> memory. When a large number of memory cgroups are continuously created
>>>> and destroyed (like in a container host), it is possible that more
>>>> and more mem_cgroup structures remained in the dying state holding up
>>>> increasing amount of percpu memory.
>>>>
>>>> We can't free up the memory of the dying mem_cgroup structure due to
>>>> active references in some other places. However, the percpu stats memory
>>>> allocated to that mem_cgroup is a different story.
>>>>
>>>> This patch adds a new percpu_stats_disabled variable to keep track of
>>>> the state of the percpu stats memory. If the variable is set, percpu
>>>> stats update will be disabled for that particular memcg. All the stats
>>>> update will be forward to its parent instead. Reading of the its percpu
>>>> stats will return 0.
>>>>
>>>> The flushing and freeing of the percpu stats memory is a multi-step
>>>> process. The percpu_stats_disabled variable is set when the memcg is
>>>> being set to offline state. After a grace period with the help of RCU,
>>>> the percpu stats data are flushed and then freed.
>>>>
>>>> This will greatly reduce the amount of memory held up by dying memory
>>>> cgroups.
>>>>
>>>> By running a simple management tool for container 2000 times per test
>>>> run, below are the results of increases of percpu memory (as reported
>>>> in /proc/meminfo) and nr_dying_descendants in root's cgroup.stat.
>>> Hi Waiman!
>>>
>>> I've been proposing the same idea some time ago:
>>> https://lore.kernel.org/all/20190312223404.28665-7-guro@fb.com/T/ .
>>>
>>> However I dropped it with the thinking that with many other fixes
>>> preventing the accumulation of the dying cgroups it's not worth the added
>>> complexity and a potential cpu overhead.
>>>
>>> I think it ultimately comes to the number of dying cgroups. If it's low,
>>> memory savings are not worth the cpu overhead. If it's high, they are.
>>> I hope long-term to drive it down significantly (with lru-pages reparenting
>>> being the first major milestone), but it might take a while.
>>>
>>> I don't have a strong opinion either way, just want to dump my thoughts
>>> on this.
>> I have quite a number of customer cases complaining about increasing percpu
>> memory usages. The number of dying memcg's can go to tens of thousands. From
>> my own investigation, I believe that those dying memcg's are not freed
>> because they are pinned down by references in the page structure. I am aware
>> that we support the use of objcg in the page structure which will allow easy
>> reparenting, but most pages don't do that and it is not easy to do this
>> conversion and it may take quite a while to do that.
> The big question is whether there is a memory pressure on those systems.
> If yes, and the number of dying cgroups is growing, it's worth investigating.
> It might be due to the sharing of pagecache pages and this will be ultimately
> fixed with implementing of the pagecache reparenting. But it also might be due
> to other bugs, which are fixable, so it would be great to understand.


Pagecache reparenting will probably fix the problem that I have seen. Is 
someone working on this?


> So if there is a memory pressure and dying cgroups are still accumulating,
> we need to investigate and fix it.
>
> If there is (almost) no memory pressure, it's a proactive reclaim question.
> There are several discussions and projects going on in this area.


As more and more memory are pinned down by dying memcg's, there is just 
less memory available for other useful works. So it is an issue from the 
user point of view. I am not sure how much memory pressure the customers 
have, but they certainly are not happy about that.


> Releasing percpu memory is more a workaround of the problem rather than fix.
> In the end, if we're accumulating dying cgroups, we're still leaking memory,
> just at a smaller pace (which is good, of course).

I agree that it is a workaround. However, without pagecache reparenting, 
userspace applications may need to be modified to minimize the chance of 
leaving behind dying memcg's. It is not easy either.

Cheers,
Longman
Muchun Song April 22, 2022, 2:29 a.m. UTC | #5
On Thu, Apr 21, 2022 at 02:46:00PM -0400, Waiman Long wrote:
> On 4/21/22 13:59, Roman Gushchin wrote:
> > On Thu, Apr 21, 2022 at 01:28:20PM -0400, Waiman Long wrote:
> > > On 4/21/22 12:33, Roman Gushchin wrote:
> > > > On Thu, Apr 21, 2022 at 10:58:45AM -0400, Waiman Long wrote:
> > > > > For systems with large number of CPUs, the majority of the memory
> > > > > consumed by the mem_cgroup structure is actually the percpu stats
> > > > > memory. When a large number of memory cgroups are continuously created
> > > > > and destroyed (like in a container host), it is possible that more
> > > > > and more mem_cgroup structures remained in the dying state holding up
> > > > > increasing amount of percpu memory.
> > > > > 
> > > > > We can't free up the memory of the dying mem_cgroup structure due to
> > > > > active references in some other places. However, the percpu stats memory
> > > > > allocated to that mem_cgroup is a different story.
> > > > > 
> > > > > This patch adds a new percpu_stats_disabled variable to keep track of
> > > > > the state of the percpu stats memory. If the variable is set, percpu
> > > > > stats update will be disabled for that particular memcg. All the stats
> > > > > update will be forward to its parent instead. Reading of the its percpu
> > > > > stats will return 0.
> > > > > 
> > > > > The flushing and freeing of the percpu stats memory is a multi-step
> > > > > process. The percpu_stats_disabled variable is set when the memcg is
> > > > > being set to offline state. After a grace period with the help of RCU,
> > > > > the percpu stats data are flushed and then freed.
> > > > > 
> > > > > This will greatly reduce the amount of memory held up by dying memory
> > > > > cgroups.
> > > > > 
> > > > > By running a simple management tool for container 2000 times per test
> > > > > run, below are the results of increases of percpu memory (as reported
> > > > > in /proc/meminfo) and nr_dying_descendants in root's cgroup.stat.
> > > > Hi Waiman!
> > > > 
> > > > I've been proposing the same idea some time ago:
> > > > https://lore.kernel.org/all/20190312223404.28665-7-guro@fb.com/T/ .
> > > > 
> > > > However I dropped it with the thinking that with many other fixes
> > > > preventing the accumulation of the dying cgroups it's not worth the added
> > > > complexity and a potential cpu overhead.
> > > > 
> > > > I think it ultimately comes to the number of dying cgroups. If it's low,
> > > > memory savings are not worth the cpu overhead. If it's high, they are.
> > > > I hope long-term to drive it down significantly (with lru-pages reparenting
> > > > being the first major milestone), but it might take a while.
> > > > 
> > > > I don't have a strong opinion either way, just want to dump my thoughts
> > > > on this.
> > > I have quite a number of customer cases complaining about increasing percpu
> > > memory usages. The number of dying memcg's can go to tens of thousands. From
> > > my own investigation, I believe that those dying memcg's are not freed
> > > because they are pinned down by references in the page structure. I am aware
> > > that we support the use of objcg in the page structure which will allow easy
> > > reparenting, but most pages don't do that and it is not easy to do this
> > > conversion and it may take quite a while to do that.
> > The big question is whether there is a memory pressure on those systems.
> > If yes, and the number of dying cgroups is growing, it's worth investigating.
> > It might be due to the sharing of pagecache pages and this will be ultimately
> > fixed with implementing of the pagecache reparenting. But it also might be due
> > to other bugs, which are fixable, so it would be great to understand.
> 
> 
> Pagecache reparenting will probably fix the problem that I have seen. Is
> someone working on this?
>

We also encountered dying cgroup issue on our servers for a long time.
I have worked on this for a while and proposed a resolution [1] based
on obj_cgroup APIs to charge the LRU pages.

[1] https://lore.kernel.org/all/20220216115132.52602-1-songmuchun@bytedance.com/

Thanks.
Roman Gushchin April 22, 2022, 2:59 a.m. UTC | #6
On Thu, Apr 21, 2022 at 02:46:00PM -0400, Waiman Long wrote:
> On 4/21/22 13:59, Roman Gushchin wrote:
> > On Thu, Apr 21, 2022 at 01:28:20PM -0400, Waiman Long wrote:
> > > On 4/21/22 12:33, Roman Gushchin wrote:
> > > > On Thu, Apr 21, 2022 at 10:58:45AM -0400, Waiman Long wrote:
> > > > > For systems with large number of CPUs, the majority of the memory
> > > > > consumed by the mem_cgroup structure is actually the percpu stats
> > > > > memory. When a large number of memory cgroups are continuously created
> > > > > and destroyed (like in a container host), it is possible that more
> > > > > and more mem_cgroup structures remained in the dying state holding up
> > > > > increasing amount of percpu memory.
> > > > > 
> > > > > We can't free up the memory of the dying mem_cgroup structure due to
> > > > > active references in some other places. However, the percpu stats memory
> > > > > allocated to that mem_cgroup is a different story.
> > > > > 
> > > > > This patch adds a new percpu_stats_disabled variable to keep track of
> > > > > the state of the percpu stats memory. If the variable is set, percpu
> > > > > stats update will be disabled for that particular memcg. All the stats
> > > > > update will be forward to its parent instead. Reading of the its percpu
> > > > > stats will return 0.
> > > > > 
> > > > > The flushing and freeing of the percpu stats memory is a multi-step
> > > > > process. The percpu_stats_disabled variable is set when the memcg is
> > > > > being set to offline state. After a grace period with the help of RCU,
> > > > > the percpu stats data are flushed and then freed.
> > > > > 
> > > > > This will greatly reduce the amount of memory held up by dying memory
> > > > > cgroups.
> > > > > 
> > > > > By running a simple management tool for container 2000 times per test
> > > > > run, below are the results of increases of percpu memory (as reported
> > > > > in /proc/meminfo) and nr_dying_descendants in root's cgroup.stat.
> > > > Hi Waiman!
> > > > 
> > > > I've been proposing the same idea some time ago:
> > > > https://lore.kernel.org/all/20190312223404.28665-7-guro@fb.com/T/ .
> > > > 
> > > > However I dropped it with the thinking that with many other fixes
> > > > preventing the accumulation of the dying cgroups it's not worth the added
> > > > complexity and a potential cpu overhead.
> > > > 
> > > > I think it ultimately comes to the number of dying cgroups. If it's low,
> > > > memory savings are not worth the cpu overhead. If it's high, they are.
> > > > I hope long-term to drive it down significantly (with lru-pages reparenting
> > > > being the first major milestone), but it might take a while.
> > > > 
> > > > I don't have a strong opinion either way, just want to dump my thoughts
> > > > on this.
> > > I have quite a number of customer cases complaining about increasing percpu
> > > memory usages. The number of dying memcg's can go to tens of thousands. From
> > > my own investigation, I believe that those dying memcg's are not freed
> > > because they are pinned down by references in the page structure. I am aware
> > > that we support the use of objcg in the page structure which will allow easy
> > > reparenting, but most pages don't do that and it is not easy to do this
> > > conversion and it may take quite a while to do that.
> > The big question is whether there is a memory pressure on those systems.
> > If yes, and the number of dying cgroups is growing, it's worth investigating.
> > It might be due to the sharing of pagecache pages and this will be ultimately
> > fixed with implementing of the pagecache reparenting. But it also might be due
> > to other bugs, which are fixable, so it would be great to understand.
> 
> 
> Pagecache reparenting will probably fix the problem that I have seen. Is
> someone working on this?

Some time ago Muchun posted patches based on the reusing of the obj_cgroup API.

I'm not strictly against this approach, but in my opinion it's not the best.
I suggested to use lru vectors as an intermediate objects. In theory, it might
allow to avoid bumping reference counters for all charged pages at all: live
cgroups will be protected by being live, dying cgroups will only need
a temporarily protection while lru vectors and associated pages are reparenting.

There are pros and cons:
+ cgroup reference counting becomes simpler and more debuggable
+ potential perf wins from fewer operations with live cgroups css refcounters
= I hope to see code simplifications (but not guaranteed)
- deleting cgroups becomes more expensive, but the cost can be spread to
  asynchronous workers

Idk if Muchun tried to implement it. If not, I might try myself.

Thanks!
Muchun Song April 22, 2022, 10:27 a.m. UTC | #7
On Thu, Apr 21, 2022 at 07:59:00PM -0700, Roman Gushchin wrote:
> On Thu, Apr 21, 2022 at 02:46:00PM -0400, Waiman Long wrote:
> > On 4/21/22 13:59, Roman Gushchin wrote:
> > > On Thu, Apr 21, 2022 at 01:28:20PM -0400, Waiman Long wrote:
> > > > On 4/21/22 12:33, Roman Gushchin wrote:
> > > > > On Thu, Apr 21, 2022 at 10:58:45AM -0400, Waiman Long wrote:
> > > > > > For systems with large number of CPUs, the majority of the memory
> > > > > > consumed by the mem_cgroup structure is actually the percpu stats
> > > > > > memory. When a large number of memory cgroups are continuously created
> > > > > > and destroyed (like in a container host), it is possible that more
> > > > > > and more mem_cgroup structures remained in the dying state holding up
> > > > > > increasing amount of percpu memory.
> > > > > > 
> > > > > > We can't free up the memory of the dying mem_cgroup structure due to
> > > > > > active references in some other places. However, the percpu stats memory
> > > > > > allocated to that mem_cgroup is a different story.
> > > > > > 
> > > > > > This patch adds a new percpu_stats_disabled variable to keep track of
> > > > > > the state of the percpu stats memory. If the variable is set, percpu
> > > > > > stats update will be disabled for that particular memcg. All the stats
> > > > > > update will be forward to its parent instead. Reading of the its percpu
> > > > > > stats will return 0.
> > > > > > 
> > > > > > The flushing and freeing of the percpu stats memory is a multi-step
> > > > > > process. The percpu_stats_disabled variable is set when the memcg is
> > > > > > being set to offline state. After a grace period with the help of RCU,
> > > > > > the percpu stats data are flushed and then freed.
> > > > > > 
> > > > > > This will greatly reduce the amount of memory held up by dying memory
> > > > > > cgroups.
> > > > > > 
> > > > > > By running a simple management tool for container 2000 times per test
> > > > > > run, below are the results of increases of percpu memory (as reported
> > > > > > in /proc/meminfo) and nr_dying_descendants in root's cgroup.stat.
> > > > > Hi Waiman!
> > > > > 
> > > > > I've been proposing the same idea some time ago:
> > > > > https://lore.kernel.org/all/20190312223404.28665-7-guro@fb.com/T/ .
> > > > > 
> > > > > However I dropped it with the thinking that with many other fixes
> > > > > preventing the accumulation of the dying cgroups it's not worth the added
> > > > > complexity and a potential cpu overhead.
> > > > > 
> > > > > I think it ultimately comes to the number of dying cgroups. If it's low,
> > > > > memory savings are not worth the cpu overhead. If it's high, they are.
> > > > > I hope long-term to drive it down significantly (with lru-pages reparenting
> > > > > being the first major milestone), but it might take a while.
> > > > > 
> > > > > I don't have a strong opinion either way, just want to dump my thoughts
> > > > > on this.
> > > > I have quite a number of customer cases complaining about increasing percpu
> > > > memory usages. The number of dying memcg's can go to tens of thousands. From
> > > > my own investigation, I believe that those dying memcg's are not freed
> > > > because they are pinned down by references in the page structure. I am aware
> > > > that we support the use of objcg in the page structure which will allow easy
> > > > reparenting, but most pages don't do that and it is not easy to do this
> > > > conversion and it may take quite a while to do that.
> > > The big question is whether there is a memory pressure on those systems.
> > > If yes, and the number of dying cgroups is growing, it's worth investigating.
> > > It might be due to the sharing of pagecache pages and this will be ultimately
> > > fixed with implementing of the pagecache reparenting. But it also might be due
> > > to other bugs, which are fixable, so it would be great to understand.
> > 
> > 
> > Pagecache reparenting will probably fix the problem that I have seen. Is
> > someone working on this?
> 
> Some time ago Muchun posted patches based on the reusing of the obj_cgroup API.
>

Yep. It is here:

https://lore.kernel.org/all/20220216115132.52602-1-songmuchun@bytedance.com/.
 
> I'm not strictly against this approach, but in my opinion it's not the best.
> I suggested to use lru vectors as an intermediate objects. In theory, it might

I remember this.

> allow to avoid bumping reference counters for all charged pages at all: live
> cgroups will be protected by being live, dying cgroups will only need
> a temporarily protection while lru vectors and associated pages are reparenting.
> 
> There are pros and cons:
> + cgroup reference counting becomes simpler and more debuggable
> + potential perf wins from fewer operations with live cgroups css refcounters
> = I hope to see code simplifications (but not guaranteed)
> - deleting cgroups becomes more expensive, but the cost can be spread to
>   asynchronous workers
> 
> Idk if Muchun tried to implement it. If not, I might try myself.
>

Yep. I have implemented a initial version recently. I'll do some stability tests
and send it out ASAP.

Thanks Roman.
Roman Gushchin April 22, 2022, 3 p.m. UTC | #8
On Fri, Apr 22, 2022 at 06:27:22PM +0800, Muchun Song wrote:
> On Thu, Apr 21, 2022 at 07:59:00PM -0700, Roman Gushchin wrote:
> > On Thu, Apr 21, 2022 at 02:46:00PM -0400, Waiman Long wrote:
> > > On 4/21/22 13:59, Roman Gushchin wrote:
> > > > On Thu, Apr 21, 2022 at 01:28:20PM -0400, Waiman Long wrote:
> > > > > On 4/21/22 12:33, Roman Gushchin wrote:
> > > > > > On Thu, Apr 21, 2022 at 10:58:45AM -0400, Waiman Long wrote:
> > > > > > > For systems with large number of CPUs, the majority of the memory
> > > > > > > consumed by the mem_cgroup structure is actually the percpu stats
> > > > > > > memory. When a large number of memory cgroups are continuously created
> > > > > > > and destroyed (like in a container host), it is possible that more
> > > > > > > and more mem_cgroup structures remained in the dying state holding up
> > > > > > > increasing amount of percpu memory.
> > > > > > > 
> > > > > > > We can't free up the memory of the dying mem_cgroup structure due to
> > > > > > > active references in some other places. However, the percpu stats memory
> > > > > > > allocated to that mem_cgroup is a different story.
> > > > > > > 
> > > > > > > This patch adds a new percpu_stats_disabled variable to keep track of
> > > > > > > the state of the percpu stats memory. If the variable is set, percpu
> > > > > > > stats update will be disabled for that particular memcg. All the stats
> > > > > > > update will be forward to its parent instead. Reading of the its percpu
> > > > > > > stats will return 0.
> > > > > > > 
> > > > > > > The flushing and freeing of the percpu stats memory is a multi-step
> > > > > > > process. The percpu_stats_disabled variable is set when the memcg is
> > > > > > > being set to offline state. After a grace period with the help of RCU,
> > > > > > > the percpu stats data are flushed and then freed.
> > > > > > > 
> > > > > > > This will greatly reduce the amount of memory held up by dying memory
> > > > > > > cgroups.
> > > > > > > 
> > > > > > > By running a simple management tool for container 2000 times per test
> > > > > > > run, below are the results of increases of percpu memory (as reported
> > > > > > > in /proc/meminfo) and nr_dying_descendants in root's cgroup.stat.
> > > > > > Hi Waiman!
> > > > > > 
> > > > > > I've been proposing the same idea some time ago:
> > > > > > https://lore.kernel.org/all/20190312223404.28665-7-guro@fb.com/T/ .
> > > > > > 
> > > > > > However I dropped it with the thinking that with many other fixes
> > > > > > preventing the accumulation of the dying cgroups it's not worth the added
> > > > > > complexity and a potential cpu overhead.
> > > > > > 
> > > > > > I think it ultimately comes to the number of dying cgroups. If it's low,
> > > > > > memory savings are not worth the cpu overhead. If it's high, they are.
> > > > > > I hope long-term to drive it down significantly (with lru-pages reparenting
> > > > > > being the first major milestone), but it might take a while.
> > > > > > 
> > > > > > I don't have a strong opinion either way, just want to dump my thoughts
> > > > > > on this.
> > > > > I have quite a number of customer cases complaining about increasing percpu
> > > > > memory usages. The number of dying memcg's can go to tens of thousands. From
> > > > > my own investigation, I believe that those dying memcg's are not freed
> > > > > because they are pinned down by references in the page structure. I am aware
> > > > > that we support the use of objcg in the page structure which will allow easy
> > > > > reparenting, but most pages don't do that and it is not easy to do this
> > > > > conversion and it may take quite a while to do that.
> > > > The big question is whether there is a memory pressure on those systems.
> > > > If yes, and the number of dying cgroups is growing, it's worth investigating.
> > > > It might be due to the sharing of pagecache pages and this will be ultimately
> > > > fixed with implementing of the pagecache reparenting. But it also might be due
> > > > to other bugs, which are fixable, so it would be great to understand.
> > > 
> > > 
> > > Pagecache reparenting will probably fix the problem that I have seen. Is
> > > someone working on this?
> > 
> > Some time ago Muchun posted patches based on the reusing of the obj_cgroup API.
> >
> 
> Yep. It is here:
> 
> https://lore.kernel.org/all/20220216115132.52602-1-songmuchun@bytedance.com/.
>  
> > I'm not strictly against this approach, but in my opinion it's not the best.
> > I suggested to use lru vectors as an intermediate objects. In theory, it might
> 
> I remember this.
> 
> > allow to avoid bumping reference counters for all charged pages at all: live
> > cgroups will be protected by being live, dying cgroups will only need
> > a temporarily protection while lru vectors and associated pages are reparenting.
> > 
> > There are pros and cons:
> > + cgroup reference counting becomes simpler and more debuggable
> > + potential perf wins from fewer operations with live cgroups css refcounters
> > = I hope to see code simplifications (but not guaranteed)
> > - deleting cgroups becomes more expensive, but the cost can be spread to
> >   asynchronous workers
> > 
> > Idk if Muchun tried to implement it. If not, I might try myself.
> >
> 
> Yep. I have implemented a initial version recently. I'll do some stability tests
> and send it out ASAP.

Great! Looking forward to see it! And thank you for working on it!
Waiman Long April 25, 2022, 1:01 a.m. UTC | #9
On 4/21/22 22:29, Muchun Song wrote:
> On Thu, Apr 21, 2022 at 02:46:00PM -0400, Waiman Long wrote:
>> On 4/21/22 13:59, Roman Gushchin wrote:
>>> On Thu, Apr 21, 2022 at 01:28:20PM -0400, Waiman Long wrote:
>>>> On 4/21/22 12:33, Roman Gushchin wrote:
>>>>> On Thu, Apr 21, 2022 at 10:58:45AM -0400, Waiman Long wrote:
>>>>>> For systems with large number of CPUs, the majority of the memory
>>>>>> consumed by the mem_cgroup structure is actually the percpu stats
>>>>>> memory. When a large number of memory cgroups are continuously created
>>>>>> and destroyed (like in a container host), it is possible that more
>>>>>> and more mem_cgroup structures remained in the dying state holding up
>>>>>> increasing amount of percpu memory.
>>>>>>
>>>>>> We can't free up the memory of the dying mem_cgroup structure due to
>>>>>> active references in some other places. However, the percpu stats memory
>>>>>> allocated to that mem_cgroup is a different story.
>>>>>>
>>>>>> This patch adds a new percpu_stats_disabled variable to keep track of
>>>>>> the state of the percpu stats memory. If the variable is set, percpu
>>>>>> stats update will be disabled for that particular memcg. All the stats
>>>>>> update will be forward to its parent instead. Reading of the its percpu
>>>>>> stats will return 0.
>>>>>>
>>>>>> The flushing and freeing of the percpu stats memory is a multi-step
>>>>>> process. The percpu_stats_disabled variable is set when the memcg is
>>>>>> being set to offline state. After a grace period with the help of RCU,
>>>>>> the percpu stats data are flushed and then freed.
>>>>>>
>>>>>> This will greatly reduce the amount of memory held up by dying memory
>>>>>> cgroups.
>>>>>>
>>>>>> By running a simple management tool for container 2000 times per test
>>>>>> run, below are the results of increases of percpu memory (as reported
>>>>>> in /proc/meminfo) and nr_dying_descendants in root's cgroup.stat.
>>>>> Hi Waiman!
>>>>>
>>>>> I've been proposing the same idea some time ago:
>>>>> https://lore.kernel.org/all/20190312223404.28665-7-guro@fb.com/T/ .
>>>>>
>>>>> However I dropped it with the thinking that with many other fixes
>>>>> preventing the accumulation of the dying cgroups it's not worth the added
>>>>> complexity and a potential cpu overhead.
>>>>>
>>>>> I think it ultimately comes to the number of dying cgroups. If it's low,
>>>>> memory savings are not worth the cpu overhead. If it's high, they are.
>>>>> I hope long-term to drive it down significantly (with lru-pages reparenting
>>>>> being the first major milestone), but it might take a while.
>>>>>
>>>>> I don't have a strong opinion either way, just want to dump my thoughts
>>>>> on this.
>>>> I have quite a number of customer cases complaining about increasing percpu
>>>> memory usages. The number of dying memcg's can go to tens of thousands. From
>>>> my own investigation, I believe that those dying memcg's are not freed
>>>> because they are pinned down by references in the page structure. I am aware
>>>> that we support the use of objcg in the page structure which will allow easy
>>>> reparenting, but most pages don't do that and it is not easy to do this
>>>> conversion and it may take quite a while to do that.
>>> The big question is whether there is a memory pressure on those systems.
>>> If yes, and the number of dying cgroups is growing, it's worth investigating.
>>> It might be due to the sharing of pagecache pages and this will be ultimately
>>> fixed with implementing of the pagecache reparenting. But it also might be due
>>> to other bugs, which are fixable, so it would be great to understand.
>>
>> Pagecache reparenting will probably fix the problem that I have seen. Is
>> someone working on this?
>>
> We also encountered dying cgroup issue on our servers for a long time.
> I have worked on this for a while and proposed a resolution [1] based
> on obj_cgroup APIs to charge the LRU pages.
>
> [1] https://lore.kernel.org/all/20220216115132.52602-1-songmuchun@bytedance.com/

Thanks for the pointer. I am interested in this patch series. Please cc 
me if you need to generate a new revision.

Cheers,
Longman
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 89b14729d59f..3389b705e242 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -51,6 +51,13 @@  enum memcg_memory_event {
 	MEMCG_NR_MEMORY_EVENTS,
 };
 
+enum percpu_stats_state {
+	MEMCG_PERCPU_STATS_ACTIVE = 0,
+	MEMCG_PERCPU_STATS_DISABLED,
+	MEMCG_PERCPU_STATS_FLUSHED,
+	MEMCG_PERCPU_STATS_FREED
+};
+
 struct mem_cgroup_reclaim_cookie {
 	pg_data_t *pgdat;
 	unsigned int generation;
@@ -299,6 +306,12 @@  struct mem_cgroup {
 	unsigned long		move_lock_flags;
 
 	MEMCG_PADDING(_pad1_);
+	/*
+	 * Disable percpu stats when offline, flush and free them after one
+	 * grace period.
+	 */
+	struct rcu_work		percpu_stats_rwork;
+	enum percpu_stats_state percpu_stats_disabled;
 
 	/* memory.stat */
 	struct memcg_vmstats	vmstats;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 08cbe23a8b94..1a5ba7035249 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -665,6 +665,31 @@  static void flush_memcg_stats_dwork(struct work_struct *w)
 	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }
 
+/*
+ * Return the active percpu stats memcg and optionally mem_cgroup_per_node.
+ *
+ * When percpu_stats_disabled, the percpu stats update is transferred to
+ * its parent.
+ */
+static inline struct mem_cgroup *
+percpu_stats_memcg(struct mem_cgroup *memcg, struct mem_cgroup_per_node **pn)
+{
+	while (unlikely(memcg->percpu_stats_disabled)) {
+		int nid;
+
+		if (pn) {
+			for_each_node(nid) {
+				if (*pn == memcg->nodeinfo[nid])
+					break;
+			}
+		}
+		memcg = parent_mem_cgroup(memcg);
+		if (pn)
+			*pn = memcg->nodeinfo[nid];
+	}
+	return memcg;
+}
+
 /**
  * __mod_memcg_state - update cgroup memory statistics
  * @memcg: the memory cgroup
@@ -676,6 +701,7 @@  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 	if (mem_cgroup_disabled())
 		return;
 
+	memcg = percpu_stats_memcg(memcg, NULL);
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 	memcg_rstat_updated(memcg, val);
 }
@@ -686,6 +712,9 @@  static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
 	long x = 0;
 	int cpu;
 
+	if (unlikely(memcg->percpu_stats_disabled))
+		return 0;
+
 	for_each_possible_cpu(cpu)
 		x += per_cpu(memcg->vmstats_percpu->state[idx], cpu);
 #ifdef CONFIG_SMP
@@ -702,7 +731,7 @@  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	struct mem_cgroup *memcg;
 
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-	memcg = pn->memcg;
+	memcg = percpu_stats_memcg(pn->memcg, &pn);
 
 	/*
 	 * The caller from rmap relay on disabled preemption becase they never
@@ -814,6 +843,7 @@  void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 {
 	if (mem_cgroup_disabled())
 		return;
+	memcg = percpu_stats_memcg(memcg, NULL);
 
 	memcg_stats_lock();
 	__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
@@ -831,6 +861,9 @@  static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
 	long x = 0;
 	int cpu;
 
+	if (unlikely(memcg->percpu_stats_disabled))
+		return 0;
+
 	for_each_possible_cpu(cpu)
 		x += per_cpu(memcg->vmstats_percpu->events[event], cpu);
 	return x;
@@ -839,6 +872,8 @@  static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 int nr_pages)
 {
+	memcg = percpu_stats_memcg(memcg, NULL);
+
 	/* pagein of a big page is an event. So, ignore page size */
 	if (nr_pages > 0)
 		__count_memcg_events(memcg, PGPGIN, 1);
@@ -855,6 +890,8 @@  static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 {
 	unsigned long val, next;
 
+	memcg = percpu_stats_memcg(memcg, NULL);
+
 	val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events);
 	next = __this_cpu_read(memcg->vmstats_percpu->targets[target]);
 	/* from time_after() in jiffies.h */
@@ -5051,7 +5088,6 @@  static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 	if (!pn)
 		return;
 
-	free_percpu(pn->lruvec_stats_percpu);
 	kfree(pn);
 }
 
@@ -5061,7 +5097,6 @@  static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
 	for_each_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
-	free_percpu(memcg->vmstats_percpu);
 	kfree(memcg);
 }
 
@@ -5132,6 +5167,44 @@  static struct mem_cgroup *mem_cgroup_alloc(void)
 	return ERR_PTR(error);
 }
 
+/*
+ * Flush and free the percpu stats
+ */
+static void percpu_stats_free_rwork_fn(struct work_struct *work)
+{
+	struct mem_cgroup *memcg = container_of(to_rcu_work(work),
+						struct mem_cgroup,
+						percpu_stats_rwork);
+	int node;
+
+
+	cgroup_rstat_flush_hold(memcg->css.cgroup);
+	memcg->percpu_stats_disabled = MEMCG_PERCPU_STATS_FLUSHED;
+	cgroup_rstat_flush_release();
+
+	for_each_node(node) {
+		struct mem_cgroup_per_node *pn = memcg->nodeinfo[node];
+
+		if (pn)
+			free_percpu(pn->lruvec_stats_percpu);
+	}
+	free_percpu(memcg->vmstats_percpu);
+	memcg->percpu_stats_disabled = MEMCG_PERCPU_STATS_FREED;
+	mem_cgroup_id_put(memcg);
+}
+
+static void memcg_percpu_stats_disable(struct mem_cgroup *memcg)
+{
+	memcg->percpu_stats_disabled = MEMCG_PERCPU_STATS_DISABLED;
+	INIT_RCU_WORK(&memcg->percpu_stats_rwork, percpu_stats_free_rwork_fn);
+	queue_rcu_work(system_wq, &memcg->percpu_stats_rwork);
+}
+
+static inline bool memcg_percpu_stats_gone(struct mem_cgroup *memcg)
+{
+	return memcg->percpu_stats_disabled >= MEMCG_PERCPU_STATS_FLUSHED;
+}
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -5227,7 +5300,7 @@  static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 
 	drain_all_stock(memcg);
 
-	mem_cgroup_id_put(memcg);
+	memcg_percpu_stats_disable(memcg);
 }
 
 static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
@@ -5308,11 +5381,13 @@  static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 		if (delta)
 			memcg->vmstats.state_pending[i] = 0;
 
-		/* Add CPU changes on this level since the last flush */
-		v = READ_ONCE(statc->state[i]);
-		if (v != statc->state_prev[i]) {
-			delta += v - statc->state_prev[i];
-			statc->state_prev[i] = v;
+		if (!memcg_percpu_stats_gone(memcg)) {
+			/* Add CPU changes on this level since the last flush */
+			v = READ_ONCE(statc->state[i]);
+			if (v != statc->state_prev[i]) {
+				delta += v - statc->state_prev[i];
+				statc->state_prev[i] = v;
+			}
 		}
 
 		if (!delta)
@@ -5329,10 +5404,12 @@  static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 		if (delta)
 			memcg->vmstats.events_pending[i] = 0;
 
-		v = READ_ONCE(statc->events[i]);
-		if (v != statc->events_prev[i]) {
-			delta += v - statc->events_prev[i];
-			statc->events_prev[i] = v;
+		if (!memcg_percpu_stats_gone(memcg)) {
+			v = READ_ONCE(statc->events[i]);
+			if (v != statc->events_prev[i]) {
+				delta += v - statc->events_prev[i];
+				statc->events_prev[i] = v;
+			}
 		}
 
 		if (!delta)
@@ -5358,10 +5435,12 @@  static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 			if (delta)
 				pn->lruvec_stats.state_pending[i] = 0;
 
-			v = READ_ONCE(lstatc->state[i]);
-			if (v != lstatc->state_prev[i]) {
-				delta += v - lstatc->state_prev[i];
-				lstatc->state_prev[i] = v;
+			if (!memcg_percpu_stats_gone(memcg)) {
+				v = READ_ONCE(lstatc->state[i]);
+				if (v != lstatc->state_prev[i]) {
+					delta += v - lstatc->state_prev[i];
+					lstatc->state_prev[i] = v;
+				}
 			}
 
 			if (!delta)
@@ -6730,6 +6809,7 @@  static inline void uncharge_gather_clear(struct uncharge_gather *ug)
 static void uncharge_batch(const struct uncharge_gather *ug)
 {
 	unsigned long flags;
+	struct mem_cgroup *memcg;
 
 	if (ug->nr_memory) {
 		page_counter_uncharge(&ug->memcg->memory, ug->nr_memory);
@@ -6740,10 +6820,12 @@  static void uncharge_batch(const struct uncharge_gather *ug)
 		memcg_oom_recover(ug->memcg);
 	}
 
+	memcg = percpu_stats_memcg(ug->memcg, NULL);
+
 	local_irq_save(flags);
-	__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
-	__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
-	memcg_check_events(ug->memcg, ug->nid);
+	__count_memcg_events(memcg, PGPGOUT, ug->pgpgout);
+	__this_cpu_add(memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
+	memcg_check_events(memcg, ug->nid);
 	local_irq_restore(flags);
 
 	/* drop reference from uncharge_folio */