Message ID | 20220129205315.478628-4-longman@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/page_owner: Extend page_owner to show memcg information | expand |
On Sat, Jan 29, 2022 at 03:53:15PM -0500, Waiman Long wrote: > It was found that a number of offlined memcgs were not freed because > they were pinned by some charged pages that were present. Even "echo > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > offlined but not freed memcgs tend to increase in number over time with > the side effect that percpu memory consumption as shown in /proc/meminfo > also increases over time. > > In order to find out more information about those pages that pin > offlined memcgs, the page_owner feature is extended to dump memory > cgroup information especially whether the cgroup is offlined or not. > > Signed-off-by: Waiman Long <longman@redhat.com> > --- > mm/page_owner.c | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/mm/page_owner.c b/mm/page_owner.c > index 28dac73e0542..8dc5cd0fa227 100644 > --- a/mm/page_owner.c > +++ b/mm/page_owner.c > @@ -10,6 +10,7 @@ > #include <linux/migrate.h> > #include <linux/stackdepot.h> > #include <linux/seq_file.h> > +#include <linux/memcontrol.h> > #include <linux/sched/clock.h> > > #include "internal.h" > @@ -331,6 +332,7 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > depot_stack_handle_t handle) > { > int ret, pageblock_mt, page_mt; > + unsigned long __maybe_unused memcg_data; > char *kbuf; > > count = min_t(size_t, count, PAGE_SIZE); > @@ -365,6 +367,35 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > migrate_reason_names[page_owner->last_migrate_reason]); > } > > +#ifdef CONFIG_MEMCG Can we put all this along with the declaration of memcg_data in a helper function please? > + /* > + * Look for memcg information and print it out > + */ > + memcg_data = READ_ONCE(page->memcg_data); > + if (memcg_data) { > + struct mem_cgroup *memcg = page_memcg_check(page); > + bool onlined; > + char name[80]; > + > + if (memcg_data & MEMCG_DATA_OBJCGS) > + ret += scnprintf(kbuf + ret, count - ret, > + "Slab cache page\n"); > + > + if (!memcg) > + goto copy_out; > + > + onlined = (memcg->css.flags & CSS_ONLINE); > + cgroup_name(memcg->css.cgroup, name, sizeof(name)); > + ret += scnprintf(kbuf + ret, count - ret, > + "Charged %sto %smemcg %s\n", > + PageMemcgKmem(page) ? "(via objcg) " : "", > + onlined ? "" : "offlined ", > + name); > + } > + > +copy_out: > +#endif > + > ret += snprintf(kbuf + ret, count - ret, "\n"); > if (ret >= count) > goto err; > -- > 2.27.0 > >
On 1/30/22 01:33, Mike Rapoport wrote: > On Sat, Jan 29, 2022 at 03:53:15PM -0500, Waiman Long wrote: >> It was found that a number of offlined memcgs were not freed because >> they were pinned by some charged pages that were present. Even "echo >> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These >> offlined but not freed memcgs tend to increase in number over time with >> the side effect that percpu memory consumption as shown in /proc/meminfo >> also increases over time. >> >> In order to find out more information about those pages that pin >> offlined memcgs, the page_owner feature is extended to dump memory >> cgroup information especially whether the cgroup is offlined or not. >> >> Signed-off-by: Waiman Long <longman@redhat.com> >> --- >> mm/page_owner.c | 31 +++++++++++++++++++++++++++++++ >> 1 file changed, 31 insertions(+) >> >> diff --git a/mm/page_owner.c b/mm/page_owner.c >> index 28dac73e0542..8dc5cd0fa227 100644 >> --- a/mm/page_owner.c >> +++ b/mm/page_owner.c >> @@ -10,6 +10,7 @@ >> #include <linux/migrate.h> >> #include <linux/stackdepot.h> >> #include <linux/seq_file.h> >> +#include <linux/memcontrol.h> >> #include <linux/sched/clock.h> >> >> #include "internal.h" >> @@ -331,6 +332,7 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, >> depot_stack_handle_t handle) >> { >> int ret, pageblock_mt, page_mt; >> + unsigned long __maybe_unused memcg_data; >> char *kbuf; >> >> count = min_t(size_t, count, PAGE_SIZE); >> @@ -365,6 +367,35 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, >> migrate_reason_names[page_owner->last_migrate_reason]); >> } >> >> +#ifdef CONFIG_MEMCG > Can we put all this along with the declaration of memcg_data in a helper > function please? > Sure. Will post another version with that change. Cheers, Longman
On Sun, 30 Jan 2022, Waiman Long wrote: > On 1/30/22 01:33, Mike Rapoport wrote: > > On Sat, Jan 29, 2022 at 03:53:15PM -0500, Waiman Long wrote: > > > It was found that a number of offlined memcgs were not freed because > > > they were pinned by some charged pages that were present. Even "echo > > > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > > > offlined but not freed memcgs tend to increase in number over time with > > > the side effect that percpu memory consumption as shown in /proc/meminfo > > > also increases over time. > > > > > > In order to find out more information about those pages that pin > > > offlined memcgs, the page_owner feature is extended to dump memory > > > cgroup information especially whether the cgroup is offlined or not. > > > > > > Signed-off-by: Waiman Long <longman@redhat.com> > > > --- > > > mm/page_owner.c | 31 +++++++++++++++++++++++++++++++ > > > 1 file changed, 31 insertions(+) > > > > > > diff --git a/mm/page_owner.c b/mm/page_owner.c > > > index 28dac73e0542..8dc5cd0fa227 100644 > > > --- a/mm/page_owner.c > > > +++ b/mm/page_owner.c > > > @@ -10,6 +10,7 @@ > > > #include <linux/migrate.h> > > > #include <linux/stackdepot.h> > > > #include <linux/seq_file.h> > > > +#include <linux/memcontrol.h> > > > #include <linux/sched/clock.h> > > > #include "internal.h" > > > @@ -331,6 +332,7 @@ print_page_owner(char __user *buf, size_t count, > > > unsigned long pfn, > > > depot_stack_handle_t handle) > > > { > > > int ret, pageblock_mt, page_mt; > > > + unsigned long __maybe_unused memcg_data; > > > char *kbuf; > > > count = min_t(size_t, count, PAGE_SIZE); > > > @@ -365,6 +367,35 @@ print_page_owner(char __user *buf, size_t count, > > > unsigned long pfn, > > > migrate_reason_names[page_owner->last_migrate_reason]); > > > } > > > +#ifdef CONFIG_MEMCG > > Can we put all this along with the declaration of memcg_data in a helper > > function please? > > > Sure. Will post another version with that change. > That would certainly make it much cleaner. After that's done (and perhaps addressing my nit comment in the first patch), feel free to add Acked-by: David Rientjes <rientjes@google.com> to all three patches. Thanks!
On Sat 29-01-22 15:53:15, Waiman Long wrote: > It was found that a number of offlined memcgs were not freed because > they were pinned by some charged pages that were present. Even "echo > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > offlined but not freed memcgs tend to increase in number over time with > the side effect that percpu memory consumption as shown in /proc/meminfo > also increases over time. > > In order to find out more information about those pages that pin > offlined memcgs, the page_owner feature is extended to dump memory > cgroup information especially whether the cgroup is offlined or not. It is not really clear to me how this is supposed to be used. Are you really dumping all the pages in the system to find out offline memcgs? That looks rather clumsy to me. I am not against adding memcg information to the page owner output. That can be useful in other contexts. > Signed-off-by: Waiman Long <longman@redhat.com> > --- > mm/page_owner.c | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/mm/page_owner.c b/mm/page_owner.c > index 28dac73e0542..8dc5cd0fa227 100644 > --- a/mm/page_owner.c > +++ b/mm/page_owner.c > @@ -10,6 +10,7 @@ > #include <linux/migrate.h> > #include <linux/stackdepot.h> > #include <linux/seq_file.h> > +#include <linux/memcontrol.h> > #include <linux/sched/clock.h> > > #include "internal.h" > @@ -331,6 +332,7 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > depot_stack_handle_t handle) > { > int ret, pageblock_mt, page_mt; > + unsigned long __maybe_unused memcg_data; > char *kbuf; > > count = min_t(size_t, count, PAGE_SIZE); > @@ -365,6 +367,35 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, > migrate_reason_names[page_owner->last_migrate_reason]); > } > > +#ifdef CONFIG_MEMCG This really begs to be in a dedicated function. page_owner_print_memcg or something like that. > + /* > + * Look for memcg information and print it out > + */ > + memcg_data = READ_ONCE(page->memcg_data); > + if (memcg_data) { > + struct mem_cgroup *memcg = page_memcg_check(page); > + bool onlined; > + char name[80]; What does prevent memcg to go away and being reused for a different purpose? > + > + if (memcg_data & MEMCG_DATA_OBJCGS) > + ret += scnprintf(kbuf + ret, count - ret, > + "Slab cache page\n"); > + > + if (!memcg) > + goto copy_out; > + > + onlined = (memcg->css.flags & CSS_ONLINE); > + cgroup_name(memcg->css.cgroup, name, sizeof(name)); > + ret += scnprintf(kbuf + ret, count - ret, > + "Charged %sto %smemcg %s\n", > + PageMemcgKmem(page) ? "(via objcg) " : "", > + onlined ? "" : "offlined ", > + name); > + } > + > +copy_out: > +#endif
On Mon, Jan 31, 2022 at 11:53:19AM -0500, Johannes Weiner wrote: > On Mon, Jan 31, 2022 at 10:38:51AM +0100, Michal Hocko wrote: > > On Sat 29-01-22 15:53:15, Waiman Long wrote: > > > It was found that a number of offlined memcgs were not freed because > > > they were pinned by some charged pages that were present. Even "echo > > > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > > > offlined but not freed memcgs tend to increase in number over time with > > > the side effect that percpu memory consumption as shown in /proc/meminfo > > > also increases over time. > > > > > > In order to find out more information about those pages that pin > > > offlined memcgs, the page_owner feature is extended to dump memory > > > cgroup information especially whether the cgroup is offlined or not. > > > > It is not really clear to me how this is supposed to be used. Are you > > really dumping all the pages in the system to find out offline memcgs? > > That looks rather clumsy to me. I am not against adding memcg > > information to the page owner output. That can be useful in other > > contexts. > > We've sometimes done exactly that in production, but with drgn > scripts. It's not very common, so it doesn't need to be very efficient > either. Typically, we'd encounter a host with an unusual number of > dying cgroups, ssh in and poke around with drgn to figure out what > kind of objects are still pinning the cgroups in question. > > This patch would make that process a little easier, I suppose. Right. Over last few years I've spent enormous amount of time digging into various aspects of this problem and in my experience the combination of drgn for the inspection of the current state and bpf for following various decisions on the reclaim path was the most useful combination. I really appreciate an effort to put useful tools to track memcg references into the kernel tree, however the page_owner infra has a limited usefulness as it has to be enabled on the boot. But because it doesn't add any overhead, I also don't think there any reasons to not add it. Thanks!
On Mon 31-01-22 10:15:45, Roman Gushchin wrote: > On Mon, Jan 31, 2022 at 11:53:19AM -0500, Johannes Weiner wrote: > > On Mon, Jan 31, 2022 at 10:38:51AM +0100, Michal Hocko wrote: > > > On Sat 29-01-22 15:53:15, Waiman Long wrote: > > > > It was found that a number of offlined memcgs were not freed because > > > > they were pinned by some charged pages that were present. Even "echo > > > > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These > > > > offlined but not freed memcgs tend to increase in number over time with > > > > the side effect that percpu memory consumption as shown in /proc/meminfo > > > > also increases over time. > > > > > > > > In order to find out more information about those pages that pin > > > > offlined memcgs, the page_owner feature is extended to dump memory > > > > cgroup information especially whether the cgroup is offlined or not. > > > > > > It is not really clear to me how this is supposed to be used. Are you > > > really dumping all the pages in the system to find out offline memcgs? > > > That looks rather clumsy to me. I am not against adding memcg > > > information to the page owner output. That can be useful in other > > > contexts. > > > > We've sometimes done exactly that in production, but with drgn > > scripts. It's not very common, so it doesn't need to be very efficient > > either. Typically, we'd encounter a host with an unusual number of > > dying cgroups, ssh in and poke around with drgn to figure out what > > kind of objects are still pinning the cgroups in question. > > > > This patch would make that process a little easier, I suppose. > > Right. Over last few years I've spent enormous amount of time digging into > various aspects of this problem and in my experience the combination of drgn > for the inspection of the current state and bpf for following various decisions > on the reclaim path was the most useful combination. > > I really appreciate an effort to put useful tools to track memcg references > into the kernel tree, however the page_owner infra has a limited usefulness > as it has to be enabled on the boot. But because it doesn't add any overhead, > I also don't think there any reasons to not add it. Would it be feasible to add a debugfs interface to displa dead memcg information?
On 1/31/22 13:25, Michal Hocko wrote: > On Mon 31-01-22 10:15:45, Roman Gushchin wrote: >> On Mon, Jan 31, 2022 at 11:53:19AM -0500, Johannes Weiner wrote: >>> On Mon, Jan 31, 2022 at 10:38:51AM +0100, Michal Hocko wrote: >>>> On Sat 29-01-22 15:53:15, Waiman Long wrote: >>>>> It was found that a number of offlined memcgs were not freed because >>>>> they were pinned by some charged pages that were present. Even "echo >>>>> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These >>>>> offlined but not freed memcgs tend to increase in number over time with >>>>> the side effect that percpu memory consumption as shown in /proc/meminfo >>>>> also increases over time. >>>>> >>>>> In order to find out more information about those pages that pin >>>>> offlined memcgs, the page_owner feature is extended to dump memory >>>>> cgroup information especially whether the cgroup is offlined or not. >>>> It is not really clear to me how this is supposed to be used. Are you >>>> really dumping all the pages in the system to find out offline memcgs? >>>> That looks rather clumsy to me. I am not against adding memcg >>>> information to the page owner output. That can be useful in other >>>> contexts. >>> We've sometimes done exactly that in production, but with drgn >>> scripts. It's not very common, so it doesn't need to be very efficient >>> either. Typically, we'd encounter a host with an unusual number of >>> dying cgroups, ssh in and poke around with drgn to figure out what >>> kind of objects are still pinning the cgroups in question. >>> >>> This patch would make that process a little easier, I suppose. >> Right. Over last few years I've spent enormous amount of time digging into >> various aspects of this problem and in my experience the combination of drgn >> for the inspection of the current state and bpf for following various decisions >> on the reclaim path was the most useful combination. >> >> I really appreciate an effort to put useful tools to track memcg references >> into the kernel tree, however the page_owner infra has a limited usefulness >> as it has to be enabled on the boot. But because it doesn't add any overhead, >> I also don't think there any reasons to not add it. > Would it be feasible to add a debugfs interface to displa dead memcg > information? Originally, I added some debug code to keep track of the list of memcg that has been offlined but not yet freed. After some more testing, I figured out that the memcg's were not freed because they were pinned by references in the page structs. At this point, I realize the using the existing page owner debugging tool will be useful to track this kind of problem since it already have all the infrastructure to list where the pages were allocated as well as various field in the page structures. Of course, it is also possible to have a debugfs interface to list those dead memcg information, displaying more information about the page that pins the memcg will be hard without using the page owner tool. Keeping track of the list of dead memcg's may also have some runtime overhead. Cheers, Longman
On 1/31/22 04:38, Michal Hocko wrote: > On Sat 29-01-22 15:53:15, Waiman Long wrote: >> It was found that a number of offlined memcgs were not freed because >> they were pinned by some charged pages that were present. Even "echo >> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These >> offlined but not freed memcgs tend to increase in number over time with >> the side effect that percpu memory consumption as shown in /proc/meminfo >> also increases over time. >> >> In order to find out more information about those pages that pin >> offlined memcgs, the page_owner feature is extended to dump memory >> cgroup information especially whether the cgroup is offlined or not. > It is not really clear to me how this is supposed to be used. Are you > really dumping all the pages in the system to find out offline memcgs? > That looks rather clumsy to me. I am not against adding memcg > information to the page owner output. That can be useful in other > contexts. I am just piggybacking on top of the existing page_owner tool to provide information for me to find out what pages are pinning the dead memcgs. page_owner is a debugging tool that is not turned on by default. We do have to add a kernel parameter and rebooting the system to use that, but that is pretty easy to do once we have a reproducer to reproduce the problem. Cheers, Longman
On Mon 31-01-22 13:38:28, Waiman Long wrote: [...] > Of course, it is also possible to have a debugfs interface to list those > dead memcg information, displaying more information about the page that pins > the memcg will be hard without using the page owner tool. Yes, you will need page owner or hook into the kernel by other means (like already mentioned drgn). The question is whether scanning all existing pages to get that information is the best we can offer. > Keeping track of > the list of dead memcg's may also have some runtime overhead. Could you be more specific? Offlined memcgs are still part of the hierarchy IIRC. So it shouldn't be much more than iterating the whole cgroup tree and collect interesting data about dead cgroups.
On 2/1/22 05:49, Michal Hocko wrote: > On Mon 31-01-22 13:38:28, Waiman Long wrote: > [...] >> Of course, it is also possible to have a debugfs interface to list those >> dead memcg information, displaying more information about the page that pins >> the memcg will be hard without using the page owner tool. > Yes, you will need page owner or hook into the kernel by other means > (like already mentioned drgn). The question is whether scanning all > existing pages to get that information is the best we can offer. The page_owner tool records the page information at allocation time. There are some slight performance overhead, but it is the memory overhead that is the major drawback of this approach as we need one page_owner structure for each physical page. Page scanning is only done when users read the page_owner debugfs file. Yes, I agree that scanning all the pages is not the most efficient way to get these dead memcg information, but it is what the page_owner tool does. I would argue that this is the most efficient coding-wise to get this information. >> Keeping track of >> the list of dead memcg's may also have some runtime overhead. > Could you be more specific? Offlined memcgs are still part of the > hierarchy IIRC. So it shouldn't be much more than iterating the whole > cgroup tree and collect interesting data about dead cgroups. What I mean is that without piggybacking on top of page_owner, we will to add a lot more code to collect and display those information which may have some overhead of its own. Cheers, Longman
On Tue 01-02-22 11:41:19, Waiman Long wrote: > > On 2/1/22 05:49, Michal Hocko wrote: [...] > > Could you be more specific? Offlined memcgs are still part of the > > hierarchy IIRC. So it shouldn't be much more than iterating the whole > > cgroup tree and collect interesting data about dead cgroups. > > What I mean is that without piggybacking on top of page_owner, we will to > add a lot more code to collect and display those information which may have > some overhead of its own. Yes, there is nothing like a free lunch. Page owner is certainly a tool that can be used. My main concern is that this tool doesn't really scale on large machines with a lots of memory. It will provide a very detailed information but I am not sure this is particularly helpful to most admins (why should people process tons of allocation backtraces in the first place). Wouldn't it be sufficient to have per dead memcg stats to see where the memory sits? Accumulated offline memcgs is something that bothers more people and I am really wondering whether we can do more for those people to evaluate the current state.
On Wed, Feb 02, 2022 at 09:57:18AM +0100, Michal Hocko wrote: > On Tue 01-02-22 11:41:19, Waiman Long wrote: > > > > On 2/1/22 05:49, Michal Hocko wrote: > [...] > > > Could you be more specific? Offlined memcgs are still part of the > > > hierarchy IIRC. So it shouldn't be much more than iterating the whole > > > cgroup tree and collect interesting data about dead cgroups. > > > > What I mean is that without piggybacking on top of page_owner, we will to > > add a lot more code to collect and display those information which may have > > some overhead of its own. > > Yes, there is nothing like a free lunch. Page owner is certainly a tool > that can be used. My main concern is that this tool doesn't really > scale on large machines with a lots of memory. It will provide a very > detailed information but I am not sure this is particularly helpful to > most admins (why should people process tons of allocation backtraces in > the first place). Wouldn't it be sufficient to have per dead memcg stats > to see where the memory sits? > > Accumulated offline memcgs is something that bothers more people and I > am really wondering whether we can do more for those people to evaluate > the current state. Cgroup v2 has corresponding counters for years. Or do you mean something different?
On 2/2/22 03:57, Michal Hocko wrote: > On Tue 01-02-22 11:41:19, Waiman Long wrote: >> On 2/1/22 05:49, Michal Hocko wrote: > [...] >>> Could you be more specific? Offlined memcgs are still part of the >>> hierarchy IIRC. So it shouldn't be much more than iterating the whole >>> cgroup tree and collect interesting data about dead cgroups. >> What I mean is that without piggybacking on top of page_owner, we will to >> add a lot more code to collect and display those information which may have >> some overhead of its own. > Yes, there is nothing like a free lunch. Page owner is certainly a tool > that can be used. My main concern is that this tool doesn't really > scale on large machines with a lots of memory. It will provide a very > detailed information but I am not sure this is particularly helpful to > most admins (why should people process tons of allocation backtraces in > the first place). Wouldn't it be sufficient to have per dead memcg stats > to see where the memory sits? > > Accumulated offline memcgs is something that bothers more people and I > am really wondering whether we can do more for those people to evaluate > the current state. You won't get the stack backtrace information without page_owner enabled. I believe that is a helpful piece of information. I don't expect page_owner to be enabled by default on production system because of its memory overhead. I believe you can actually see the number of memory cgroups present by looking at the /proc/cgroups file. Though, you don't know how many of them are offline memcgs. So if one suspect that there are a large number of offline memcgs, one can set up a test environment with page_owner enabled for further analysis. Cheers, Longman
On Wed 02-02-22 07:54:48, Roman Gushchin wrote: > On Wed, Feb 02, 2022 at 09:57:18AM +0100, Michal Hocko wrote: > > On Tue 01-02-22 11:41:19, Waiman Long wrote: > > > > > > On 2/1/22 05:49, Michal Hocko wrote: > > [...] > > > > Could you be more specific? Offlined memcgs are still part of the > > > > hierarchy IIRC. So it shouldn't be much more than iterating the whole > > > > cgroup tree and collect interesting data about dead cgroups. > > > > > > What I mean is that without piggybacking on top of page_owner, we will to > > > add a lot more code to collect and display those information which may have > > > some overhead of its own. > > > > Yes, there is nothing like a free lunch. Page owner is certainly a tool > > that can be used. My main concern is that this tool doesn't really > > scale on large machines with a lots of memory. It will provide a very > > detailed information but I am not sure this is particularly helpful to > > most admins (why should people process tons of allocation backtraces in > > the first place). Wouldn't it be sufficient to have per dead memcg stats > > to see where the memory sits? > > > > Accumulated offline memcgs is something that bothers more people and I > > am really wondering whether we can do more for those people to evaluate > > the current state. > > Cgroup v2 has corresponding counters for years. Or do you mean something different? Do we have anything more specific than nr_dying_descendants? I was thinking about an interface which would provide paths and stats for dead memcgs. But I have to confess I haven't really spent much time thinking about how much work that would be. I am by no means against adding memcg information to the page owner. I just think there must be a better way to present resource consumption by dead memcgs.
On Wed, Feb 02, 2022 at 05:38:07PM +0100, Michal Hocko wrote: > On Wed 02-02-22 07:54:48, Roman Gushchin wrote: > > On Wed, Feb 02, 2022 at 09:57:18AM +0100, Michal Hocko wrote: > > > On Tue 01-02-22 11:41:19, Waiman Long wrote: > > > > > > > > On 2/1/22 05:49, Michal Hocko wrote: > > > [...] > > > > > Could you be more specific? Offlined memcgs are still part of the > > > > > hierarchy IIRC. So it shouldn't be much more than iterating the whole > > > > > cgroup tree and collect interesting data about dead cgroups. > > > > > > > > What I mean is that without piggybacking on top of page_owner, we will to > > > > add a lot more code to collect and display those information which may have > > > > some overhead of its own. > > > > > > Yes, there is nothing like a free lunch. Page owner is certainly a tool > > > that can be used. My main concern is that this tool doesn't really > > > scale on large machines with a lots of memory. It will provide a very > > > detailed information but I am not sure this is particularly helpful to > > > most admins (why should people process tons of allocation backtraces in > > > the first place). Wouldn't it be sufficient to have per dead memcg stats > > > to see where the memory sits? > > > > > > Accumulated offline memcgs is something that bothers more people and I > > > am really wondering whether we can do more for those people to evaluate > > > the current state. > > > > Cgroup v2 has corresponding counters for years. Or do you mean something different? > > Do we have anything more specific than nr_dying_descendants? No, just nr_dying_descendants. > I was thinking about an interface which would provide paths and stats for dead > memcgs. But I have to confess I haven't really spent much time thinking > about how much work that would be. I am by no means against adding memcg > information to the page owner. I just think there must be a better way > to present resource consumption by dead memcgs. I'd go with a drgn script. I wrote a bunch of them some times ago and can probably revive them and post here (will take few days). I agree that the problem still exists and providing some tool around would be useful. Thanks!
On Wed 02-02-22 09:51:32, Roman Gushchin wrote: > On Wed, Feb 02, 2022 at 05:38:07PM +0100, Michal Hocko wrote: > > On Wed 02-02-22 07:54:48, Roman Gushchin wrote: > > > On Wed, Feb 02, 2022 at 09:57:18AM +0100, Michal Hocko wrote: > > > > On Tue 01-02-22 11:41:19, Waiman Long wrote: > > > > > > > > > > On 2/1/22 05:49, Michal Hocko wrote: > > > > [...] > > > > > > Could you be more specific? Offlined memcgs are still part of the > > > > > > hierarchy IIRC. So it shouldn't be much more than iterating the whole > > > > > > cgroup tree and collect interesting data about dead cgroups. > > > > > > > > > > What I mean is that without piggybacking on top of page_owner, we will to > > > > > add a lot more code to collect and display those information which may have > > > > > some overhead of its own. > > > > > > > > Yes, there is nothing like a free lunch. Page owner is certainly a tool > > > > that can be used. My main concern is that this tool doesn't really > > > > scale on large machines with a lots of memory. It will provide a very > > > > detailed information but I am not sure this is particularly helpful to > > > > most admins (why should people process tons of allocation backtraces in > > > > the first place). Wouldn't it be sufficient to have per dead memcg stats > > > > to see where the memory sits? > > > > > > > > Accumulated offline memcgs is something that bothers more people and I > > > > am really wondering whether we can do more for those people to evaluate > > > > the current state. > > > > > > Cgroup v2 has corresponding counters for years. Or do you mean something different? > > > > Do we have anything more specific than nr_dying_descendants? > > No, just nr_dying_descendants. > > > I was thinking about an interface which would provide paths and stats for dead > > memcgs. But I have to confess I haven't really spent much time thinking > > about how much work that would be. I am by no means against adding memcg > > information to the page owner. I just think there must be a better way > > to present resource consumption by dead memcgs. > > I'd go with a drgn script. I wrote a bunch of them some times ago and > can probably revive them and post here (will take few days). That would be really awsome! Thanks!
diff --git a/mm/page_owner.c b/mm/page_owner.c index 28dac73e0542..8dc5cd0fa227 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -10,6 +10,7 @@ #include <linux/migrate.h> #include <linux/stackdepot.h> #include <linux/seq_file.h> +#include <linux/memcontrol.h> #include <linux/sched/clock.h> #include "internal.h" @@ -331,6 +332,7 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, depot_stack_handle_t handle) { int ret, pageblock_mt, page_mt; + unsigned long __maybe_unused memcg_data; char *kbuf; count = min_t(size_t, count, PAGE_SIZE); @@ -365,6 +367,35 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn, migrate_reason_names[page_owner->last_migrate_reason]); } +#ifdef CONFIG_MEMCG + /* + * Look for memcg information and print it out + */ + memcg_data = READ_ONCE(page->memcg_data); + if (memcg_data) { + struct mem_cgroup *memcg = page_memcg_check(page); + bool onlined; + char name[80]; + + if (memcg_data & MEMCG_DATA_OBJCGS) + ret += scnprintf(kbuf + ret, count - ret, + "Slab cache page\n"); + + if (!memcg) + goto copy_out; + + onlined = (memcg->css.flags & CSS_ONLINE); + cgroup_name(memcg->css.cgroup, name, sizeof(name)); + ret += scnprintf(kbuf + ret, count - ret, + "Charged %sto %smemcg %s\n", + PageMemcgKmem(page) ? "(via objcg) " : "", + onlined ? "" : "offlined ", + name); + } + +copy_out: +#endif + ret += snprintf(kbuf + ret, count - ret, "\n"); if (ret >= count) goto err;
It was found that a number of offlined memcgs were not freed because they were pinned by some charged pages that were present. Even "echo 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These offlined but not freed memcgs tend to increase in number over time with the side effect that percpu memory consumption as shown in /proc/meminfo also increases over time. In order to find out more information about those pages that pin offlined memcgs, the page_owner feature is extended to dump memory cgroup information especially whether the cgroup is offlined or not. Signed-off-by: Waiman Long <longman@redhat.com> --- mm/page_owner.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+)