Message ID | 20210428094949.43579-1-songmuchun@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | Shrink the list lru size on memory cgroup removal | expand |
On Wed, Apr 28, 2021 at 2:54 AM Muchun Song <songmuchun@bytedance.com> wrote: > > In our server, we found a suspected memory leak problem. The kmalloc-32 > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > memory. > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > cache is the cause of list_lru_one allocation. > > crash> p memcg_nr_cache_ids > memcg_nr_cache_ids = $2 = 24574 > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > can be calculated with the following formula. > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > crash> list super_blocks | wc -l > 952 > > Every mount will register 2 list lrus, one is for inode, another is for > dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 > MB (~5.6GB). But the number of memory cgroup is less than 500. So I > guess more than 12286 containers have been deployed on this machine (I > do not know why there are so many containers, it may be a user's bug or > the user really want to do that). But now there are less than 500 > containers in the system. And memcg_nr_cache_ids has not been reduced > to a suitable value. This can waste a lot of memory. If we want to reduce > memcg_nr_cache_ids, we have to reboot the server. This is not what we > want. > > So this patchset will dynamically adjust the value of memcg_nr_cache_ids > to keep healthy memory consumption. In this case, we may be able to restore > a healthy environment even if the users have created tens of thousands of > memory cgroups and then destroyed those memory cgroups. This patchset also > contains some code simplification. > There was a recent discussion [1] on the same issue. Did you get the chance to take a look at that. I have not gone through this patch series yet but will do in the next couple of weeks. [1] https://lore.kernel.org/linux-fsdevel/20210405054848.GA1077931@in.ibm.com/
On Thu, Apr 29, 2021 at 7:32 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Wed, Apr 28, 2021 at 2:54 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > memory. > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > cache is the cause of list_lru_one allocation. > > > > crash> p memcg_nr_cache_ids > > memcg_nr_cache_ids = $2 = 24574 > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > can be calculated with the following formula. > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > crash> list super_blocks | wc -l > > 952 > > > > Every mount will register 2 list lrus, one is for inode, another is for > > dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 > > MB (~5.6GB). But the number of memory cgroup is less than 500. So I > > guess more than 12286 containers have been deployed on this machine (I > > do not know why there are so many containers, it may be a user's bug or > > the user really want to do that). But now there are less than 500 > > containers in the system. And memcg_nr_cache_ids has not been reduced > > to a suitable value. This can waste a lot of memory. If we want to reduce > > memcg_nr_cache_ids, we have to reboot the server. This is not what we > > want. > > > > So this patchset will dynamically adjust the value of memcg_nr_cache_ids > > to keep healthy memory consumption. In this case, we may be able to restore > > a healthy environment even if the users have created tens of thousands of > > memory cgroups and then destroyed those memory cgroups. This patchset also > > contains some code simplification. > > > > There was a recent discussion [1] on the same issue. Did you get the > chance to take a look at that. I have not gone through this patch > series yet but will do in the next couple of weeks. > > [1] https://lore.kernel.org/linux-fsdevel/20210405054848.GA1077931@in.ibm.com/ Thanks for your reminder. No, I haven't. But now I have looked at this. The issue is very similar to mine. But Bharata seems to want to run 10k containers. And optimize the memory consumption of list_lru_one in this case. This is not what I do. I want to try to shrink the size of the list lrus when the number of memcgs is reduced from tens of thousands to hundreds. Thanks.
On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote: > In our server, we found a suspected memory leak problem. The kmalloc-32 > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > memory. > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > cache is the cause of list_lru_one allocation. > > crash> p memcg_nr_cache_ids > memcg_nr_cache_ids = $2 = 24574 > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > can be calculated with the following formula. > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > crash> list super_blocks | wc -l > 952 The more I see people trying to work around this, the more I think that the way memcgs have been grafted into the list_lru is back to front. We currently allocate scope for every memcg to be able to tracked on every not on every superblock instantiated in the system, regardless of whether that superblock is even accessible to that memcg. These huge memcg counts come from container hosts where memcgs are confined to just a small subset of the total number of superblocks that instantiated at any given point in time. IOWs, for these systems with huge container counts, list_lru does not need the capability of tracking every memcg on every superblock. What it comes down to is that the list_lru is only needed for a given memcg if that memcg is instatiating and freeing objects on a given list_lru. Which makes me think we should be moving more towards "add the memcg to the list_lru at the first insert" model rather than "instantiate all at memcg init time just in case". The model we originally came up with for supprting memcgs is really starting to show it's limits, and we should address those limitations rahter than hack more complexity into the system that does nothing to remove the limitations that are causing the problems in the first place. > Every mount will register 2 list lrus, one is for inode, another is for > dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 > MB (~5.6GB). But the number of memory cgroup is less than 500. So I > guess more than 12286 containers have been deployed on this machine (I > do not know why there are so many containers, it may be a user's bug or > the user really want to do that). But now there are less than 500 > containers in the system. And memcg_nr_cache_ids has not been reduced > to a suitable value. This can waste a lot of memory. If we want to reduce > memcg_nr_cache_ids, we have to reboot the server. This is not what we > want. Exactly my point. This model is broken and doesn't scale to large counts of either memcgs or superblocks. We need a different model for dynamically adding, removing and mapping memcgs to LRU lists they are actually using so that we can efficiently scale to tens of thousands of memcgs instances along with tens of thousands of unique superblock instants. That's the real problem that needs solving here. > So this patchset will dynamically adjust the value of memcg_nr_cache_ids > to keep healthy memory consumption. Gigabytes of RAM for largely unused memcg list_lrus on every superblock is not healthy. It's highly inefficient because the infrastructure we currently have was never designed to scale to these numbers of unique containers and superblocks... Cheers, Dave.
On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote: > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote: > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > memory. > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > cache is the cause of list_lru_one allocation. > > > > crash> p memcg_nr_cache_ids > > memcg_nr_cache_ids = $2 = 24574 > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > can be calculated with the following formula. > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > crash> list super_blocks | wc -l > > 952 > > The more I see people trying to work around this, the more I think > that the way memcgs have been grafted into the list_lru is back to > front. > > We currently allocate scope for every memcg to be able to tracked on > every not on every superblock instantiated in the system, regardless > of whether that superblock is even accessible to that memcg. > > These huge memcg counts come from container hosts where memcgs are > confined to just a small subset of the total number of superblocks > that instantiated at any given point in time. > > IOWs, for these systems with huge container counts, list_lru does > not need the capability of tracking every memcg on every superblock. > > What it comes down to is that the list_lru is only needed for a > given memcg if that memcg is instatiating and freeing objects on a > given list_lru. > > Which makes me think we should be moving more towards "add the memcg > to the list_lru at the first insert" model rather than "instantiate > all at memcg init time just in case". The model we originally came > up with for supprting memcgs is really starting to show it's limits, > and we should address those limitations rahter than hack more > complexity into the system that does nothing to remove the > limitations that are causing the problems in the first place. I totally agree. It looks like the initial implementation of the whole kernel memory accounting and memcg-aware shrinkers was based on the idea that the number of memory cgroups is relatively small and stable. With systemd creating a separate cgroup for everything including short-living processes it simple not true anymore. Thanks!
On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote: > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote: > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote: > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > > memory. > > > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > > cache is the cause of list_lru_one allocation. > > > > > > crash> p memcg_nr_cache_ids > > > memcg_nr_cache_ids = $2 = 24574 > > > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > > can be calculated with the following formula. > > > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > > > crash> list super_blocks | wc -l > > > 952 > > > > The more I see people trying to work around this, the more I think > > that the way memcgs have been grafted into the list_lru is back to > > front. > > > > We currently allocate scope for every memcg to be able to tracked on > > every not on every superblock instantiated in the system, regardless > > of whether that superblock is even accessible to that memcg. > > > > These huge memcg counts come from container hosts where memcgs are > > confined to just a small subset of the total number of superblocks > > that instantiated at any given point in time. > > > > IOWs, for these systems with huge container counts, list_lru does > > not need the capability of tracking every memcg on every superblock. > > > > What it comes down to is that the list_lru is only needed for a > > given memcg if that memcg is instatiating and freeing objects on a > > given list_lru. > > > > Which makes me think we should be moving more towards "add the memcg > > to the list_lru at the first insert" model rather than "instantiate > > all at memcg init time just in case". The model we originally came > > up with for supprting memcgs is really starting to show it's limits, > > and we should address those limitations rahter than hack more > > complexity into the system that does nothing to remove the > > limitations that are causing the problems in the first place. > > I totally agree. > > It looks like the initial implementation of the whole kernel memory accounting > and memcg-aware shrinkers was based on the idea that the number of memory > cgroups is relatively small and stable. Yes, that was one of the original assumptions - tens to maybe low hundreds of memcgs at most. The other was that memcgs weren't NUMA aware, and so would only need a single LRU list per memcg. Hence the total overhead even with "lots" of memcgsi and superblocks the overhead wasn't that great. Then came "memcgs need to be NUMA aware" because of the size of the machines they were being use for resrouce management in, and that greatly increased the per-memcg, per LRU overhead. Now we're talking about needing to support a couple of orders of magnitude more memcgs and superblocks than were originally designed for. So, really, we're way beyond the original design scope of this subsystem now. > With systemd creating a separate cgroup > for everything including short-living processes it simple not true anymore. Yeah, that too. Everything is much more dynamic these days... Cheers, Dave.
On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote: > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote: > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote: > > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > > > memory. > > > > > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > > > cache is the cause of list_lru_one allocation. > > > > > > > > crash> p memcg_nr_cache_ids > > > > memcg_nr_cache_ids = $2 = 24574 > > > > > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > > > can be calculated with the following formula. > > > > > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > > > > > crash> list super_blocks | wc -l > > > > 952 > > > > > > The more I see people trying to work around this, the more I think > > > that the way memcgs have been grafted into the list_lru is back to > > > front. > > > > > > We currently allocate scope for every memcg to be able to tracked on > > > every not on every superblock instantiated in the system, regardless > > > of whether that superblock is even accessible to that memcg. > > > > > > These huge memcg counts come from container hosts where memcgs are > > > confined to just a small subset of the total number of superblocks > > > that instantiated at any given point in time. > > > > > > IOWs, for these systems with huge container counts, list_lru does > > > not need the capability of tracking every memcg on every superblock. > > > > > > What it comes down to is that the list_lru is only needed for a > > > given memcg if that memcg is instatiating and freeing objects on a > > > given list_lru. > > > > > > Which makes me think we should be moving more towards "add the memcg > > > to the list_lru at the first insert" model rather than "instantiate > > > all at memcg init time just in case". The model we originally came > > > up with for supprting memcgs is really starting to show it's limits, > > > and we should address those limitations rahter than hack more > > > complexity into the system that does nothing to remove the > > > limitations that are causing the problems in the first place. > > > > I totally agree. > > > > It looks like the initial implementation of the whole kernel memory accounting > > and memcg-aware shrinkers was based on the idea that the number of memory > > cgroups is relatively small and stable. > > Yes, that was one of the original assumptions - tens to maybe low > hundreds of memcgs at most. The other was that memcgs weren't NUMA > aware, and so would only need a single LRU list per memcg. Hence the > total overhead even with "lots" of memcgsi and superblocks the > overhead wasn't that great. > > Then came "memcgs need to be NUMA aware" because of the size of the > machines they were being use for resrouce management in, and that > greatly increased the per-memcg, per LRU overhead. Now we're talking > about needing to support a couple of orders of magnitude more memcgs > and superblocks than were originally designed for. > > So, really, we're way beyond the original design scope of this > subsystem now. Got it. So it is better to allocate the structure of the list_lru_node dynamically. We should only allocate it when it is really demanded. But allocating memory by using GFP_ATOMIC in list_lru_add() is not a good idea. So we should allocate the memory out of list_lru_add(). I can propose an approach that may work. Before start, we should know about the following rules of list lrus. - Only objects allocated with __GFP_ACCOUNT need to allocate the struct list_lru_node. - The caller of allocating memory must know which list_lru the object will insert. So we can allocate struct list_lru_node when allocating the object instead of allocating it when list_lru_add(). It is easy, because we already know the list_lru and memcg which the object belongs to. So we can introduce a new helper to allocate the object and list_lru_node. Like below. void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s, gfp_t gfpflags) { void *ret = kmem_cache_alloc(s, gfpflags); if (ret && (gfpflags & __GFP_ACCOUNT)) { struct mem_cgroup *memcg = mem_cgroup_from_obj(ret); if (mem_cgroup_is_root(memcg)) return ret; /* Allocate per-memcg list_lru_node, if it already allocated, do nothing. */ memcg_list_lru_node_alloc(lru, memcg, page_to_nid(virt_to_page(ret)), gfpflags); } return ret; } If the user wants to insert the allocated object to its lru list in the feature. The user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc(). I have looked at the code closely. There are 3 different kmem_caches that need to use this new API to allocate memory. They are inode_cachep, dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate. Hi Roman and Dave, What do you think about this approach? If there is no problem, I can provide a preliminary patchset within a week. Thanks. > > > With systemd creating a separate cgroup > > for everything including short-living processes it simple not true anymore. > > Yeah, that too. Everything is much more dynamic these days... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote: > On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote: > > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote: > > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote: > > > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > > > > memory. > > > > > > > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > > > > cache is the cause of list_lru_one allocation. > > > > > > > > > > crash> p memcg_nr_cache_ids > > > > > memcg_nr_cache_ids = $2 = 24574 > > > > > > > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > > > > can be calculated with the following formula. > > > > > > > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > > > > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > > > > > > > crash> list super_blocks | wc -l > > > > > 952 > > > > > > > > The more I see people trying to work around this, the more I think > > > > that the way memcgs have been grafted into the list_lru is back to > > > > front. > > > > > > > > We currently allocate scope for every memcg to be able to tracked on > > > > every not on every superblock instantiated in the system, regardless > > > > of whether that superblock is even accessible to that memcg. > > > > > > > > These huge memcg counts come from container hosts where memcgs are > > > > confined to just a small subset of the total number of superblocks > > > > that instantiated at any given point in time. > > > > > > > > IOWs, for these systems with huge container counts, list_lru does > > > > not need the capability of tracking every memcg on every superblock. > > > > > > > > What it comes down to is that the list_lru is only needed for a > > > > given memcg if that memcg is instatiating and freeing objects on a > > > > given list_lru. > > > > > > > > Which makes me think we should be moving more towards "add the memcg > > > > to the list_lru at the first insert" model rather than "instantiate > > > > all at memcg init time just in case". The model we originally came > > > > up with for supprting memcgs is really starting to show it's limits, > > > > and we should address those limitations rahter than hack more > > > > complexity into the system that does nothing to remove the > > > > limitations that are causing the problems in the first place. > > > > > > I totally agree. > > > > > > It looks like the initial implementation of the whole kernel memory accounting > > > and memcg-aware shrinkers was based on the idea that the number of memory > > > cgroups is relatively small and stable. > > > > Yes, that was one of the original assumptions - tens to maybe low > > hundreds of memcgs at most. The other was that memcgs weren't NUMA > > aware, and so would only need a single LRU list per memcg. Hence the > > total overhead even with "lots" of memcgsi and superblocks the > > overhead wasn't that great. > > > > Then came "memcgs need to be NUMA aware" because of the size of the > > machines they were being use for resrouce management in, and that > > greatly increased the per-memcg, per LRU overhead. Now we're talking > > about needing to support a couple of orders of magnitude more memcgs > > and superblocks than were originally designed for. > > > > So, really, we're way beyond the original design scope of this > > subsystem now. > > Got it. So it is better to allocate the structure of the list_lru_node > dynamically. We should only allocate it when it is really demanded. > But allocating memory by using GFP_ATOMIC in list_lru_add() is > not a good idea. So we should allocate the memory out of > list_lru_add(). I can propose an approach that may work. > > Before start, we should know about the following rules of list lrus. > > - Only objects allocated with __GFP_ACCOUNT need to allocate > the struct list_lru_node. > - The caller of allocating memory must know which list_lru the > object will insert. > > So we can allocate struct list_lru_node when allocating the > object instead of allocating it when list_lru_add(). It is easy, because > we already know the list_lru and memcg which the object belongs > to. So we can introduce a new helper to allocate the object and > list_lru_node. Like below. > > void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s, > gfp_t gfpflags) > { > void *ret = kmem_cache_alloc(s, gfpflags); > > if (ret && (gfpflags & __GFP_ACCOUNT)) { > struct mem_cgroup *memcg = mem_cgroup_from_obj(ret); > > if (mem_cgroup_is_root(memcg)) > return ret; > > /* Allocate per-memcg list_lru_node, if it already > allocated, do nothing. */ > memcg_list_lru_node_alloc(lru, memcg, > page_to_nid(virt_to_page(ret)), gfpflags); > } > > return ret; > } > > If the user wants to insert the allocated object to its lru list in > the feature. The > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc(). > I have looked at the code closely. There are 3 different kmem_caches that > need to use this new API to allocate memory. They are inode_cachep, > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate. > > Hi Roman and Dave, > > What do you think about this approach? If there is no problem, I can provide > a preliminary patchset within a week. At a very first glance it looks similar to what Bharata proposed, but with some additional tricks. It would be nice to find a common ground here. In general, I think it's a right direction. In general I believe we might need some more fundamental changes, but I don't have a specific recipe yet. I need to think more of it. Thanks!
On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote: > Before start, we should know about the following rules of list lrus. > > - Only objects allocated with __GFP_ACCOUNT need to allocate > the struct list_lru_node. > - The caller of allocating memory must know which list_lru the > object will insert. > > So we can allocate struct list_lru_node when allocating the > object instead of allocating it when list_lru_add(). It is easy, because > we already know the list_lru and memcg which the object belongs > to. So we can introduce a new helper to allocate the object and > list_lru_node. Like below. I feel like there may be a simpler solution, although I'm not really familiar with the list_lru situation. The three caches you mention: > I have looked at the code closely. There are 3 different kmem_caches that > need to use this new API to allocate memory. They are inode_cachep, > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate. are all filesystem. So if there's a way of knowing which filesystems are exposed to each container, we can allocate the list_lru structures at "mount" time rather than at first allocation for a given cache/lru/memcg combination.
On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote: > On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote: > > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote: > > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote: > > > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > > > > memory. > > > > > > > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > > > > cache is the cause of list_lru_one allocation. > > > > > > > > > > crash> p memcg_nr_cache_ids > > > > > memcg_nr_cache_ids = $2 = 24574 > > > > > > > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > > > > can be calculated with the following formula. > > > > > > > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > > > > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > > > > > > > crash> list super_blocks | wc -l > > > > > 952 > > > > > > > > The more I see people trying to work around this, the more I think > > > > that the way memcgs have been grafted into the list_lru is back to > > > > front. > > > > > > > > We currently allocate scope for every memcg to be able to tracked on > > > > every not on every superblock instantiated in the system, regardless > > > > of whether that superblock is even accessible to that memcg. > > > > > > > > These huge memcg counts come from container hosts where memcgs are > > > > confined to just a small subset of the total number of superblocks > > > > that instantiated at any given point in time. > > > > > > > > IOWs, for these systems with huge container counts, list_lru does > > > > not need the capability of tracking every memcg on every superblock. > > > > > > > > What it comes down to is that the list_lru is only needed for a > > > > given memcg if that memcg is instatiating and freeing objects on a > > > > given list_lru. > > > > > > > > Which makes me think we should be moving more towards "add the memcg > > > > to the list_lru at the first insert" model rather than "instantiate > > > > all at memcg init time just in case". The model we originally came > > > > up with for supprting memcgs is really starting to show it's limits, > > > > and we should address those limitations rahter than hack more > > > > complexity into the system that does nothing to remove the > > > > limitations that are causing the problems in the first place. > > > > > > I totally agree. > > > > > > It looks like the initial implementation of the whole kernel memory accounting > > > and memcg-aware shrinkers was based on the idea that the number of memory > > > cgroups is relatively small and stable. > > > > Yes, that was one of the original assumptions - tens to maybe low > > hundreds of memcgs at most. The other was that memcgs weren't NUMA > > aware, and so would only need a single LRU list per memcg. Hence the > > total overhead even with "lots" of memcgsi and superblocks the > > overhead wasn't that great. > > > > Then came "memcgs need to be NUMA aware" because of the size of the > > machines they were being use for resrouce management in, and that > > greatly increased the per-memcg, per LRU overhead. Now we're talking > > about needing to support a couple of orders of magnitude more memcgs > > and superblocks than were originally designed for. > > > > So, really, we're way beyond the original design scope of this > > subsystem now. > > Got it. So it is better to allocate the structure of the list_lru_node > dynamically. We should only allocate it when it is really demanded. > But allocating memory by using GFP_ATOMIC in list_lru_add() is > not a good idea. So we should allocate the memory out of > list_lru_add(). I can propose an approach that may work. > > Before start, we should know about the following rules of list lrus. > > - Only objects allocated with __GFP_ACCOUNT need to allocate > the struct list_lru_node. This seems .... misguided. inode and dentry caches are already marked as accounted, so individual calls to allocate from these slabs do not need this annotation. > - The caller of allocating memory must know which list_lru the > object will insert. > > So we can allocate struct list_lru_node when allocating the > object instead of allocating it when list_lru_add(). It is easy, because > we already know the list_lru and memcg which the object belongs > to. So we can introduce a new helper to allocate the object and > list_lru_node. Like below. > > void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s, > gfp_t gfpflags) > { > void *ret = kmem_cache_alloc(s, gfpflags); > > if (ret && (gfpflags & __GFP_ACCOUNT)) { > struct mem_cgroup *memcg = mem_cgroup_from_obj(ret); > > if (mem_cgroup_is_root(memcg)) > return ret; > > /* Allocate per-memcg list_lru_node, if it already > allocated, do nothing. */ > memcg_list_lru_node_alloc(lru, memcg, > page_to_nid(virt_to_page(ret)), gfpflags); If we are allowing kmem_cache_alloc() to fail, then we can allow memcg_list_lru_node_alloc() to fail, too. Also, why put this outside kmem_cache_alloc()? Node id and memcg is already known internally to kmem_cache_alloc() when allocating from a slab, so why not associate the slab allocation with the LRU directly when doing the memcg accounting and so avoid doing costly duplicate work on every allocation? i.e. the list-lru was moved inside the mm/ dir because "it's a mm specific construct only", so why not actually make use of that designation to internalise this entire memcg management issue into the slab allocation routines? i.e. an API like kmem_cache_alloc_lru(cache, lru, gfpflags) allows this to be completely internalised and efficiently implemented with minimal change to callers. It also means that memory allocation callers don't need to know anything about memcg management, which is always a win.... > } > > return ret; > } > > If the user wants to insert the allocated object to its lru list in > the feature. The > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc(). > I have looked at the code closely. There are 3 different kmem_caches that > need to use this new API to allocate memory. They are inode_cachep, > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate. It might work, but I think you may have overlooked the complexity of inode allocation for filesystems. i.e. alloc_inode() calls out to filesystem allocation functions more often than it allocates directly from the inode_cachep. i.e. Most filesystems provide their own ->alloc_inode superblock operation, and they allocate inodes out of their own specific slab caches, not the inode_cachep. And then you have filesystems like XFS, where alloc_inode() will never be called, and implement ->alloc_inode as: /* Catch misguided souls that try to use this interface on XFS */ STATIC struct inode * xfs_fs_alloc_inode( struct super_block *sb) { BUG(); return NULL; } Because all the inode caching and allocation is internal to XFS and VFS inode management interfaces are not used. So I suspect that an external wrapper function is not the way to go here - either internalising the LRU management into the slab allocation or adding the memcg code to alloc_inode() and filesystem specific routines would make a lot more sense to me. Cheers, Dave.
On Mon, May 3, 2021 at 7:58 AM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote: > > On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote: > > > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote: > > > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote: > > > > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > > > > > memory. > > > > > > > > > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > > > > > cache is the cause of list_lru_one allocation. > > > > > > > > > > > > crash> p memcg_nr_cache_ids > > > > > > memcg_nr_cache_ids = $2 = 24574 > > > > > > > > > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > > > > > can be calculated with the following formula. > > > > > > > > > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > > > > > > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > > > > > > > > > crash> list super_blocks | wc -l > > > > > > 952 > > > > > > > > > > The more I see people trying to work around this, the more I think > > > > > that the way memcgs have been grafted into the list_lru is back to > > > > > front. > > > > > > > > > > We currently allocate scope for every memcg to be able to tracked on > > > > > every not on every superblock instantiated in the system, regardless > > > > > of whether that superblock is even accessible to that memcg. > > > > > > > > > > These huge memcg counts come from container hosts where memcgs are > > > > > confined to just a small subset of the total number of superblocks > > > > > that instantiated at any given point in time. > > > > > > > > > > IOWs, for these systems with huge container counts, list_lru does > > > > > not need the capability of tracking every memcg on every superblock. > > > > > > > > > > What it comes down to is that the list_lru is only needed for a > > > > > given memcg if that memcg is instatiating and freeing objects on a > > > > > given list_lru. > > > > > > > > > > Which makes me think we should be moving more towards "add the memcg > > > > > to the list_lru at the first insert" model rather than "instantiate > > > > > all at memcg init time just in case". The model we originally came > > > > > up with for supprting memcgs is really starting to show it's limits, > > > > > and we should address those limitations rahter than hack more > > > > > complexity into the system that does nothing to remove the > > > > > limitations that are causing the problems in the first place. > > > > > > > > I totally agree. > > > > > > > > It looks like the initial implementation of the whole kernel memory accounting > > > > and memcg-aware shrinkers was based on the idea that the number of memory > > > > cgroups is relatively small and stable. > > > > > > Yes, that was one of the original assumptions - tens to maybe low > > > hundreds of memcgs at most. The other was that memcgs weren't NUMA > > > aware, and so would only need a single LRU list per memcg. Hence the > > > total overhead even with "lots" of memcgsi and superblocks the > > > overhead wasn't that great. > > > > > > Then came "memcgs need to be NUMA aware" because of the size of the > > > machines they were being use for resrouce management in, and that > > > greatly increased the per-memcg, per LRU overhead. Now we're talking > > > about needing to support a couple of orders of magnitude more memcgs > > > and superblocks than were originally designed for. > > > > > > So, really, we're way beyond the original design scope of this > > > subsystem now. > > > > Got it. So it is better to allocate the structure of the list_lru_node > > dynamically. We should only allocate it when it is really demanded. > > But allocating memory by using GFP_ATOMIC in list_lru_add() is > > not a good idea. So we should allocate the memory out of > > list_lru_add(). I can propose an approach that may work. > > > > Before start, we should know about the following rules of list lrus. > > > > - Only objects allocated with __GFP_ACCOUNT need to allocate > > the struct list_lru_node. > > This seems .... misguided. inode and dentry caches are already > marked as accounted, so individual calls to allocate from these > slabs do not need this annotation. Sorry for the confusion. You are right. > > > - The caller of allocating memory must know which list_lru the > > object will insert. > > > > So we can allocate struct list_lru_node when allocating the > > object instead of allocating it when list_lru_add(). It is easy, because > > we already know the list_lru and memcg which the object belongs > > to. So we can introduce a new helper to allocate the object and > > list_lru_node. Like below. > > > > void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s, > > gfp_t gfpflags) > > { > > void *ret = kmem_cache_alloc(s, gfpflags); > > > > if (ret && (gfpflags & __GFP_ACCOUNT)) { > > struct mem_cgroup *memcg = mem_cgroup_from_obj(ret); > > > > if (mem_cgroup_is_root(memcg)) > > return ret; > > > > /* Allocate per-memcg list_lru_node, if it already > > allocated, do nothing. */ > > memcg_list_lru_node_alloc(lru, memcg, > > page_to_nid(virt_to_page(ret)), gfpflags); > > If we are allowing kmem_cache_alloc() to fail, then we can allow > memcg_list_lru_node_alloc() to fail, too. > > Also, why put this outside kmem_cache_alloc()? Node id and memcg is > already known internally to kmem_cache_alloc() when allocating from > a slab, so why not associate the slab allocation with the LRU > directly when doing the memcg accounting and so avoid doing costly > duplicate work on every allocation? > > i.e. the list-lru was moved inside the mm/ dir because "it's a mm > specific construct only", so why not actually make use of that > designation to internalise this entire memcg management issue into > the slab allocation routines? i.e. an API like Yeah, we can. > kmem_cache_alloc_lru(cache, lru, gfpflags) allows this to be > completely internalised and efficiently implemented with minimal > change to callers. It also means that memory allocation callers > don't need to know anything about memcg management, which is always > a win.... Great idea. It's efficient. I'd give it a try. > > > } > > > > return ret; > > } > > > > If the user wants to insert the allocated object to its lru list in > > the feature. The > > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc(). > > I have looked at the code closely. There are 3 different kmem_caches that > > need to use this new API to allocate memory. They are inode_cachep, > > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate. > > It might work, but I think you may have overlooked the complexity > of inode allocation for filesystems. i.e. alloc_inode() calls out > to filesystem allocation functions more often than it allocates > directly from the inode_cachep. i.e. Most filesystems provide > their own ->alloc_inode superblock operation, and they allocate > inodes out of their own specific slab caches, not the inode_cachep. I didn't realize this before. You are right. Most filesystems have their own kmem_cache instead of inode_cachep. We need a lot of filesystems special to be changed. Thanks for your reminder. > > And then you have filesystems like XFS, where alloc_inode() will > never be called, and implement ->alloc_inode as: > > /* Catch misguided souls that try to use this interface on XFS */ > STATIC struct inode * > xfs_fs_alloc_inode( > struct super_block *sb) > { > BUG(); > return NULL; > } > > Because all the inode caching and allocation is internal to XFS and > VFS inode management interfaces are not used. > > So I suspect that an external wrapper function is not the way to go > here - either internalising the LRU management into the slab > allocation or adding the memcg code to alloc_inode() and filesystem > specific routines would make a lot more sense to me. Sure. If we introduce kmem_cache_alloc_lru, all filesystems need to migrate to kmem_cache_alloc_lru. I cannot figure out an approach that does not need to change filesystems code. Thanks. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Mon, May 03, 2021 at 02:33:21PM +0800, Muchun Song wrote: > On Mon, May 3, 2021 at 7:58 AM Dave Chinner <david@fromorbit.com> wrote: > > > If the user wants to insert the allocated object to its lru list in > > > the feature. The > > > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc(). > > > I have looked at the code closely. There are 3 different kmem_caches that > > > need to use this new API to allocate memory. They are inode_cachep, > > > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate. > > > > It might work, but I think you may have overlooked the complexity > > of inode allocation for filesystems. i.e. alloc_inode() calls out > > to filesystem allocation functions more often than it allocates > > directly from the inode_cachep. i.e. Most filesystems provide > > their own ->alloc_inode superblock operation, and they allocate > > inodes out of their own specific slab caches, not the inode_cachep. > > I didn't realize this before. You are right. Most filesystems > have their own kmem_cache instead of inode_cachep. > We need a lot of filesystems special to be changed. > Thanks for your reminder. > > > > > And then you have filesystems like XFS, where alloc_inode() will > > never be called, and implement ->alloc_inode as: > > > > /* Catch misguided souls that try to use this interface on XFS */ > > STATIC struct inode * > > xfs_fs_alloc_inode( > > struct super_block *sb) > > { > > BUG(); > > return NULL; > > } > > > > Because all the inode caching and allocation is internal to XFS and > > VFS inode management interfaces are not used. > > > > So I suspect that an external wrapper function is not the way to go > > here - either internalising the LRU management into the slab > > allocation or adding the memcg code to alloc_inode() and filesystem > > specific routines would make a lot more sense to me. > > Sure. If we introduce kmem_cache_alloc_lru, all filesystems > need to migrate to kmem_cache_alloc_lru. I cannot figure out > an approach that does not need to change filesystems code. Right, I don't think there's a way to avoid changing all the filesystem code if we are touching the cache allocation routines. However, if we hide it all inside the allocation routine, then the changes to each filesystem is effectively just a 1-liner like: - inode = kmem_cache_alloc(inode_cache, GFP_NOFS); + inode = kmem_cache_alloc_lru(inode_cache, sb->s_inode_lru, GFP_NOFS); Or perhaps, define a generic wrapper function like: static inline void * alloc_inode_sb(struct superblock *sb, struct kmem_cache *cache, gfp_flags_t gfp) { return kmem_cache_alloc_lru(cache, sb->s_inode_lru, gfp); } And then each filesystem ends up with: - inode = kmem_cache_alloc(inode_cache, GFP_NOFS); + inode = alloc_inode_sb(sb, inode_cache, GFP_NOFS); so that all the superblock LRU stuff is also hidden from the filesystems... Cheers, Dave.
On Wed, May 5, 2021 at 9:13 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, May 03, 2021 at 02:33:21PM +0800, Muchun Song wrote: > > On Mon, May 3, 2021 at 7:58 AM Dave Chinner <david@fromorbit.com> wrote: > > > > If the user wants to insert the allocated object to its lru list in > > > > the feature. The > > > > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc(). > > > > I have looked at the code closely. There are 3 different kmem_caches that > > > > need to use this new API to allocate memory. They are inode_cachep, > > > > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate. > > > > > > It might work, but I think you may have overlooked the complexity > > > of inode allocation for filesystems. i.e. alloc_inode() calls out > > > to filesystem allocation functions more often than it allocates > > > directly from the inode_cachep. i.e. Most filesystems provide > > > their own ->alloc_inode superblock operation, and they allocate > > > inodes out of their own specific slab caches, not the inode_cachep. > > > > I didn't realize this before. You are right. Most filesystems > > have their own kmem_cache instead of inode_cachep. > > We need a lot of filesystems special to be changed. > > Thanks for your reminder. > > > > > > > > And then you have filesystems like XFS, where alloc_inode() will > > > never be called, and implement ->alloc_inode as: > > > > > > /* Catch misguided souls that try to use this interface on XFS */ > > > STATIC struct inode * > > > xfs_fs_alloc_inode( > > > struct super_block *sb) > > > { > > > BUG(); > > > return NULL; > > > } > > > > > > Because all the inode caching and allocation is internal to XFS and > > > VFS inode management interfaces are not used. > > > > > > So I suspect that an external wrapper function is not the way to go > > > here - either internalising the LRU management into the slab > > > allocation or adding the memcg code to alloc_inode() and filesystem > > > specific routines would make a lot more sense to me. > > > > Sure. If we introduce kmem_cache_alloc_lru, all filesystems > > need to migrate to kmem_cache_alloc_lru. I cannot figure out > > an approach that does not need to change filesystems code. > > Right, I don't think there's a way to avoid changing all the > filesystem code if we are touching the cache allocation routines. > However, if we hide it all inside the allocation routine, then > the changes to each filesystem is effectively just a 1-liner like: > > - inode = kmem_cache_alloc(inode_cache, GFP_NOFS); > + inode = kmem_cache_alloc_lru(inode_cache, sb->s_inode_lru, GFP_NOFS); > > Or perhaps, define a generic wrapper function like: > > static inline void * > alloc_inode_sb(struct superblock *sb, struct kmem_cache *cache, gfp_flags_t gfp) > { > return kmem_cache_alloc_lru(cache, sb->s_inode_lru, gfp); > } Good idea. I am doing this. A preliminary patch is expected next week. Thanks. > > And then each filesystem ends up with: > > - inode = kmem_cache_alloc(inode_cache, GFP_NOFS); > + inode = alloc_inode_sb(sb, inode_cache, GFP_NOFS); > > so that all the superblock LRU stuff is also hidden from the > filesystems... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com