diff mbox series

[v5,10/16] mm: list_lru: allocate list_lru_one only when needed

Message ID 20211220085649.8196-11-songmuchun@bytedance.com (mailing list archive)
State New
Headers show
Series Optimize list lru memory consumption | expand

Commit Message

Muchun Song Dec. 20, 2021, 8:56 a.m. UTC
In our server, we found a suspected memory leak problem. The kmalloc-32
consumes more than 6GB of memory. Other kmem_caches consume less than
2GB memory.

After our in-depth analysis, the memory consumption of kmalloc-32 slab
cache is the cause of list_lru_one allocation.

  crash> p memcg_nr_cache_ids
  memcg_nr_cache_ids = $2 = 24574

memcg_nr_cache_ids is very large and memory consumption of each list_lru
can be calculated with the following formula.

  num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)

There are 4 numa nodes in our system, so each list_lru consumes ~3MB.

  crash> list super_blocks | wc -l
  952

Every mount will register 2 list lrus, one is for inode, another is for
dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
MB (~5.6GB). But the number of memory cgroup is less than 500. So I
guess more than 12286 containers have been deployed on this machine (I
do not know why there are so many containers, it may be a user's bug or
the user really want to do that). And memcg_nr_cache_ids has not been
reduced to a suitable value. This can waste a lot of memory.

Now the infrastructure for dynamic list_lru_one allocation is ready, so
remove statically allocated memory code to save memory.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/list_lru.h |   7 +--
 mm/list_lru.c            | 121 ++++++++++++++++++++++++++---------------------
 mm/memcontrol.c          |   6 ++-
 3 files changed, 77 insertions(+), 57 deletions(-)

Comments

Michal Koutný Jan. 6, 2022, 11 a.m. UTC | #1
On Mon, Dec 20, 2021 at 04:56:43PM +0800, Muchun Song <songmuchun@bytedance.com> wrote:
(Thanks for pointing me here.)

> -void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg)
> +void memcg_drain_all_list_lrus(struct mem_cgroup *src, struct mem_cgroup *dst)
>  {
> +	struct cgroup_subsys_state *css;
>  	struct list_lru *lru;
> +	int src_idx = src->kmemcg_id;
> +
> +	/*
> +	 * Change kmemcg_id of this cgroup and all its descendants to the
> +	 * parent's id, and then move all entries from this cgroup's list_lrus
> +	 * to ones of the parent.
> +	 *
> +	 * After we have finished, all list_lrus corresponding to this cgroup
> +	 * are guaranteed to remain empty. So we can safely free this cgroup's
> +	 * list lrus in memcg_list_lru_free().
> +	 *
> +	 * Changing ->kmemcg_id to the parent can prevent memcg_list_lru_alloc()
> +	 * from allocating list lrus for this cgroup after memcg_list_lru_free()
> +	 * call.
> +	 */
> +	rcu_read_lock();
> +	css_for_each_descendant_pre(css, &src->css) {
> +		struct mem_cgroup *memcg;
> +
> +		memcg = mem_cgroup_from_css(css);
> +		memcg->kmemcg_id = dst->kmemcg_id;
> +	}
> +	rcu_read_unlock();

Do you envision using this function anywhere else beside offlining?
If not, you shouldn't need traversing whole subtree because normally
parents are offlined only after children (see cgroup_subsys_state.online_cnt).

>  
>  	mutex_lock(&list_lrus_mutex);
>  	list_for_each_entry(lru, &memcg_list_lrus, list)
> -		memcg_drain_list_lru(lru, src_idx, dst_memcg);
> +		memcg_drain_list_lru(lru, src_idx, dst);
>  	mutex_unlock(&list_lrus_mutex);

If you do, then here you only drain list_lru of the subtree root but not
the descendants anymore.

So I do get that mem_cgroup.kmemcg_id refernces the "effective"
kmemcg_id after offlining, so that proper list_lrus are used afterwards.

I wonder -- is this necessary when objcgs are reparented too? IOW would
anyone query the offlined child's kmemcg_id?
(Maybe it's worth explaining better in the commit message, I think even
current approach is OK (better safe than sorry).)


Thanks,
Michal
Roman Gushchin Jan. 11, 2022, 8 p.m. UTC | #2
On Mon, Dec 20, 2021 at 04:56:43PM +0800, Muchun Song wrote:
> In our server, we found a suspected memory leak problem. The kmalloc-32
> consumes more than 6GB of memory. Other kmem_caches consume less than
> 2GB memory.
> 
> After our in-depth analysis, the memory consumption of kmalloc-32 slab
> cache is the cause of list_lru_one allocation.
> 
>   crash> p memcg_nr_cache_ids
>   memcg_nr_cache_ids = $2 = 24574
> 
> memcg_nr_cache_ids is very large and memory consumption of each list_lru
> can be calculated with the following formula.
> 
>   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> 
> There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> 
>   crash> list super_blocks | wc -l
>   952
> 
> Every mount will register 2 list lrus, one is for inode, another is for
> dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
> MB (~5.6GB). But the number of memory cgroup is less than 500. So I
> guess more than 12286 containers have been deployed on this machine (I
> do not know why there are so many containers, it may be a user's bug or
> the user really want to do that). And memcg_nr_cache_ids has not been
> reduced to a suitable value. This can waste a lot of memory.

But on the other side you increase the size of struct list_lru_per_memcg,
so if number of cgroups is close to memcg_nr_cache_ids, we can actually
waste more memory. I'm not saying the change is not worth it, but would be
nice to add some real-world numbers.

Or it's all irrelevant and is done as a preparation to the conversion to xarray?
If so, please, make it clear.

Thanks!
Muchun Song Jan. 12, 2022, 4:48 a.m. UTC | #3
On Wed, Jan 12, 2022 at 4:00 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Dec 20, 2021 at 04:56:43PM +0800, Muchun Song wrote:
> > In our server, we found a suspected memory leak problem. The kmalloc-32
> > consumes more than 6GB of memory. Other kmem_caches consume less than
> > 2GB memory.
> >
> > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > cache is the cause of list_lru_one allocation.
> >
> >   crash> p memcg_nr_cache_ids
> >   memcg_nr_cache_ids = $2 = 24574
> >
> > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > can be calculated with the following formula.
> >
> >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> >
> > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> >
> >   crash> list super_blocks | wc -l
> >   952
> >
> > Every mount will register 2 list lrus, one is for inode, another is for
> > dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
> > MB (~5.6GB). But the number of memory cgroup is less than 500. So I
> > guess more than 12286 containers have been deployed on this machine (I
> > do not know why there are so many containers, it may be a user's bug or
> > the user really want to do that). And memcg_nr_cache_ids has not been
> > reduced to a suitable value. This can waste a lot of memory.
>
> But on the other side you increase the size of struct list_lru_per_memcg,
> so if number of cgroups is close to memcg_nr_cache_ids, we can actually
> waste more memory.

The saving comes from the fact that we currently allocate scope for every
memcg to be able to be tracked on every superblock instantiated in the system,
regardless of whether that superblock is even accessible to that memcg.

In theory, increasing struct list_lru_per_memcg is not significant, most
savings is from decreasing the number of allocations of struct
list_lru_per_memcg.

> I'm not saying the change is not worth it, but would be
> nice to add some real-world numbers.

OK. I will do a test.

>
> Or it's all irrelevant and is done as a preparation to the conversion to xarray?

Right. It's also a preparation to transfer to xarray.

> If so, please, make it clear.

Will do.

Thanks.
Muchun Song Jan. 12, 2022, 1:22 p.m. UTC | #4
On Thu, Jan 6, 2022 at 7:00 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Mon, Dec 20, 2021 at 04:56:43PM +0800, Muchun Song <songmuchun@bytedance.com> wrote:
> (Thanks for pointing me here.)
>
> > -void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg)
> > +void memcg_drain_all_list_lrus(struct mem_cgroup *src, struct mem_cgroup *dst)
> >  {
> > +     struct cgroup_subsys_state *css;
> >       struct list_lru *lru;
> > +     int src_idx = src->kmemcg_id;
> > +
> > +     /*
> > +      * Change kmemcg_id of this cgroup and all its descendants to the
> > +      * parent's id, and then move all entries from this cgroup's list_lrus
> > +      * to ones of the parent.
> > +      *
> > +      * After we have finished, all list_lrus corresponding to this cgroup
> > +      * are guaranteed to remain empty. So we can safely free this cgroup's
> > +      * list lrus in memcg_list_lru_free().
> > +      *
> > +      * Changing ->kmemcg_id to the parent can prevent memcg_list_lru_alloc()
> > +      * from allocating list lrus for this cgroup after memcg_list_lru_free()
> > +      * call.
> > +      */
> > +     rcu_read_lock();
> > +     css_for_each_descendant_pre(css, &src->css) {
> > +             struct mem_cgroup *memcg;
> > +
> > +             memcg = mem_cgroup_from_css(css);
> > +             memcg->kmemcg_id = dst->kmemcg_id;
> > +     }
> > +     rcu_read_unlock();
>
> Do you envision using this function anywhere else beside offlining?
> If not, you shouldn't need traversing whole subtree because normally
> parents are offlined only after children (see cgroup_subsys_state.online_cnt).
>
> >
> >       mutex_lock(&list_lrus_mutex);
> >       list_for_each_entry(lru, &memcg_list_lrus, list)
> > -             memcg_drain_list_lru(lru, src_idx, dst_memcg);
> > +             memcg_drain_list_lru(lru, src_idx, dst);
> >       mutex_unlock(&list_lrus_mutex);
>
> If you do, then here you only drain list_lru of the subtree root but not
> the descendants anymore.
>
> So I do get that mem_cgroup.kmemcg_id refernces the "effective"
> kmemcg_id after offlining, so that proper list_lrus are used afterwards.
>
> I wonder -- is this necessary when objcgs are reparented too? IOW would
> anyone query the offlined child's kmemcg_id?
> (Maybe it's worth explaining better in the commit message, I think even
> current approach is OK (better safe than sorry).)
>

Sorry for the late reply.

Image a bunch of memcg as follows. C's parent is B, B's parent is A and
A's parent is root. The numbers in parentheses are their kmemcg_id
respectively.

  root(-1) -> A(0) -> B(1) -> C(2)

CPU0:                                   CPU1:
memcg_list_lru_alloc(C)
                                        memcg_drain_all_list_lrus(C)
                                        memcg_drain_all_list_lrus(B)
                                        // Now C and B are offline. The
                                        // kmemcg_id becomes the following if
                                        // we do not the kmemcg_id of its
                                        // descendants in
                                        // memcg_drain_all_list_lrus().
                                        //
                                        // root(-1) -> A(0) -> B(0) -> C(1)

  for (i = 0; memcg; memcg = parent_mem_cgroup(memcg), i++) {
      // allocate struct list_lru_per_memcg for memcg C
      table[i].mlru = memcg_init_list_lru_one(gfp);
  }

  spin_lock_irqsave(&lru->lock, flags);
  while (i--) {
      // here index = 1
      int index = table[i].memcg->kmemcg_id;

      struct list_lru_per_memcg *mlru = table[i].mlru;
      if (index < 0 || rcu_dereference_protected(mlrus->mlru[index], true))
          kfree(mlru);
      else
          // mlrus->mlru[index] will be assigned a new value regardless
          // memcg C is already offline.
          rcu_assign_pointer(mlrus->mlru[index], mlru);
  }
  spin_unlock_irqrestore(&lru->lock, flags);

So changing ->kmemcg_id of all its descendants can prevent
memcg_list_lru_alloc() from allocating list lrus for the offlined
cgroup after memcg_list_lru_free() calling.

Thanks.
Michal Koutný Jan. 13, 2022, 1:32 p.m. UTC | #5
On Wed, Jan 12, 2022 at 09:22:36PM +0800, Muchun Song <songmuchun@bytedance.com> wrote:
>   root(-1) -> A(0) -> B(1) -> C(2)
> 
> CPU0:                                   CPU1:
> memcg_list_lru_alloc(C)
>                                         memcg_drain_all_list_lrus(C)
>                                         memcg_drain_all_list_lrus(B)
>                                         // Now C and B are offline. The
>                                         // kmemcg_id becomes the following if
>                                         // we do not the kmemcg_id of its
>                                         // descendants in
>                                         // memcg_drain_all_list_lrus().
>                                         //
>                                         // root(-1) -> A(0) -> B(0) -> C(1)
> 
>   for (i = 0; memcg; memcg = parent_mem_cgroup(memcg), i++) {
>       // allocate struct list_lru_per_memcg for memcg C
>       table[i].mlru = memcg_init_list_lru_one(gfp);
>   }
> 
>   spin_lock_irqsave(&lru->lock, flags);
>   while (i--) {
>       // here index = 1
>       int index = table[i].memcg->kmemcg_id;
> 
>       struct list_lru_per_memcg *mlru = table[i].mlru;
>       if (index < 0 || rcu_dereference_protected(mlrus->mlru[index], true))
>           kfree(mlru);
>       else
>           // mlrus->mlru[index] will be assigned a new value regardless
>           // memcg C is already offline.
>           rcu_assign_pointer(mlrus->mlru[index], mlru);
>   }
>   spin_unlock_irqrestore(&lru->lock, flags);
> 

> So changing ->kmemcg_id of all its descendants can prevent
> memcg_list_lru_alloc() from allocating list lrus for the offlined
> cgroup after memcg_list_lru_free() calling.

Thanks for the illustrative example. I can see how this can be a problem
in a general call of memcg_list_lru_alloc(C).

However, the code, as I understand it, resolves the memcg to which lru
allocation should be associated via get_mem_cgroup_from_objcg() and
memcg_reparent_list_lrus(C) comes after memcg_reparent_objcgs(C, B),
i.e. the allocation would target B (or even A if after
memcg_reparent_objcgs(B, A))?

It seems to me like "wasting" the existing objcg reparenting mechanism.
Or what do you think could be a problem relying on it?

Thanks,
Michal
Muchun Song Jan. 18, 2022, 12:05 p.m. UTC | #6
On Thu, Jan 13, 2022 at 9:32 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Wed, Jan 12, 2022 at 09:22:36PM +0800, Muchun Song <songmuchun@bytedance.com> wrote:
> >   root(-1) -> A(0) -> B(1) -> C(2)
> >
> > CPU0:                                   CPU1:
> > memcg_list_lru_alloc(C)
> >                                         memcg_drain_all_list_lrus(C)
> >                                         memcg_drain_all_list_lrus(B)
> >                                         // Now C and B are offline. The
> >                                         // kmemcg_id becomes the following if
> >                                         // we do not the kmemcg_id of its
> >                                         // descendants in
> >                                         // memcg_drain_all_list_lrus().
> >                                         //
> >                                         // root(-1) -> A(0) -> B(0) -> C(1)
> >
> >   for (i = 0; memcg; memcg = parent_mem_cgroup(memcg), i++) {
> >       // allocate struct list_lru_per_memcg for memcg C
> >       table[i].mlru = memcg_init_list_lru_one(gfp);
> >   }
> >
> >   spin_lock_irqsave(&lru->lock, flags);
> >   while (i--) {
> >       // here index = 1
> >       int index = table[i].memcg->kmemcg_id;
> >
> >       struct list_lru_per_memcg *mlru = table[i].mlru;
> >       if (index < 0 || rcu_dereference_protected(mlrus->mlru[index], true))
> >           kfree(mlru);
> >       else
> >           // mlrus->mlru[index] will be assigned a new value regardless
> >           // memcg C is already offline.
> >           rcu_assign_pointer(mlrus->mlru[index], mlru);
> >   }
> >   spin_unlock_irqrestore(&lru->lock, flags);
> >
>
> > So changing ->kmemcg_id of all its descendants can prevent
> > memcg_list_lru_alloc() from allocating list lrus for the offlined
> > cgroup after memcg_list_lru_free() calling.
>
> Thanks for the illustrative example. I can see how this can be a problem
> in a general call of memcg_list_lru_alloc(C).
>
> However, the code, as I understand it, resolves the memcg to which lru
> allocation should be associated via get_mem_cgroup_from_objcg() and
> memcg_reparent_list_lrus(C) comes after memcg_reparent_objcgs(C, B),
> i.e. the allocation would target B (or even A if after
> memcg_reparent_objcgs(B, A))?
>
> It seems to me like "wasting" the existing objcg reparenting mechanism.
> Or what do you think could be a problem relying on it?
>

I have thought about this. It's a little different to rely on objcg
reparenting since the user can get memcg from objcg and
then does not realize the memcg has reparented. Although it
can check memcg->objcg to know whether the memcg is
reparented, it should also prevent this memcg from being
reparented throughout memcg_list_lru_alloc(). Maybe
holding css_set_lock can do that. I do not think this
is a good choice. Do you have any thoughts about this?

Thanks.
Michal Koutný Jan. 19, 2022, 9:33 a.m. UTC | #7
On Tue, Jan 18, 2022 at 08:05:44PM +0800, Muchun Song <songmuchun@bytedance.com> wrote:
> I have thought about this. It's a little different to rely on objcg
> reparenting since the user can get memcg from objcg and
> then does not realize the memcg has reparented.

When you pointed that out, I'm now also wondering how
memcg_list_lru_alloc() would be synchronized against
reparenting/renumbering of kmemcg_ids. What I suspect is that newly
allocated mlru may be stored into the xarray with a stale kmemcg_id.

> Maybe holding css_set_lock can do that. I do not think this
> is a good choice.

I agree, it doesn't sound well.

> Do you have any thoughts about this?

Thoughts / questions of what I don't undestand well:
- Why do you allocate mlrus for all ancestors in memcg_list_lru_alloc()?
  - It'd be sufficient to allocate just for the current memcg.
  - Possibly allocate ancestors upon reparenting (to simplify the
    allocation from slab_pre_alloc_hook itself).
    
- What is the per-kmemcg_id lookup good for?
  - I observe most calls of list_lru_from_memcg_idx() come from callers
    that know memcg (or even objcg).
  - The non-specific use case seems list_lru_walk_node() working with
    per-node and not per-memcg projection.
    - Consequently that is only used over all nodes anyway
      (list_lru_walk().
  - The idea behind this question is -- attach the list_lrus to
    obj_cgroup (and decomission the kmemcg_id completely).
    (Not necessarily part of this series but independent approach.)

Thanks,
Michal
Muchun Song Jan. 21, 2022, 5:28 a.m. UTC | #8
On Wed, Jan 19, 2022 at 5:33 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Tue, Jan 18, 2022 at 08:05:44PM +0800, Muchun Song <songmuchun@bytedance.com> wrote:
> > I have thought about this. It's a little different to rely on objcg
> > reparenting since the user can get memcg from objcg and
> > then does not realize the memcg has reparented.
>
> When you pointed that out, I'm now also wondering how
> memcg_list_lru_alloc() would be synchronized against
> reparenting/renumbering of kmemcg_ids. What I suspect is that newly
> allocated mlru may be stored into the xarray with a stale kmemcg_id.

The synchronization is based on the lock of list_lru->lock.
memcg_list_lru_free() will help us do housekeeping.

>
> > Maybe holding css_set_lock can do that. I do not think this
> > is a good choice.
>
> I agree, it doesn't sound well.
>
> > Do you have any thoughts about this?
>
> Thoughts / questions of what I don't undestand well:
> - Why do you allocate mlrus for all ancestors in memcg_list_lru_alloc()?

It's because we need to be reparenting.

>   - It'd be sufficient to allocate just for the current memcg.
>   - Possibly allocate ancestors upon reparenting (to simplify the
>     allocation from slab_pre_alloc_hook itself).

I agree it is nice to only allocate for current memcg, but
reparenting cannot handle failure of memory allocation.

>
> - What is the per-kmemcg_id lookup good for?
>   - I observe most calls of list_lru_from_memcg_idx() come from callers
>     that know memcg (or even objcg).
>   - The non-specific use case seems list_lru_walk_node() working with
>     per-node and not per-memcg projection.
>     - Consequently that is only used over all nodes anyway
>       (list_lru_walk().
>   - The idea behind this question is -- attach the list_lrus to
>     obj_cgroup (and decommission the kmemcg_id completely).
>     (Not necessarily part of this series but independent approach.)
>

I have some questions about this thought.
We would attach more than one list_lrus to obj_cgroup,
right? How to arrange those list_lrus, array or linked-list?

Thanks.
diff mbox series

Patch

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index ab912c49334f..c36db6dc2a65 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -32,14 +32,15 @@  struct list_lru_one {
 };
 
 struct list_lru_per_memcg {
+	struct rcu_head		rcu;
 	/* array of per cgroup per node lists, indexed by node id */
-	struct list_lru_one	node[0];
+	struct list_lru_one	node[];
 };
 
 struct list_lru_memcg {
 	struct rcu_head			rcu;
 	/* array of per cgroup lists, indexed by memcg_cache_id */
-	struct list_lru_per_memcg	*mlru[];
+	struct list_lru_per_memcg __rcu	*mlru[];
 };
 
 struct list_lru_node {
@@ -77,7 +78,7 @@  int __list_lru_init(struct list_lru *lru, bool memcg_aware,
 int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
 			 gfp_t gfp);
 int memcg_update_all_list_lrus(int num_memcgs);
-void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg);
+void memcg_drain_all_list_lrus(struct mem_cgroup *src, struct mem_cgroup *dst);
 
 /**
  * list_lru_add: add an element to the lru list's tail
diff --git a/mm/list_lru.c b/mm/list_lru.c
index bffa80527723..fc938d8ff48f 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -60,8 +60,12 @@  list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 	 * from relocation (see memcg_update_list_lru).
 	 */
 	mlrus = rcu_dereference_check(lru->mlrus, lockdep_is_held(&nlru->lock));
-	if (mlrus && idx >= 0)
-		return &mlrus->mlru[idx]->node[nid];
+	if (mlrus && idx >= 0) {
+		struct list_lru_per_memcg *mlru;
+
+		mlru = rcu_dereference_check(mlrus->mlru[idx], true);
+		return mlru ? &mlru->node[nid] : NULL;
+	}
 	return &nlru->lru;
 }
 
@@ -188,7 +192,7 @@  unsigned long list_lru_count_one(struct list_lru *lru,
 
 	rcu_read_lock();
 	l = list_lru_from_memcg_idx(lru, nid, memcg_cache_id(memcg));
-	count = READ_ONCE(l->nr_items);
+	count = l ? READ_ONCE(l->nr_items) : 0;
 	rcu_read_unlock();
 
 	if (unlikely(count < 0))
@@ -217,8 +221,11 @@  __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
 
-	l = list_lru_from_memcg_idx(lru, nid, memcg_idx);
 restart:
+	l = list_lru_from_memcg_idx(lru, nid, memcg_idx);
+	if (!l)
+		goto out;
+
 	list_for_each_safe(item, n, &l->list) {
 		enum lru_status ret;
 
@@ -262,6 +269,7 @@  __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
 			BUG();
 		}
 	}
+out:
 	return isolated;
 }
 
@@ -354,20 +362,25 @@  static struct list_lru_per_memcg *memcg_init_list_lru_one(gfp_t gfp)
 	return mlru;
 }
 
-static int memcg_init_list_lru_range(struct list_lru_memcg *mlrus,
-				     int begin, int end)
+static void memcg_list_lru_free(struct list_lru *lru, int src_idx)
 {
-	int i;
+	struct list_lru_memcg *mlrus;
+	struct list_lru_per_memcg *mlru;
 
-	for (i = begin; i < end; i++) {
-		mlrus->mlru[i] = memcg_init_list_lru_one(GFP_KERNEL);
-		if (!mlrus->mlru[i])
-			goto fail;
-	}
-	return 0;
-fail:
-	memcg_destroy_list_lru_range(mlrus, begin, i);
-	return -ENOMEM;
+	spin_lock_irq(&lru->lock);
+	mlrus = rcu_dereference_protected(lru->mlrus, true);
+	mlru = rcu_dereference_protected(mlrus->mlru[src_idx], true);
+	rcu_assign_pointer(mlrus->mlru[src_idx], NULL);
+	spin_unlock_irq(&lru->lock);
+
+	/*
+	 * The __list_lru_walk_one() can walk the list of this node.
+	 * We need kvfree_rcu() here. And the walking of the list
+	 * is under lru->node[nid]->lock, which can serve as a RCU
+	 * read-side critical section.
+	 */
+	if (mlru)
+		kvfree_rcu(mlru, rcu);
 }
 
 static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
@@ -381,14 +394,10 @@  static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
 
 	spin_lock_init(&lru->lock);
 
-	mlrus = kvmalloc(struct_size(mlrus, mlru, size), GFP_KERNEL);
+	mlrus = kvzalloc(struct_size(mlrus, mlru, size), GFP_KERNEL);
 	if (!mlrus)
 		return -ENOMEM;
 
-	if (memcg_init_list_lru_range(mlrus, 0, size)) {
-		kvfree(mlrus);
-		return -ENOMEM;
-	}
 	RCU_INIT_POINTER(lru->mlrus, mlrus);
 
 	return 0;
@@ -422,13 +431,9 @@  static int memcg_update_list_lru(struct list_lru *lru, int old_size, int new_siz
 	if (!new)
 		return -ENOMEM;
 
-	if (memcg_init_list_lru_range(new, old_size, new_size)) {
-		kvfree(new);
-		return -ENOMEM;
-	}
-
 	spin_lock_irq(&lru->lock);
 	memcpy(&new->mlru, &old->mlru, flex_array_size(new, mlru, old_size));
+	memset(&new->mlru[old_size], 0, flex_array_size(new, mlru, new_size - old_size));
 	rcu_assign_pointer(lru->mlrus, new);
 	spin_unlock_irq(&lru->lock);
 
@@ -436,20 +441,6 @@  static int memcg_update_list_lru(struct list_lru *lru, int old_size, int new_siz
 	return 0;
 }
 
-static void memcg_cancel_update_list_lru(struct list_lru *lru,
-					 int old_size, int new_size)
-{
-	struct list_lru_memcg *mlrus;
-
-	mlrus = rcu_dereference_protected(lru->mlrus,
-					  lockdep_is_held(&list_lrus_mutex));
-	/*
-	 * Do not bother shrinking the array back to the old size, because we
-	 * cannot handle allocation failures here.
-	 */
-	memcg_destroy_list_lru_range(mlrus, old_size, new_size);
-}
-
 int memcg_update_all_list_lrus(int new_size)
 {
 	int ret = 0;
@@ -460,15 +451,10 @@  int memcg_update_all_list_lrus(int new_size)
 	list_for_each_entry(lru, &memcg_list_lrus, list) {
 		ret = memcg_update_list_lru(lru, old_size, new_size);
 		if (ret)
-			goto fail;
+			break;
 	}
-out:
 	mutex_unlock(&list_lrus_mutex);
 	return ret;
-fail:
-	list_for_each_entry_continue_reverse(lru, &memcg_list_lrus, list)
-		memcg_cancel_update_list_lru(lru, old_size, new_size);
-	goto out;
 }
 
 static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
@@ -485,6 +471,8 @@  static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	spin_lock_irq(&nlru->lock);
 
 	src = list_lru_from_memcg_idx(lru, nid, src_idx);
+	if (!src)
+		goto out;
 	dst = list_lru_from_memcg_idx(lru, nid, dst_idx);
 
 	list_splice_init(&src->list, &dst->list);
@@ -494,7 +482,7 @@  static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 		set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
 		src->nr_items = 0;
 	}
-
+out:
 	spin_unlock_irq(&nlru->lock);
 }
 
@@ -505,15 +493,41 @@  static void memcg_drain_list_lru(struct list_lru *lru,
 
 	for_each_node(i)
 		memcg_drain_list_lru_node(lru, i, src_idx, dst_memcg);
+
+	memcg_list_lru_free(lru, src_idx);
 }
 
-void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg)
+void memcg_drain_all_list_lrus(struct mem_cgroup *src, struct mem_cgroup *dst)
 {
+	struct cgroup_subsys_state *css;
 	struct list_lru *lru;
+	int src_idx = src->kmemcg_id;
+
+	/*
+	 * Change kmemcg_id of this cgroup and all its descendants to the
+	 * parent's id, and then move all entries from this cgroup's list_lrus
+	 * to ones of the parent.
+	 *
+	 * After we have finished, all list_lrus corresponding to this cgroup
+	 * are guaranteed to remain empty. So we can safely free this cgroup's
+	 * list lrus in memcg_list_lru_free().
+	 *
+	 * Changing ->kmemcg_id to the parent can prevent memcg_list_lru_alloc()
+	 * from allocating list lrus for this cgroup after memcg_list_lru_free()
+	 * call.
+	 */
+	rcu_read_lock();
+	css_for_each_descendant_pre(css, &src->css) {
+		struct mem_cgroup *memcg;
+
+		memcg = mem_cgroup_from_css(css);
+		memcg->kmemcg_id = dst->kmemcg_id;
+	}
+	rcu_read_unlock();
 
 	mutex_lock(&list_lrus_mutex);
 	list_for_each_entry(lru, &memcg_list_lrus, list)
-		memcg_drain_list_lru(lru, src_idx, dst_memcg);
+		memcg_drain_list_lru(lru, src_idx, dst);
 	mutex_unlock(&list_lrus_mutex);
 }
 
@@ -528,7 +542,7 @@  static bool memcg_list_lru_allocated(struct mem_cgroup *memcg,
 		return true;
 
 	rcu_read_lock();
-	allocated = !!rcu_dereference(lru->mlrus)->mlru[idx];
+	allocated = !!rcu_access_pointer(rcu_dereference(lru->mlrus)->mlru[idx]);
 	rcu_read_unlock();
 
 	return allocated;
@@ -576,11 +590,12 @@  int memcg_list_lru_alloc(struct mem_cgroup *memcg, struct list_lru *lru,
 	mlrus = rcu_dereference_protected(lru->mlrus, true);
 	while (i--) {
 		int index = table[i].memcg->kmemcg_id;
+		struct list_lru_per_memcg *mlru = table[i].mlru;
 
-		if (mlrus->mlru[index])
-			kfree(table[i].mlru);
+		if (index < 0 || rcu_dereference_protected(mlrus->mlru[index], true))
+			kfree(mlru);
 		else
-			mlrus->mlru[index] = table[i].mlru;
+			rcu_assign_pointer(mlrus->mlru[index], mlru);
 	}
 	spin_unlock_irqrestore(&lru->lock, flags);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ec7a62f39326..94d8f742c32e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3643,6 +3643,10 @@  static void memcg_offline_kmem(struct mem_cgroup *memcg)
 
 	memcg_reparent_objcgs(memcg, parent);
 
+	/*
+	 * memcg_drain_all_list_lrus() can change memcg->kmemcg_id.
+	 * Cache it to local @kmemcg_id.
+	 */
 	kmemcg_id = memcg->kmemcg_id;
 
 	/*
@@ -3651,7 +3655,7 @@  static void memcg_offline_kmem(struct mem_cgroup *memcg)
 	 * The ordering is imposed by list_lru_node->lock taken by
 	 * memcg_drain_all_list_lrus().
 	 */
-	memcg_drain_all_list_lrus(kmemcg_id, parent);
+	memcg_drain_all_list_lrus(memcg, parent);
 
 	memcg_free_cache_id(kmemcg_id);
 }