diff mbox series

[v5,01/16] mm: list_lru: optimize memory consumption of arrays of per cgroup lists

Message ID 20211220085649.8196-2-songmuchun@bytedance.com (mailing list archive)
State New
Headers show
Series Optimize list lru memory consumption | expand

Commit Message

Muchun Song Dec. 20, 2021, 8:56 a.m. UTC
The list_lru uses an array (list_lru_memcg->lru) to store pointers
which point to the list_lru_one. And the array is per memcg per node.
Therefore, the size of the arrays will be 10K * number_of_node * 8 (
a pointer size on 64 bits system) when we run 10k containers in the
system. The memory consumption of the arrays becomes significant. The
more numa node, the more memory it consumes.

I have done a simple test, which creates 10K memcg and mount point
each in a two-node system. The memory consumption of the list_lru
will be 24464MB. After converting the array from per memcg per node
to per memcg, the memory consumption is going to be 21957MB. It is
reduces by 2.5GB. In our AMD servers with 8 numa nodes in those
sysuem, the memory consumption could be more significant. The savings
come from the list_lru_one heads, that it also simplifies the
alloc/dealloc path.

The new scheme looks like the following.

  +----------+   mlrus   +----------------+   mlru   +----------------------+
  | list_lru +---------->| list_lru_memcg +--------->|  list_lru_per_memcg  |
  +----------+           +----------------+          +----------------------+
                                                     |  list_lru_per_memcg  |
                                                     +----------------------+
                                                     |          ...         |
                          +--------------+   node    +----------------------+
                          | list_lru_one |<----------+  list_lru_per_memcg  |
                          +--------------+           +----------------------+
                          | list_lru_one |
                          +--------------+
                          |      ...     |
                          +--------------+
                          | list_lru_one |
                          +--------------+

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/list_lru.h |  17 ++--
 mm/list_lru.c            | 206 +++++++++++++++++------------------------------
 2 files changed, 86 insertions(+), 137 deletions(-)

Comments

Roman Gushchin Jan. 7, 2022, 12:05 a.m. UTC | #1
On Mon, Dec 20, 2021 at 04:56:34PM +0800, Muchun Song wrote:
> The list_lru uses an array (list_lru_memcg->lru) to store pointers
> which point to the list_lru_one. And the array is per memcg per node.
> Therefore, the size of the arrays will be 10K * number_of_node * 8 (
> a pointer size on 64 bits system) when we run 10k containers in the
> system. The memory consumption of the arrays becomes significant. The
> more numa node, the more memory it consumes.
> 
> I have done a simple test, which creates 10K memcg and mount point
> each in a two-node system. The memory consumption of the list_lru
> will be 24464MB. After converting the array from per memcg per node
> to per memcg, the memory consumption is going to be 21957MB. It is
> reduces by 2.5GB. In our AMD servers with 8 numa nodes in those
> sysuem, the memory consumption could be more significant. The savings
> come from the list_lru_one heads, that it also simplifies the
> alloc/dealloc path.
> 
> The new scheme looks like the following.
> 
>   +----------+   mlrus   +----------------+   mlru   +----------------------+
>   | list_lru +---------->| list_lru_memcg +--------->|  list_lru_per_memcg  |
>   +----------+           +----------------+          +----------------------+
>                                                      |  list_lru_per_memcg  |
>                                                      +----------------------+
>                                                      |          ...         |
>                           +--------------+   node    +----------------------+
>                           | list_lru_one |<----------+  list_lru_per_memcg  |
>                           +--------------+           +----------------------+
>                           | list_lru_one |
>                           +--------------+
>                           |      ...     |
>                           +--------------+
>                           | list_lru_one |
>                           +--------------+
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

As much as I like the code changes (there is indeed a significant simplification!),
I don't like the commit message and title, because I wasn't able to understand
what the patch is doing and some parts look simply questionable. Overall it
sounds like you reduce the number of list_lru_one structures, which is not true.

How about something like this?

--
mm: list_lru: transpose the array of per-node per-memcg lru lists

The current scheme of maintaining per-node per-memcg lru lists looks like:
  struct list_lru {
    struct list_lru_node *node;           (for each node)
      struct list_lru_memcg *memcg_lrus;
        struct list_lru_one *lru[];       (for each memcg)
  }

By effectively transposing the two-dimension array of list_lru_one's structures
(per-node per-memcg => per-memcg per-node) it's possible to save some memory
and simplify alloc/dealloc paths. The new scheme looks like:
  struct list_lru {
    struct list_lru_memcg *mlrus;
      struct list_lru_per_memcg *mlru[];  (for each memcg)
        struct list_lru_one node[0];      (for each node)
  }

Memory savings are coming from having fewer list_lru_memcg structures, which
contain an extra struct rcu_head to handle the destruction process.
--

But what worries me is that memory savings numbers you posted don't do up.
In theory we can save
16 (size of struct rcu_head) * 10000 (number of cgroups) * 2 (number of numa nodes) = 320k
per slab cache. Did you have a ton of mount points? Otherwise I don't understand
where these 2.5Gb are coming from.

Thanks!
Muchun Song Jan. 9, 2022, 4:49 a.m. UTC | #2
On Fri, Jan 7, 2022 at 8:05 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Dec 20, 2021 at 04:56:34PM +0800, Muchun Song wrote:
> > The list_lru uses an array (list_lru_memcg->lru) to store pointers
> > which point to the list_lru_one. And the array is per memcg per node.
> > Therefore, the size of the arrays will be 10K * number_of_node * 8 (
> > a pointer size on 64 bits system) when we run 10k containers in the
> > system. The memory consumption of the arrays becomes significant. The
> > more numa node, the more memory it consumes.
> >
> > I have done a simple test, which creates 10K memcg and mount point
> > each in a two-node system. The memory consumption of the list_lru
> > will be 24464MB. After converting the array from per memcg per node
> > to per memcg, the memory consumption is going to be 21957MB. It is
> > reduces by 2.5GB. In our AMD servers with 8 numa nodes in those
> > sysuem, the memory consumption could be more significant. The savings
> > come from the list_lru_one heads, that it also simplifies the
> > alloc/dealloc path.
> >
> > The new scheme looks like the following.
> >
> >   +----------+   mlrus   +----------------+   mlru   +----------------------+
> >   | list_lru +---------->| list_lru_memcg +--------->|  list_lru_per_memcg  |
> >   +----------+           +----------------+          +----------------------+
> >                                                      |  list_lru_per_memcg  |
> >                                                      +----------------------+
> >                                                      |          ...         |
> >                           +--------------+   node    +----------------------+
> >                           | list_lru_one |<----------+  list_lru_per_memcg  |
> >                           +--------------+           +----------------------+
> >                           | list_lru_one |
> >                           +--------------+
> >                           |      ...     |
> >                           +--------------+
> >                           | list_lru_one |
> >                           +--------------+
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
> As much as I like the code changes (there is indeed a significant simplification!),
> I don't like the commit message and title, because I wasn't able to understand
> what the patch is doing and some parts look simply questionable. Overall it
> sounds like you reduce the number of list_lru_one structures, which is not true.
>
> How about something like this?
>
> --
> mm: list_lru: transpose the array of per-node per-memcg lru lists
>
> The current scheme of maintaining per-node per-memcg lru lists looks like:
>   struct list_lru {
>     struct list_lru_node *node;           (for each node)
>       struct list_lru_memcg *memcg_lrus;
>         struct list_lru_one *lru[];       (for each memcg)
>   }
>
> By effectively transposing the two-dimension array of list_lru_one's structures
> (per-node per-memcg => per-memcg per-node) it's possible to save some memory
> and simplify alloc/dealloc paths. The new scheme looks like:
>   struct list_lru {
>     struct list_lru_memcg *mlrus;
>       struct list_lru_per_memcg *mlru[];  (for each memcg)
>         struct list_lru_one node[0];      (for each node)
>   }
>
> Memory savings are coming from having fewer list_lru_memcg structures, which
> contain an extra struct rcu_head to handle the destruction process.

My bad English. Actually, the saving is coming from not only 'struct rcu_head'
but also some pointer arrays used to store the pointer to 'struct list_lru_one'.
The array is per node and its size is 8 (a pointer) * num_memcgs. So the total
size of the arrays is  8 * num_nodes * memcg_nr_cache_ids. After this patch,
the size becomes 8 * memcg_nr_cache_ids. So the saving is

   8 * (num_nodes - 1) * memcg_nr_cache_ids.

> --
>
> But what worries me is that memory savings numbers you posted don't do up.
> In theory we can save
> 16 (size of struct rcu_head) * 10000 (number of cgroups) * 2 (number of numa nodes) = 320k
> per slab cache. Did you have a ton of mount points? Otherwise I don't understand
> where these 2.5Gb are coming from.

memcg_nr_cache_ids is 12286 when creating 10k memcgs. So the saving
of arrays of one list_lru is 8 * 1 (number of numa nodes - 1) * 12286 = 96k.
There will be 2 * 10k list_lru when mounting 10k points. So the total
saving is 96k * 2 * 10k = 1920 M.

Thanks Roman.
Roman Gushchin Jan. 10, 2022, 6:42 p.m. UTC | #3
On Sun, Jan 09, 2022 at 12:49:56PM +0800, Muchun Song wrote:
> On Fri, Jan 7, 2022 at 8:05 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Mon, Dec 20, 2021 at 04:56:34PM +0800, Muchun Song wrote:
> > > The list_lru uses an array (list_lru_memcg->lru) to store pointers
> > > which point to the list_lru_one. And the array is per memcg per node.
> > > Therefore, the size of the arrays will be 10K * number_of_node * 8 (
> > > a pointer size on 64 bits system) when we run 10k containers in the
> > > system. The memory consumption of the arrays becomes significant. The
> > > more numa node, the more memory it consumes.
> > >
> > > I have done a simple test, which creates 10K memcg and mount point
> > > each in a two-node system. The memory consumption of the list_lru
> > > will be 24464MB. After converting the array from per memcg per node
> > > to per memcg, the memory consumption is going to be 21957MB. It is
> > > reduces by 2.5GB. In our AMD servers with 8 numa nodes in those
> > > sysuem, the memory consumption could be more significant. The savings
> > > come from the list_lru_one heads, that it also simplifies the
> > > alloc/dealloc path.
> > >
> > > The new scheme looks like the following.
> > >
> > >   +----------+   mlrus   +----------------+   mlru   +----------------------+
> > >   | list_lru +---------->| list_lru_memcg +--------->|  list_lru_per_memcg  |
> > >   +----------+           +----------------+          +----------------------+
> > >                                                      |  list_lru_per_memcg  |
> > >                                                      +----------------------+
> > >                                                      |          ...         |
> > >                           +--------------+   node    +----------------------+
> > >                           | list_lru_one |<----------+  list_lru_per_memcg  |
> > >                           +--------------+           +----------------------+
> > >                           | list_lru_one |
> > >                           +--------------+
> > >                           |      ...     |
> > >                           +--------------+
> > >                           | list_lru_one |
> > >                           +--------------+
> > >
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > As much as I like the code changes (there is indeed a significant simplification!),
> > I don't like the commit message and title, because I wasn't able to understand
> > what the patch is doing and some parts look simply questionable. Overall it
> > sounds like you reduce the number of list_lru_one structures, which is not true.
> >
> > How about something like this?
> >
> > --
> > mm: list_lru: transpose the array of per-node per-memcg lru lists
> >
> > The current scheme of maintaining per-node per-memcg lru lists looks like:
> >   struct list_lru {
> >     struct list_lru_node *node;           (for each node)
> >       struct list_lru_memcg *memcg_lrus;
> >         struct list_lru_one *lru[];       (for each memcg)
> >   }
> >
> > By effectively transposing the two-dimension array of list_lru_one's structures
> > (per-node per-memcg => per-memcg per-node) it's possible to save some memory
> > and simplify alloc/dealloc paths. The new scheme looks like:
> >   struct list_lru {
> >     struct list_lru_memcg *mlrus;
> >       struct list_lru_per_memcg *mlru[];  (for each memcg)
> >         struct list_lru_one node[0];      (for each node)
> >   }
> >
> > Memory savings are coming from having fewer list_lru_memcg structures, which
> > contain an extra struct rcu_head to handle the destruction process.
> 
> My bad English. Actually, the saving is coming from not only 'struct rcu_head'
> but also some pointer arrays used to store the pointer to 'struct list_lru_one'.
> The array is per node and its size is 8 (a pointer) * num_memcgs.

Nice! Please, add this to the commit log.

> So the total
> size of the arrays is  8 * num_nodes * memcg_nr_cache_ids. After this patch,
> the size becomes 8 * memcg_nr_cache_ids. So the saving is
> 
>    8 * (num_nodes - 1) * memcg_nr_cache_ids.
> 
> > --
> >
> > But what worries me is that memory savings numbers you posted don't do up.
> > In theory we can save
> > 16 (size of struct rcu_head) * 10000 (number of cgroups) * 2 (number of numa nodes) = 320k
> > per slab cache. Did you have a ton of mount points? Otherwise I don't understand
> > where these 2.5Gb are coming from.
> 
> memcg_nr_cache_ids is 12286 when creating 10k memcgs. So the saving
> of arrays of one list_lru is 8 * 1 (number of numa nodes - 1) * 12286 = 96k.
> There will be 2 * 10k list_lru when mounting 10k points. So the total
> saving is 96k * 2 * 10k = 1920 M.

So, there are 10k cgroups _and_ 10k mountpoints. Please, make it obvious from
the commit log. Most users don't have that many mount points (and likely cgroups),
so they shouldn't expect Gb's in savings.

Thanks!

PS I hope to review the rest of the patchset till the end of this week.
Muchun Song Jan. 11, 2022, 3:19 a.m. UTC | #4
On Tue, Jan 11, 2022 at 2:42 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Sun, Jan 09, 2022 at 12:49:56PM +0800, Muchun Song wrote:
> > On Fri, Jan 7, 2022 at 8:05 AM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Mon, Dec 20, 2021 at 04:56:34PM +0800, Muchun Song wrote:
> > > > The list_lru uses an array (list_lru_memcg->lru) to store pointers
> > > > which point to the list_lru_one. And the array is per memcg per node.
> > > > Therefore, the size of the arrays will be 10K * number_of_node * 8 (
> > > > a pointer size on 64 bits system) when we run 10k containers in the
> > > > system. The memory consumption of the arrays becomes significant. The
> > > > more numa node, the more memory it consumes.
> > > >
> > > > I have done a simple test, which creates 10K memcg and mount point
> > > > each in a two-node system. The memory consumption of the list_lru
> > > > will be 24464MB. After converting the array from per memcg per node
> > > > to per memcg, the memory consumption is going to be 21957MB. It is
> > > > reduces by 2.5GB. In our AMD servers with 8 numa nodes in those
> > > > sysuem, the memory consumption could be more significant. The savings
> > > > come from the list_lru_one heads, that it also simplifies the
> > > > alloc/dealloc path.
> > > >
> > > > The new scheme looks like the following.
> > > >
> > > >   +----------+   mlrus   +----------------+   mlru   +----------------------+
> > > >   | list_lru +---------->| list_lru_memcg +--------->|  list_lru_per_memcg  |
> > > >   +----------+           +----------------+          +----------------------+
> > > >                                                      |  list_lru_per_memcg  |
> > > >                                                      +----------------------+
> > > >                                                      |          ...         |
> > > >                           +--------------+   node    +----------------------+
> > > >                           | list_lru_one |<----------+  list_lru_per_memcg  |
> > > >                           +--------------+           +----------------------+
> > > >                           | list_lru_one |
> > > >                           +--------------+
> > > >                           |      ...     |
> > > >                           +--------------+
> > > >                           | list_lru_one |
> > > >                           +--------------+
> > > >
> > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > >
> > > As much as I like the code changes (there is indeed a significant simplification!),
> > > I don't like the commit message and title, because I wasn't able to understand
> > > what the patch is doing and some parts look simply questionable. Overall it
> > > sounds like you reduce the number of list_lru_one structures, which is not true.
> > >
> > > How about something like this?
> > >
> > > --
> > > mm: list_lru: transpose the array of per-node per-memcg lru lists
> > >
> > > The current scheme of maintaining per-node per-memcg lru lists looks like:
> > >   struct list_lru {
> > >     struct list_lru_node *node;           (for each node)
> > >       struct list_lru_memcg *memcg_lrus;
> > >         struct list_lru_one *lru[];       (for each memcg)
> > >   }
> > >
> > > By effectively transposing the two-dimension array of list_lru_one's structures
> > > (per-node per-memcg => per-memcg per-node) it's possible to save some memory
> > > and simplify alloc/dealloc paths. The new scheme looks like:
> > >   struct list_lru {
> > >     struct list_lru_memcg *mlrus;
> > >       struct list_lru_per_memcg *mlru[];  (for each memcg)
> > >         struct list_lru_one node[0];      (for each node)
> > >   }
> > >
> > > Memory savings are coming from having fewer list_lru_memcg structures, which
> > > contain an extra struct rcu_head to handle the destruction process.
> >
> > My bad English. Actually, the saving is coming from not only 'struct rcu_head'
> > but also some pointer arrays used to store the pointer to 'struct list_lru_one'.
> > The array is per node and its size is 8 (a pointer) * num_memcgs.
>
> Nice! Please, add this to the commit log.

Will do.

>
> > So the total
> > size of the arrays is  8 * num_nodes * memcg_nr_cache_ids. After this patch,
> > the size becomes 8 * memcg_nr_cache_ids. So the saving is
> >
> >    8 * (num_nodes - 1) * memcg_nr_cache_ids.
> >
> > > --
> > >
> > > But what worries me is that memory savings numbers you posted don't do up.
> > > In theory we can save
> > > 16 (size of struct rcu_head) * 10000 (number of cgroups) * 2 (number of numa nodes) = 320k
> > > per slab cache. Did you have a ton of mount points? Otherwise I don't understand
> > > where these 2.5Gb are coming from.
> >
> > memcg_nr_cache_ids is 12286 when creating 10k memcgs. So the saving
> > of arrays of one list_lru is 8 * 1 (number of numa nodes - 1) * 12286 = 96k.
> > There will be 2 * 10k list_lru when mounting 10k points. So the total
> > saving is 96k * 2 * 10k = 1920 M.
>
> So, there are 10k cgroups _and_ 10k mountpoints. Please, make it obvious from
> the commit log. Most users don't have that many mount points (and likely cgroups),
> so they shouldn't expect Gb's in savings.

I'll add those infos into the commit log.

>
> Thanks!
>
> PS I hope to review the rest of the patchset till the end of this week.

Thanks Roman.
diff mbox series

Patch

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 1b5fceb565df..729a27b6ff53 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -31,10 +31,15 @@  struct list_lru_one {
 	long			nr_items;
 };
 
+struct list_lru_per_memcg {
+	/* array of per cgroup per node lists, indexed by node id */
+	struct list_lru_one	node[0];
+};
+
 struct list_lru_memcg {
-	struct rcu_head		rcu;
+	struct rcu_head			rcu;
 	/* array of per cgroup lists, indexed by memcg_cache_id */
-	struct list_lru_one	*lru[];
+	struct list_lru_per_memcg	*mlru[];
 };
 
 struct list_lru_node {
@@ -42,11 +47,7 @@  struct list_lru_node {
 	spinlock_t		lock;
 	/* global list, used for the root cgroup in cgroup aware lrus */
 	struct list_lru_one	lru;
-#ifdef CONFIG_MEMCG_KMEM
-	/* for cgroup aware lrus points to per cgroup lists, otherwise NULL */
-	struct list_lru_memcg	__rcu *memcg_lrus;
-#endif
-	long nr_items;
+	long			nr_items;
 } ____cacheline_aligned_in_smp;
 
 struct list_lru {
@@ -55,6 +56,8 @@  struct list_lru {
 	struct list_head	list;
 	int			shrinker_id;
 	bool			memcg_aware;
+	/* for cgroup aware lrus points to per cgroup lists, otherwise NULL */
+	struct list_lru_memcg	__rcu *mlrus;
 #endif
 };
 
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 0cd5e89ca063..7d1356241aa8 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -49,35 +49,37 @@  static int lru_shrinker_id(struct list_lru *lru)
 }
 
 static inline struct list_lru_one *
-list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
+list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 {
-	struct list_lru_memcg *memcg_lrus;
+	struct list_lru_memcg *mlrus;
+	struct list_lru_node *nlru = &lru->node[nid];
+
 	/*
 	 * Either lock or RCU protects the array of per cgroup lists
-	 * from relocation (see memcg_update_list_lru_node).
+	 * from relocation (see memcg_update_list_lru).
 	 */
-	memcg_lrus = rcu_dereference_check(nlru->memcg_lrus,
-					   lockdep_is_held(&nlru->lock));
-	if (memcg_lrus && idx >= 0)
-		return memcg_lrus->lru[idx];
+	mlrus = rcu_dereference_check(lru->mlrus, lockdep_is_held(&nlru->lock));
+	if (mlrus && idx >= 0)
+		return &mlrus->mlru[idx]->node[nid];
 	return &nlru->lru;
 }
 
 static inline struct list_lru_one *
-list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
+list_lru_from_kmem(struct list_lru *lru, int nid, void *ptr,
 		   struct mem_cgroup **memcg_ptr)
 {
+	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l = &nlru->lru;
 	struct mem_cgroup *memcg = NULL;
 
-	if (!nlru->memcg_lrus)
+	if (!lru->mlrus)
 		goto out;
 
 	memcg = mem_cgroup_from_obj(ptr);
 	if (!memcg)
 		goto out;
 
-	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
+	l = list_lru_from_memcg_idx(lru, nid, memcg_cache_id(memcg));
 out:
 	if (memcg_ptr)
 		*memcg_ptr = memcg;
@@ -103,18 +105,18 @@  static inline bool list_lru_memcg_aware(struct list_lru *lru)
 }
 
 static inline struct list_lru_one *
-list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
+list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
 {
-	return &nlru->lru;
+	return &lru->node[nid].lru;
 }
 
 static inline struct list_lru_one *
-list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
+list_lru_from_kmem(struct list_lru *lru, int nid, void *ptr,
 		   struct mem_cgroup **memcg_ptr)
 {
 	if (memcg_ptr)
 		*memcg_ptr = NULL;
-	return &nlru->lru;
+	return &lru->node[nid].lru;
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
@@ -127,7 +129,7 @@  bool list_lru_add(struct list_lru *lru, struct list_head *item)
 
 	spin_lock(&nlru->lock);
 	if (list_empty(item)) {
-		l = list_lru_from_kmem(nlru, item, &memcg);
+		l = list_lru_from_kmem(lru, nid, item, &memcg);
 		list_add_tail(item, &l->list);
 		/* Set shrinker bit if the first element was added */
 		if (!l->nr_items++)
@@ -150,7 +152,7 @@  bool list_lru_del(struct list_lru *lru, struct list_head *item)
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
-		l = list_lru_from_kmem(nlru, item, NULL);
+		l = list_lru_from_kmem(lru, nid, item, NULL);
 		list_del_init(item);
 		l->nr_items--;
 		nlru->nr_items--;
@@ -180,12 +182,11 @@  EXPORT_SYMBOL_GPL(list_lru_isolate_move);
 unsigned long list_lru_count_one(struct list_lru *lru,
 				 int nid, struct mem_cgroup *memcg)
 {
-	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
 	long count;
 
 	rcu_read_lock();
-	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
+	l = list_lru_from_memcg_idx(lru, nid, memcg_cache_id(memcg));
 	count = READ_ONCE(l->nr_items);
 	rcu_read_unlock();
 
@@ -206,16 +207,16 @@  unsigned long list_lru_count_node(struct list_lru *lru, int nid)
 EXPORT_SYMBOL_GPL(list_lru_count_node);
 
 static unsigned long
-__list_lru_walk_one(struct list_lru_node *nlru, int memcg_idx,
+__list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
 		    list_lru_walk_cb isolate, void *cb_arg,
 		    unsigned long *nr_to_walk)
 {
-
+	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
 
-	l = list_lru_from_memcg_idx(nlru, memcg_idx);
+	l = list_lru_from_memcg_idx(lru, nid, memcg_idx);
 restart:
 	list_for_each_safe(item, n, &l->list) {
 		enum lru_status ret;
@@ -272,8 +273,8 @@  list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 	unsigned long ret;
 
 	spin_lock(&nlru->lock);
-	ret = __list_lru_walk_one(nlru, memcg_cache_id(memcg), isolate, cb_arg,
-				  nr_to_walk);
+	ret = __list_lru_walk_one(lru, nid, memcg_cache_id(memcg), isolate,
+				  cb_arg, nr_to_walk);
 	spin_unlock(&nlru->lock);
 	return ret;
 }
@@ -288,8 +289,8 @@  list_lru_walk_one_irq(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 	unsigned long ret;
 
 	spin_lock_irq(&nlru->lock);
-	ret = __list_lru_walk_one(nlru, memcg_cache_id(memcg), isolate, cb_arg,
-				  nr_to_walk);
+	ret = __list_lru_walk_one(lru, nid, memcg_cache_id(memcg), isolate,
+				  cb_arg, nr_to_walk);
 	spin_unlock_irq(&nlru->lock);
 	return ret;
 }
@@ -308,7 +309,7 @@  unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 			struct list_lru_node *nlru = &lru->node[nid];
 
 			spin_lock(&nlru->lock);
-			isolated += __list_lru_walk_one(nlru, memcg_idx,
+			isolated += __list_lru_walk_one(lru, nid, memcg_idx,
 							isolate, cb_arg,
 							nr_to_walk);
 			spin_unlock(&nlru->lock);
@@ -328,166 +329,111 @@  static void init_one_lru(struct list_lru_one *l)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static void __memcg_destroy_list_lru_node(struct list_lru_memcg *memcg_lrus,
-					  int begin, int end)
+static void memcg_destroy_list_lru_range(struct list_lru_memcg *mlrus,
+					 int begin, int end)
 {
 	int i;
 
 	for (i = begin; i < end; i++)
-		kfree(memcg_lrus->lru[i]);
+		kfree(mlrus->mlru[i]);
 }
 
-static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus,
-				      int begin, int end)
+static int memcg_init_list_lru_range(struct list_lru_memcg *mlrus,
+				     int begin, int end)
 {
 	int i;
 
 	for (i = begin; i < end; i++) {
-		struct list_lru_one *l;
+		int nid;
+		struct list_lru_per_memcg *mlru;
 
-		l = kmalloc(sizeof(struct list_lru_one), GFP_KERNEL);
-		if (!l)
+		mlru = kmalloc(struct_size(mlru, node, nr_node_ids), GFP_KERNEL);
+		if (!mlru)
 			goto fail;
 
-		init_one_lru(l);
-		memcg_lrus->lru[i] = l;
+		for_each_node(nid)
+			init_one_lru(&mlru->node[nid]);
+		mlrus->mlru[i] = mlru;
 	}
 	return 0;
 fail:
-	__memcg_destroy_list_lru_node(memcg_lrus, begin, i);
+	memcg_destroy_list_lru_range(mlrus, begin, i);
 	return -ENOMEM;
 }
 
-static int memcg_init_list_lru_node(struct list_lru_node *nlru)
+static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
 {
-	struct list_lru_memcg *memcg_lrus;
+	struct list_lru_memcg *mlrus;
 	int size = memcg_nr_cache_ids;
 
-	memcg_lrus = kvmalloc(struct_size(memcg_lrus, lru, size), GFP_KERNEL);
-	if (!memcg_lrus)
+	lru->memcg_aware = memcg_aware;
+	if (!memcg_aware)
+		return 0;
+
+	mlrus = kvmalloc(struct_size(mlrus, mlru, size), GFP_KERNEL);
+	if (!mlrus)
 		return -ENOMEM;
 
-	if (__memcg_init_list_lru_node(memcg_lrus, 0, size)) {
-		kvfree(memcg_lrus);
+	if (memcg_init_list_lru_range(mlrus, 0, size)) {
+		kvfree(mlrus);
 		return -ENOMEM;
 	}
-	RCU_INIT_POINTER(nlru->memcg_lrus, memcg_lrus);
+	RCU_INIT_POINTER(lru->mlrus, mlrus);
 
 	return 0;
 }
 
-static void memcg_destroy_list_lru_node(struct list_lru_node *nlru)
+static void memcg_destroy_list_lru(struct list_lru *lru)
 {
-	struct list_lru_memcg *memcg_lrus;
+	struct list_lru_memcg *mlrus;
+
+	if (!list_lru_memcg_aware(lru))
+		return;
+
 	/*
 	 * This is called when shrinker has already been unregistered,
 	 * and nobody can use it. So, there is no need to use kvfree_rcu().
 	 */
-	memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
-	__memcg_destroy_list_lru_node(memcg_lrus, 0, memcg_nr_cache_ids);
-	kvfree(memcg_lrus);
+	mlrus = rcu_dereference_protected(lru->mlrus, true);
+	memcg_destroy_list_lru_range(mlrus, 0, memcg_nr_cache_ids);
+	kvfree(mlrus);
 }
 
-static int memcg_update_list_lru_node(struct list_lru_node *nlru,
-				      int old_size, int new_size)
+static int memcg_update_list_lru(struct list_lru *lru, int old_size, int new_size)
 {
 	struct list_lru_memcg *old, *new;
 
 	BUG_ON(old_size > new_size);
 
-	old = rcu_dereference_protected(nlru->memcg_lrus,
+	old = rcu_dereference_protected(lru->mlrus,
 					lockdep_is_held(&list_lrus_mutex));
-	new = kvmalloc(struct_size(new, lru, new_size), GFP_KERNEL);
+	new = kvmalloc(struct_size(new, mlru, new_size), GFP_KERNEL);
 	if (!new)
 		return -ENOMEM;
 
-	if (__memcg_init_list_lru_node(new, old_size, new_size)) {
+	if (memcg_init_list_lru_range(new, old_size, new_size)) {
 		kvfree(new);
 		return -ENOMEM;
 	}
 
-	memcpy(&new->lru, &old->lru, flex_array_size(new, lru, old_size));
-	rcu_assign_pointer(nlru->memcg_lrus, new);
+	memcpy(&new->mlru, &old->mlru, flex_array_size(new, mlru, old_size));
+	rcu_assign_pointer(lru->mlrus, new);
 	kvfree_rcu(old, rcu);
 	return 0;
 }
 
-static void memcg_cancel_update_list_lru_node(struct list_lru_node *nlru,
-					      int old_size, int new_size)
-{
-	struct list_lru_memcg *memcg_lrus;
-
-	memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus,
-					       lockdep_is_held(&list_lrus_mutex));
-	/* do not bother shrinking the array back to the old size, because we
-	 * cannot handle allocation failures here */
-	__memcg_destroy_list_lru_node(memcg_lrus, old_size, new_size);
-}
-
-static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
-{
-	int i;
-
-	lru->memcg_aware = memcg_aware;
-
-	if (!memcg_aware)
-		return 0;
-
-	for_each_node(i) {
-		if (memcg_init_list_lru_node(&lru->node[i]))
-			goto fail;
-	}
-	return 0;
-fail:
-	for (i = i - 1; i >= 0; i--) {
-		if (!lru->node[i].memcg_lrus)
-			continue;
-		memcg_destroy_list_lru_node(&lru->node[i]);
-	}
-	return -ENOMEM;
-}
-
-static void memcg_destroy_list_lru(struct list_lru *lru)
-{
-	int i;
-
-	if (!list_lru_memcg_aware(lru))
-		return;
-
-	for_each_node(i)
-		memcg_destroy_list_lru_node(&lru->node[i]);
-}
-
-static int memcg_update_list_lru(struct list_lru *lru,
-				 int old_size, int new_size)
-{
-	int i;
-
-	for_each_node(i) {
-		if (memcg_update_list_lru_node(&lru->node[i],
-					       old_size, new_size))
-			goto fail;
-	}
-	return 0;
-fail:
-	for (i = i - 1; i >= 0; i--) {
-		if (!lru->node[i].memcg_lrus)
-			continue;
-
-		memcg_cancel_update_list_lru_node(&lru->node[i],
-						  old_size, new_size);
-	}
-	return -ENOMEM;
-}
-
 static void memcg_cancel_update_list_lru(struct list_lru *lru,
 					 int old_size, int new_size)
 {
-	int i;
+	struct list_lru_memcg *mlrus;
 
-	for_each_node(i)
-		memcg_cancel_update_list_lru_node(&lru->node[i],
-						  old_size, new_size);
+	mlrus = rcu_dereference_protected(lru->mlrus,
+					  lockdep_is_held(&list_lrus_mutex));
+	/*
+	 * Do not bother shrinking the array back to the old size, because we
+	 * cannot handle allocation failures here.
+	 */
+	memcg_destroy_list_lru_range(mlrus, old_size, new_size);
 }
 
 int memcg_update_all_list_lrus(int new_size)
@@ -524,8 +470,8 @@  static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	 */
 	spin_lock_irq(&nlru->lock);
 
-	src = list_lru_from_memcg_idx(nlru, src_idx);
-	dst = list_lru_from_memcg_idx(nlru, dst_idx);
+	src = list_lru_from_memcg_idx(lru, nid, src_idx);
+	dst = list_lru_from_memcg_idx(lru, nid, dst_idx);
 
 	list_splice_init(&src->list, &dst->list);