diff mbox series

[2/2] mm/vmscan: calculate reclaimed slab caches in all reclaim paths

Message ID 1561112086-6169-3-git-send-email-laoar.shao@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm/vmscan: calculate reclaimed slab in all reclaim paths | expand

Commit Message

Yafang Shao June 21, 2019, 10:14 a.m. UTC
There're six different reclaim paths by now,
- kswapd reclaim path
- node reclaim path
- hibernate preallocate memory reclaim path
- direct reclaim path
- memcg reclaim path
- memcg softlimit reclaim path

The slab caches reclaimed in these paths are only calculated in the above
three paths.

There're some drawbacks if we don't calculate the reclaimed slab caches.
- The sc->nr_reclaimed isn't correct if there're some slab caches
  relcaimed in this path.
- The slab caches may be reclaimed thoroughly if there're lots of
  reclaimable slab caches and few page caches.
  Let's take an easy example for this case.
  If one memcg is full of slab caches and the limit of it is 512M, in
  other words there're approximately 512M slab caches in this memcg.
  Then the limit of the memcg is reached and the memcg reclaim begins,
  and then in this memcg reclaim path it will continuesly reclaim the
  slab caches until the sc->priority drops to 0.
  After this reclaim stops, you will find there're few slab caches left,
  which is less than 20M in my test case.
  While after this patch applied the number is greater than 300M and
  the sc->priority only drops to 3.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/vmscan.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Andrew Morton June 22, 2019, 3:30 a.m. UTC | #1
On Fri, 21 Jun 2019 18:14:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:

> There're six different reclaim paths by now,
> - kswapd reclaim path
> - node reclaim path
> - hibernate preallocate memory reclaim path
> - direct reclaim path
> - memcg reclaim path
> - memcg softlimit reclaim path
> 
> The slab caches reclaimed in these paths are only calculated in the above
> three paths.
> 
> There're some drawbacks if we don't calculate the reclaimed slab caches.
> - The sc->nr_reclaimed isn't correct if there're some slab caches
>   relcaimed in this path.
> - The slab caches may be reclaimed thoroughly if there're lots of
>   reclaimable slab caches and few page caches.
>   Let's take an easy example for this case.
>   If one memcg is full of slab caches and the limit of it is 512M, in
>   other words there're approximately 512M slab caches in this memcg.
>   Then the limit of the memcg is reached and the memcg reclaim begins,
>   and then in this memcg reclaim path it will continuesly reclaim the
>   slab caches until the sc->priority drops to 0.
>   After this reclaim stops, you will find there're few slab caches left,
>   which is less than 20M in my test case.
>   While after this patch applied the number is greater than 300M and
>   the sc->priority only drops to 3.

I got a bit exhausted checking that none of these six callsites can
scribble on some caller's value of current->reclaim_state.

How about we do it at runtime?

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm/vmscan.c: add checks for incorrect handling of current->reclaim_state

Six sites are presently altering current->reclaim_state.  There is a risk
that one function stomps on a caller's value.  Use a helper function to
catch such errors.

Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |   37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

--- a/mm/vmscan.c~mm-vmscanc-add-checks-for-incorrect-handling-of-current-reclaim_state
+++ a/mm/vmscan.c
@@ -177,6 +177,18 @@ unsigned long vm_total_pages;
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
+static void set_task_reclaim_state(struct task_struct *task,
+				   struct reclaim_state *rs)
+{
+	/* Check for an overwrite */
+	WARN_ON_ONCE(rs && task->reclaim_state);
+
+	/* Check for the nulling of an already-nulled member */
+	WARN_ON_ONCE(!rs && !task->reclaim_state);
+
+	task->reclaim_state = rs;
+}
+
 #ifdef CONFIG_MEMCG_KMEM
 
 /*
@@ -3194,13 +3206,13 @@ unsigned long try_to_free_pages(struct z
 	if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
 		return 1;
 
-	current->reclaim_state = &sc.reclaim_state;
+	set_task_reclaim_state(current, &sc.reclaim_state);
 	trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
-	current->reclaim_state = NULL;
+	set_task_reclaim_state(current, NULL);
 
 	return nr_reclaimed;
 }
@@ -3223,7 +3235,7 @@ unsigned long mem_cgroup_shrink_node(str
 	};
 	unsigned long lru_pages;
 
-	current->reclaim_state = &sc.reclaim_state;
+	set_task_reclaim_state(current, &sc.reclaim_state);
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
 
@@ -3245,7 +3257,7 @@ unsigned long mem_cgroup_shrink_node(str
 					cgroup_ino(memcg->css.cgroup),
 					sc.nr_reclaimed);
 
-	current->reclaim_state = NULL;
+	set_task_reclaim_state(current, NULL);
 	*nr_scanned = sc.nr_scanned;
 
 	return sc.nr_reclaimed;
@@ -3274,7 +3286,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.may_shrinkslab = 1,
 	};
 
-	current->reclaim_state = &sc.reclaim_state;
+	set_task_reclaim_state(current, &sc.reclaim_state);
 	/*
 	 * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
 	 * take care zof from where we get pages. So the node where we start the
@@ -3299,7 +3311,7 @@ unsigned long try_to_free_mem_cgroup_pag
 	trace_mm_vmscan_memcg_reclaim_end(
 				cgroup_ino(memcg->css.cgroup),
 				nr_reclaimed);
-	current->reclaim_state = NULL;
+	set_task_reclaim_state(current, NULL);
 
 	return nr_reclaimed;
 }
@@ -3501,7 +3513,7 @@ static int balance_pgdat(pg_data_t *pgda
 		.may_unmap = 1,
 	};
 
-	current->reclaim_state = &sc.reclaim_state;
+	set_task_reclaim_state(current, &sc.reclaim_state);
 	psi_memstall_enter(&pflags);
 	__fs_reclaim_acquire();
 
@@ -3683,7 +3695,7 @@ out:
 	snapshot_refaults(NULL, pgdat);
 	__fs_reclaim_release();
 	psi_memstall_leave(&pflags);
-	current->reclaim_state = NULL;
+	set_task_reclaim_state(current, NULL);
 
 	/*
 	 * Return the order kswapd stopped reclaiming at as
@@ -3945,17 +3957,16 @@ unsigned long shrink_all_memory(unsigned
 		.hibernation_mode = 1,
 	};
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
-	struct task_struct *p = current;
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
 
 	fs_reclaim_acquire(sc.gfp_mask);
 	noreclaim_flag = memalloc_noreclaim_save();
-	p->reclaim_state = &sc.reclaim_state;
+	set_task_reclaim_state(current, &sc.reclaim_state);
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
-	p->reclaim_state = NULL;
+	set_task_reclaim_state(current, NULL);
 	memalloc_noreclaim_restore(noreclaim_flag);
 	fs_reclaim_release(sc.gfp_mask);
 
@@ -4144,7 +4155,7 @@ static int __node_reclaim(struct pglist_
 	 */
 	noreclaim_flag = memalloc_noreclaim_save();
 	p->flags |= PF_SWAPWRITE;
-	p->reclaim_state = &sc.reclaim_state;
+	set_task_reclaim_state(p, &sc.reclaim_state);
 
 	if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
 		/*
@@ -4156,7 +4167,7 @@ static int __node_reclaim(struct pglist_
 		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
 	}
 
-	p->reclaim_state = NULL;
+	set_task_reclaim_state(p, NULL);
 	current->flags &= ~PF_SWAPWRITE;
 	memalloc_noreclaim_restore(noreclaim_flag);
 	fs_reclaim_release(sc.gfp_mask);
Yafang Shao June 22, 2019, 6:31 a.m. UTC | #2
On Sat, Jun 22, 2019 at 11:30 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Fri, 21 Jun 2019 18:14:46 +0800 Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > There're six different reclaim paths by now,
> > - kswapd reclaim path
> > - node reclaim path
> > - hibernate preallocate memory reclaim path
> > - direct reclaim path
> > - memcg reclaim path
> > - memcg softlimit reclaim path
> >
> > The slab caches reclaimed in these paths are only calculated in the above
> > three paths.
> >
> > There're some drawbacks if we don't calculate the reclaimed slab caches.
> > - The sc->nr_reclaimed isn't correct if there're some slab caches
> >   relcaimed in this path.
> > - The slab caches may be reclaimed thoroughly if there're lots of
> >   reclaimable slab caches and few page caches.
> >   Let's take an easy example for this case.
> >   If one memcg is full of slab caches and the limit of it is 512M, in
> >   other words there're approximately 512M slab caches in this memcg.
> >   Then the limit of the memcg is reached and the memcg reclaim begins,
> >   and then in this memcg reclaim path it will continuesly reclaim the
> >   slab caches until the sc->priority drops to 0.
> >   After this reclaim stops, you will find there're few slab caches left,
> >   which is less than 20M in my test case.
> >   While after this patch applied the number is greater than 300M and
> >   the sc->priority only drops to 3.
>
> I got a bit exhausted checking that none of these six callsites can
> scribble on some caller's value of current->reclaim_state.
>
> How about we do it at runtime?
>

That's good.
Thanks for your improvement.

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/vmscan.c: add checks for incorrect handling of current->reclaim_state
>
> Six sites are presently altering current->reclaim_state.  There is a risk
> that one function stomps on a caller's value.  Use a helper function to
> catch such errors.
>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  mm/vmscan.c |   37 ++++++++++++++++++++++++-------------
>  1 file changed, 24 insertions(+), 13 deletions(-)
>
> --- a/mm/vmscan.c~mm-vmscanc-add-checks-for-incorrect-handling-of-current-reclaim_state
> +++ a/mm/vmscan.c
> @@ -177,6 +177,18 @@ unsigned long vm_total_pages;
>  static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>
> +static void set_task_reclaim_state(struct task_struct *task,
> +                                  struct reclaim_state *rs)
> +{
> +       /* Check for an overwrite */
> +       WARN_ON_ONCE(rs && task->reclaim_state);
> +
> +       /* Check for the nulling of an already-nulled member */
> +       WARN_ON_ONCE(!rs && !task->reclaim_state);
> +
> +       task->reclaim_state = rs;
> +}
> +
>  #ifdef CONFIG_MEMCG_KMEM
>
>  /*
> @@ -3194,13 +3206,13 @@ unsigned long try_to_free_pages(struct z
>         if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
>                 return 1;
>
> -       current->reclaim_state = &sc.reclaim_state;
> +       set_task_reclaim_state(current, &sc.reclaim_state);
>         trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
>
>         nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
>
>         trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
> -       current->reclaim_state = NULL;
> +       set_task_reclaim_state(current, NULL);
>
>         return nr_reclaimed;
>  }
> @@ -3223,7 +3235,7 @@ unsigned long mem_cgroup_shrink_node(str
>         };
>         unsigned long lru_pages;
>
> -       current->reclaim_state = &sc.reclaim_state;
> +       set_task_reclaim_state(current, &sc.reclaim_state);
>         sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>                         (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
>
> @@ -3245,7 +3257,7 @@ unsigned long mem_cgroup_shrink_node(str
>                                         cgroup_ino(memcg->css.cgroup),
>                                         sc.nr_reclaimed);
>
> -       current->reclaim_state = NULL;
> +       set_task_reclaim_state(current, NULL);
>         *nr_scanned = sc.nr_scanned;
>
>         return sc.nr_reclaimed;
> @@ -3274,7 +3286,7 @@ unsigned long try_to_free_mem_cgroup_pag
>                 .may_shrinkslab = 1,
>         };
>
> -       current->reclaim_state = &sc.reclaim_state;
> +       set_task_reclaim_state(current, &sc.reclaim_state);
>         /*
>          * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
>          * take care zof from where we get pages. So the node where we start the
> @@ -3299,7 +3311,7 @@ unsigned long try_to_free_mem_cgroup_pag
>         trace_mm_vmscan_memcg_reclaim_end(
>                                 cgroup_ino(memcg->css.cgroup),
>                                 nr_reclaimed);
> -       current->reclaim_state = NULL;
> +       set_task_reclaim_state(current, NULL);
>
>         return nr_reclaimed;
>  }
> @@ -3501,7 +3513,7 @@ static int balance_pgdat(pg_data_t *pgda
>                 .may_unmap = 1,
>         };
>
> -       current->reclaim_state = &sc.reclaim_state;
> +       set_task_reclaim_state(current, &sc.reclaim_state);
>         psi_memstall_enter(&pflags);
>         __fs_reclaim_acquire();
>
> @@ -3683,7 +3695,7 @@ out:
>         snapshot_refaults(NULL, pgdat);
>         __fs_reclaim_release();
>         psi_memstall_leave(&pflags);
> -       current->reclaim_state = NULL;
> +       set_task_reclaim_state(current, NULL);
>
>         /*
>          * Return the order kswapd stopped reclaiming at as
> @@ -3945,17 +3957,16 @@ unsigned long shrink_all_memory(unsigned
>                 .hibernation_mode = 1,
>         };
>         struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
> -       struct task_struct *p = current;
>         unsigned long nr_reclaimed;
>         unsigned int noreclaim_flag;
>
>         fs_reclaim_acquire(sc.gfp_mask);
>         noreclaim_flag = memalloc_noreclaim_save();
> -       p->reclaim_state = &sc.reclaim_state;
> +       set_task_reclaim_state(current, &sc.reclaim_state);
>
>         nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
>
> -       p->reclaim_state = NULL;
> +       set_task_reclaim_state(current, NULL);
>         memalloc_noreclaim_restore(noreclaim_flag);
>         fs_reclaim_release(sc.gfp_mask);
>
> @@ -4144,7 +4155,7 @@ static int __node_reclaim(struct pglist_
>          */
>         noreclaim_flag = memalloc_noreclaim_save();
>         p->flags |= PF_SWAPWRITE;
> -       p->reclaim_state = &sc.reclaim_state;
> +       set_task_reclaim_state(p, &sc.reclaim_state);
>
>         if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
>                 /*
> @@ -4156,7 +4167,7 @@ static int __node_reclaim(struct pglist_
>                 } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
>         }
>
> -       p->reclaim_state = NULL;
> +       set_task_reclaim_state(p, NULL);
>         current->flags &= ~PF_SWAPWRITE;
>         memalloc_noreclaim_restore(noreclaim_flag);
>         fs_reclaim_release(sc.gfp_mask);
> _
>
Kirill Tkhai June 24, 2019, 8:53 a.m. UTC | #3
On 21.06.2019 13:14, Yafang Shao wrote:
> There're six different reclaim paths by now,
> - kswapd reclaim path
> - node reclaim path
> - hibernate preallocate memory reclaim path
> - direct reclaim path
> - memcg reclaim path
> - memcg softlimit reclaim path
> 
> The slab caches reclaimed in these paths are only calculated in the above
> three paths.
> 
> There're some drawbacks if we don't calculate the reclaimed slab caches.
> - The sc->nr_reclaimed isn't correct if there're some slab caches
>   relcaimed in this path.
> - The slab caches may be reclaimed thoroughly if there're lots of
>   reclaimable slab caches and few page caches.
>   Let's take an easy example for this case.
>   If one memcg is full of slab caches and the limit of it is 512M, in
>   other words there're approximately 512M slab caches in this memcg.
>   Then the limit of the memcg is reached and the memcg reclaim begins,
>   and then in this memcg reclaim path it will continuesly reclaim the
>   slab caches until the sc->priority drops to 0.
>   After this reclaim stops, you will find there're few slab caches left,
>   which is less than 20M in my test case.
>   While after this patch applied the number is greater than 300M and
>   the sc->priority only drops to 3.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  mm/vmscan.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 18a66e5..d6c3fc8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3164,11 +3164,13 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  	if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
>  		return 1;
>  
> +	current->reclaim_state = &sc.reclaim_state;
>  	trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
>  
>  	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
>  
>  	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
> +	current->reclaim_state = NULL;

Shouldn't we remove reclaim_state assignment from __perform_reclaim() after this?
  
>  	return nr_reclaimed;
>  }
> @@ -3191,6 +3193,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  	};
>  	unsigned long lru_pages;
>  
> +	current->reclaim_state = &sc.reclaim_state;
>  	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>  			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
>  
> @@ -3212,7 +3215,9 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  					cgroup_ino(memcg->css.cgroup),
>  					sc.nr_reclaimed);
>  
> +	current->reclaim_state = NULL;
>  	*nr_scanned = sc.nr_scanned;
> +
>  	return sc.nr_reclaimed;
>  }
>  
> @@ -3239,6 +3244,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  		.may_shrinkslab = 1,
>  	};
>  
> +	current->reclaim_state = &sc.reclaim_state;
>  	/*
>  	 * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
>  	 * take care of from where we get pages. So the node where we start the
> @@ -3263,6 +3269,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  	trace_mm_vmscan_memcg_reclaim_end(
>  				cgroup_ino(memcg->css.cgroup),
>  				nr_reclaimed);
> +	current->reclaim_state = NULL;
>  
>  	return nr_reclaimed;
>  }
>
Yafang Shao June 24, 2019, 12:30 p.m. UTC | #4
On Mon, Jun 24, 2019 at 4:53 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 21.06.2019 13:14, Yafang Shao wrote:
> > There're six different reclaim paths by now,
> > - kswapd reclaim path
> > - node reclaim path
> > - hibernate preallocate memory reclaim path
> > - direct reclaim path
> > - memcg reclaim path
> > - memcg softlimit reclaim path
> >
> > The slab caches reclaimed in these paths are only calculated in the above
> > three paths.
> >
> > There're some drawbacks if we don't calculate the reclaimed slab caches.
> > - The sc->nr_reclaimed isn't correct if there're some slab caches
> >   relcaimed in this path.
> > - The slab caches may be reclaimed thoroughly if there're lots of
> >   reclaimable slab caches and few page caches.
> >   Let's take an easy example for this case.
> >   If one memcg is full of slab caches and the limit of it is 512M, in
> >   other words there're approximately 512M slab caches in this memcg.
> >   Then the limit of the memcg is reached and the memcg reclaim begins,
> >   and then in this memcg reclaim path it will continuesly reclaim the
> >   slab caches until the sc->priority drops to 0.
> >   After this reclaim stops, you will find there're few slab caches left,
> >   which is less than 20M in my test case.
> >   While after this patch applied the number is greater than 300M and
> >   the sc->priority only drops to 3.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  mm/vmscan.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 18a66e5..d6c3fc8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3164,11 +3164,13 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >       if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
> >               return 1;
> >
> > +     current->reclaim_state = &sc.reclaim_state;
> >       trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
> >
> >       nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
> >
> >       trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
> > +     current->reclaim_state = NULL;
>
> Shouldn't we remove reclaim_state assignment from __perform_reclaim() after this?
>

Oh yes. We should remove it. Thanks for pointing out.
I will post a fix soon.

Thanks
Yafang

> >       return nr_reclaimed;
> >  }
> > @@ -3191,6 +3193,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> >       };
> >       unsigned long lru_pages;
> >
> > +     current->reclaim_state = &sc.reclaim_state;
> >       sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> >                       (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> >
> > @@ -3212,7 +3215,9 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> >                                       cgroup_ino(memcg->css.cgroup),
> >                                       sc.nr_reclaimed);
> >
> > +     current->reclaim_state = NULL;
> >       *nr_scanned = sc.nr_scanned;
> > +
> >       return sc.nr_reclaimed;
> >  }
> >
> > @@ -3239,6 +3244,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >               .may_shrinkslab = 1,
> >       };
> >
> > +     current->reclaim_state = &sc.reclaim_state;
> >       /*
> >        * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
> >        * take care of from where we get pages. So the node where we start the
> > @@ -3263,6 +3269,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >       trace_mm_vmscan_memcg_reclaim_end(
> >                               cgroup_ino(memcg->css.cgroup),
> >                               nr_reclaimed);
> > +     current->reclaim_state = NULL;
> >
> >       return nr_reclaimed;
> >  }
> >
>
Kirill Tkhai June 24, 2019, 12:33 p.m. UTC | #5
On 24.06.2019 15:30, Yafang Shao wrote:
> On Mon, Jun 24, 2019 at 4:53 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 21.06.2019 13:14, Yafang Shao wrote:
>>> There're six different reclaim paths by now,
>>> - kswapd reclaim path
>>> - node reclaim path
>>> - hibernate preallocate memory reclaim path
>>> - direct reclaim path
>>> - memcg reclaim path
>>> - memcg softlimit reclaim path
>>>
>>> The slab caches reclaimed in these paths are only calculated in the above
>>> three paths.
>>>
>>> There're some drawbacks if we don't calculate the reclaimed slab caches.
>>> - The sc->nr_reclaimed isn't correct if there're some slab caches
>>>   relcaimed in this path.
>>> - The slab caches may be reclaimed thoroughly if there're lots of
>>>   reclaimable slab caches and few page caches.
>>>   Let's take an easy example for this case.
>>>   If one memcg is full of slab caches and the limit of it is 512M, in
>>>   other words there're approximately 512M slab caches in this memcg.
>>>   Then the limit of the memcg is reached and the memcg reclaim begins,
>>>   and then in this memcg reclaim path it will continuesly reclaim the
>>>   slab caches until the sc->priority drops to 0.
>>>   After this reclaim stops, you will find there're few slab caches left,
>>>   which is less than 20M in my test case.
>>>   While after this patch applied the number is greater than 300M and
>>>   the sc->priority only drops to 3.
>>>
>>> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>>> ---
>>>  mm/vmscan.c | 7 +++++++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 18a66e5..d6c3fc8 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -3164,11 +3164,13 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>>>       if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
>>>               return 1;
>>>
>>> +     current->reclaim_state = &sc.reclaim_state;
>>>       trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
>>>
>>>       nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
>>>
>>>       trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
>>> +     current->reclaim_state = NULL;
>>
>> Shouldn't we remove reclaim_state assignment from __perform_reclaim() after this?
>>
> 
> Oh yes. We should remove it. Thanks for pointing out.
> I will post a fix soon.

With the change above, feel free to add my Reviewed-by: to all of the series.

Kirill
Yafang Shao June 24, 2019, 12:40 p.m. UTC | #6
On Mon, Jun 24, 2019 at 8:33 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 24.06.2019 15:30, Yafang Shao wrote:
> > On Mon, Jun 24, 2019 at 4:53 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >> On 21.06.2019 13:14, Yafang Shao wrote:
> >>> There're six different reclaim paths by now,
> >>> - kswapd reclaim path
> >>> - node reclaim path
> >>> - hibernate preallocate memory reclaim path
> >>> - direct reclaim path
> >>> - memcg reclaim path
> >>> - memcg softlimit reclaim path
> >>>
> >>> The slab caches reclaimed in these paths are only calculated in the above
> >>> three paths.
> >>>
> >>> There're some drawbacks if we don't calculate the reclaimed slab caches.
> >>> - The sc->nr_reclaimed isn't correct if there're some slab caches
> >>>   relcaimed in this path.
> >>> - The slab caches may be reclaimed thoroughly if there're lots of
> >>>   reclaimable slab caches and few page caches.
> >>>   Let's take an easy example for this case.
> >>>   If one memcg is full of slab caches and the limit of it is 512M, in
> >>>   other words there're approximately 512M slab caches in this memcg.
> >>>   Then the limit of the memcg is reached and the memcg reclaim begins,
> >>>   and then in this memcg reclaim path it will continuesly reclaim the
> >>>   slab caches until the sc->priority drops to 0.
> >>>   After this reclaim stops, you will find there're few slab caches left,
> >>>   which is less than 20M in my test case.
> >>>   While after this patch applied the number is greater than 300M and
> >>>   the sc->priority only drops to 3.
> >>>
> >>> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >>> ---
> >>>  mm/vmscan.c | 7 +++++++
> >>>  1 file changed, 7 insertions(+)
> >>>
> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> index 18a66e5..d6c3fc8 100644
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -3164,11 +3164,13 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >>>       if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
> >>>               return 1;
> >>>
> >>> +     current->reclaim_state = &sc.reclaim_state;
> >>>       trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
> >>>
> >>>       nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
> >>>
> >>>       trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
> >>> +     current->reclaim_state = NULL;
> >>
> >> Shouldn't we remove reclaim_state assignment from __perform_reclaim() after this?
> >>
> >
> > Oh yes. We should remove it. Thanks for pointing out.
> > I will post a fix soon.
>
> With the change above, feel free to add my Reviewed-by: to all of the series.
>

Sure, thanks for your review.

Thanks
Yafang
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 18a66e5..d6c3fc8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3164,11 +3164,13 @@  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
 		return 1;
 
+	current->reclaim_state = &sc.reclaim_state;
 	trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+	current->reclaim_state = NULL;
 
 	return nr_reclaimed;
 }
@@ -3191,6 +3193,7 @@  unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 	};
 	unsigned long lru_pages;
 
+	current->reclaim_state = &sc.reclaim_state;
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
 
@@ -3212,7 +3215,9 @@  unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 					cgroup_ino(memcg->css.cgroup),
 					sc.nr_reclaimed);
 
+	current->reclaim_state = NULL;
 	*nr_scanned = sc.nr_scanned;
+
 	return sc.nr_reclaimed;
 }
 
@@ -3239,6 +3244,7 @@  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_shrinkslab = 1,
 	};
 
+	current->reclaim_state = &sc.reclaim_state;
 	/*
 	 * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
 	 * take care of from where we get pages. So the node where we start the
@@ -3263,6 +3269,7 @@  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	trace_mm_vmscan_memcg_reclaim_end(
 				cgroup_ino(memcg->css.cgroup),
 				nr_reclaimed);
+	current->reclaim_state = NULL;
 
 	return nr_reclaimed;
 }