memcg: schedule high reclaim for remote memcgs on high_work

Message ID	20190103015638.205424-1-shakeelb@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of 342stxagkcfgi70a44b16ee6b4.2ecb8dkn-ccal02a.eh6@flex--shakeelb.bounces.google.com designates 209.85.220.73 as permitted sender) client-ip=209.85.220.73; Date: Wed, 2 Jan 2019 17:56:38 -0800 Message-Id: <20190103015638.205424-1-shakeelb@google.com> Mime-Version: 1.0 Subject: [PATCH] memcg: schedule high reclaim for remote memcgs on high_work From: Shakeel Butt <shakeelb@google.com> To: Johannes Weiner <hannes@cmpxchg.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, Michal Hocko <mhocko@suse.com>, Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Shakeel Butt <shakeelb@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memcg: schedule high reclaim for remote memcgs on high_work \| expand memcg: schedule high reclaim for remote memcgs on high_work

Message ID

20190103015638.205424-1-shakeelb@google.com (mailing list archive)

State

New, archived

Headers

Received-SPF: pass (google.com: domain of
 342stxagkcfgi70a44b16ee6b4.2ecb8dkn-ccal02a.eh6@flex--shakeelb.bounces.google.com
 designates 209.85.220.73 as permitted sender) client-ip=209.85.220.73;
Date: Wed,  2 Jan 2019 17:56:38 -0800
Message-Id: <20190103015638.205424-1-shakeelb@google.com>
Mime-Version: 1.0
Subject: [PATCH] memcg: schedule high reclaim for remote memcgs on high_work
From: Shakeel Butt <shakeelb@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>,
 Vladimir Davydov <vdavydov.dev@gmail.com>,
	Michal Hocko <mhocko@suse.com>, Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	Shakeel Butt <shakeelb@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

memcg: schedule high reclaim for remote memcgs on high_work | expand

Commit Message

Shakeel Butt Jan. 3, 2019, 1:56 a.m. UTC

If a memcg is over high limit, memory reclaim is scheduled to run on
return-to-userland. However it is assumed that the memcg is the current
process's memcg. With remote memcg charging for kmem or swapping in a
page charged to remote memcg, current process can trigger reclaim on
remote memcg. So, schduling reclaim on return-to-userland for remote
memcgs will ignore the high reclaim altogether. So, punt the high
reclaim of remote memcgs to high_work.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

Comments

Michal Hocko Jan. 8, 2019, 2:59 p.m. UTC | #1

On Wed 02-01-19 17:56:38, Shakeel Butt wrote:
> If a memcg is over high limit, memory reclaim is scheduled to run on
> return-to-userland. However it is assumed that the memcg is the current
> process's memcg. With remote memcg charging for kmem or swapping in a
> page charged to remote memcg, current process can trigger reclaim on
> remote memcg. So, schduling reclaim on return-to-userland for remote
> memcgs will ignore the high reclaim altogether. So, punt the high
> reclaim of remote memcgs to high_work.

Have you seen this happening in real life workloads? And is this
offloading what we really want to do? I mean it is clearly the current
task that has triggered the remote charge so why should we offload that
work to a system? Is there any reason we cannot reclaim on the remote
memcg from the return-to-userland path?

> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/memcontrol.c | 20 ++++++++++++--------
>  1 file changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e9db1160ccbc..47439c84667a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2302,19 +2302,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * reclaim on returning to userland.  We can perform reclaim here
>  	 * if __GFP_RECLAIM but let's always punt for simplicity and so that
>  	 * GFP_KERNEL can consistently be used during reclaim.  @memcg is
> -	 * not recorded as it most likely matches current's and won't
> -	 * change in the meantime.  As high limit is checked again before
> -	 * reclaim, the cost of mismatch is negligible.
> +	 * not recorded as the return-to-userland high reclaim will only reclaim
> +	 * from current's memcg (or its ancestor). For other memcgs we punt them
> +	 * to work queue.
>  	 */
>  	do {
>  		if (page_counter_read(&memcg->memory) > memcg->high) {
> -			/* Don't bother a random interrupted task */
> -			if (in_interrupt()) {
> +			/*
> +			 * Don't bother a random interrupted task or if the
> +			 * memcg is not current's memcg's ancestor.
> +			 */
> +			if (in_interrupt() ||
> +			    !mm_match_cgroup(current->mm, memcg)) {
>  				schedule_work(&memcg->high_work);
> -				break;
> +			} else {
> +				current->memcg_nr_pages_over_high += batch;
> +				set_notify_resume(current);
>  			}
> -			current->memcg_nr_pages_over_high += batch;
> -			set_notify_resume(current);
>  			break;
>  		}
>  	} while ((memcg = parent_mem_cgroup(memcg)));
> -- 
> 2.20.1.415.g653613c723-goog
>

Shakeel Butt Jan. 8, 2019, 5:24 p.m. UTC | #2

On Tue, Jan 8, 2019 at 6:59 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 02-01-19 17:56:38, Shakeel Butt wrote:
> > If a memcg is over high limit, memory reclaim is scheduled to run on
> > return-to-userland. However it is assumed that the memcg is the current
> > process's memcg. With remote memcg charging for kmem or swapping in a
> > page charged to remote memcg, current process can trigger reclaim on
> > remote memcg. So, schduling reclaim on return-to-userland for remote
> > memcgs will ignore the high reclaim altogether. So, punt the high
> > reclaim of remote memcgs to high_work.
>
> Have you seen this happening in real life workloads?

No, just during code review.

> And is this offloading what we really want to do?

That's the question I am brainstorming nowadays. More generally how
memcg-oom-kill should work in the remote memcg charging case.

> I mean it is clearly the current
> task that has triggered the remote charge so why should we offload that
> work to a system? Is there any reason we cannot reclaim on the remote
> memcg from the return-to-userland path?
>

The only reason I did this was the code was much simpler but I see
that the current is charging the given memcg and maybe even
reclaiming, so, why not do the high reclaim as well. I will update the
patch.

thanks,
Shakeel

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9db1160ccbc..47439c84667a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2302,19 +2302,23 @@  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * reclaim on returning to userland.  We can perform reclaim here
 	 * if __GFP_RECLAIM but let's always punt for simplicity and so that
 	 * GFP_KERNEL can consistently be used during reclaim.  @memcg is
-	 * not recorded as it most likely matches current's and won't
-	 * change in the meantime.  As high limit is checked again before
-	 * reclaim, the cost of mismatch is negligible.
+	 * not recorded as the return-to-userland high reclaim will only reclaim
+	 * from current's memcg (or its ancestor). For other memcgs we punt them
+	 * to work queue.
 	 */
 	do {
 		if (page_counter_read(&memcg->memory) > memcg->high) {
-			/* Don't bother a random interrupted task */
-			if (in_interrupt()) {
+			/*
+			 * Don't bother a random interrupted task or if the
+			 * memcg is not current's memcg's ancestor.
+			 */
+			if (in_interrupt() ||
+			    !mm_match_cgroup(current->mm, memcg)) {
 				schedule_work(&memcg->high_work);
-				break;
+			} else {
+				current->memcg_nr_pages_over_high += batch;
+				set_notify_resume(current);
 			}
-			current->memcg_nr_pages_over_high += batch;
-			set_notify_resume(current);
 			break;
 		}
 	} while ((memcg = parent_mem_cgroup(memcg)));

memcg: schedule high reclaim for remote memcgs on high_work

Commit Message

Comments

Patch