Message ID | 20190103015638.205424-1-shakeelb@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | memcg: schedule high reclaim for remote memcgs on high_work | expand |
On Wed 02-01-19 17:56:38, Shakeel Butt wrote: > If a memcg is over high limit, memory reclaim is scheduled to run on > return-to-userland. However it is assumed that the memcg is the current > process's memcg. With remote memcg charging for kmem or swapping in a > page charged to remote memcg, current process can trigger reclaim on > remote memcg. So, schduling reclaim on return-to-userland for remote > memcgs will ignore the high reclaim altogether. So, punt the high > reclaim of remote memcgs to high_work. Have you seen this happening in real life workloads? And is this offloading what we really want to do? I mean it is clearly the current task that has triggered the remote charge so why should we offload that work to a system? Is there any reason we cannot reclaim on the remote memcg from the return-to-userland path? > Signed-off-by: Shakeel Butt <shakeelb@google.com> > --- > mm/memcontrol.c | 20 ++++++++++++-------- > 1 file changed, 12 insertions(+), 8 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e9db1160ccbc..47439c84667a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2302,19 +2302,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > * reclaim on returning to userland. We can perform reclaim here > * if __GFP_RECLAIM but let's always punt for simplicity and so that > * GFP_KERNEL can consistently be used during reclaim. @memcg is > - * not recorded as it most likely matches current's and won't > - * change in the meantime. As high limit is checked again before > - * reclaim, the cost of mismatch is negligible. > + * not recorded as the return-to-userland high reclaim will only reclaim > + * from current's memcg (or its ancestor). For other memcgs we punt them > + * to work queue. > */ > do { > if (page_counter_read(&memcg->memory) > memcg->high) { > - /* Don't bother a random interrupted task */ > - if (in_interrupt()) { > + /* > + * Don't bother a random interrupted task or if the > + * memcg is not current's memcg's ancestor. > + */ > + if (in_interrupt() || > + !mm_match_cgroup(current->mm, memcg)) { > schedule_work(&memcg->high_work); > - break; > + } else { > + current->memcg_nr_pages_over_high += batch; > + set_notify_resume(current); > } > - current->memcg_nr_pages_over_high += batch; > - set_notify_resume(current); > break; > } > } while ((memcg = parent_mem_cgroup(memcg))); > -- > 2.20.1.415.g653613c723-goog >
On Tue, Jan 8, 2019 at 6:59 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Wed 02-01-19 17:56:38, Shakeel Butt wrote: > > If a memcg is over high limit, memory reclaim is scheduled to run on > > return-to-userland. However it is assumed that the memcg is the current > > process's memcg. With remote memcg charging for kmem or swapping in a > > page charged to remote memcg, current process can trigger reclaim on > > remote memcg. So, schduling reclaim on return-to-userland for remote > > memcgs will ignore the high reclaim altogether. So, punt the high > > reclaim of remote memcgs to high_work. > > Have you seen this happening in real life workloads? No, just during code review. > And is this offloading what we really want to do? That's the question I am brainstorming nowadays. More generally how memcg-oom-kill should work in the remote memcg charging case. > I mean it is clearly the current > task that has triggered the remote charge so why should we offload that > work to a system? Is there any reason we cannot reclaim on the remote > memcg from the return-to-userland path? > The only reason I did this was the code was much simpler but I see that the current is charging the given memcg and maybe even reclaiming, so, why not do the high reclaim as well. I will update the patch. thanks, Shakeel
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e9db1160ccbc..47439c84667a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2302,19 +2302,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, * reclaim on returning to userland. We can perform reclaim here * if __GFP_RECLAIM but let's always punt for simplicity and so that * GFP_KERNEL can consistently be used during reclaim. @memcg is - * not recorded as it most likely matches current's and won't - * change in the meantime. As high limit is checked again before - * reclaim, the cost of mismatch is negligible. + * not recorded as the return-to-userland high reclaim will only reclaim + * from current's memcg (or its ancestor). For other memcgs we punt them + * to work queue. */ do { if (page_counter_read(&memcg->memory) > memcg->high) { - /* Don't bother a random interrupted task */ - if (in_interrupt()) { + /* + * Don't bother a random interrupted task or if the + * memcg is not current's memcg's ancestor. + */ + if (in_interrupt() || + !mm_match_cgroup(current->mm, memcg)) { schedule_work(&memcg->high_work); - break; + } else { + current->memcg_nr_pages_over_high += batch; + set_notify_resume(current); } - current->memcg_nr_pages_over_high += batch; - set_notify_resume(current); break; } } while ((memcg = parent_mem_cgroup(memcg)));
If a memcg is over high limit, memory reclaim is scheduled to run on return-to-userland. However it is assumed that the memcg is the current process's memcg. With remote memcg charging for kmem or swapping in a page charged to remote memcg, current process can trigger reclaim on remote memcg. So, schduling reclaim on return-to-userland for remote memcgs will ignore the high reclaim altogether. So, punt the high reclaim of remote memcgs to high_work. Signed-off-by: Shakeel Butt <shakeelb@google.com> --- mm/memcontrol.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-)