Message ID | 20190110174432.82064-1-shakeelb@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] memcg: schedule high reclaim for remote memcgs on high_work | expand |
Hi Shakeel, On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote: > If a memcg is over high limit, memory reclaim is scheduled to run on > return-to-userland. However it is assumed that the memcg is the current > process's memcg. With remote memcg charging for kmem or swapping in a > page charged to remote memcg, current process can trigger reclaim on > remote memcg. So, schduling reclaim on return-to-userland for remote > memcgs will ignore the high reclaim altogether. So, record the memcg > needing high reclaim and trigger high reclaim for that memcg on > return-to-userland. However if the memcg is already recorded for high > reclaim and the recorded memcg is not the descendant of the the memcg > needing high reclaim, punt the high reclaim to the work queue. The idea behind remote charging is that the thread allocating the memory is not responsible for that memory, but a different cgroup is. Why would the same thread then have to work off any high excess this could produce in that unrelated group? Say you have a inotify/dnotify listener that is restricted in its memory use - now everybody sending notification events from outside that listener's group would get throttled on a cgroup over which it has no control. That sounds like a recipe for priority inversions. It seems to me we should only do reclaim-on-return when current is in the ill-behaved cgroup, and punt everything else - interrupts and remote charges - to the workqueue.
Hi Johannes, On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > Hi Shakeel, > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote: > > If a memcg is over high limit, memory reclaim is scheduled to run on > > return-to-userland. However it is assumed that the memcg is the current > > process's memcg. With remote memcg charging for kmem or swapping in a > > page charged to remote memcg, current process can trigger reclaim on > > remote memcg. So, schduling reclaim on return-to-userland for remote > > memcgs will ignore the high reclaim altogether. So, record the memcg > > needing high reclaim and trigger high reclaim for that memcg on > > return-to-userland. However if the memcg is already recorded for high > > reclaim and the recorded memcg is not the descendant of the the memcg > > needing high reclaim, punt the high reclaim to the work queue. > > The idea behind remote charging is that the thread allocating the > memory is not responsible for that memory, but a different cgroup > is. Why would the same thread then have to work off any high excess > this could produce in that unrelated group? > > Say you have a inotify/dnotify listener that is restricted in its > memory use - now everybody sending notification events from outside > that listener's group would get throttled on a cgroup over which it > has no control. That sounds like a recipe for priority inversions. > > It seems to me we should only do reclaim-on-return when current is in > the ill-behaved cgroup, and punt everything else - interrupts and > remote charges - to the workqueue. This is what v1 of this patch was doing but Michal suggested to do what this version is doing. Michal's argument was that the current is already charging and maybe reclaiming a remote memcg then why not do the high excess reclaim as well. Personally I don't have any strong opinion either way. What I actually wanted was to punt this high reclaim to some process in that remote memcg. However I didn't explore much on that direction thinking if that complexity is worth it. Maybe I should at least explore it, so, we can compare the solutions. What do you think? Shakeel
On Fri 11-01-19 14:54:32, Shakeel Butt wrote: > Hi Johannes, > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > Hi Shakeel, > > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote: > > > If a memcg is over high limit, memory reclaim is scheduled to run on > > > return-to-userland. However it is assumed that the memcg is the current > > > process's memcg. With remote memcg charging for kmem or swapping in a > > > page charged to remote memcg, current process can trigger reclaim on > > > remote memcg. So, schduling reclaim on return-to-userland for remote > > > memcgs will ignore the high reclaim altogether. So, record the memcg > > > needing high reclaim and trigger high reclaim for that memcg on > > > return-to-userland. However if the memcg is already recorded for high > > > reclaim and the recorded memcg is not the descendant of the the memcg > > > needing high reclaim, punt the high reclaim to the work queue. > > > > The idea behind remote charging is that the thread allocating the > > memory is not responsible for that memory, but a different cgroup > > is. Why would the same thread then have to work off any high excess > > this could produce in that unrelated group? > > > > Say you have a inotify/dnotify listener that is restricted in its > > memory use - now everybody sending notification events from outside > > that listener's group would get throttled on a cgroup over which it > > has no control. That sounds like a recipe for priority inversions. > > > > It seems to me we should only do reclaim-on-return when current is in > > the ill-behaved cgroup, and punt everything else - interrupts and > > remote charges - to the workqueue. > > This is what v1 of this patch was doing but Michal suggested to do > what this version is doing. Michal's argument was that the current is > already charging and maybe reclaiming a remote memcg then why not do > the high excess reclaim as well. Johannes has a good point about the priority inversion problems which I haven't thought about. > Personally I don't have any strong opinion either way. What I actually > wanted was to punt this high reclaim to some process in that remote > memcg. However I didn't explore much on that direction thinking if > that complexity is worth it. Maybe I should at least explore it, so, > we can compare the solutions. What do you think? My question would be whether we really care all that much. Do we know of workloads which would generate a large high limit excess?
On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 11-01-19 14:54:32, Shakeel Butt wrote: > > Hi Johannes, > > > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > Hi Shakeel, > > > > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote: > > > > If a memcg is over high limit, memory reclaim is scheduled to run on > > > > return-to-userland. However it is assumed that the memcg is the current > > > > process's memcg. With remote memcg charging for kmem or swapping in a > > > > page charged to remote memcg, current process can trigger reclaim on > > > > remote memcg. So, schduling reclaim on return-to-userland for remote > > > > memcgs will ignore the high reclaim altogether. So, record the memcg > > > > needing high reclaim and trigger high reclaim for that memcg on > > > > return-to-userland. However if the memcg is already recorded for high > > > > reclaim and the recorded memcg is not the descendant of the the memcg > > > > needing high reclaim, punt the high reclaim to the work queue. > > > > > > The idea behind remote charging is that the thread allocating the > > > memory is not responsible for that memory, but a different cgroup > > > is. Why would the same thread then have to work off any high excess > > > this could produce in that unrelated group? > > > > > > Say you have a inotify/dnotify listener that is restricted in its > > > memory use - now everybody sending notification events from outside > > > that listener's group would get throttled on a cgroup over which it > > > has no control. That sounds like a recipe for priority inversions. > > > > > > It seems to me we should only do reclaim-on-return when current is in > > > the ill-behaved cgroup, and punt everything else - interrupts and > > > remote charges - to the workqueue. > > > > This is what v1 of this patch was doing but Michal suggested to do > > what this version is doing. Michal's argument was that the current is > > already charging and maybe reclaiming a remote memcg then why not do > > the high excess reclaim as well. > > Johannes has a good point about the priority inversion problems which I > haven't thought about. > > > Personally I don't have any strong opinion either way. What I actually > > wanted was to punt this high reclaim to some process in that remote > > memcg. However I didn't explore much on that direction thinking if > > that complexity is worth it. Maybe I should at least explore it, so, > > we can compare the solutions. What do you think? > > My question would be whether we really care all that much. Do we know of > workloads which would generate a large high limit excess? > The current semantics of memory.high is that it can be breached under extreme conditions. However any workload where memory.high is used and a lot of remote memcg charging happens (inotify/dnotify example given by Johannes or swapping in tmpfs file or shared memory region) the memory.high breach will become common. Shakeel
On Mon 14-01-19 12:18:07, Shakeel Butt wrote: > On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Fri 11-01-19 14:54:32, Shakeel Butt wrote: > > > Hi Johannes, > > > > > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > Hi Shakeel, > > > > > > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote: > > > > > If a memcg is over high limit, memory reclaim is scheduled to run on > > > > > return-to-userland. However it is assumed that the memcg is the current > > > > > process's memcg. With remote memcg charging for kmem or swapping in a > > > > > page charged to remote memcg, current process can trigger reclaim on > > > > > remote memcg. So, schduling reclaim on return-to-userland for remote > > > > > memcgs will ignore the high reclaim altogether. So, record the memcg > > > > > needing high reclaim and trigger high reclaim for that memcg on > > > > > return-to-userland. However if the memcg is already recorded for high > > > > > reclaim and the recorded memcg is not the descendant of the the memcg > > > > > needing high reclaim, punt the high reclaim to the work queue. > > > > > > > > The idea behind remote charging is that the thread allocating the > > > > memory is not responsible for that memory, but a different cgroup > > > > is. Why would the same thread then have to work off any high excess > > > > this could produce in that unrelated group? > > > > > > > > Say you have a inotify/dnotify listener that is restricted in its > > > > memory use - now everybody sending notification events from outside > > > > that listener's group would get throttled on a cgroup over which it > > > > has no control. That sounds like a recipe for priority inversions. > > > > > > > > It seems to me we should only do reclaim-on-return when current is in > > > > the ill-behaved cgroup, and punt everything else - interrupts and > > > > remote charges - to the workqueue. > > > > > > This is what v1 of this patch was doing but Michal suggested to do > > > what this version is doing. Michal's argument was that the current is > > > already charging and maybe reclaiming a remote memcg then why not do > > > the high excess reclaim as well. > > > > Johannes has a good point about the priority inversion problems which I > > haven't thought about. > > > > > Personally I don't have any strong opinion either way. What I actually > > > wanted was to punt this high reclaim to some process in that remote > > > memcg. However I didn't explore much on that direction thinking if > > > that complexity is worth it. Maybe I should at least explore it, so, > > > we can compare the solutions. What do you think? > > > > My question would be whether we really care all that much. Do we know of > > workloads which would generate a large high limit excess? > > > > The current semantics of memory.high is that it can be breached under > extreme conditions. However any workload where memory.high is used and > a lot of remote memcg charging happens (inotify/dnotify example given > by Johannes or swapping in tmpfs file or shared memory region) the > memory.high breach will become common. This is exactly what I am asking about. Is this something that can happen easily? Remote charges on themselves should be rare, no?
On Mon, Jan 14, 2019 at 11:25 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 14-01-19 12:18:07, Shakeel Butt wrote: > > On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Fri 11-01-19 14:54:32, Shakeel Butt wrote: > > > > Hi Johannes, > > > > > > > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > > > Hi Shakeel, > > > > > > > > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote: > > > > > > If a memcg is over high limit, memory reclaim is scheduled to run on > > > > > > return-to-userland. However it is assumed that the memcg is the current > > > > > > process's memcg. With remote memcg charging for kmem or swapping in a > > > > > > page charged to remote memcg, current process can trigger reclaim on > > > > > > remote memcg. So, schduling reclaim on return-to-userland for remote > > > > > > memcgs will ignore the high reclaim altogether. So, record the memcg > > > > > > needing high reclaim and trigger high reclaim for that memcg on > > > > > > return-to-userland. However if the memcg is already recorded for high > > > > > > reclaim and the recorded memcg is not the descendant of the the memcg > > > > > > needing high reclaim, punt the high reclaim to the work queue. > > > > > > > > > > The idea behind remote charging is that the thread allocating the > > > > > memory is not responsible for that memory, but a different cgroup > > > > > is. Why would the same thread then have to work off any high excess > > > > > this could produce in that unrelated group? > > > > > > > > > > Say you have a inotify/dnotify listener that is restricted in its > > > > > memory use - now everybody sending notification events from outside > > > > > that listener's group would get throttled on a cgroup over which it > > > > > has no control. That sounds like a recipe for priority inversions. > > > > > > > > > > It seems to me we should only do reclaim-on-return when current is in > > > > > the ill-behaved cgroup, and punt everything else - interrupts and > > > > > remote charges - to the workqueue. > > > > > > > > This is what v1 of this patch was doing but Michal suggested to do > > > > what this version is doing. Michal's argument was that the current is > > > > already charging and maybe reclaiming a remote memcg then why not do > > > > the high excess reclaim as well. > > > > > > Johannes has a good point about the priority inversion problems which I > > > haven't thought about. > > > > > > > Personally I don't have any strong opinion either way. What I actually > > > > wanted was to punt this high reclaim to some process in that remote > > > > memcg. However I didn't explore much on that direction thinking if > > > > that complexity is worth it. Maybe I should at least explore it, so, > > > > we can compare the solutions. What do you think? > > > > > > My question would be whether we really care all that much. Do we know of > > > workloads which would generate a large high limit excess? > > > > > > > The current semantics of memory.high is that it can be breached under > > extreme conditions. However any workload where memory.high is used and > > a lot of remote memcg charging happens (inotify/dnotify example given > > by Johannes or swapping in tmpfs file or shared memory region) the > > memory.high breach will become common. > > This is exactly what I am asking about. Is this something that can > happen easily? Remote charges on themselves should be rare, no? > At the moment, for kmem we can do remote charging for fanotify, inotify and buffer_head and for anon pages we can do remote charging on swap in. Now based on the workload's cgroup setup the remote charging can be very frequent or rare. At Google, remote charging is very frequent but since we are still on cgroup-v1 and do not use memory.high, the issue this patch is fixing is not observed. However for the adoption of cgroup-v2, this fix is needed. Shakeel
On Tue 15-01-19 11:38:23, Shakeel Butt wrote: > On Mon, Jan 14, 2019 at 11:25 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 14-01-19 12:18:07, Shakeel Butt wrote: > > > On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > On Fri 11-01-19 14:54:32, Shakeel Butt wrote: > > > > > Hi Johannes, > > > > > > > > > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > > > > > Hi Shakeel, > > > > > > > > > > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote: > > > > > > > If a memcg is over high limit, memory reclaim is scheduled to run on > > > > > > > return-to-userland. However it is assumed that the memcg is the current > > > > > > > process's memcg. With remote memcg charging for kmem or swapping in a > > > > > > > page charged to remote memcg, current process can trigger reclaim on > > > > > > > remote memcg. So, schduling reclaim on return-to-userland for remote > > > > > > > memcgs will ignore the high reclaim altogether. So, record the memcg > > > > > > > needing high reclaim and trigger high reclaim for that memcg on > > > > > > > return-to-userland. However if the memcg is already recorded for high > > > > > > > reclaim and the recorded memcg is not the descendant of the the memcg > > > > > > > needing high reclaim, punt the high reclaim to the work queue. > > > > > > > > > > > > The idea behind remote charging is that the thread allocating the > > > > > > memory is not responsible for that memory, but a different cgroup > > > > > > is. Why would the same thread then have to work off any high excess > > > > > > this could produce in that unrelated group? > > > > > > > > > > > > Say you have a inotify/dnotify listener that is restricted in its > > > > > > memory use - now everybody sending notification events from outside > > > > > > that listener's group would get throttled on a cgroup over which it > > > > > > has no control. That sounds like a recipe for priority inversions. > > > > > > > > > > > > It seems to me we should only do reclaim-on-return when current is in > > > > > > the ill-behaved cgroup, and punt everything else - interrupts and > > > > > > remote charges - to the workqueue. > > > > > > > > > > This is what v1 of this patch was doing but Michal suggested to do > > > > > what this version is doing. Michal's argument was that the current is > > > > > already charging and maybe reclaiming a remote memcg then why not do > > > > > the high excess reclaim as well. > > > > > > > > Johannes has a good point about the priority inversion problems which I > > > > haven't thought about. > > > > > > > > > Personally I don't have any strong opinion either way. What I actually > > > > > wanted was to punt this high reclaim to some process in that remote > > > > > memcg. However I didn't explore much on that direction thinking if > > > > > that complexity is worth it. Maybe I should at least explore it, so, > > > > > we can compare the solutions. What do you think? > > > > > > > > My question would be whether we really care all that much. Do we know of > > > > workloads which would generate a large high limit excess? > > > > > > > > > > The current semantics of memory.high is that it can be breached under > > > extreme conditions. However any workload where memory.high is used and > > > a lot of remote memcg charging happens (inotify/dnotify example given > > > by Johannes or swapping in tmpfs file or shared memory region) the > > > memory.high breach will become common. > > > > This is exactly what I am asking about. Is this something that can > > happen easily? Remote charges on themselves should be rare, no? > > > > At the moment, for kmem we can do remote charging for fanotify, > inotify and buffer_head and for anon pages we can do remote charging > on swap in. Now based on the workload's cgroup setup the remote > charging can be very frequent or rare. > > At Google, remote charging is very frequent but since we are still on > cgroup-v1 and do not use memory.high, the issue this patch is fixing > is not observed. However for the adoption of cgroup-v2, this fix is > needed. Adding some numbers into the changelog would be really valuable to judge the urgency and the scale of the problem. If we are going via kworker then it is also important to evaluate what kind of effect on the system this has. How big of the excess can we get? Why don't those memcgs resolve the excess by themselves on the first direct charge? Is it possible that kworkers simply swamp the system with many parallel memcgs with remote charges? In other words we need deeper analysis of the problem and the solution.
diff --git a/include/linux/sched.h b/include/linux/sched.h index 7d08562eeec7..5e6690042497 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1172,6 +1172,9 @@ struct task_struct { /* Used by memcontrol for targeted memcg charge: */ struct mem_cgroup *active_memcg; + + /* Used by memcontrol for high relcaim: */ + struct mem_cgroup *memcg_high_reclaim; #endif #ifdef CONFIG_BLK_CGROUP diff --git a/kernel/fork.c b/kernel/fork.c index 1b0fde63d831..85da44137847 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -918,6 +918,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; + tsk->memcg_high_reclaim = NULL; #endif return tsk; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 953d4ba8a595..18f4aefbe0bf 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2168,14 +2168,17 @@ static void high_work_func(struct work_struct *work) void mem_cgroup_handle_over_high(void) { unsigned int nr_pages = current->memcg_nr_pages_over_high; - struct mem_cgroup *memcg; + struct mem_cgroup *memcg = current->memcg_high_reclaim; if (likely(!nr_pages)) return; - memcg = get_mem_cgroup_from_mm(current->mm); + if (!memcg) + memcg = get_mem_cgroup_from_mm(current->mm); + reclaim_high(memcg, nr_pages, GFP_KERNEL); css_put(&memcg->css); + current->memcg_high_reclaim = NULL; current->memcg_nr_pages_over_high = 0; } @@ -2329,10 +2332,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, * If the hierarchy is above the normal consumption range, schedule * reclaim on returning to userland. We can perform reclaim here * if __GFP_RECLAIM but let's always punt for simplicity and so that - * GFP_KERNEL can consistently be used during reclaim. @memcg is - * not recorded as it most likely matches current's and won't - * change in the meantime. As high limit is checked again before - * reclaim, the cost of mismatch is negligible. + * GFP_KERNEL can consistently be used during reclaim. Record the memcg + * for the return-to-userland high reclaim. If the memcg is already + * recorded and the recorded memcg is not the descendant of the memcg + * needing high reclaim, punt the high reclaim to the work queue. */ do { if (page_counter_read(&memcg->memory) > memcg->high) { @@ -2340,6 +2343,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (in_interrupt()) { schedule_work(&memcg->high_work); break; + } else if (!current->memcg_high_reclaim) { + css_get(&memcg->css); + current->memcg_high_reclaim = memcg; + } else if (!mem_cgroup_is_descendant( + current->memcg_high_reclaim, memcg)) { + schedule_work(&memcg->high_work); + break; } current->memcg_nr_pages_over_high += batch; set_notify_resume(current);
If a memcg is over high limit, memory reclaim is scheduled to run on return-to-userland. However it is assumed that the memcg is the current process's memcg. With remote memcg charging for kmem or swapping in a page charged to remote memcg, current process can trigger reclaim on remote memcg. So, schduling reclaim on return-to-userland for remote memcgs will ignore the high reclaim altogether. So, record the memcg needing high reclaim and trigger high reclaim for that memcg on return-to-userland. However if the memcg is already recorded for high reclaim and the recorded memcg is not the descendant of the the memcg needing high reclaim, punt the high reclaim to the work queue. Signed-off-by: Shakeel Butt <shakeelb@google.com> --- Changelog since v2: - TIF_NOTIFY_RESUME can be set from places other than try_charge() in which case current->memcg_high_reclaim will be null. Correctly handle such scenarios. Changelog since v1: - Punt high reclaim of a memcg to work queue only if the recorded memcg is not its descendant. include/linux/sched.h | 3 +++ kernel/fork.c | 1 + mm/memcontrol.c | 22 ++++++++++++++++------ 3 files changed, 20 insertions(+), 6 deletions(-)