Message ID | 20180525185501.82098-1-shakeelb@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote: > Based on several conditions the kernel can decide to force charge an > allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw > counters. Do the same for memcg->kmem counter too. In cgroup-v1, this > bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit > on kmem counter is set and reached. memory.kmem.limit is broken and unlikely to ever be fixed as this knob was deprecated in cgroup-v2. The fact that hitting the limit doesn't trigger reclaim can result in unexpected behavior from user's pov, like getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL allocations isn't going to fix those problem. So I'd suggest to avoid setting memory.kmem.limit instead of trying to fix it or, even better, switch to cgroup-v2. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > --- > mm/memcontrol.c | 21 +++++++++++++++++++-- > 1 file changed, 19 insertions(+), 2 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ab5673dbfc4e..0a88f824c550 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1893,6 +1893,18 @@ void mem_cgroup_handle_over_high(void) > current->memcg_nr_pages_over_high = 0; > } > > +/* > + * Based on try_charge() force charge conditions. > + */ > +static inline bool should_force_charge(gfp_t gfp_mask) > +{ > + return (unlikely(tsk_is_oom_victim(current) || > + fatal_signal_pending(current) || > + current->flags & PF_EXITING || > + current->flags & PF_MEMALLOC || > + gfp_mask & __GFP_NOFAIL)); > +} > + > static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > unsigned int nr_pages) > { > @@ -2008,6 +2020,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > * The allocation either can't fail or will lead to more memory > * being freed very soon. Allow memory usage go over the limit > * temporarily by force charging it. > + * > + * NOTE: Please keep the should_force_charge() conditions in sync. > */ > page_counter_charge(&memcg->memory, nr_pages); > if (do_memsw_account()) > @@ -2331,8 +2345,11 @@ int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && > !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) { > - cancel_charge(memcg, nr_pages); > - return -ENOMEM; > + if (!should_force_charge(gfp)) { > + cancel_charge(memcg, nr_pages); > + return -ENOMEM; > + } > + page_counter_charge(&memcg->kmem, nr_pages); > } > > page->mem_cgroup = memcg;
On Sat, May 26, 2018 at 11:51 AM, Vladimir Davydov <vdavydov.dev@gmail.com> wrote: > On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote: >> Based on several conditions the kernel can decide to force charge an >> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw >> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this >> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit >> on kmem counter is set and reached. > > memory.kmem.limit is broken and unlikely to ever be fixed as this knob > was deprecated in cgroup-v2. The fact that hitting the limit doesn't > trigger reclaim can result in unexpected behavior from user's pov, like > getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL > allocations isn't going to fix those problem. I understand that fixing NOFAIL will not fix all other issues but it still is better than current situation. IMHO we should keep fixing kmem bit by bit. One crazy idea is to just break it completely by force charging all the time.
On Sat 26-05-18 15:37:05, Shakeel Butt wrote: > On Sat, May 26, 2018 at 11:51 AM, Vladimir Davydov > <vdavydov.dev@gmail.com> wrote: > > On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote: > >> Based on several conditions the kernel can decide to force charge an > >> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw > >> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this > >> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit > >> on kmem counter is set and reached. > > > > memory.kmem.limit is broken and unlikely to ever be fixed as this knob > > was deprecated in cgroup-v2. The fact that hitting the limit doesn't > > trigger reclaim can result in unexpected behavior from user's pov, like > > getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL > > allocations isn't going to fix those problem. > > I understand that fixing NOFAIL will not fix all other issues but it > still is better than current situation. IMHO we should keep fixing > kmem bit by bit. > > One crazy idea is to just break it completely by force charging all the time. What is the limit good for then? Accounting?
On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Sat 26-05-18 15:37:05, Shakeel Butt wrote: >> On Sat, May 26, 2018 at 11:51 AM, Vladimir Davydov >> <vdavydov.dev@gmail.com> wrote: >> > On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote: >> >> Based on several conditions the kernel can decide to force charge an >> >> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw >> >> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this >> >> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit >> >> on kmem counter is set and reached. >> > >> > memory.kmem.limit is broken and unlikely to ever be fixed as this knob >> > was deprecated in cgroup-v2. The fact that hitting the limit doesn't >> > trigger reclaim can result in unexpected behavior from user's pov, like >> > getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL >> > allocations isn't going to fix those problem. >> >> I understand that fixing NOFAIL will not fix all other issues but it >> still is better than current situation. IMHO we should keep fixing >> kmem bit by bit. >> >> One crazy idea is to just break it completely by force charging all the time. > > What is the limit good for then? Accounting? > Unlike tcpmem, the kmem accounting is enabled by default. No need to set the limit to enable accounting. I think my crazy idea was just wrong and without much thought. Though is there a precedence where the broken feature is not fixed because an alternative is available?
On Mon 28-05-18 10:23:07, Shakeel Butt wrote: > On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote: > Though is there a precedence where the broken feature is not fixed > because an alternative is available? Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the other hand we keep saying that kmem accounting in v1 is hard usable and strongly discourage people from using it. Sure we can add the code which handles _this_ particular case but that wouldn't make the whole thing more usable I strongly suspect. Maybe I am wrong and you can provide some specific examples. Is GFP_NOFAIL that common to matter? In any case we should balance between the code maintainability here. Adding more cruft into the allocator path is not free.
On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Mon 28-05-18 10:23:07, Shakeel Butt wrote: >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote: >> Though is there a precedence where the broken feature is not fixed >> because an alternative is available? > > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the > other hand we keep saying that kmem accounting in v1 is hard usable and > strongly discourage people from using it. Sure we can add the code which > handles _this_ particular case but that wouldn't make the whole thing > more usable I strongly suspect. Maybe I am wrong and you can provide > some specific examples. Is GFP_NOFAIL that common to matter? > > In any case we should balance between the code maintainability here. > Adding more cruft into the allocator path is not free. > We do not use kmem limits internally and this is something I found through code inspection. If this patch is increasing the cost of code maintainability I am fine with dropping it but at least there should a comment saying that kmem limits are broken and no need fix. Shakeel
On Wed, May 30, 2018 at 11:14:33AM -0700, Shakeel Butt wrote: > On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 28-05-18 10:23:07, Shakeel Butt wrote: > >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote: > >> Though is there a precedence where the broken feature is not fixed > >> because an alternative is available? > > > > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the > > other hand we keep saying that kmem accounting in v1 is hard usable and > > strongly discourage people from using it. Sure we can add the code which > > handles _this_ particular case but that wouldn't make the whole thing > > more usable I strongly suspect. Maybe I am wrong and you can provide > > some specific examples. Is GFP_NOFAIL that common to matter? > > > > In any case we should balance between the code maintainability here. > > Adding more cruft into the allocator path is not free. > > > > We do not use kmem limits internally and this is something I found > through code inspection. If this patch is increasing the cost of code > maintainability I am fine with dropping it but at least there should a > comment saying that kmem limits are broken and no need fix. I agree. Even, I didn't know kmem is strongly discouraged until now. Then, why is it enabled by default on cgroup v1? Let's turn if off with comment "It's broken so do not use/fix. Instead, please move to cgroup v2".
On Thu 31-05-18 15:01:33, Minchan Kim wrote: > On Wed, May 30, 2018 at 11:14:33AM -0700, Shakeel Butt wrote: > > On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > On Mon 28-05-18 10:23:07, Shakeel Butt wrote: > > >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote: > > >> Though is there a precedence where the broken feature is not fixed > > >> because an alternative is available? > > > > > > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the > > > other hand we keep saying that kmem accounting in v1 is hard usable and > > > strongly discourage people from using it. Sure we can add the code which > > > handles _this_ particular case but that wouldn't make the whole thing > > > more usable I strongly suspect. Maybe I am wrong and you can provide > > > some specific examples. Is GFP_NOFAIL that common to matter? > > > > > > In any case we should balance between the code maintainability here. > > > Adding more cruft into the allocator path is not free. > > > > > > > We do not use kmem limits internally and this is something I found > > through code inspection. If this patch is increasing the cost of code > > maintainability I am fine with dropping it but at least there should a > > comment saying that kmem limits are broken and no need fix. > > I agree. > > Even, I didn't know kmem is strongly discouraged until now. Then, > why is it enabled by default on cgroup v1? You have to set a non-zero limit to make it active IIRC. The code is compiled in because v2 enables it by default. > Let's turn if off with comment "It's broken so do not use/fix. Instead, > please move to cgroup v2".
On Thu, May 31, 2018 at 08:56:42AM +0200, Michal Hocko wrote: > On Thu 31-05-18 15:01:33, Minchan Kim wrote: > > On Wed, May 30, 2018 at 11:14:33AM -0700, Shakeel Butt wrote: > > > On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 28-05-18 10:23:07, Shakeel Butt wrote: > > > >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > >> Though is there a precedence where the broken feature is not fixed > > > >> because an alternative is available? > > > > > > > > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the > > > > other hand we keep saying that kmem accounting in v1 is hard usable and > > > > strongly discourage people from using it. Sure we can add the code which > > > > handles _this_ particular case but that wouldn't make the whole thing > > > > more usable I strongly suspect. Maybe I am wrong and you can provide > > > > some specific examples. Is GFP_NOFAIL that common to matter? > > > > > > > > In any case we should balance between the code maintainability here. > > > > Adding more cruft into the allocator path is not free. > > > > > > > > > > We do not use kmem limits internally and this is something I found > > > through code inspection. If this patch is increasing the cost of code > > > maintainability I am fine with dropping it but at least there should a > > > comment saying that kmem limits are broken and no need fix. > > > > I agree. > > > > Even, I didn't know kmem is strongly discouraged until now. Then, > > why is it enabled by default on cgroup v1? > > You have to set a non-zero limit to make it active IIRC. The code is Maybe, no. I didn't set anything. IOW, it was a default without any setting and I hit this as you can remember. http://lkml.kernel.org/r/<20180418022912.248417-1-minchan@kernel.org> We don't need to allocate memory for stuff even maintainers discourage. > compiled in because v2 enables it by default. > > > Let's turn if off with comment "It's broken so do not use/fix. Instead, > > please move to cgroup v2". > > -- > Michal Hocko > SUSE Labs
On Thu 31-05-18 17:23:17, Minchan Kim wrote: > On Thu, May 31, 2018 at 08:56:42AM +0200, Michal Hocko wrote: > > On Thu 31-05-18 15:01:33, Minchan Kim wrote: > > > On Wed, May 30, 2018 at 11:14:33AM -0700, Shakeel Butt wrote: > > > > On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > > On Mon 28-05-18 10:23:07, Shakeel Butt wrote: > > > > >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > >> Though is there a precedence where the broken feature is not fixed > > > > >> because an alternative is available? > > > > > > > > > > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the > > > > > other hand we keep saying that kmem accounting in v1 is hard usable and > > > > > strongly discourage people from using it. Sure we can add the code which > > > > > handles _this_ particular case but that wouldn't make the whole thing > > > > > more usable I strongly suspect. Maybe I am wrong and you can provide > > > > > some specific examples. Is GFP_NOFAIL that common to matter? > > > > > > > > > > In any case we should balance between the code maintainability here. > > > > > Adding more cruft into the allocator path is not free. > > > > > > > > > > > > > We do not use kmem limits internally and this is something I found > > > > through code inspection. If this patch is increasing the cost of code > > > > maintainability I am fine with dropping it but at least there should a > > > > comment saying that kmem limits are broken and no need fix. > > > > > > I agree. > > > > > > Even, I didn't know kmem is strongly discouraged until now. Then, > > > why is it enabled by default on cgroup v1? > > > > You have to set a non-zero limit to make it active IIRC. The code is > > Maybe, no. I didn't set anything. IOW, it was a default without any setting > and I hit this as you can remember. > http://lkml.kernel.org/r/<20180418022912.248417-1-minchan@kernel.org> > We don't need to allocate memory for stuff even maintainers discourage. Well, you can disable the whole kmem accounting by cgroup.memory=nokmem kernel parameter. I wasn't very happy when we removed the config option which would have put all the tracking aside even when no limit is set but the argument was that the ifdefery was growing into an unmaintainable mess and I have to agree with that. So here we are...
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ab5673dbfc4e..0a88f824c550 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1893,6 +1893,18 @@ void mem_cgroup_handle_over_high(void) current->memcg_nr_pages_over_high = 0; } +/* + * Based on try_charge() force charge conditions. + */ +static inline bool should_force_charge(gfp_t gfp_mask) +{ + return (unlikely(tsk_is_oom_victim(current) || + fatal_signal_pending(current) || + current->flags & PF_EXITING || + current->flags & PF_MEMALLOC || + gfp_mask & __GFP_NOFAIL)); +} + static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { @@ -2008,6 +2020,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, * The allocation either can't fail or will lead to more memory * being freed very soon. Allow memory usage go over the limit * temporarily by force charging it. + * + * NOTE: Please keep the should_force_charge() conditions in sync. */ page_counter_charge(&memcg->memory, nr_pages); if (do_memsw_account()) @@ -2331,8 +2345,11 @@ int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) { - cancel_charge(memcg, nr_pages); - return -ENOMEM; + if (!should_force_charge(gfp)) { + cancel_charge(memcg, nr_pages); + return -ENOMEM; + } + page_counter_charge(&memcg->kmem, nr_pages); } page->mem_cgroup = memcg;
Based on several conditions the kernel can decide to force charge an allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw counters. Do the same for memcg->kmem counter too. In cgroup-v1, this bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit on kmem counter is set and reached. Signed-off-by: Shakeel Butt <shakeelb@google.com> --- mm/memcontrol.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)