[v2] mm, memcg: avoid oom if cgroup is not populated

Message ID	1574818117-2885-1-git-send-email-laoar.shao@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=4KWc=ZT=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 84FDE2071E From: Yafang Shao <laoar.shao@gmail.com> To: mhocko@kernel.org, hannes@cmpxchg.org, vdavydov.dev@gmail.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, Yafang Shao <laoar.shao@gmail.com>, Michal Hocko <mhocko@suse.com> Subject: [PATCH v2] mm, memcg: avoid oom if cgroup is not populated Date: Tue, 26 Nov 2019 20:28:37 -0500 Message-Id: <1574818117-2885-1-git-send-email-laoar.shao@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] mm, memcg: avoid oom if cgroup is not populated \| expand [v2] mm, memcg: avoid oom if cgroup is not populated

Yafang Shao Nov. 27, 2019, 1:28 a.m. UTC

There's one case that the processes in a memcg are all exit (due to OOM
group or some other reasons), but the file page caches are still exist.
These file page caches may be protected by memory.min so can't be
reclaimed. If we can't success to restart the processes in this memcg or
don't want to make this memcg offline, then we want to drop the file page
caches.
The advantage of droping this file caches is it can avoid the reclaimer
(either kswapd or direct) scanning and reclaiming pages from all memcgs
exist in this system, because currently the reclaimer will fairly reclaim
pages from all memcgs if the system is under memory pressure.
The possible method to drop these file page caches is setting the
hard limit of this memcg to 0. Unfortunately this may invoke the OOM killer
and generates lots of outputs, that should not happen.
The OOM output is not expected by the admin if he or she wants to drop
the cahes and knows there're no processes in this memcg.

If memcg is not populated, we should not invoke the OOM killer because
there's nothing to kill. Next time when you start a new process and if the
max is still bellow usage, the OOM killer will be invoked and your new
process is killed, so we can cosider it as lazy OOM, that is we have been
always doing in the kernel.

Fixes: b6e6edcf ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

Michal Hocko Nov. 27, 2019, 8:54 a.m. UTC | #1

On Tue 26-11-19 20:28:37, Yafang Shao wrote:
> There's one case that the processes in a memcg are all exit (due to OOM
> group or some other reasons), but the file page caches are still exist.
> These file page caches may be protected by memory.min so can't be
> reclaimed. If we can't success to restart the processes in this memcg or
> don't want to make this memcg offline, then we want to drop the file page
> caches.
> The advantage of droping this file caches is it can avoid the reclaimer
> (either kswapd or direct) scanning and reclaiming pages from all memcgs
> exist in this system, because currently the reclaimer will fairly reclaim
> pages from all memcgs if the system is under memory pressure.
> The possible method to drop these file page caches is setting the
> hard limit of this memcg to 0. Unfortunately this may invoke the OOM killer
> and generates lots of outputs, that should not happen.
> The OOM output is not expected by the admin if he or she wants to drop
> the cahes and knows there're no processes in this memcg.
> 
> If memcg is not populated, we should not invoke the OOM killer because
> there's nothing to kill. Next time when you start a new process and if the
> max is still bellow usage, the OOM killer will be invoked and your new
> process is killed, so we can cosider it as lazy OOM, that is we have been
> always doing in the kernel.
> 
> Fixes: b6e6edcf ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>

due to reasons explained repeatedly
Nacked-by: Michal Hocko <mhocko@suse.com>

And I really find it highly annoying that you keep ignoring the review
feedback.

Yafang Shao Nov. 27, 2019, 9:17 a.m. UTC | #2

On Wed, Nov 27, 2019 at 4:54 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 26-11-19 20:28:37, Yafang Shao wrote:
> > There's one case that the processes in a memcg are all exit (due to OOM
> > group or some other reasons), but the file page caches are still exist.
> > These file page caches may be protected by memory.min so can't be
> > reclaimed. If we can't success to restart the processes in this memcg or
> > don't want to make this memcg offline, then we want to drop the file page
> > caches.
> > The advantage of droping this file caches is it can avoid the reclaimer
> > (either kswapd or direct) scanning and reclaiming pages from all memcgs
> > exist in this system, because currently the reclaimer will fairly reclaim
> > pages from all memcgs if the system is under memory pressure.
> > The possible method to drop these file page caches is setting the
> > hard limit of this memcg to 0. Unfortunately this may invoke the OOM killer
> > and generates lots of outputs, that should not happen.
> > The OOM output is not expected by the admin if he or she wants to drop
> > the cahes and knows there're no processes in this memcg.
> >
> > If memcg is not populated, we should not invoke the OOM killer because
> > there's nothing to kill. Next time when you start a new process and if the
> > max is still bellow usage, the OOM killer will be invoked and your new
> > process is killed, so we can cosider it as lazy OOM, that is we have been
> > always doing in the kernel.
> >
> > Fixes: b6e6edcf ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@suse.com>
>
> due to reasons explained repeatedly
> Nacked-by: Michal Hocko <mhocko@suse.com>
>
> And I really find it highly annoying that you keep ignoring the review
> feedback.

I didn't ignore your feedback, pls. read my reply and commit log seriously.
The reason I didn't accept your freeback is that your freeback is
based on your wrong knowladge.

Thanks

Yafang

Yafang Shao Nov. 27, 2019, 9:33 a.m. UTC | #3

On Wed, Nov 27, 2019 at 5:17 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Nov 27, 2019 at 4:54 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 26-11-19 20:28:37, Yafang Shao wrote:
> > > There's one case that the processes in a memcg are all exit (due to OOM
> > > group or some other reasons), but the file page caches are still exist.
> > > These file page caches may be protected by memory.min so can't be
> > > reclaimed. If we can't success to restart the processes in this memcg or
> > > don't want to make this memcg offline, then we want to drop the file page
> > > caches.
> > > The advantage of droping this file caches is it can avoid the reclaimer
> > > (either kswapd or direct) scanning and reclaiming pages from all memcgs
> > > exist in this system, because currently the reclaimer will fairly reclaim
> > > pages from all memcgs if the system is under memory pressure.
> > > The possible method to drop these file page caches is setting the
> > > hard limit of this memcg to 0. Unfortunately this may invoke the OOM killer
> > > and generates lots of outputs, that should not happen.
> > > The OOM output is not expected by the admin if he or she wants to drop
> > > the cahes and knows there're no processes in this memcg.
> > >
> > > If memcg is not populated, we should not invoke the OOM killer because
> > > there's nothing to kill. Next time when you start a new process and if the
> > > max is still bellow usage, the OOM killer will be invoked and your new
> > > process is killed, so we can cosider it as lazy OOM, that is we have been
> > > always doing in the kernel.
> > >
> > > Fixes: b6e6edcf ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Michal Hocko <mhocko@suse.com>
> >
> > due to reasons explained repeatedly
> > Nacked-by: Michal Hocko <mhocko@suse.com>
> >
> > And I really find it highly annoying that you keep ignoring the review
> > feedback.
>
> I didn't ignore your feedback, pls. read my reply and commit log seriously.
> The reason I didn't accept your freeback is that your freeback is
> based on your wrong knowladge.
>

While Johannes really give me some useful feedback, Thanks Johannes !

Thanks
Yafang

David Hildenbrand Nov. 27, 2019, 11:11 a.m. UTC | #4

On 27.11.19 02:28, Yafang Shao wrote:

Let me give this patch description an overhaul:

> There's one case that the processes in a memcg are all exit (due to OOM
> group or some other reasons), but the file page caches are still exist.

"When there are no more processes in a memcg (e.g., due to OOM
group), we can still have file pages in the page cache."

> These file page caches may be protected by memory.min so can't be
> reclaimed. If we can't success to restart the processes in this memcg or
> don't want to make this memcg offline, then we want to drop the file page
> caches.

"If these pages are protected by memory.min, they can't be reclaimed.
Especially if there won't be another process in this memcg and the memcg
is kept online, we do want to drop these pages from the page cache."

> The advantage of droping this file caches is it can avoid the reclaimer
> (either kswapd or direct) scanning and reclaiming pages from all memcgs
> exist in this system, because currently the reclaimer will fairly reclaim
> pages from all memcgs if the system is under memory pressure.

"By dropping these page caches we can avoid reclaimers (e.g., kswapd or
direct) to scan and reclaim pages from all memcgs in the system -
because the reclaimers will try to fairly reclaim pages from all memcgs
in the system when under memory pressure."

> The possible method to drop these file page caches is setting the
> hard limit of this memcg to 0. Unfortunately this may invoke the OOM killer
> and generates lots of outputs, that should not happen.
> The OOM output is not expected by the admin if he or she wants to drop
> the cahes and knows there're no processes in this memcg.

"By setting the hard limit of such a memcg to 0, we allow to drop the
page cache of such memcgs. Unfortunately, this may invoke the OOM killer
and generate a lot of output. The OOM output is not expected by an admin
who wants to drop these caches and knows that there are no processes in
this memcg anymore."

> 
> If memcg is not populated, we should not invoke the OOM killer because
> there's nothing to kill. Next time when you start a new process and if the
> max is still bellow usage, the OOM killer will be invoked and your new
> process is killed, so we can cosider it as lazy OOM, that is we have been
> always doing in the kernel.

"Therefore, if a memcg is not populated, we should not invoke the OOM
killer - there is nothing to kill. The next time a new process is
started in the memcg and the "max" is still below usage, the OOM killer
will be invoked and the new process will be killed."

1. I don't think the "lazy OOM" part is relevant.

2. Where is the part that modifies the limits? or did you drop that? is
it part of another patch?

3. I think I agree with Michal that modifying the limits smells more
like a configuration thingy to be handled by an admin (especially, adapt
min/max properly). But again, not sure where that change is located :)

4. This patch on its own (if there are no processes, there is nothing to
kill) does not sound too wrong to me. Instead of an endless loop
(besides signals) where we can't make any progress, we exit right away.

(I am not yet too familiar with memgc, Michal is clearly the expert :) )

> 
> Fixes: b6e6edcf ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memcontrol.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1c4c08b..e936f1b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6139,9 +6139,20 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>  			continue;
>  		}
>  
> -		memcg_memory_event(memcg, MEMCG_OOM);
> -		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> +		/* If there's no procesess, we don't need to invoke the OOM
> +		 * killer. Then next time when you try to start a process
> +		 * in this memcg, the max may still bellow usage, and then
> +		 * this OOM killer will be invoked. This can be considered
> +		 * as lazy OOM, that is we have been always doing in the
> +		 * kernel. Pls. Michal, that is really consistency.
> +		 */
> +		if (cgroup_is_populated(memcg->css.cgroup)) {
> +			memcg_memory_event(memcg, MEMCG_OOM);
> +			if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> +				break;
> +		} else  {
>  			break;
> +		}
>  	}
>  
>  	memcg_wb_domain_size_changed(memcg);
>

Yafang Shao Nov. 27, 2019, 11:35 a.m. UTC | #5

On Wed, Nov 27, 2019 at 7:11 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.11.19 02:28, Yafang Shao wrote:
>
> Let me give this patch description an overhaul:
>

Well done!
Thanks for your work.

> > There's one case that the processes in a memcg are all exit (due to OOM
> > group or some other reasons), but the file page caches are still exist.
>
> "When there are no more processes in a memcg (e.g., due to OOM
> group), we can still have file pages in the page cache."
>
> > These file page caches may be protected by memory.min so can't be
> > reclaimed. If we can't success to restart the processes in this memcg or
> > don't want to make this memcg offline, then we want to drop the file page
> > caches.
>
> "If these pages are protected by memory.min, they can't be reclaimed.
> Especially if there won't be another process in this memcg and the memcg
> is kept online, we do want to drop these pages from the page cache."
>
> > The advantage of droping this file caches is it can avoid the reclaimer
> > (either kswapd or direct) scanning and reclaiming pages from all memcgs
> > exist in this system, because currently the reclaimer will fairly reclaim
> > pages from all memcgs if the system is under memory pressure.
>
> "By dropping these page caches we can avoid reclaimers (e.g., kswapd or
> direct) to scan and reclaim pages from all memcgs in the system -
> because the reclaimers will try to fairly reclaim pages from all memcgs
> in the system when under memory pressure."
>
> > The possible method to drop these file page caches is setting the
> > hard limit of this memcg to 0. Unfortunately this may invoke the OOM killer
> > and generates lots of outputs, that should not happen.
> > The OOM output is not expected by the admin if he or she wants to drop
> > the cahes and knows there're no processes in this memcg.
>
> "By setting the hard limit of such a memcg to 0, we allow to drop the
> page cache of such memcgs. Unfortunately, this may invoke the OOM killer
> and generate a lot of output. The OOM output is not expected by an admin
> who wants to drop these caches and knows that there are no processes in
> this memcg anymore."
>
> >
> > If memcg is not populated, we should not invoke the OOM killer because
> > there's nothing to kill. Next time when you start a new process and if the
> > max is still bellow usage, the OOM killer will be invoked and your new
> > process is killed, so we can cosider it as lazy OOM, that is we have been
> > always doing in the kernel.
>
> "Therefore, if a memcg is not populated, we should not invoke the OOM
> killer - there is nothing to kill. The next time a new process is
> started in the memcg and the "max" is still below usage, the OOM killer
> will be invoked and the new process will be killed."
>
> 1. I don't think the "lazy OOM" part is relevant.
>

That doesn't imporatant.

> 2. Where is the part that modifies the limits? or did you drop that? is
> it part of another patch?
>

No. it is not part of another patch.
Modifying the limits is really a workaround that Michal[1] has told me
to fix my problem,
while actually it doesn't work, that is why I submit this patch.

1. https://lore.kernel.org/linux-mm/20191126073129.GA20912@dhcp22.suse.cz/


> 3. I think I agree with Michal that modifying the limits smells more
> like a configuration thingy to be handled by an admin (especially, adapt
> min/max properly). But again, not sure where that change is located :)
>

I agree with you all, but that is Michal told me to do. See above and
the disccussion in this thread.

> 4. This patch on its own (if there are no processes, there is nothing to
> kill) does not sound too wrong to me. Instead of an endless loop
> (besides signals) where we can't make any progress, we exit right away.
>

Thanks for you feedback.

> (I am not yet too familiar with memgc, Michal is clearly the expert :) )
>

I agree with you that Michal is an expert, but clearly that Michal is
not an expert on this issue.

> >
> > Fixes: b6e6edcf ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/memcontrol.c | 15 +++++++++++++--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 1c4c08b..e936f1b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6139,9 +6139,20 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> >                       continue;
> >               }
> >
> > -             memcg_memory_event(memcg, MEMCG_OOM);
> > -             if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> > +             /* If there's no procesess, we don't need to invoke the OOM
> > +              * killer. Then next time when you try to start a process
> > +              * in this memcg, the max may still bellow usage, and then
> > +              * this OOM killer will be invoked. This can be considered
> > +              * as lazy OOM, that is we have been always doing in the
> > +              * kernel. Pls. Michal, that is really consistency.
> > +              */
> > +             if (cgroup_is_populated(memcg->css.cgroup)) {
> > +                     memcg_memory_event(memcg, MEMCG_OOM);
> > +                     if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> > +                             break;
> > +             } else  {
> >                       break;
> > +             }
> >       }
> >
> >       memcg_wb_domain_size_changed(memcg);
> >
>
>
> --
> Thanks,
>
> David / dhildenb
>


Thanks
Yafang

Michal Hocko Nov. 27, 2019, 11:41 a.m. UTC | #6

On Wed 27-11-19 12:11:24, David Hildenbrand wrote:
[...]
> 4. This patch on its own (if there are no processes, there is nothing to
> kill) does not sound too wrong to me. Instead of an endless loop
> (besides signals) where we can't make any progress, we exit right away.

mem_cgroup_out_of_memory returns false when there is no oom victim
selected and then we break out.

My main objection to the patch is that it adds a subtle inconsitency.
Admins are simply not going to see that the memcg was OOM due to the
limit change and OOM killer cannot do anything about that. No tasks vs.
no killable task doesn't make any real difference. There is simply no
way to get out of that situation.

Yafang Shao Nov. 27, 2019, 11:55 a.m. UTC | #7

On Wed, Nov 27, 2019 at 7:41 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 27-11-19 12:11:24, David Hildenbrand wrote:
> [...]
> > 4. This patch on its own (if there are no processes, there is nothing to
> > kill) does not sound too wrong to me. Instead of an endless loop
> > (besides signals) where we can't make any progress, we exit right away.
>
> mem_cgroup_out_of_memory returns false when there is no oom victim
> selected and then we break out.
>
> My main objection to the patch is that it adds a subtle inconsitency.

I don't want to argue inconsitency or consitency with you.

> Admins are simply not going to see that the memcg was OOM due to the
> limit change and OOM killer cannot do anything about that.

Printing something like "OOM and no tasks" can esily fix this issue,
if you insist that we should print something.
You can ignore my feedback if you would like to.

> No tasks vs.
> no killable task doesn't make any real difference. There is simply no
> way to get out of that situation.

Well, I don't want to argue with you again.

Thanks
Yafang

David Hildenbrand Nov. 27, 2019, 11:57 a.m. UTC | #8

On 27.11.19 12:41, Michal Hocko wrote:
> On Wed 27-11-19 12:11:24, David Hildenbrand wrote:
> [...]
>> 4. This patch on its own (if there are no processes, there is nothing to
>> kill) does not sound too wrong to me. Instead of an endless loop
>> (besides signals) where we can't make any progress, we exit right away.
> 
> mem_cgroup_out_of_memory returns false when there is no oom victim
> selected and then we break out.

I see. So it really is one iteration of OOM messages and then we break.

> 
> My main objection to the patch is that it adds a subtle inconsitency.
> Admins are simply not going to see that the memcg was OOM due to the
> limit change and OOM killer cannot do anything about that. No tasks vs.
> no killable task doesn't make any real difference. There is simply no
> way to get out of that situation.

Yeah, I was asking myself if we could handle that differently in the
shrinker then. E.g., print a different message ("OOM but no killable
tasks") or sth. like that. The the admin is aware that there is an OOM
event and that e.g., starting the next process will definitely result in
surprises.

But again, no expert :)

Michal Hocko Nov. 27, 2019, 11:58 a.m. UTC | #9

On Wed 27-11-19 19:35:03, Yafang Shao wrote:
[...]
> > 3. I think I agree with Michal that modifying the limits smells more
> > like a configuration thingy to be handled by an admin (especially, adapt
> > min/max properly). But again, not sure where that change is located :)
> >
> 
> I agree with you all, but that is Michal told me to do. See above and
> the disccussion in this thread.

Look, I have tried to help you here. I have explained why force_empty is
not a part of cgroup v2. I have suggested to use hard limit to achieve
a similar outcome. The OOM killer is a natural part of the hard limit -
I guess I could have been more explicit about that. As Johannes noted
high limit can be used as well (you need to have a task in the memcg
context for that to be effective).

Since then you have tried to tweak the code here and there with a very
weak justification and now you are complaining and questioning my
expertise. Please think about your attitude.

Thanks!

Yafang Shao Nov. 27, 2019, 12:01 p.m. UTC | #10

On Wed, Nov 27, 2019 at 7:58 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 27-11-19 19:35:03, Yafang Shao wrote:
> [...]
> > > 3. I think I agree with Michal that modifying the limits smells more
> > > like a configuration thingy to be handled by an admin (especially, adapt
> > > min/max properly). But again, not sure where that change is located :)
> > >
> >
> > I agree with you all, but that is Michal told me to do. See above and
> > the disccussion in this thread.
>
> Look, I have tried to help you here.

Thanks for your help and patience.

> I have explained why force_empty is
> not a part of cgroup v2. I have suggested to use hard limit to achieve
> a similar outcome. The OOM killer is a natural part of the hard limit -
> I guess I could have been more explicit about that. As Johannes noted
> high limit can be used as well (you need to have a task in the memcg
> context for that to be effective).
>

I trust you so I tried your solution.

> Since then you have tried to tweak the code here and there with a very
> weak justification and now you are complaining and questioning my
> expertise. Please think about your attitude.
>

I'm sorry if my wrong expression offend you, which is not I mean to.

Thanks
Yafang

[v2] mm, memcg: avoid oom if cgroup is not populated

Commit Message

Comments

Patch