Message ID | 1566177486-2649-1-git-send-email-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm, memcg: skip killing processes under memcg protection at first scan | expand |
On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > In the current memory.min design, the system is going to do OOM instead > of reclaiming the reclaimable pages protected by memory.min if the > system is lack of free memory. While under this condition, the OOM > killer may kill the processes in the memcg protected by memory.min. > This behavior is very weird. > In order to make it more reasonable, I make some changes in the OOM > killer. In this patch, the OOM killer will do two-round scan. It will > skip the processes under memcg protection at the first scan, and if it > can't kill any processes it will rescan all the processes. > > Regarding the overhead this change may takes, I don't think it will be a > problem because this only happens under system memory pressure and > the OOM killer can't find any proper victims which are not under memcg > protection. Hi Yafang! The idea makes sense at the first glance, but actually I'm worried about mixing per-memcg and per-process characteristics. Actually, it raises many questions: 1) if we do respect memory.min, why not memory.low too? 2) if the task is 200Gb large, does 10Mb memory protection make any difference? if so, why would we respect it? 3) if it works for global OOMs, why not memcg-level OOMs? 4) if the task is prioritized to be killed by OOM (via oom_score_adj), why even small memory.protection prevents it completely? 5) if there are two tasks similar in size and both protected, should we prefer one with the smaller protection? etc. Actually, I think that it makes more sense to build a completely cgroup-aware OOM killer, which will select the OOM victim scanning the memcg tree, not individual tasks. And then it can easily respect memory.low/min in a reasonable way. But I failed to reach the upstream consensus on how it should work. You can search for "memcg-aware OOM killer" in the lkml archive, there was a ton of discussions and many many patchset versions. The code itself can be simplified a bit too, but I think it's not that important now. Thanks!
On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote: > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > In the current memory.min design, the system is going to do OOM instead > > of reclaiming the reclaimable pages protected by memory.min if the > > system is lack of free memory. While under this condition, the OOM > > killer may kill the processes in the memcg protected by memory.min. > > This behavior is very weird. > > In order to make it more reasonable, I make some changes in the OOM > > killer. In this patch, the OOM killer will do two-round scan. It will > > skip the processes under memcg protection at the first scan, and if it > > can't kill any processes it will rescan all the processes. > > > > Regarding the overhead this change may takes, I don't think it will be a > > problem because this only happens under system memory pressure and > > the OOM killer can't find any proper victims which are not under memcg > > protection. > > Hi Yafang! > > The idea makes sense at the first glance, but actually I'm worried > about mixing per-memcg and per-process characteristics. > Actually, it raises many questions: > 1) if we do respect memory.min, why not memory.low too? memroy.low is different with memory.min, as the OOM killer will not be invoked when it is reached. If memory.low should be considered as well, we can use mem_cgroup_protected() here to repclace task_under_memcg_protection() here. > 2) if the task is 200Gb large, does 10Mb memory protection make any > difference? if so, why would we respect it? Same with above, only consider it when the proctecion is enabled. > 3) if it works for global OOMs, why not memcg-level OOMs? memcg OOM is when the memory limit is reached and it can't find something to relcaim in the memcg and have to kill processes in this memcg. That is different with global OOM, because the global OOM can chose processes outside the memcg but the memcg OOM can't. > 4) if the task is prioritized to be killed by OOM (via oom_score_adj), > why even small memory.protection prevents it completely? Would you pls. show me some examples that when we will set both memory.min(meaning the porcesses in this memcg is very important) and higher oom_score_adj(meaning the porcesses in this memcg is not improtant at all) ? Note that the memory.min don't know which processes is important, while it only knows is if this process in this memcg. > 5) if there are two tasks similar in size and both protected, > should we prefer one with the smaller protection? > etc. Same with the answer in 1). > > Actually, I think that it makes more sense to build a completely > cgroup-aware OOM killer, which will select the OOM victim scanning > the memcg tree, not individual tasks. And then it can easily respect > memory.low/min in a reasonable way. I haven't taken a close look at the memcg-aware OOM killer, but even with cgroup-aware OOM killer I think it still can't answer your question 4). > But I failed to reach the upstream consensus on how it should work. > You can search for "memcg-aware OOM killer" in the lkml archive, > there was a ton of discussions and many many patchset versions. > I will take a close look at it. Thanks for your reference. > > The code itself can be simplified a bit too, but I think it's > not that important now. > > Thanks! Thanks Yafang
On Tue, Aug 20, 2019 at 09:16:01AM +0800, Yafang Shao wrote: > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote: > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > > In the current memory.min design, the system is going to do OOM instead > > > of reclaiming the reclaimable pages protected by memory.min if the > > > system is lack of free memory. While under this condition, the OOM > > > killer may kill the processes in the memcg protected by memory.min. > > > This behavior is very weird. > > > In order to make it more reasonable, I make some changes in the OOM > > > killer. In this patch, the OOM killer will do two-round scan. It will > > > skip the processes under memcg protection at the first scan, and if it > > > can't kill any processes it will rescan all the processes. > > > > > > Regarding the overhead this change may takes, I don't think it will be a > > > problem because this only happens under system memory pressure and > > > the OOM killer can't find any proper victims which are not under memcg > > > protection. > > > > Hi Yafang! > > > > The idea makes sense at the first glance, but actually I'm worried > > about mixing per-memcg and per-process characteristics. > > Actually, it raises many questions: > > 1) if we do respect memory.min, why not memory.low too? > > memroy.low is different with memory.min, as the OOM killer will not be > invoked when it is reached. > If memory.low should be considered as well, we can use > mem_cgroup_protected() here to repclace task_under_memcg_protection() > here. > > > 2) if the task is 200Gb large, does 10Mb memory protection make any > > difference? if so, why would we respect it? > > Same with above, only consider it when the proctecion is enabled. Right, but memory.min is a number, not a boolean flag. It defines how much memory is protected. You're using it as an on-off knob, which is sub-optimal from my point of view. > > > 3) if it works for global OOMs, why not memcg-level OOMs? > > memcg OOM is when the memory limit is reached and it can't find > something to relcaim in the memcg and have to kill processes in this > memcg. > That is different with global OOM, because the global OOM can chose > processes outside the memcg but the memcg OOM can't. Imagine the following hierarchy: / | A memory.max = 10G, memory.min = 2G / \ B C memory.min = 1G, memory.min = 0 Say, you have memcg OOM in A, why B's memory min is not respected? How it's different to the system-wide OOM? > > > 4) if the task is prioritized to be killed by OOM (via oom_score_adj), > > why even small memory.protection prevents it completely? > > Would you pls. show me some examples that when we will set both > memory.min(meaning the porcesses in this memcg is very important) and > higher oom_score_adj(meaning the porcesses in this memcg is not > improtant at all) ? > Note that the memory.min don't know which processes is important, > while it only knows is if this process in this memcg. For instance, to prefer a specific process to be killed in case of memcg OOM. Also, memory.min can be used mostly to preserve the pagecache, and an OOM kill means nothing but some anon memory leak. In this case, it makes no sense to protect the leaked task. > > > 5) if there are two tasks similar in size and both protected, > > should we prefer one with the smaller protection? > > etc. > > Same with the answer in 1). So the problem is not that your patch is incorrect (or the idea is bad), but you're defining a new policy, which will be impossible or very hard to change further (as any other policy). So it's important to define it very well. Using the memory.min number as a binary flag for selecting tasks seems a bit limited. Thanks!
On Tue, Aug 20, 2019 at 9:40 AM Roman Gushchin <guro@fb.com> wrote: > > On Tue, Aug 20, 2019 at 09:16:01AM +0800, Yafang Shao wrote: > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote: > > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > > > In the current memory.min design, the system is going to do OOM instead > > > > of reclaiming the reclaimable pages protected by memory.min if the > > > > system is lack of free memory. While under this condition, the OOM > > > > killer may kill the processes in the memcg protected by memory.min. > > > > This behavior is very weird. > > > > In order to make it more reasonable, I make some changes in the OOM > > > > killer. In this patch, the OOM killer will do two-round scan. It will > > > > skip the processes under memcg protection at the first scan, and if it > > > > can't kill any processes it will rescan all the processes. > > > > > > > > Regarding the overhead this change may takes, I don't think it will be a > > > > problem because this only happens under system memory pressure and > > > > the OOM killer can't find any proper victims which are not under memcg > > > > protection. > > > > > > Hi Yafang! > > > > > > The idea makes sense at the first glance, but actually I'm worried > > > about mixing per-memcg and per-process characteristics. > > > Actually, it raises many questions: > > > 1) if we do respect memory.min, why not memory.low too? > > > > memroy.low is different with memory.min, as the OOM killer will not be > > invoked when it is reached. > > If memory.low should be considered as well, we can use > > mem_cgroup_protected() here to repclace task_under_memcg_protection() > > here. > > > > > 2) if the task is 200Gb large, does 10Mb memory protection make any > > > difference? if so, why would we respect it? > > > > Same with above, only consider it when the proctecion is enabled. > > Right, but memory.min is a number, not a boolean flag. It defines > how much memory is protected. You're using it as an on-off knob, > which is sub-optimal from my point of view. > I mean using mem_cgroup_protected(), sam with memory.min is implementad in the global reclaim path. > > > > > 3) if it works for global OOMs, why not memcg-level OOMs? > > > > memcg OOM is when the memory limit is reached and it can't find > > something to relcaim in the memcg and have to kill processes in this > > memcg. > > That is different with global OOM, because the global OOM can chose > > processes outside the memcg but the memcg OOM can't. > > Imagine the following hierarchy: > / > | > A memory.max = 10G, memory.min = 2G > / \ > B C memory.min = 1G, memory.min = 0 > > Say, you have memcg OOM in A, why B's memory min is not respected? > How it's different to the system-wide OOM? > Ah, this should be considered as well. Thanks for pointing out. > > > > > 4) if the task is prioritized to be killed by OOM (via oom_score_adj), > > > why even small memory.protection prevents it completely? > > > > Would you pls. show me some examples that when we will set both > > memory.min(meaning the porcesses in this memcg is very important) and > > higher oom_score_adj(meaning the porcesses in this memcg is not > > improtant at all) ? > > Note that the memory.min don't know which processes is important, > > while it only knows is if this process in this memcg. > > For instance, to prefer a specific process to be killed in case > of memcg OOM. > Also, memory.min can be used mostly to preserve the pagecache, > and an OOM kill means nothing but some anon memory leak. > In this case, it makes no sense to protect the leaked task. > But actually what memory.min protected is the memory usage, instead of pagecache, e.g. if the anon memory is higher than memory.min, then memroy.min can't protect file memory when swap is off. Even there is no anon memory leak, the OOM killer can also be invoked due to excess use of memroy. Plus, the memory.min can also protect the leaked anon memroy in current implementation. > > > > > 5) if there are two tasks similar in size and both protected, > > > should we prefer one with the smaller protection? > > > etc. > > > > Same with the answer in 1). > > So the problem is not that your patch is incorrect (or the idea is bad), > but you're defining a new policy, which will be impossible or very hard > to change further (as any other policy). > > So it's important to define it very well. Using the memory.min > number as a binary flag for selecting tasks seems a bit limited. > > > Thanks!
On Tue, Aug 20, 2019 at 10:01 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Tue, Aug 20, 2019 at 9:40 AM Roman Gushchin <guro@fb.com> wrote: > > > > On Tue, Aug 20, 2019 at 09:16:01AM +0800, Yafang Shao wrote: > > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote: > > > > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > > > > In the current memory.min design, the system is going to do OOM instead > > > > > of reclaiming the reclaimable pages protected by memory.min if the > > > > > system is lack of free memory. While under this condition, the OOM > > > > > killer may kill the processes in the memcg protected by memory.min. > > > > > This behavior is very weird. > > > > > In order to make it more reasonable, I make some changes in the OOM > > > > > killer. In this patch, the OOM killer will do two-round scan. It will > > > > > skip the processes under memcg protection at the first scan, and if it > > > > > can't kill any processes it will rescan all the processes. > > > > > > > > > > Regarding the overhead this change may takes, I don't think it will be a > > > > > problem because this only happens under system memory pressure and > > > > > the OOM killer can't find any proper victims which are not under memcg > > > > > protection. > > > > > > > > Hi Yafang! > > > > > > > > The idea makes sense at the first glance, but actually I'm worried > > > > about mixing per-memcg and per-process characteristics. > > > > Actually, it raises many questions: > > > > 1) if we do respect memory.min, why not memory.low too? > > > > > > memroy.low is different with memory.min, as the OOM killer will not be > > > invoked when it is reached. > > > If memory.low should be considered as well, we can use > > > mem_cgroup_protected() here to repclace task_under_memcg_protection() > > > here. > > > > > > > 2) if the task is 200Gb large, does 10Mb memory protection make any > > > > difference? if so, why would we respect it? > > > > > > Same with above, only consider it when the proctecion is enabled. > > > > Right, but memory.min is a number, not a boolean flag. It defines > > how much memory is protected. You're using it as an on-off knob, > > which is sub-optimal from my point of view. > > > > I mean using mem_cgroup_protected(), sam with memory.min is > implementad in the global reclaim path. > > > > > > > > 3) if it works for global OOMs, why not memcg-level OOMs? > > > > > > memcg OOM is when the memory limit is reached and it can't find > > > something to relcaim in the memcg and have to kill processes in this > > > memcg. > > > That is different with global OOM, because the global OOM can chose > > > processes outside the memcg but the memcg OOM can't. > > > > Imagine the following hierarchy: > > / > > | > > A memory.max = 10G, memory.min = 2G > > / \ > > B C memory.min = 1G, memory.min = 0 > > > > Say, you have memcg OOM in A, why B's memory min is not respected? > > How it's different to the system-wide OOM? > > > > Ah, this should be considered as well. Thanks for pointing out. > > > > > > > > 4) if the task is prioritized to be killed by OOM (via oom_score_adj), > > > > why even small memory.protection prevents it completely? > > > > > > Would you pls. show me some examples that when we will set both > > > memory.min(meaning the porcesses in this memcg is very important) and > > > higher oom_score_adj(meaning the porcesses in this memcg is not > > > improtant at all) ? > > > Note that the memory.min don't know which processes is important, > > > while it only knows is if this process in this memcg. > > > > For instance, to prefer a specific process to be killed in case > > of memcg OOM. > > Also, memory.min can be used mostly to preserve the pagecache, > > and an OOM kill means nothing but some anon memory leak. > > In this case, it makes no sense to protect the leaked task. > > > > But actually what memory.min protected is the memory usage, instead of > pagecache, > e.g. if the anon memory is higher than memory.min, then memroy.min > can't protect file memory when swap is off. > > Even there is no anon memory leak, the OOM killer can also be invoked > due to excess use of memroy. > Plus, the memory.min can also protect the leaked anon memroy in > current implementation. > BTW, if there are two different memcgs open the same file, the memcg proection will not work if one memcg is protected while another memcg is not protected. But that may be a rare case. > > > > > > > 5) if there are two tasks similar in size and both protected, > > > > should we prefer one with the smaller protection? > > > > etc. > > > > > > Same with the answer in 1). > > > > So the problem is not that your patch is incorrect (or the idea is bad), > > but you're defining a new policy, which will be impossible or very hard > > to change further (as any other policy). > > > > So it's important to define it very well. Using the memory.min > > number as a binary flag for selecting tasks seems a bit limited. > > > > > > Thanks!
On Tue 20-08-19 09:16:01, Yafang Shao wrote: > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote: > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > > In the current memory.min design, the system is going to do OOM instead > > > of reclaiming the reclaimable pages protected by memory.min if the > > > system is lack of free memory. While under this condition, the OOM > > > killer may kill the processes in the memcg protected by memory.min. > > > This behavior is very weird. > > > In order to make it more reasonable, I make some changes in the OOM > > > killer. In this patch, the OOM killer will do two-round scan. It will > > > skip the processes under memcg protection at the first scan, and if it > > > can't kill any processes it will rescan all the processes. > > > > > > Regarding the overhead this change may takes, I don't think it will be a > > > problem because this only happens under system memory pressure and > > > the OOM killer can't find any proper victims which are not under memcg > > > protection. > > > > Hi Yafang! > > > > The idea makes sense at the first glance, but actually I'm worried > > about mixing per-memcg and per-process characteristics. > > Actually, it raises many questions: > > 1) if we do respect memory.min, why not memory.low too? > > memroy.low is different with memory.min, as the OOM killer will not be > invoked when it is reached. Responded in other email thread (please do not post two versions of the patch on the same day because it makes conversation too scattered and confusing). Think of min limit protection as some sort of a more inteligent mlock. It protects from the regular memory reclaim and it can lead to the OOM situation (be it global or memcg) but by no means it doesn't prevent from the system to kill the workload if there is a need. Those two decisions are simply orthogonal IMHO. The later is a an emergency action while the former is to help guanratee a runtime behavior of the workload. To be completely fair, the OOM killer is a sort of the memory reclaim as well so strictly speaking both mlock and memcg min protection could be considered but from any practical aspect I can think of I simply do not see a strong usecase that would justify a more complex oom behavior. People will be simply confused that the selection is less deterministic and therefore more confusing.
On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 20-08-19 09:16:01, Yafang Shao wrote: > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote: > > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > > > In the current memory.min design, the system is going to do OOM instead > > > > of reclaiming the reclaimable pages protected by memory.min if the > > > > system is lack of free memory. While under this condition, the OOM > > > > killer may kill the processes in the memcg protected by memory.min. > > > > This behavior is very weird. > > > > In order to make it more reasonable, I make some changes in the OOM > > > > killer. In this patch, the OOM killer will do two-round scan. It will > > > > skip the processes under memcg protection at the first scan, and if it > > > > can't kill any processes it will rescan all the processes. > > > > > > > > Regarding the overhead this change may takes, I don't think it will be a > > > > problem because this only happens under system memory pressure and > > > > the OOM killer can't find any proper victims which are not under memcg > > > > protection. > > > > > > Hi Yafang! > > > > > > The idea makes sense at the first glance, but actually I'm worried > > > about mixing per-memcg and per-process characteristics. > > > Actually, it raises many questions: > > > 1) if we do respect memory.min, why not memory.low too? > > > > memroy.low is different with memory.min, as the OOM killer will not be > > invoked when it is reached. > > Responded in other email thread (please do not post two versions of the > patch on the same day because it makes conversation too scattered and > confusing). > (This is an issue about time zone :-) ) > Think of min limit protection as some sort of a more inteligent mlock. Per my perspective, it is a less inteligent mlock, because what it protected may be a garbage memory. As I said before, what it protected is the memroy usage, rather than a specified file memory or anon memory or somethin else. The advantage of it is easy to use. > It protects from the regular memory reclaim and it can lead to the OOM > situation (be it global or memcg) but by no means it doesn't prevent > from the system to kill the workload if there is a need. Those two > decisions are simply orthogonal IMHO. The later is a an emergency action > while the former is to help guanratee a runtime behavior of the workload. > If it can handle OOM memory reclaim, it will be more inteligent. > To be completely fair, the OOM killer is a sort of the memory reclaim as > well so strictly speaking both mlock and memcg min protection could be > considered but from any practical aspect I can think of I simply do not > see a strong usecase that would justify a more complex oom behavior. > People will be simply confused that the selection is less deterministic > and therefore more confusing. > -- So what about ajusting the oom_socore_adj automatically when we set memory.min or mlock ? Thanks Yafang
On Tue 20-08-19 15:15:54, Yafang Shao wrote: > On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 20-08-19 09:16:01, Yafang Shao wrote: > > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote: > > > > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > > > > In the current memory.min design, the system is going to do OOM instead > > > > > of reclaiming the reclaimable pages protected by memory.min if the > > > > > system is lack of free memory. While under this condition, the OOM > > > > > killer may kill the processes in the memcg protected by memory.min. > > > > > This behavior is very weird. > > > > > In order to make it more reasonable, I make some changes in the OOM > > > > > killer. In this patch, the OOM killer will do two-round scan. It will > > > > > skip the processes under memcg protection at the first scan, and if it > > > > > can't kill any processes it will rescan all the processes. > > > > > > > > > > Regarding the overhead this change may takes, I don't think it will be a > > > > > problem because this only happens under system memory pressure and > > > > > the OOM killer can't find any proper victims which are not under memcg > > > > > protection. > > > > > > > > Hi Yafang! > > > > > > > > The idea makes sense at the first glance, but actually I'm worried > > > > about mixing per-memcg and per-process characteristics. > > > > Actually, it raises many questions: > > > > 1) if we do respect memory.min, why not memory.low too? > > > > > > memroy.low is different with memory.min, as the OOM killer will not be > > > invoked when it is reached. > > > > Responded in other email thread (please do not post two versions of the > > patch on the same day because it makes conversation too scattered and > > confusing). > > > (This is an issue about time zone :-) ) Normally we wait few days until feedback on the particular patch is settled before a new version is posted. > > Think of min limit protection as some sort of a more inteligent mlock. > > Per my perspective, it is a less inteligent mlock, because what it > protected may be a garbage memory. > As I said before, what it protected is the memroy usage, rather than a > specified file memory or anon memory or somethin else. > > The advantage of it is easy to use. > > > It protects from the regular memory reclaim and it can lead to the OOM > > situation (be it global or memcg) but by no means it doesn't prevent > > from the system to kill the workload if there is a need. Those two > > decisions are simply orthogonal IMHO. The later is a an emergency action > > while the former is to help guanratee a runtime behavior of the workload. > > > > If it can handle OOM memory reclaim, it will be more inteligent. Can we get back to an actual usecase please? > > To be completely fair, the OOM killer is a sort of the memory reclaim as > > well so strictly speaking both mlock and memcg min protection could be > > considered but from any practical aspect I can think of I simply do not > > see a strong usecase that would justify a more complex oom behavior. > > People will be simply confused that the selection is less deterministic > > and therefore more confusing. > > -- > > So what about ajusting the oom_socore_adj automatically when we set > memory.min or mlock ? oom_score_adj is a _user_ tuning. The kernel has no business in auto-tuning it. It should just consume the value.
On Tue, Aug 20, 2019 at 3:27 PM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 20-08-19 15:15:54, Yafang Shao wrote: > > On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Tue 20-08-19 09:16:01, Yafang Shao wrote: > > > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote: > > > > > > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > > > > > In the current memory.min design, the system is going to do OOM instead > > > > > > of reclaiming the reclaimable pages protected by memory.min if the > > > > > > system is lack of free memory. While under this condition, the OOM > > > > > > killer may kill the processes in the memcg protected by memory.min. > > > > > > This behavior is very weird. > > > > > > In order to make it more reasonable, I make some changes in the OOM > > > > > > killer. In this patch, the OOM killer will do two-round scan. It will > > > > > > skip the processes under memcg protection at the first scan, and if it > > > > > > can't kill any processes it will rescan all the processes. > > > > > > > > > > > > Regarding the overhead this change may takes, I don't think it will be a > > > > > > problem because this only happens under system memory pressure and > > > > > > the OOM killer can't find any proper victims which are not under memcg > > > > > > protection. > > > > > > > > > > Hi Yafang! > > > > > > > > > > The idea makes sense at the first glance, but actually I'm worried > > > > > about mixing per-memcg and per-process characteristics. > > > > > Actually, it raises many questions: > > > > > 1) if we do respect memory.min, why not memory.low too? > > > > > > > > memroy.low is different with memory.min, as the OOM killer will not be > > > > invoked when it is reached. > > > > > > Responded in other email thread (please do not post two versions of the > > > patch on the same day because it makes conversation too scattered and > > > confusing). > > > > > (This is an issue about time zone :-) ) > > Normally we wait few days until feedback on the particular patch is > settled before a new version is posted. > > > > Think of min limit protection as some sort of a more inteligent mlock. > > > > Per my perspective, it is a less inteligent mlock, because what it > > protected may be a garbage memory. > > As I said before, what it protected is the memroy usage, rather than a > > specified file memory or anon memory or somethin else. > > > > The advantage of it is easy to use. > > > > > It protects from the regular memory reclaim and it can lead to the OOM > > > situation (be it global or memcg) but by no means it doesn't prevent > > > from the system to kill the workload if there is a need. Those two > > > decisions are simply orthogonal IMHO. The later is a an emergency action > > > while the former is to help guanratee a runtime behavior of the workload. > > > > > > > If it can handle OOM memory reclaim, it will be more inteligent. > > Can we get back to an actual usecase please? > No real usecase. What we concerned is if it can lead to more OOMs but can't protect itself in OOM then this behavior seems a little wierd. Setting oom_score_adj is another choice, but there's no memcg-level oom_score_adj. memory.min is memcg-level, while oom_score_adj is process-level, that is wierd as well. > > > To be completely fair, the OOM killer is a sort of the memory reclaim as > > > well so strictly speaking both mlock and memcg min protection could be > > > considered but from any practical aspect I can think of I simply do not > > > see a strong usecase that would justify a more complex oom behavior. > > > People will be simply confused that the selection is less deterministic > > > and therefore more confusing. > > > -- > > > > So what about ajusting the oom_socore_adj automatically when we set > > memory.min or mlock ? > > oom_score_adj is a _user_ tuning. The kernel has no business in > auto-tuning it. It should just consume the value. > > -- > Michal Hocko > SUSE Labs
On Tue 20-08-19 15:49:20, Yafang Shao wrote: > On Tue, Aug 20, 2019 at 3:27 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 20-08-19 15:15:54, Yafang Shao wrote: > > > On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Tue 20-08-19 09:16:01, Yafang Shao wrote: > > > > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote: > > > > > > > > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > > > > > > In the current memory.min design, the system is going to do OOM instead > > > > > > > of reclaiming the reclaimable pages protected by memory.min if the > > > > > > > system is lack of free memory. While under this condition, the OOM > > > > > > > killer may kill the processes in the memcg protected by memory.min. > > > > > > > This behavior is very weird. > > > > > > > In order to make it more reasonable, I make some changes in the OOM > > > > > > > killer. In this patch, the OOM killer will do two-round scan. It will > > > > > > > skip the processes under memcg protection at the first scan, and if it > > > > > > > can't kill any processes it will rescan all the processes. > > > > > > > > > > > > > > Regarding the overhead this change may takes, I don't think it will be a > > > > > > > problem because this only happens under system memory pressure and > > > > > > > the OOM killer can't find any proper victims which are not under memcg > > > > > > > protection. > > > > > > > > > > > > Hi Yafang! > > > > > > > > > > > > The idea makes sense at the first glance, but actually I'm worried > > > > > > about mixing per-memcg and per-process characteristics. > > > > > > Actually, it raises many questions: > > > > > > 1) if we do respect memory.min, why not memory.low too? > > > > > > > > > > memroy.low is different with memory.min, as the OOM killer will not be > > > > > invoked when it is reached. > > > > > > > > Responded in other email thread (please do not post two versions of the > > > > patch on the same day because it makes conversation too scattered and > > > > confusing). > > > > > > > (This is an issue about time zone :-) ) > > > > Normally we wait few days until feedback on the particular patch is > > settled before a new version is posted. > > > > > > Think of min limit protection as some sort of a more inteligent mlock. > > > > > > Per my perspective, it is a less inteligent mlock, because what it > > > protected may be a garbage memory. > > > As I said before, what it protected is the memroy usage, rather than a > > > specified file memory or anon memory or somethin else. > > > > > > The advantage of it is easy to use. > > > > > > > It protects from the regular memory reclaim and it can lead to the OOM > > > > situation (be it global or memcg) but by no means it doesn't prevent > > > > from the system to kill the workload if there is a need. Those two > > > > decisions are simply orthogonal IMHO. The later is a an emergency action > > > > while the former is to help guanratee a runtime behavior of the workload. > > > > > > > > > > If it can handle OOM memory reclaim, it will be more inteligent. > > > > Can we get back to an actual usecase please? > > > > No real usecase. > What we concerned is if it can lead to more OOMs but can't protect > itself in OOM then this behavior seems a little wierd. This is a natural side effect of protecting memory from the reclaim. Read mlock kind of protection. Weird? I dunno. Unexpected, no! > Setting oom_score_adj is another choice, but there's no memcg-level > oom_score_adj. > memory.min is memcg-level, while oom_score_adj is process-level, that > is wierd as well. OOM, is per process operation. Sure we have that group kill option but then still the selection is per-process. Without any clear usecase in sight I do not think it makes sense to pursue this further. Thanks!
On Tue, Aug 20, 2019 at 4:34 PM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 20-08-19 15:49:20, Yafang Shao wrote: > > On Tue, Aug 20, 2019 at 3:27 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Tue 20-08-19 15:15:54, Yafang Shao wrote: > > > > On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Tue 20-08-19 09:16:01, Yafang Shao wrote: > > > > > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote: > > > > > > > > > > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > > > > > > > In the current memory.min design, the system is going to do OOM instead > > > > > > > > of reclaiming the reclaimable pages protected by memory.min if the > > > > > > > > system is lack of free memory. While under this condition, the OOM > > > > > > > > killer may kill the processes in the memcg protected by memory.min. > > > > > > > > This behavior is very weird. > > > > > > > > In order to make it more reasonable, I make some changes in the OOM > > > > > > > > killer. In this patch, the OOM killer will do two-round scan. It will > > > > > > > > skip the processes under memcg protection at the first scan, and if it > > > > > > > > can't kill any processes it will rescan all the processes. > > > > > > > > > > > > > > > > Regarding the overhead this change may takes, I don't think it will be a > > > > > > > > problem because this only happens under system memory pressure and > > > > > > > > the OOM killer can't find any proper victims which are not under memcg > > > > > > > > protection. > > > > > > > > > > > > > > Hi Yafang! > > > > > > > > > > > > > > The idea makes sense at the first glance, but actually I'm worried > > > > > > > about mixing per-memcg and per-process characteristics. > > > > > > > Actually, it raises many questions: > > > > > > > 1) if we do respect memory.min, why not memory.low too? > > > > > > > > > > > > memroy.low is different with memory.min, as the OOM killer will not be > > > > > > invoked when it is reached. > > > > > > > > > > Responded in other email thread (please do not post two versions of the > > > > > patch on the same day because it makes conversation too scattered and > > > > > confusing). > > > > > > > > > (This is an issue about time zone :-) ) > > > > > > Normally we wait few days until feedback on the particular patch is > > > settled before a new version is posted. > > > > > > > > Think of min limit protection as some sort of a more inteligent mlock. > > > > > > > > Per my perspective, it is a less inteligent mlock, because what it > > > > protected may be a garbage memory. > > > > As I said before, what it protected is the memroy usage, rather than a > > > > specified file memory or anon memory or somethin else. > > > > > > > > The advantage of it is easy to use. > > > > > > > > > It protects from the regular memory reclaim and it can lead to the OOM > > > > > situation (be it global or memcg) but by no means it doesn't prevent > > > > > from the system to kill the workload if there is a need. Those two > > > > > decisions are simply orthogonal IMHO. The later is a an emergency action > > > > > while the former is to help guanratee a runtime behavior of the workload. > > > > > > > > > > > > > If it can handle OOM memory reclaim, it will be more inteligent. > > > > > > Can we get back to an actual usecase please? > > > > > > > No real usecase. > > What we concerned is if it can lead to more OOMs but can't protect > > itself in OOM then this behavior seems a little wierd. > > This is a natural side effect of protecting memory from the reclaim. > Read mlock kind of protection. Weird? I dunno. Unexpected, no! > > > Setting oom_score_adj is another choice, but there's no memcg-level > > oom_score_adj. > > memory.min is memcg-level, while oom_score_adj is process-level, that > > is wierd as well. > > OOM, is per process operation. Sure we have that group kill option but > then still the selection is per-process. > > Without any clear usecase in sight I do not think it makes sense to > pursue this further. > As there's a memory.oom.group option to select killing all processes in a memcg, why not introduce a memcg level memcg.oom.score_adj? Then we can set different scores to different memcgs. Because we always deploy lots of containers on a single host, when OOM occurs it will better to prefer killing the low priority containers (with higher memcg.oom.score_adj) first. Thanks Yafang
On Tue 20-08-19 16:55:12, Yafang Shao wrote: > On Tue, Aug 20, 2019 at 4:34 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 20-08-19 15:49:20, Yafang Shao wrote: > > > On Tue, Aug 20, 2019 at 3:27 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Tue 20-08-19 15:15:54, Yafang Shao wrote: > > > > > On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Tue 20-08-19 09:16:01, Yafang Shao wrote: > > > > > > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote: > > > > > > > > > > > > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > > > > > > > > In the current memory.min design, the system is going to do OOM instead > > > > > > > > > of reclaiming the reclaimable pages protected by memory.min if the > > > > > > > > > system is lack of free memory. While under this condition, the OOM > > > > > > > > > killer may kill the processes in the memcg protected by memory.min. > > > > > > > > > This behavior is very weird. > > > > > > > > > In order to make it more reasonable, I make some changes in the OOM > > > > > > > > > killer. In this patch, the OOM killer will do two-round scan. It will > > > > > > > > > skip the processes under memcg protection at the first scan, and if it > > > > > > > > > can't kill any processes it will rescan all the processes. > > > > > > > > > > > > > > > > > > Regarding the overhead this change may takes, I don't think it will be a > > > > > > > > > problem because this only happens under system memory pressure and > > > > > > > > > the OOM killer can't find any proper victims which are not under memcg > > > > > > > > > protection. > > > > > > > > > > > > > > > > Hi Yafang! > > > > > > > > > > > > > > > > The idea makes sense at the first glance, but actually I'm worried > > > > > > > > about mixing per-memcg and per-process characteristics. > > > > > > > > Actually, it raises many questions: > > > > > > > > 1) if we do respect memory.min, why not memory.low too? > > > > > > > > > > > > > > memroy.low is different with memory.min, as the OOM killer will not be > > > > > > > invoked when it is reached. > > > > > > > > > > > > Responded in other email thread (please do not post two versions of the > > > > > > patch on the same day because it makes conversation too scattered and > > > > > > confusing). > > > > > > > > > > > (This is an issue about time zone :-) ) > > > > > > > > Normally we wait few days until feedback on the particular patch is > > > > settled before a new version is posted. > > > > > > > > > > Think of min limit protection as some sort of a more inteligent mlock. > > > > > > > > > > Per my perspective, it is a less inteligent mlock, because what it > > > > > protected may be a garbage memory. > > > > > As I said before, what it protected is the memroy usage, rather than a > > > > > specified file memory or anon memory or somethin else. > > > > > > > > > > The advantage of it is easy to use. > > > > > > > > > > > It protects from the regular memory reclaim and it can lead to the OOM > > > > > > situation (be it global or memcg) but by no means it doesn't prevent > > > > > > from the system to kill the workload if there is a need. Those two > > > > > > decisions are simply orthogonal IMHO. The later is a an emergency action > > > > > > while the former is to help guanratee a runtime behavior of the workload. > > > > > > > > > > > > > > > > If it can handle OOM memory reclaim, it will be more inteligent. > > > > > > > > Can we get back to an actual usecase please? > > > > > > > > > > No real usecase. > > > What we concerned is if it can lead to more OOMs but can't protect > > > itself in OOM then this behavior seems a little wierd. > > > > This is a natural side effect of protecting memory from the reclaim. > > Read mlock kind of protection. Weird? I dunno. Unexpected, no! > > > > > Setting oom_score_adj is another choice, but there's no memcg-level > > > oom_score_adj. > > > memory.min is memcg-level, while oom_score_adj is process-level, that > > > is wierd as well. > > > > OOM, is per process operation. Sure we have that group kill option but > > then still the selection is per-process. > > > > Without any clear usecase in sight I do not think it makes sense to > > pursue this further. > > > > As there's a memory.oom.group option to select killing all processes > in a memcg, why not introduce a memcg level memcg.oom.score_adj? Because the oom selection is process based as already mentioned. There was a long discussion about memcg based oom victim selection last year but no consensus has been achieved. > Then we can set different scores to different memcgs. > Because we always deploy lots of containers on a single host, when OOM > occurs it will better to prefer killing the low priority containers > (with higher memcg.oom.score_adj) first. How would you define low priority container with score_adj?
On Tue, Aug 20, 2019 at 5:17 PM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 20-08-19 16:55:12, Yafang Shao wrote: > > On Tue, Aug 20, 2019 at 4:34 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Tue 20-08-19 15:49:20, Yafang Shao wrote: > > > > On Tue, Aug 20, 2019 at 3:27 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Tue 20-08-19 15:15:54, Yafang Shao wrote: > > > > > > On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Tue 20-08-19 09:16:01, Yafang Shao wrote: > > > > > > > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote: > > > > > > > > > > > > > > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > > > > > > > > > In the current memory.min design, the system is going to do OOM instead > > > > > > > > > > of reclaiming the reclaimable pages protected by memory.min if the > > > > > > > > > > system is lack of free memory. While under this condition, the OOM > > > > > > > > > > killer may kill the processes in the memcg protected by memory.min. > > > > > > > > > > This behavior is very weird. > > > > > > > > > > In order to make it more reasonable, I make some changes in the OOM > > > > > > > > > > killer. In this patch, the OOM killer will do two-round scan. It will > > > > > > > > > > skip the processes under memcg protection at the first scan, and if it > > > > > > > > > > can't kill any processes it will rescan all the processes. > > > > > > > > > > > > > > > > > > > > Regarding the overhead this change may takes, I don't think it will be a > > > > > > > > > > problem because this only happens under system memory pressure and > > > > > > > > > > the OOM killer can't find any proper victims which are not under memcg > > > > > > > > > > protection. > > > > > > > > > > > > > > > > > > Hi Yafang! > > > > > > > > > > > > > > > > > > The idea makes sense at the first glance, but actually I'm worried > > > > > > > > > about mixing per-memcg and per-process characteristics. > > > > > > > > > Actually, it raises many questions: > > > > > > > > > 1) if we do respect memory.min, why not memory.low too? > > > > > > > > > > > > > > > > memroy.low is different with memory.min, as the OOM killer will not be > > > > > > > > invoked when it is reached. > > > > > > > > > > > > > > Responded in other email thread (please do not post two versions of the > > > > > > > patch on the same day because it makes conversation too scattered and > > > > > > > confusing). > > > > > > > > > > > > > (This is an issue about time zone :-) ) > > > > > > > > > > Normally we wait few days until feedback on the particular patch is > > > > > settled before a new version is posted. > > > > > > > > > > > > Think of min limit protection as some sort of a more inteligent mlock. > > > > > > > > > > > > Per my perspective, it is a less inteligent mlock, because what it > > > > > > protected may be a garbage memory. > > > > > > As I said before, what it protected is the memroy usage, rather than a > > > > > > specified file memory or anon memory or somethin else. > > > > > > > > > > > > The advantage of it is easy to use. > > > > > > > > > > > > > It protects from the regular memory reclaim and it can lead to the OOM > > > > > > > situation (be it global or memcg) but by no means it doesn't prevent > > > > > > > from the system to kill the workload if there is a need. Those two > > > > > > > decisions are simply orthogonal IMHO. The later is a an emergency action > > > > > > > while the former is to help guanratee a runtime behavior of the workload. > > > > > > > > > > > > > > > > > > > If it can handle OOM memory reclaim, it will be more inteligent. > > > > > > > > > > Can we get back to an actual usecase please? > > > > > > > > > > > > > No real usecase. > > > > What we concerned is if it can lead to more OOMs but can't protect > > > > itself in OOM then this behavior seems a little wierd. > > > > > > This is a natural side effect of protecting memory from the reclaim. > > > Read mlock kind of protection. Weird? I dunno. Unexpected, no! > > > > > > > Setting oom_score_adj is another choice, but there's no memcg-level > > > > oom_score_adj. > > > > memory.min is memcg-level, while oom_score_adj is process-level, that > > > > is wierd as well. > > > > > > OOM, is per process operation. Sure we have that group kill option but > > > then still the selection is per-process. > > > > > > Without any clear usecase in sight I do not think it makes sense to > > > pursue this further. > > > > > > > As there's a memory.oom.group option to select killing all processes > > in a memcg, why not introduce a memcg level memcg.oom.score_adj? > > Because the oom selection is process based as already mentioned. There > was a long discussion about memcg based oom victim selection last year > but no consensus has been achieved. > > > Then we can set different scores to different memcgs. > > Because we always deploy lots of containers on a single host, when OOM > > occurs it will better to prefer killing the low priority containers > > (with higher memcg.oom.score_adj) first. > > How would you define low priority container with score_adj? > For example, Container-A is high priority and Container-B is low priority. When OOM killer happens we prefer to kill all processes in Container-B and prevent Container-A from being killed. So we set memroy.oom.score_adj with -1000 to Container-A and +1000 to Container-B, both container with memory.oom.cgroup set. When we set memroy.oom.score_adj to a container, all processes belonging to this container will be set this value to their own oom_score_adj. If a new process is created, its oom_score_adj will be initialized with this memory.oom.score_adj as well. Thanks Yafang
On Tue 20-08-19 17:26:49, Yafang Shao wrote: > On Tue, Aug 20, 2019 at 5:17 PM Michal Hocko <mhocko@suse.com> wrote: [...] > > > As there's a memory.oom.group option to select killing all processes > > > in a memcg, why not introduce a memcg level memcg.oom.score_adj? > > > > Because the oom selection is process based as already mentioned. There > > was a long discussion about memcg based oom victim selection last year > > but no consensus has been achieved. > > > > > Then we can set different scores to different memcgs. > > > Because we always deploy lots of containers on a single host, when OOM > > > occurs it will better to prefer killing the low priority containers > > > (with higher memcg.oom.score_adj) first. > > > > How would you define low priority container with score_adj? > > > > For example, Container-A is high priority and Container-B is low priority. > When OOM killer happens we prefer to kill all processes in Container-B > and prevent Container-A from being killed. > So we set memroy.oom.score_adj with -1000 to Container-A and +1000 > to Container-B, both container with memory.oom.cgroup set. > When we set memroy.oom.score_adj to a container, all processes > belonging to this container will be set this value to their own > oom_score_adj. I hope you can see that this on/off mechanism doesn't scale and thus it is a dubious interface. Just think of mutlitple containers.
On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > In the current memory.min design, the system is going to do OOM instead > of reclaiming the reclaimable pages protected by memory.min if the > system is lack of free memory. While under this condition, the OOM > killer may kill the processes in the memcg protected by memory.min. > This behavior is very weird. > In order to make it more reasonable, I make some changes in the OOM > killer. In this patch, the OOM killer will do two-round scan. It will > skip the processes under memcg protection at the first scan, and if it > can't kill any processes it will rescan all the processes. > > Regarding the overhead this change may takes, I don't think it will be a > problem because this only happens under system memory pressure and > the OOM killer can't find any proper victims which are not under memcg > protection. Also, after the second thought, what your patch really does, it basically guarantees that no processes out of memory cgroups with memory.min set will be ever killed (unless there are any other processes). In most cases (at least on our setups) it's basically makes such processes immune to the OOM killer (similar to oom_score_adj set to -1000). This is by far a too strong side effect of setting memory.min, so I don't think the idea is acceptable at all. Thanks!
On Wed, Aug 21, 2019 at 5:39 AM Roman Gushchin <guro@fb.com> wrote: > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote: > > In the current memory.min design, the system is going to do OOM instead > > of reclaiming the reclaimable pages protected by memory.min if the > > system is lack of free memory. While under this condition, the OOM > > killer may kill the processes in the memcg protected by memory.min. > > This behavior is very weird. > > In order to make it more reasonable, I make some changes in the OOM > > killer. In this patch, the OOM killer will do two-round scan. It will > > skip the processes under memcg protection at the first scan, and if it > > can't kill any processes it will rescan all the processes. > > > > Regarding the overhead this change may takes, I don't think it will be a > > problem because this only happens under system memory pressure and > > the OOM killer can't find any proper victims which are not under memcg > > protection. > > Also, after the second thought, what your patch really does, > it basically guarantees that no processes out of memory cgroups > with memory.min set will be ever killed (unless there are any other > processes). In most cases (at least on our setups) it's basically > makes such processes immune to the OOM killer (similar to oom_score_adj > set to -1000). > Actually it is between -999 and -1000. > This is by far a too strong side effect of setting memory.min, > so I don't think the idea is acceptable at all. > More possible OOMs is also a strong side effect (and it prevent us from using it). Leave all other works to the userspace is not proper. We should improve it. Thanks Yafang
On Wed 21-08-19 09:00:39, Yafang Shao wrote: [...] > More possible OOMs is also a strong side effect (and it prevent us > from using it). So why don't you use low limit if the guarantee side of min limit is too strong for you?
On Wed, Aug 21, 2019 at 2:44 PM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 21-08-19 09:00:39, Yafang Shao wrote: > [...] > > More possible OOMs is also a strong side effect (and it prevent us > > from using it). > > So why don't you use low limit if the guarantee side of min limit is too > strong for you? Well, I don't know what the best-practice of memory.min is. In our plan, we want to use it to protect the top priority containers (e.g. set the memory.min same with memory limit), which may latency sensive. Using memory.min may sometimes decrease the refault. If we set it too low, it may useless, becasue what memory.min is protecting is not specified. And if there're some busrt anon memory allocate in this memcg, the memory.min may can't protect any file memory. I appreciate if you could show me some best practice of it. Thanks Yafang
On Wed 21-08-19 15:26:56, Yafang Shao wrote: > On Wed, Aug 21, 2019 at 2:44 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 21-08-19 09:00:39, Yafang Shao wrote: > > [...] > > > More possible OOMs is also a strong side effect (and it prevent us > > > from using it). > > > > So why don't you use low limit if the guarantee side of min limit is too > > strong for you? > > Well, I don't know what the best-practice of memory.min is. It is really a workload reclaim protection. Say you have a memory consumer which performance characteristics would be noticeably disrupted by any memory reclaim which then would lead to SLA disruption. This is a strong requirement/QoS feature and as such comes with its demand on configuration. > In our plan, we want to use it to protect the top priority containers > (e.g. set the memory.min same with memory limit), which may latency > sensive. Using memory.min may sometimes decrease the refault. > If we set it too low, it may useless, becasue what memory.min is > protecting is not specified. And if there're some busrt anon memory > allocate in this memcg, the memory.min may can't protect any file > memory. I am still not seeing why you are considering guarantee (memory.min) rather than best practice (memory.low) here?
On Wed, Aug 21, 2019 at 4:05 PM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 21-08-19 15:26:56, Yafang Shao wrote: > > On Wed, Aug 21, 2019 at 2:44 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 21-08-19 09:00:39, Yafang Shao wrote: > > > [...] > > > > More possible OOMs is also a strong side effect (and it prevent us > > > > from using it). > > > > > > So why don't you use low limit if the guarantee side of min limit is too > > > strong for you? > > > > Well, I don't know what the best-practice of memory.min is. > > It is really a workload reclaim protection. Say you have a memory > consumer which performance characteristics would be noticeably disrupted > by any memory reclaim which then would lead to SLA disruption. This is a > strong requirement/QoS feature and as such comes with its demand on > configuration. > > > In our plan, we want to use it to protect the top priority containers > > (e.g. set the memory.min same with memory limit), which may latency > > sensive. Using memory.min may sometimes decrease the refault. > > If we set it too low, it may useless, becasue what memory.min is > > protecting is not specified. And if there're some busrt anon memory > > allocate in this memcg, the memory.min may can't protect any file > > memory. > > I am still not seeing why you are considering guarantee (memory.min) > rather than best practice (memory.low) here? Let me show some examples for you. Suppose we have three containers with different priorities, which are high priority, medium priority and low priority. Then we set memory.low to these containers as bellow, high prioirty: memory.low same with memory.max medium priroity: memory.low is 50% of memory.max low priority: memory.low is 0 When all relcaimable pages withouth protection are reclaimed, the reclaimer begins to reclaim the protected pages, but unforuantely it desn't know which pages are belonging to high priority container and which pages are belonging to medium priority container. So the relcaimer may reclaim the high priority contianer first, and without reclaiming the medium priority container at all. Thanks Yafang
On Wed 21-08-19 16:15:54, Yafang Shao wrote: > On Wed, Aug 21, 2019 at 4:05 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 21-08-19 15:26:56, Yafang Shao wrote: > > > On Wed, Aug 21, 2019 at 2:44 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Wed 21-08-19 09:00:39, Yafang Shao wrote: > > > > [...] > > > > > More possible OOMs is also a strong side effect (and it prevent us > > > > > from using it). > > > > > > > > So why don't you use low limit if the guarantee side of min limit is too > > > > strong for you? > > > > > > Well, I don't know what the best-practice of memory.min is. > > > > It is really a workload reclaim protection. Say you have a memory > > consumer which performance characteristics would be noticeably disrupted > > by any memory reclaim which then would lead to SLA disruption. This is a > > strong requirement/QoS feature and as such comes with its demand on > > configuration. > > > > > In our plan, we want to use it to protect the top priority containers > > > (e.g. set the memory.min same with memory limit), which may latency > > > sensive. Using memory.min may sometimes decrease the refault. > > > If we set it too low, it may useless, becasue what memory.min is > > > protecting is not specified. And if there're some busrt anon memory > > > allocate in this memcg, the memory.min may can't protect any file > > > memory. > > > > I am still not seeing why you are considering guarantee (memory.min) > > rather than best practice (memory.low) here? > > Let me show some examples for you. > Suppose we have three containers with different priorities, which are > high priority, medium priority and low priority. > Then we set memory.low to these containers as bellow, > high prioirty: memory.low same with memory.max > medium priroity: memory.low is 50% of memory.max > low priority: memory.low is 0 > > When all relcaimable pages withouth protection are reclaimed, the > reclaimer begins to reclaim the protected pages, but unforuantely it > desn't know which pages are belonging to high priority container and > which pages are belonging to medium priority container. So the > relcaimer may reclaim the high priority contianer first, and without > reclaiming the medium priority container at all. Hmm, it is hard to comment on this configuration without knowing what is the overall consumption of all the three. In any case reclaiming all of the reclaimable memory means that you have actually reclaimed full of the low and half of the medium container to even start hitting on high priority one. When there are only low priority protected containers then they will get reclaimed proportionally to their size.
On Wed, Aug 21, 2019 at 4:34 PM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 21-08-19 16:15:54, Yafang Shao wrote: > > On Wed, Aug 21, 2019 at 4:05 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 21-08-19 15:26:56, Yafang Shao wrote: > > > > On Wed, Aug 21, 2019 at 2:44 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Wed 21-08-19 09:00:39, Yafang Shao wrote: > > > > > [...] > > > > > > More possible OOMs is also a strong side effect (and it prevent us > > > > > > from using it). > > > > > > > > > > So why don't you use low limit if the guarantee side of min limit is too > > > > > strong for you? > > > > > > > > Well, I don't know what the best-practice of memory.min is. > > > > > > It is really a workload reclaim protection. Say you have a memory > > > consumer which performance characteristics would be noticeably disrupted > > > by any memory reclaim which then would lead to SLA disruption. This is a > > > strong requirement/QoS feature and as such comes with its demand on > > > configuration. > > > > > > > In our plan, we want to use it to protect the top priority containers > > > > (e.g. set the memory.min same with memory limit), which may latency > > > > sensive. Using memory.min may sometimes decrease the refault. > > > > If we set it too low, it may useless, becasue what memory.min is > > > > protecting is not specified. And if there're some busrt anon memory > > > > allocate in this memcg, the memory.min may can't protect any file > > > > memory. > > > > > > I am still not seeing why you are considering guarantee (memory.min) > > > rather than best practice (memory.low) here? > > > > Let me show some examples for you. > > Suppose we have three containers with different priorities, which are > > high priority, medium priority and low priority. > > Then we set memory.low to these containers as bellow, > > high prioirty: memory.low same with memory.max > > medium priroity: memory.low is 50% of memory.max > > low priority: memory.low is 0 > > > > When all relcaimable pages withouth protection are reclaimed, the > > reclaimer begins to reclaim the protected pages, but unforuantely it > > desn't know which pages are belonging to high priority container and > > which pages are belonging to medium priority container. So the > > relcaimer may reclaim the high priority contianer first, and without > > reclaiming the medium priority container at all. > > Hmm, it is hard to comment on this configuration without knowing what is > the overall consumption of all the three. In any case reclaiming all of > the reclaimable memory means that you have actually reclaimed full of > the low and half of the medium container to even start hitting on high > priority one. When there are only low priority protected containers then > they will get reclaimed proportionally to their size. Right. I think priority-based reclaimer (different priorities has differecnt proportional scan count ) would be more fine, while memroy.low is not easy to practice in this situation. Thanks Yafang
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 44c4146..534fe92 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -337,6 +337,7 @@ static inline bool mem_cgroup_disabled(void) enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, struct mem_cgroup *memcg); +bool task_under_memcg_protection(struct task_struct *p); int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask, struct mem_cgroup **memcgp, @@ -813,6 +814,11 @@ static inline enum mem_cgroup_protection mem_cgroup_protected( return MEMCG_PROT_NONE; } +static inline bool task_under_memcg_protection(struct task_struct *p) +{ + return false; +} + static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask, struct mem_cgroup **memcgp, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cdbb7a8..8df3d88 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6030,6 +6030,22 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, return MEMCG_PROT_NONE; } +bool task_under_memcg_protection(struct task_struct *p) +{ + struct mem_cgroup *memcg; + bool protected; + + rcu_read_lock(); + memcg = mem_cgroup_from_task(p); + if (memcg != root_mem_cgroup && memcg->memory.min) + protected = true; + else + protected = false; + rcu_read_unlock(); + + return protected; +} + /** * mem_cgroup_try_charge - try charging a page * @page: page to charge diff --git a/mm/oom_kill.c b/mm/oom_kill.c index eda2e2a..e9bdad3 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -367,12 +367,31 @@ static void select_bad_process(struct oom_control *oc) if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); else { + bool memcg_check = false; + bool memcg_skip = false; struct task_struct *p; + int selected = 0; rcu_read_lock(); - for_each_process(p) - if (oom_evaluate_task(p, oc)) +retry: + for_each_process(p) { + if (!memcg_check && task_under_memcg_protection(p)) { + memcg_skip = true; + continue; + } + selected = oom_evaluate_task(p, oc); + if (selected) break; + } + + if (!selected) { + if (memcg_skip) { + if (!oc->chosen || oc->chosen == (void *)-1UL) { + memcg_check = true; + goto retry; + } + } + } rcu_read_unlock(); } }
In the current memory.min design, the system is going to do OOM instead of reclaiming the reclaimable pages protected by memory.min if the system is lack of free memory. While under this condition, the OOM killer may kill the processes in the memcg protected by memory.min. This behavior is very weird. In order to make it more reasonable, I make some changes in the OOM killer. In this patch, the OOM killer will do two-round scan. It will skip the processes under memcg protection at the first scan, and if it can't kill any processes it will rescan all the processes. Regarding the overhead this change may takes, I don't think it will be a problem because this only happens under system memory pressure and the OOM killer can't find any proper victims which are not under memcg protection. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Yafang Shao <shaoyafang@didiglobal.com> --- include/linux/memcontrol.h | 6 ++++++ mm/memcontrol.c | 16 ++++++++++++++++ mm/oom_kill.c | 23 +++++++++++++++++++++-- 3 files changed, 43 insertions(+), 2 deletions(-)