Message ID | 20210826220149.058089c6@imladris.surriel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm,vmscan: fix divide by zero in get_scan_count | expand |
On Thu, Aug 26, 2021 at 10:01:49PM -0400, Rik van Riel wrote: > Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to > proportional memory.low reclaim") introduced a divide by zero corner > case when oomd is being used in combination with cgroup memory.low > protection. > > When oomd decides to kill a cgroup, it will force the cgroup memory > to be reclaimed after killing the tasks, by writing to the memory.max > file for that cgroup, forcing the remaining page cache and reclaimable > slab to be reclaimed down to zero. > > Previously, on cgroups with some memory.low protection that would result > in the memory being reclaimed down to the memory.low limit, or likely not > at all, having the page cache reclaimed asynchronously later. > > With f56ce412a59d the oomd write to memory.max tries to reclaim all the > way down to zero, which may race with another reclaimer, to the point of > ending up with the divide by zero below. > > This patch implements the obvious fix. > > Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim") > Signed-off-by: Rik van Riel <riel@surriel.com> It looks like we discover a new divide-by-zero corner case after every functional change to the memory protection code :) Acked-by: Roman Gushchin <guro@fb.com> Thanks, Rik!
On Thu 26-08-21 22:01:49, Rik van Riel wrote: > Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to > proportional memory.low reclaim") introduced a divide by zero corner > case when oomd is being used in combination with cgroup memory.low > protection. > > When oomd decides to kill a cgroup, it will force the cgroup memory > to be reclaimed after killing the tasks, by writing to the memory.max > file for that cgroup, forcing the remaining page cache and reclaimable > slab to be reclaimed down to zero. > > Previously, on cgroups with some memory.low protection that would result > in the memory being reclaimed down to the memory.low limit, or likely not > at all, having the page cache reclaimed asynchronously later. > > With f56ce412a59d the oomd write to memory.max tries to reclaim all the > way down to zero, which may race with another reclaimer, to the point of > ending up with the divide by zero below. > > This patch implements the obvious fix. I must be missing something but how can cgroup_size be ever 0 when it is max(cgroup_size, protection) and protection != 0?
On Mon, 2021-08-30 at 13:33 +0200, Michal Hocko wrote: > On Thu 26-08-21 22:01:49, Rik van Riel wrote: > > Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to > > proportional memory.low reclaim") introduced a divide by zero > > corner > > case when oomd is being used in combination with cgroup memory.low > > protection. > > > > When oomd decides to kill a cgroup, it will force the cgroup memory > > to be reclaimed after killing the tasks, by writing to the > > memory.max > > file for that cgroup, forcing the remaining page cache and > > reclaimable > > slab to be reclaimed down to zero. > > > > Previously, on cgroups with some memory.low protection that would > > result > > in the memory being reclaimed down to the memory.low limit, or > > likely not > > at all, having the page cache reclaimed asynchronously later. > > > > With f56ce412a59d the oomd write to memory.max tries to reclaim all > > the > > way down to zero, which may race with another reclaimer, to the > > point of > > ending up with the divide by zero below. > > > > This patch implements the obvious fix. > > I must be missing something but how can cgroup_size be ever 0 when it > is > max(cgroup_size, protection) and protection != 0? Going into the condition we use if (low || min), where it is possible for low > 0 && min == 0. Inside the conditional, we can end up testing against min.
On Mon 30-08-21 09:24:22, Rik van Riel wrote: > On Mon, 2021-08-30 at 13:33 +0200, Michal Hocko wrote: [...] > > I must be missing something but how can cgroup_size be ever 0 when it > > is > > max(cgroup_size, protection) and protection != 0? > > Going into the condition we use if (low || min), where > it is possible for low > 0 && min == 0. > > Inside the conditional, we can end up testing against > min. Dang, I was looking at the tree without f56ce412a59d7 applied. My bad! Personally I would consider the following slightly easier to follow scan = lruvec_size - lruvec_size * protection / max(cgroup_size, 1); The code is quite tricky already and if you asked me what kind of effect cgroup_size + 1 have there I would just shrug shoulders... Anyway your fix will prevent the reported problem and I cannot see any obvious problem with it either so Acked-by: Michal Hocko <mhocko@suse.com>
On Thu, Aug 26, 2021 at 10:01:49PM -0400, Rik van Riel wrote: > Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to > proportional memory.low reclaim") introduced a divide by zero corner > case when oomd is being used in combination with cgroup memory.low > protection. > > When oomd decides to kill a cgroup, it will force the cgroup memory > to be reclaimed after killing the tasks, by writing to the memory.max > file for that cgroup, forcing the remaining page cache and reclaimable > slab to be reclaimed down to zero. > > Previously, on cgroups with some memory.low protection that would result > in the memory being reclaimed down to the memory.low limit, or likely not > at all, having the page cache reclaimed asynchronously later. > > With f56ce412a59d the oomd write to memory.max tries to reclaim all the > way down to zero, which may race with another reclaimer, to the point of > ending up with the divide by zero below. > > This patch implements the obvious fix. > > Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim") > Signed-off-by: Rik van Riel <riel@surriel.com> That took me a second. Before the patch, that sc->memcg_low_reclaim test was outside of that whole proportional reclaim branch. So if we were in low reclaim mode we wouldn't even check if a low setting is in place; if min is zero, we don't enter the proportional branch. Now we enter if low is set but ignored, and then end up with cgroup_size == min == 0 == divide by black hole. Good catch. Acked-by: Johannes Weiner <hannes@cmpxchg.org> > diff --git a/mm/vmscan.c b/mm/vmscan.c > index eeae2f6bc532..f1782b816c98 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > cgroup_size = max(cgroup_size, protection); > > scan = lruvec_size - lruvec_size * protection / > - cgroup_size; > + (cgroup_size + 1); I have no overly strong preferences, but if Michal prefers max(), how about: cgroup_size = max3(cgroup_size, protection, 1); Or go back to not taking the branch in the first place when there is no protection in effect... diff --git a/mm/vmscan.c b/mm/vmscan.c index 6247f6f4469a..9c200bb3ae51 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, mem_cgroup_protection(sc->target_mem_cgroup, memcg, &min, &low); - if (min || low) { + if (min || (!sc->memcg_low_reclaim && low)) { /* * Scale a cgroup's reclaim pressure by proportioning * its current usage to its memory.low or memory.min
On Mon 30-08-21 16:48:03, Johannes Weiner wrote: > On Thu, Aug 26, 2021 at 10:01:49PM -0400, Rik van Riel wrote: [...] > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index eeae2f6bc532..f1782b816c98 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > > cgroup_size = max(cgroup_size, protection); > > > > scan = lruvec_size - lruvec_size * protection / > > - cgroup_size; > > + (cgroup_size + 1); > > I have no overly strong preferences, but if Michal prefers max(), how about: > > cgroup_size = max3(cgroup_size, protection, 1); Yes this is better. > Or go back to not taking the branch in the first place when there is > no protection in effect... > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 6247f6f4469a..9c200bb3ae51 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > mem_cgroup_protection(sc->target_mem_cgroup, memcg, > &min, &low); > > - if (min || low) { > + if (min || (!sc->memcg_low_reclaim && low)) { > /* > * Scale a cgroup's reclaim pressure by proportioning > * its current usage to its memory.low or memory.min This is slightly more complex to read but it is also better than +1 trick. Either of the two work for me.
Rik van Riel writes: >Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to >proportional memory.low reclaim") introduced a divide by zero corner >case when oomd is being used in combination with cgroup memory.low >protection. > >When oomd decides to kill a cgroup, it will force the cgroup memory >to be reclaimed after killing the tasks, by writing to the memory.max >file for that cgroup, forcing the remaining page cache and reclaimable >slab to be reclaimed down to zero. > >Previously, on cgroups with some memory.low protection that would result >in the memory being reclaimed down to the memory.low limit, or likely not >at all, having the page cache reclaimed asynchronously later. > >With f56ce412a59d the oomd write to memory.max tries to reclaim all the >way down to zero, which may race with another reclaimer, to the point of >ending up with the divide by zero below. > >This patch implements the obvious fix. > >Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim") >Signed-off-by: Rik van Riel <riel@surriel.com> Thanks, good catch. No strong preference on this vs. max3(), so feel free to keep my ack either way. Acked-by: Chris Down <chris@chrisdown.name> > >diff --git a/mm/vmscan.c b/mm/vmscan.c >index eeae2f6bc532..f1782b816c98 100644 >--- a/mm/vmscan.c >+++ b/mm/vmscan.c >@@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > cgroup_size = max(cgroup_size, protection); > > scan = lruvec_size - lruvec_size * protection / >- cgroup_size; >+ (cgroup_size + 1); > > /* > * Minimally target SWAP_CLUSTER_MAX pages to keep > >
On Tue, 2021-08-31 at 11:59 +0200, Michal Hocko wrote: > On Mon 30-08-21 16:48:03, Johannes Weiner wrote: > > > > Or go back to not taking the branch in the first place when there > > is > > no protection in effect... > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 6247f6f4469a..9c200bb3ae51 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec > > *lruvec, struct scan_control *sc, > > mem_cgroup_protection(sc->target_mem_cgroup, memcg, > > &min, &low); > > > > - if (min || low) { > > + if (min || (!sc->memcg_low_reclaim && low)) { > > /* > > * Scale a cgroup's reclaim pressure by > > proportioning > > * its current usage to its memory.low or > > memory.min > > This is slightly more complex to read but it is also better than +1 > trick. We could also fold it into the helper function, with mem_cgroup_protection deciding whether to use low or min for the protection limit, and then we key the rest of our decisions off that. Wait a minute, that's pretty much what mem_cgroup_protection looked like before f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim") Now I'm confused how that changeset works. Before f56ce412a59d, mem_cgroup_protection would return memcg->memory.emin if sc->memcg_low_reclaim is true, and memcg->memory.elow when not. After f56ce412a59d, we still do the same thing. We just also set sc->memcg_low_skipped so we know to come back if we could not hit our target without skipping groups with memory.low protection...
On Tue, Aug 31, 2021 at 11:48:28AM -0400, Rik van Riel wrote: > On Tue, 2021-08-31 at 11:59 +0200, Michal Hocko wrote: > > On Mon 30-08-21 16:48:03, Johannes Weiner wrote: > > > > > > > Or go back to not taking the branch in the first place when there > > > is > > > no protection in effect... > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 6247f6f4469a..9c200bb3ae51 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec > > > *lruvec, struct scan_control *sc, > > > mem_cgroup_protection(sc->target_mem_cgroup, memcg, > > > &min, &low); > > > > > > - if (min || low) { > > > + if (min || (!sc->memcg_low_reclaim && low)) { > > > /* > > > * Scale a cgroup's reclaim pressure by > > > proportioning > > > * its current usage to its memory.low or > > > memory.min > > > > This is slightly more complex to read but it is also better than +1 > > trick. > > We could also fold it into the helper function, with > mem_cgroup_protection deciding whether to use low or > min for the protection limit, and then we key the rest > of our decisions off that. > > Wait a minute, that's pretty much what mem_cgroup_protection > looked like before f56ce412a59d ("mm: memcontrol: fix occasional > OOMs due to proportional memory.low reclaim") > > Now I'm confused how that changeset works. > > Before f56ce412a59d, mem_cgroup_protection would return > memcg->memory.emin if sc->memcg_low_reclaim is true, and > memcg->memory.elow when not. > > After f56ce412a59d, we still do the same thing. We just > also set sc->memcg_low_skipped so we know to come back > if we could not hit our target without skipping groups > with memory.low protection... Yeah, I just bubbled the sc->memcg_low_reclaim test up into vmscan.c so we can modify sc->memcg_low_skipped based on it. Because scan_control is vmscan.c-scope and I tried to retain that; and avoid doing things like passing &sc->memcg_low_skipped into memcg code.
diff --git a/mm/vmscan.c b/mm/vmscan.c index eeae2f6bc532..f1782b816c98 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, cgroup_size = max(cgroup_size, protection); scan = lruvec_size - lruvec_size * protection / - cgroup_size; + (cgroup_size + 1); /* * Minimally target SWAP_CLUSTER_MAX pages to keep
Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim") introduced a divide by zero corner case when oomd is being used in combination with cgroup memory.low protection. When oomd decides to kill a cgroup, it will force the cgroup memory to be reclaimed after killing the tasks, by writing to the memory.max file for that cgroup, forcing the remaining page cache and reclaimable slab to be reclaimed down to zero. Previously, on cgroups with some memory.low protection that would result in the memory being reclaimed down to the memory.low limit, or likely not at all, having the page cache reclaimed asynchronously later. With f56ce412a59d the oomd write to memory.max tries to reclaim all the way down to zero, which may race with another reclaimer, to the point of ending up with the divide by zero below. This patch implements the obvious fix. Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim") Signed-off-by: Rik van Riel <riel@surriel.com>