Message ID | 8d35206601ccf0e1fe021d24405b2a0c2f4e052f.1613584277.git.tim.c.chen@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Soft limit memory management bug fixes | expand |
On Wed 17-02-21 12:41:34, Tim Chen wrote: > During soft limit memory reclaim, we will temporarily remove the target > mem cgroup from the cgroup soft limit tree. We then perform memory > reclaim, update the memory usage excess count and re-insert the mem > cgroup back into the mem cgroup soft limit tree according to the new > memory usage excess count. > > However, when memory reclaim failed for a maximum number of attempts > and we bail out of the reclaim loop, we forgot to put the target mem > cgroup chosen for next reclaim back to the soft limit tree. This prevented > pages in the mem cgroup from being reclaimed in the future even though > the mem cgroup exceeded its soft limit. Fix the logic and put the mem > cgroup back on the tree when page reclaim failed for the mem cgroup. > > Reviewed-by: Ying Huang <ying.huang@intel.com> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> I have already acked this patch in the previous version along with Fixes tag. It seems that my review feedback has been completely ignored also for other patches in this series. > --- > mm/memcontrol.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ed5cc78a8dbf..a51bf90732cb 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) > break; > } while (!nr_reclaimed); > - if (next_mz) > + if (next_mz) { > + spin_lock_irq(&mctz->lock); > + __mem_cgroup_insert_exceeded(next_mz, mctz, excess); > + spin_unlock_irq(&mctz->lock); > css_put(&next_mz->memcg->css); > + } > return nr_reclaimed; > } > > -- > 2.20.1
On 2/18/21 12:24 AM, Michal Hocko wrote: > > I have already acked this patch in the previous version along with Fixes > tag. It seems that my review feedback has been completely ignored also > for other patches in this series. Michal, My apology. Our mail system screwed up and there are some mail missing from our mail system that I completely missed your mail. Only saw them now after I looked into the lore.kernel.org. Responding to your comment: >Have you observed this happening in the real life? I do agree that the >threshold based updates of the tree is not ideal but the whole soft >reclaim code is far from optimal. So why do we care only now? The >feature is essentially dead and fine tuning it sounds like a step back >to me. Yes, I did see the issue mentioned in patch 2 breaking soft limit reclaim for cgroup v1. There are still some of our customers using cgroup v1 so we will like to fix this if possible. For patch 3 regarding the uncharge_batch, it is more of an observation that we should uncharge in batch of same node and not prompted by actual workload. Thinking more about this, the worst that could happen is we could have some entries in the soft limit tree that overestimate the memory used. The worst that could happen is a soft page reclaim on that cgroup. The overhead from extra memcg event update could be more than a soft page reclaim pass. So let's drop patch 3 for now. Let me know if you will like me to resend patch 1 with the fixes tag for commit 4e41695356fb ("memory controller: soft limit reclaim on contention") and if there are any changes I should make for patch 2. Thanks. Tim
On Thu 18-02-21 10:30:20, Tim Chen wrote: > > > On 2/18/21 12:24 AM, Michal Hocko wrote: > > > > > I have already acked this patch in the previous version along with Fixes > > tag. It seems that my review feedback has been completely ignored also > > for other patches in this series. > > Michal, > > My apology. Our mail system screwed up and there are some mail missing > from our mail system that I completely missed your mail. > Only saw them now after I looked into the lore.kernel.org. I see. My apology for suspecting you from ignoring my review. > Responding to your comment: > > >Have you observed this happening in the real life? I do agree that the > >threshold based updates of the tree is not ideal but the whole soft > >reclaim code is far from optimal. So why do we care only now? The > >feature is essentially dead and fine tuning it sounds like a step back > >to me. > > Yes, I did see the issue mentioned in patch 2 breaking soft limit > reclaim for cgroup v1. There are still some of our customers using > cgroup v1 so we will like to fix this if possible. It would be great to see more details. > For patch 3 regarding the uncharge_batch, it > is more of an observation that we should uncharge in batch of same node > and not prompted by actual workload. > Thinking more about this, the worst that could happen > is we could have some entries in the soft limit tree that overestimate > the memory used. The worst that could happen is a soft page reclaim > on that cgroup. The overhead from extra memcg event update could > be more than a soft page reclaim pass. So let's drop patch 3 > for now. I would still prefer to handle that in the soft limit reclaim path and check each memcg for the soft limit reclaim excess before the reclaim. > Let me know if you will like me to resend patch 1 with the fixes tag > for commit 4e41695356fb ("memory controller: soft limit reclaim on contention") > and if there are any changes I should make for patch 2. I will ack and suggest Fixes. > > Thanks. > > Tim
On Wed 17-02-21 12:41:34, Tim Chen wrote: > During soft limit memory reclaim, we will temporarily remove the target > mem cgroup from the cgroup soft limit tree. We then perform memory > reclaim, update the memory usage excess count and re-insert the mem > cgroup back into the mem cgroup soft limit tree according to the new > memory usage excess count. > > However, when memory reclaim failed for a maximum number of attempts > and we bail out of the reclaim loop, we forgot to put the target mem > cgroup chosen for next reclaim back to the soft limit tree. This prevented > pages in the mem cgroup from being reclaimed in the future even though > the mem cgroup exceeded its soft limit. Fix the logic and put the mem > cgroup back on the tree when page reclaim failed for the mem cgroup. > > Reviewed-by: Ying Huang <ying.huang@intel.com> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Fixes: 4e41695356fb ("memory controller: soft limit reclaim on contention") Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > --- > mm/memcontrol.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ed5cc78a8dbf..a51bf90732cb 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) > break; > } while (!nr_reclaimed); > - if (next_mz) > + if (next_mz) { > + spin_lock_irq(&mctz->lock); > + __mem_cgroup_insert_exceeded(next_mz, mctz, excess); > + spin_unlock_irq(&mctz->lock); > css_put(&next_mz->memcg->css); > + } > return nr_reclaimed; > } > > -- > 2.20.1
On 2/18/21 11:13 AM, Michal Hocko wrote: > On Thu 18-02-21 10:30:20, Tim Chen wrote: >> >> >> On 2/18/21 12:24 AM, Michal Hocko wrote: >> >>> >>> I have already acked this patch in the previous version along with Fixes >>> tag. It seems that my review feedback has been completely ignored also >>> for other patches in this series. >> >> Michal, >> >> My apology. Our mail system screwed up and there are some mail missing >> from our mail system that I completely missed your mail. >> Only saw them now after I looked into the lore.kernel.org. > > I see. My apology for suspecting you from ignoring my review. > >> Responding to your comment: >> >>> Have you observed this happening in the real life? I do agree that the >>> threshold based updates of the tree is not ideal but the whole soft >>> reclaim code is far from optimal. So why do we care only now? The >>> feature is essentially dead and fine tuning it sounds like a step back >>> to me. >> >> Yes, I did see the issue mentioned in patch 2 breaking soft limit >> reclaim for cgroup v1. There are still some of our customers using >> cgroup v1 so we will like to fix this if possible. > > It would be great to see more details. > The sceanrio I saw was we have multiple cgroups running pmbench. One cgroup exceeded the soft limit and soft reclaim is active on that cgroup. So there are a whole bunch of memcg events associated with that cgroup. Then another cgroup starts to exceed its soft limit. Memory is accessed at a much lower frequency for the second cgroup. The memcg event update was not triggered for the second cgroup as the memcg event update didn't happened on the 1024th sample. The second cgroup was not placed on the soft limit tree and we didn't try to reclaim the excess pages. As time goes on, we saw that the first cgroup was kept close to its soft limit due to reclaim activities, while the second cgroup's memory usage slowly creep up as it keeps getting missed from the soft limit tree update as the update didn't fall on the modulo 1024 sample. As a result, the memory usage of the second cgroup keeps growing over the soft limit for a long time due to its relatively rare occurrence. >> For patch 3 regarding the uncharge_batch, it >> is more of an observation that we should uncharge in batch of same node >> and not prompted by actual workload. >> Thinking more about this, the worst that could happen >> is we could have some entries in the soft limit tree that overestimate >> the memory used. The worst that could happen is a soft page reclaim >> on that cgroup. The overhead from extra memcg event update could >> be more than a soft page reclaim pass. So let's drop patch 3 >> for now. > > I would still prefer to handle that in the soft limit reclaim path and > check each memcg for the soft limit reclaim excess before the reclaim. > Something like this? diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8bddee75f5cb..b50cae3b2a1a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3472,6 +3472,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, if (!mz) break; + /* + * Soft limit tree is updated based on memcg events sampling. + * We could have missed some updates on page uncharge and + * the cgroup is below soft limit. Skip useless soft reclaim. + */ + if (!soft_limit_excess(mz->memcg)) + continue; + nr_scanned = 0; reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat, Tim
On 2/18/21 11:13 AM, Michal Hocko wrote: > > Fixes: 4e41695356fb ("memory controller: soft limit reclaim on contention") > Acked-by: Michal Hocko <mhocko@suse.com> > > Thanks! >> --- >> mm/memcontrol.c | 6 +++++- >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index ed5cc78a8dbf..a51bf90732cb 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, >> loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) >> break; >> } while (!nr_reclaimed); >> - if (next_mz) >> + if (next_mz) { >> + spin_lock_irq(&mctz->lock); >> + __mem_cgroup_insert_exceeded(next_mz, mctz, excess); >> + spin_unlock_irq(&mctz->lock); >> css_put(&next_mz->memcg->css); >> + } >> return nr_reclaimed; >> } >> >> -- >> 2.20.1 > Mel, Reviewing this patch a bit more, I realize that there is a chance that the removed next_mz could be inserted back to the tree from a memcg_check_events that happen in between. So we need to make sure that the next_mz is indeed off the tree and update the excess value before adding it back. Update the patch to the patch below. Thanks. Tim --- From 412764d1fad219b04c77bcb1cc8161067c8424f2 Mon Sep 17 00:00:00 2001 From: Tim Chen <tim.c.chen@linux.intel.com> Date: Tue, 2 Feb 2021 15:53:21 -0800 Subject: [PATCH v3] mm: Fix dropped memcg from mem cgroup soft limit tree To: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@suse.cz>,Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com>, Ying Huang <ying.huang@intel.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org During soft limit memory reclaim, we will temporarily remove the target mem cgroup from the cgroup soft limit tree. We then perform memory reclaim, update the memory usage excess count and re-insert the mem cgroup back into the mem cgroup soft limit tree according to the new memory usage excess count. However, when memory reclaim failed for a maximum number of attempts and we bail out of the reclaim loop, we forgot to put the target mem cgroup chosen for next reclaim back to the soft limit tree. This prevented pages in the mem cgroup from being reclaimed in the future even though the mem cgroup exceeded its soft limit. Fix the logic and put the mem cgroup back on the tree when page reclaim failed for the mem cgroup. Fixes: 4e41695356fb ("memory controller: soft limit reclaim on contention") --- mm/memcontrol.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ed5cc78a8dbf..bc9cc73ff66b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3505,8 +3505,18 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) break; } while (!nr_reclaimed); - if (next_mz) + if (next_mz) { + /* + * next_mz was removed in __mem_cgroup_largest_soft_limit_node. + * Put it back in tree with latest excess value. + */ + spin_lock_irq(&mctz->lock); + __mem_cgroup_remove_exceeded(next_mz, mctz); + excess = soft_limit_excess(next_mz->memcg); + __mem_cgroup_insert_exceeded(next_mz, mctz, excess); + spin_unlock_irq(&mctz->lock); css_put(&next_mz->memcg->css); + } return nr_reclaimed; }
On Thu 04-03-21 09:35:08, Tim Chen wrote: > > > On 2/18/21 11:13 AM, Michal Hocko wrote: > > > > > Fixes: 4e41695356fb ("memory controller: soft limit reclaim on contention") > > Acked-by: Michal Hocko <mhocko@suse.com> > > > > Thanks! > >> --- > >> mm/memcontrol.c | 6 +++++- > >> 1 file changed, 5 insertions(+), 1 deletion(-) > >> > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >> index ed5cc78a8dbf..a51bf90732cb 100644 > >> --- a/mm/memcontrol.c > >> +++ b/mm/memcontrol.c > >> @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > >> loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) > >> break; > >> } while (!nr_reclaimed); > >> - if (next_mz) > >> + if (next_mz) { > >> + spin_lock_irq(&mctz->lock); > >> + __mem_cgroup_insert_exceeded(next_mz, mctz, excess); > >> + spin_unlock_irq(&mctz->lock); > >> css_put(&next_mz->memcg->css); > >> + } > >> return nr_reclaimed; > >> } > >> > >> -- > >> 2.20.1 > > > > Mel, > > Reviewing this patch a bit more, I realize that there is a chance that the removed > next_mz could be inserted back to the tree from a memcg_check_events > that happen in between. So we need to make sure that the next_mz > is indeed off the tree and update the excess value before adding it > back. Update the patch to the patch below. This scenario is certainly possible but it shouldn't really matter much as __mem_cgroup_insert_exceeded bails out when the node is on the tree already.
On 3/5/21 1:11 AM, Michal Hocko wrote: > On Thu 04-03-21 09:35:08, Tim Chen wrote: >> >> >> On 2/18/21 11:13 AM, Michal Hocko wrote: >> >>> >>> Fixes: 4e41695356fb ("memory controller: soft limit reclaim on contention") >>> Acked-by: Michal Hocko <mhocko@suse.com> >>> >>> Thanks! >>>> --- >>>> mm/memcontrol.c | 6 +++++- >>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>>> index ed5cc78a8dbf..a51bf90732cb 100644 >>>> --- a/mm/memcontrol.c >>>> +++ b/mm/memcontrol.c >>>> @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, >>>> loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) >>>> break; >>>> } while (!nr_reclaimed); >>>> - if (next_mz) >>>> + if (next_mz) { >>>> + spin_lock_irq(&mctz->lock); >>>> + __mem_cgroup_insert_exceeded(next_mz, mctz, excess); >>>> + spin_unlock_irq(&mctz->lock); >>>> css_put(&next_mz->memcg->css); >>>> + } >>>> return nr_reclaimed; >>>> } >>>> >>>> -- >>>> 2.20.1 >>> >> >> Mel, >> >> Reviewing this patch a bit more, I realize that there is a chance that the removed >> next_mz could be inserted back to the tree from a memcg_check_events >> that happen in between. So we need to make sure that the next_mz >> is indeed off the tree and update the excess value before adding it >> back. Update the patch to the patch below. > > This scenario is certainly possible but it shouldn't really matter much > as __mem_cgroup_insert_exceeded bails out when the node is on the tree > already. > Makes sense. We should still update the excess value with + excess = soft_limit_excess(next_mz->memcg); + __mem_cgroup_insert_exceeded(next_mz, mctz, excess); before doing insertion. The excess value was recorded from previous mz in the loop and needs to be updated to that of next_mz. Tim
On Fri 05-03-21 11:07:59, Tim Chen wrote: > > > On 3/5/21 1:11 AM, Michal Hocko wrote: > > On Thu 04-03-21 09:35:08, Tim Chen wrote: > >> > >> > >> On 2/18/21 11:13 AM, Michal Hocko wrote: > >> > >>> > >>> Fixes: 4e41695356fb ("memory controller: soft limit reclaim on contention") > >>> Acked-by: Michal Hocko <mhocko@suse.com> > >>> > >>> Thanks! > >>>> --- > >>>> mm/memcontrol.c | 6 +++++- > >>>> 1 file changed, 5 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>>> index ed5cc78a8dbf..a51bf90732cb 100644 > >>>> --- a/mm/memcontrol.c > >>>> +++ b/mm/memcontrol.c > >>>> @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > >>>> loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) > >>>> break; > >>>> } while (!nr_reclaimed); > >>>> - if (next_mz) > >>>> + if (next_mz) { > >>>> + spin_lock_irq(&mctz->lock); > >>>> + __mem_cgroup_insert_exceeded(next_mz, mctz, excess); > >>>> + spin_unlock_irq(&mctz->lock); > >>>> css_put(&next_mz->memcg->css); > >>>> + } > >>>> return nr_reclaimed; > >>>> } > >>>> > >>>> -- > >>>> 2.20.1 > >>> > >> > >> Mel, > >> > >> Reviewing this patch a bit more, I realize that there is a chance that the removed > >> next_mz could be inserted back to the tree from a memcg_check_events > >> that happen in between. So we need to make sure that the next_mz > >> is indeed off the tree and update the excess value before adding it > >> back. Update the patch to the patch below. > > > > This scenario is certainly possible but it shouldn't really matter much > > as __mem_cgroup_insert_exceeded bails out when the node is on the tree > > already. > > > > Makes sense. We should still update the excess value with > > + excess = soft_limit_excess(next_mz->memcg); > + __mem_cgroup_insert_exceeded(next_mz, mctz, excess); > > before doing insertion. The excess value was recorded from previous > mz in the loop and needs to be updated to that of next_mz. Yes. Sorry, I have missed that part previously.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ed5cc78a8dbf..a51bf90732cb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3505,8 +3505,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS)) break; } while (!nr_reclaimed); - if (next_mz) + if (next_mz) { + spin_lock_irq(&mctz->lock); + __mem_cgroup_insert_exceeded(next_mz, mctz, excess); + spin_unlock_irq(&mctz->lock); css_put(&next_mz->memcg->css); + } return nr_reclaimed; }