diff mbox series

mm: list_lru: hold nlru lock to avoid reading transient negative nr_items

Message ID 20201130184514.551950-1-shy828301@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm: list_lru: hold nlru lock to avoid reading transient negative nr_items | expand

Commit Message

Yang Shi Nov. 30, 2020, 6:45 p.m. UTC
When investigating a slab cache bloat problem, significant amount of
negative dentry cache was seen, but confusingly they neither got shrunk
by reclaimer (the host has very tight memory) nor be shrunk by dropping
cache.  The vmcore shows there are over 14M negative dentry objects on lru,
but tracing result shows they were even not scanned at all.  The further
investigation shows the memcg's vfs shrinker_map bit is not set.  So the
reclaimer or dropping cache just skip calling vfs shrinker.  So we have
to reboot the hosts to get the memory back.

I didn't manage to come up with a reproducer in test environment, and the
problem can't be reproduced after rebooting.  But it seems there is race
between shrinker map bit clear and reparenting by code inspection.  The
hypothesis is elaborated as below.

The memcg hierarchy on our production environment looks like:
                root
               /    \
          system   user

The main workloads are running under user slice's children, and it creates
and removes memcg frequently.  So reparenting happens very often under user
slice, but no task is under user slice directly.

So with the frequent reparenting and tight memory pressure, the below
hypothetical race condition may happen:

    CPU A                            CPU B                         CPU C
reparent
    dst->nr_items == 0
                                 shrinker:
                                     total_objects == 0
    add src->nr_items to dst
    set_bit
                                     retrun SHRINK_EMPTY
                                     clear_bit
                                                                  list_lru_del()
reparent again
    dst->nr_items may go negative
    due to current list_lru_del()
    on CPU C
                                 The second run of shrinker:
                                     read nr_items without any
                                     synchronization, so it may
                                     see intermediate negative
                                     nr_items then total_objects
                                     may return 0 conincidently

                                     keep the bit cleared
    dst->nr_items != 0
    skip set_bit
    add scr->nr_item to dst

After this point dst->nr_item may never go zero, so reparenting will not
set shrinker_map bit anymore.  And since there is no task under user
slice directly, so no new object will be added to its lru to set the
shrinker map bit either.  That bit is kept cleared forever.

How does list_lru_del() race with reparenting?  It is because
reparenting replaces childen's kmemcg_id to parent's without protecting
from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
actually deleting items from child's lru, but dec'ing parent's nr_items,
so the parent's nr_items may go negative as commit
2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
free kmemcg_id on css offline") says.

Can we move kmemcg_id replacement after reparenting?  No, because the
race with list_lru_del() may result in negative src->nr_items, but it
will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
keep the shrinker map bit set always.  The shrinker will be always
called for nonsense.

Can we synchronize list_lru_del() and reparenting?  Yes, it could be
done.  But it seems we need introduce a new lock or use nlru->lock.  But
it sounds complicated to move kmemcg_id replacement code under nlru->lock.
And list_lru_del() may be called quite often to exacerbate some hot
path, i.e. dentry kill.

So, it sounds acceptable to synchronize reading nr_items to avoid seeing
intermediate negative nr_items given the simplicity and it is typically
just called by shrinkers when counting the freeable objects.

The patch is tested with some shrinker intensive workloads, no
noticeable regression is soptted.

Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/list_lru.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Roman Gushchin Nov. 30, 2020, 8:09 p.m. UTC | #1
On Mon, Nov 30, 2020 at 10:45:14AM -0800, Yang Shi wrote:
> When investigating a slab cache bloat problem, significant amount of
> negative dentry cache was seen, but confusingly they neither got shrunk
> by reclaimer (the host has very tight memory) nor be shrunk by dropping
> cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> but tracing result shows they were even not scanned at all.  The further
> investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> to reboot the hosts to get the memory back.
> 
> I didn't manage to come up with a reproducer in test environment, and the
> problem can't be reproduced after rebooting.  But it seems there is race
> between shrinker map bit clear and reparenting by code inspection.  The
> hypothesis is elaborated as below.
> 
> The memcg hierarchy on our production environment looks like:
>                 root
>                /    \
>           system   user
> 
> The main workloads are running under user slice's children, and it creates
> and removes memcg frequently.  So reparenting happens very often under user
> slice, but no task is under user slice directly.
> 
> So with the frequent reparenting and tight memory pressure, the below
> hypothetical race condition may happen:
> 
>     CPU A                            CPU B                         CPU C
> reparent
>     dst->nr_items == 0
>                                  shrinker:
>                                      total_objects == 0
>     add src->nr_items to dst
>     set_bit
>                                      retrun SHRINK_EMPTY
>                                      clear_bit
>                                                                   list_lru_del()
> reparent again
>     dst->nr_items may go negative
>     due to current list_lru_del()
>     on CPU C
>                                  The second run of shrinker:
>                                      read nr_items without any
>                                      synchronization, so it may
>                                      see intermediate negative
>                                      nr_items then total_objects
>                                      may return 0 conincidently
> 
>                                      keep the bit cleared
>     dst->nr_items != 0
>     skip set_bit
>     add scr->nr_item to dst
> 
> After this point dst->nr_item may never go zero, so reparenting will not
> set shrinker_map bit anymore.  And since there is no task under user
> slice directly, so no new object will be added to its lru to set the
> shrinker map bit either.  That bit is kept cleared forever.
> 
> How does list_lru_del() race with reparenting?  It is because
> reparenting replaces childen's kmemcg_id to parent's without protecting
> from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
> actually deleting items from child's lru, but dec'ing parent's nr_items,
> so the parent's nr_items may go negative as commit
> 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
> free kmemcg_id on css offline") says.
> 
> Can we move kmemcg_id replacement after reparenting?  No, because the
> race with list_lru_del() may result in negative src->nr_items, but it
> will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
> keep the shrinker map bit set always.  The shrinker will be always
> called for nonsense.
> 
> Can we synchronize list_lru_del() and reparenting?  Yes, it could be
> done.  But it seems we need introduce a new lock or use nlru->lock.  But
> it sounds complicated to move kmemcg_id replacement code under nlru->lock.
> And list_lru_del() may be called quite often to exacerbate some hot
> path, i.e. dentry kill.
> 
> So, it sounds acceptable to synchronize reading nr_items to avoid seeing
> intermediate negative nr_items given the simplicity and it is typically
> just called by shrinkers when counting the freeable objects.
> 
> The patch is tested with some shrinker intensive workloads, no
> noticeable regression is soptted.

Hi Yang!

It's really tricky, thank you for digging in! It's a perfect analysis!

I wonder though, if it's better to just always set the shrinker bit on reparenting
if we do reparent some items? Then we'll avoid adding new synchronization
to the hot path. What do you think?

--

@@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	struct list_lru_node *nlru = &lru->node[nid];
 	int dst_idx = dst_memcg->kmemcg_id;
 	struct list_lru_one *src, *dst;
-	bool set;
 
 	/*
 	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -546,9 +545,8 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	dst = list_lru_from_memcg_idx(nlru, dst_idx);
 
 	list_splice_init(&src->list, &dst->list);
-	set = (!dst->nr_items && src->nr_items);
 	dst->nr_items += src->nr_items;
-	if (set)
+	if (src->nr_items)
 		memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
 	src->nr_items = 0;
 

--

Btw, it seems that the bug is quite old. I wonder why we haven't seen it before?
Any ideas?


Thanks!


> 
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/list_lru.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 5aa6e44bc2ae..5c128a7710ff 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -178,10 +178,17 @@ unsigned long list_lru_count_one(struct list_lru *lru,
>  	struct list_lru_one *l;
>  	unsigned long count;
>  
> -	rcu_read_lock();
> +	/*
> +	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> +	 * we have to use IRQ-safe primitives here to avoid deadlock.
> +	 *
> +	 * Hold the lock to prevent from seeing transient negative
> +	 * nr_items value.
> +	 */
> +	spin_lock_irq(&nlru->lock);
>  	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
>  	count = READ_ONCE(l->nr_items);
> -	rcu_read_unlock();
> +	spin_unlock_irq(&nlru->lock);
>  
>  	return count;
>  }
> -- 
> 2.26.2
>
Yang Shi Nov. 30, 2020, 8:57 p.m. UTC | #2
On Mon, Nov 30, 2020 at 12:09 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Nov 30, 2020 at 10:45:14AM -0800, Yang Shi wrote:
> > When investigating a slab cache bloat problem, significant amount of
> > negative dentry cache was seen, but confusingly they neither got shrunk
> > by reclaimer (the host has very tight memory) nor be shrunk by dropping
> > cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> > but tracing result shows they were even not scanned at all.  The further
> > investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> > reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> > to reboot the hosts to get the memory back.
> >
> > I didn't manage to come up with a reproducer in test environment, and the
> > problem can't be reproduced after rebooting.  But it seems there is race
> > between shrinker map bit clear and reparenting by code inspection.  The
> > hypothesis is elaborated as below.
> >
> > The memcg hierarchy on our production environment looks like:
> >                 root
> >                /    \
> >           system   user
> >
> > The main workloads are running under user slice's children, and it creates
> > and removes memcg frequently.  So reparenting happens very often under user
> > slice, but no task is under user slice directly.
> >
> > So with the frequent reparenting and tight memory pressure, the below
> > hypothetical race condition may happen:
> >
> >     CPU A                            CPU B                         CPU C
> > reparent
> >     dst->nr_items == 0
> >                                  shrinker:
> >                                      total_objects == 0
> >     add src->nr_items to dst
> >     set_bit
> >                                      retrun SHRINK_EMPTY
> >                                      clear_bit
> >                                                                   list_lru_del()
> > reparent again
> >     dst->nr_items may go negative
> >     due to current list_lru_del()
> >     on CPU C
> >                                  The second run of shrinker:
> >                                      read nr_items without any
> >                                      synchronization, so it may
> >                                      see intermediate negative
> >                                      nr_items then total_objects
> >                                      may return 0 conincidently
> >
> >                                      keep the bit cleared
> >     dst->nr_items != 0
> >     skip set_bit
> >     add scr->nr_item to dst
> >
> > After this point dst->nr_item may never go zero, so reparenting will not
> > set shrinker_map bit anymore.  And since there is no task under user
> > slice directly, so no new object will be added to its lru to set the
> > shrinker map bit either.  That bit is kept cleared forever.
> >
> > How does list_lru_del() race with reparenting?  It is because
> > reparenting replaces childen's kmemcg_id to parent's without protecting
> > from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
> > actually deleting items from child's lru, but dec'ing parent's nr_items,
> > so the parent's nr_items may go negative as commit
> > 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
> > free kmemcg_id on css offline") says.
> >
> > Can we move kmemcg_id replacement after reparenting?  No, because the
> > race with list_lru_del() may result in negative src->nr_items, but it
> > will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
> > keep the shrinker map bit set always.  The shrinker will be always
> > called for nonsense.
> >
> > Can we synchronize list_lru_del() and reparenting?  Yes, it could be
> > done.  But it seems we need introduce a new lock or use nlru->lock.  But
> > it sounds complicated to move kmemcg_id replacement code under nlru->lock.
> > And list_lru_del() may be called quite often to exacerbate some hot
> > path, i.e. dentry kill.
> >
> > So, it sounds acceptable to synchronize reading nr_items to avoid seeing
> > intermediate negative nr_items given the simplicity and it is typically
> > just called by shrinkers when counting the freeable objects.
> >
> > The patch is tested with some shrinker intensive workloads, no
> > noticeable regression is soptted.
>
> Hi Yang!
>
> It's really tricky, thank you for digging in! It's a perfect analysis!
>
> I wonder though, if it's better to just always set the shrinker bit on reparenting
> if we do reparent some items? Then we'll avoid adding new synchronization
> to the hot path. What do you think?

Thanks a lot for the suggestion. I was thinking about the same
approach too, but I thought src->nr_items may go zero due to
concurrent list_lru_del() at the first place. But I just rethought the
whole thing, it seems impossible that dst->nr_items goes negative and
src->nr_items goes zero at the same time. list_lru_del() should just
see either dst or src, it can't manipulate both lists simultaneously.
So I think your suggestion should work. I will incarnate your
suggestion in v2.

>
> --
>
> @@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
>         struct list_lru_node *nlru = &lru->node[nid];
>         int dst_idx = dst_memcg->kmemcg_id;
>         struct list_lru_one *src, *dst;
> -       bool set;
>
>         /*
>          * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> @@ -546,9 +545,8 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
>         dst = list_lru_from_memcg_idx(nlru, dst_idx);
>
>         list_splice_init(&src->list, &dst->list);
> -       set = (!dst->nr_items && src->nr_items);
>         dst->nr_items += src->nr_items;
> -       if (set)
> +       if (src->nr_items)
>                 memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
>         src->nr_items = 0;
>
>
> --
>
> Btw, it seems that the bug is quite old. I wonder why we haven't seen it before?
> Any ideas?

It is not new, but not that old from my point of view. The
shrinker_map thing was introduced since v4.19, I bet pre-v4.19 kernel
may still dominate in production environment. And, it needs some
conditions (i.e. nr_inode + nr_dentry == 0 coincidently, and there is
not task under dst memcg directly, etc) to trigger, so it seems
unlikely to hit.

And the consequence may be not noticeable to the most people at all.
We happened to see frequent OOMs on a couple of small machines (32G
memory w/o swap, but most memory was consumed by anonymous pages)
recently and they were already up for long time (almost 300 days),
then the investigation leads to this race condition.

>
>
> Thanks!
>
>
> >
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/list_lru.c | 11 +++++++++--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > index 5aa6e44bc2ae..5c128a7710ff 100644
> > --- a/mm/list_lru.c
> > +++ b/mm/list_lru.c
> > @@ -178,10 +178,17 @@ unsigned long list_lru_count_one(struct list_lru *lru,
> >       struct list_lru_one *l;
> >       unsigned long count;
> >
> > -     rcu_read_lock();
> > +     /*
> > +      * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> > +      * we have to use IRQ-safe primitives here to avoid deadlock.
> > +      *
> > +      * Hold the lock to prevent from seeing transient negative
> > +      * nr_items value.
> > +      */
> > +     spin_lock_irq(&nlru->lock);
> >       l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
> >       count = READ_ONCE(l->nr_items);
> > -     rcu_read_unlock();
> > +     spin_unlock_irq(&nlru->lock);
> >
> >       return count;
> >  }
> > --
> > 2.26.2
> >
Roman Gushchin Nov. 30, 2020, 10:33 p.m. UTC | #3
On Mon, Nov 30, 2020 at 12:57:47PM -0800, Yang Shi wrote:
> On Mon, Nov 30, 2020 at 12:09 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 10:45:14AM -0800, Yang Shi wrote:
> > > When investigating a slab cache bloat problem, significant amount of
> > > negative dentry cache was seen, but confusingly they neither got shrunk
> > > by reclaimer (the host has very tight memory) nor be shrunk by dropping
> > > cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> > > but tracing result shows they were even not scanned at all.  The further
> > > investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> > > reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> > > to reboot the hosts to get the memory back.
> > >
> > > I didn't manage to come up with a reproducer in test environment, and the
> > > problem can't be reproduced after rebooting.  But it seems there is race
> > > between shrinker map bit clear and reparenting by code inspection.  The
> > > hypothesis is elaborated as below.
> > >
> > > The memcg hierarchy on our production environment looks like:
> > >                 root
> > >                /    \
> > >           system   user
> > >
> > > The main workloads are running under user slice's children, and it creates
> > > and removes memcg frequently.  So reparenting happens very often under user
> > > slice, but no task is under user slice directly.
> > >
> > > So with the frequent reparenting and tight memory pressure, the below
> > > hypothetical race condition may happen:
> > >
> > >     CPU A                            CPU B                         CPU C
> > > reparent
> > >     dst->nr_items == 0
> > >                                  shrinker:
> > >                                      total_objects == 0
> > >     add src->nr_items to dst
> > >     set_bit
> > >                                      retrun SHRINK_EMPTY
> > >                                      clear_bit
> > >                                                                   list_lru_del()
> > > reparent again
> > >     dst->nr_items may go negative
> > >     due to current list_lru_del()
> > >     on CPU C
> > >                                  The second run of shrinker:
> > >                                      read nr_items without any
> > >                                      synchronization, so it may
> > >                                      see intermediate negative
> > >                                      nr_items then total_objects
> > >                                      may return 0 conincidently
> > >
> > >                                      keep the bit cleared
> > >     dst->nr_items != 0
> > >     skip set_bit
> > >     add scr->nr_item to dst
> > >
> > > After this point dst->nr_item may never go zero, so reparenting will not
> > > set shrinker_map bit anymore.  And since there is no task under user
> > > slice directly, so no new object will be added to its lru to set the
> > > shrinker map bit either.  That bit is kept cleared forever.
> > >
> > > How does list_lru_del() race with reparenting?  It is because
> > > reparenting replaces childen's kmemcg_id to parent's without protecting
> > > from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
> > > actually deleting items from child's lru, but dec'ing parent's nr_items,
> > > so the parent's nr_items may go negative as commit
> > > 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
> > > free kmemcg_id on css offline") says.

Also note that since the introduction of the slab reparenting, list_lru_from_kmem()
can return the parent lru.

> > >
> > > Can we move kmemcg_id replacement after reparenting?  No, because the
> > > race with list_lru_del() may result in negative src->nr_items, but it
> > > will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
> > > keep the shrinker map bit set always.  The shrinker will be always
> > > called for nonsense.
> > >
> > > Can we synchronize list_lru_del() and reparenting?  Yes, it could be
> > > done.  But it seems we need introduce a new lock or use nlru->lock.  But
> > > it sounds complicated to move kmemcg_id replacement code under nlru->lock.
> > > And list_lru_del() may be called quite often to exacerbate some hot
> > > path, i.e. dentry kill.
> > >
> > > So, it sounds acceptable to synchronize reading nr_items to avoid seeing
> > > intermediate negative nr_items given the simplicity and it is typically
> > > just called by shrinkers when counting the freeable objects.
> > >
> > > The patch is tested with some shrinker intensive workloads, no
> > > noticeable regression is soptted.
> >
> > Hi Yang!
> >
> > It's really tricky, thank you for digging in! It's a perfect analysis!
> >
> > I wonder though, if it's better to just always set the shrinker bit on reparenting
> > if we do reparent some items? Then we'll avoid adding new synchronization
> > to the hot path. What do you think?
> 
> Thanks a lot for the suggestion. I was thinking about the same
> approach too, but I thought src->nr_items may go zero due to
> concurrent list_lru_del() at the first place. But I just rethought the
> whole thing, it seems impossible that dst->nr_items goes negative and
> src->nr_items goes zero at the same time.

Even if it would be possible, it seems less scary: the next reparenting
will likely set the bit. So we'll not get into the permanently bad state.

> list_lru_del() should just
> see either dst or src, it can't manipulate both lists simultaneously.
> So I think your suggestion should work. I will incarnate your
> suggestion in v2.
> 
> >
> > --
> >
> > @@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
> >         struct list_lru_node *nlru = &lru->node[nid];
> >         int dst_idx = dst_memcg->kmemcg_id;
> >         struct list_lru_one *src, *dst;
> > -       bool set;
> >
> >         /*
> >          * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> > @@ -546,9 +545,8 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
> >         dst = list_lru_from_memcg_idx(nlru, dst_idx);
> >
> >         list_splice_init(&src->list, &dst->list);
> > -       set = (!dst->nr_items && src->nr_items);
> >         dst->nr_items += src->nr_items;
> > -       if (set)
> > +       if (src->nr_items)
> >                 memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
> >         src->nr_items = 0;
> >
> >
> > --
> >
> > Btw, it seems that the bug is quite old. I wonder why we haven't seen it before?
> > Any ideas?
> 
> It is not new, but not that old from my point of view. The
> shrinker_map thing was introduced since v4.19, I bet pre-v4.19 kernel
> may still dominate in production environment. And, it needs some
> conditions (i.e. nr_inode + nr_dentry == 0 coincidently, and there is
> not task under dst memcg directly, etc) to trigger, so it seems
> unlikely to hit.
> 
> And the consequence may be not noticeable to the most people at all.
> We happened to see frequent OOMs on a couple of small machines (32G
> memory w/o swap, but most memory was consumed by anonymous pages)
> recently and they were already up for long time (almost 300 days),
> then the investigation leads to this race condition.

I agree that most users will unlikely notice it.

But https://www.spinics.net/lists/cgroups/msg27295.html looks very similar
and can be caused by the same problem. Once you'll have v2, let's ask
them to test it too.

Thanks!
Roman Gushchin Nov. 30, 2020, 10:52 p.m. UTC | #4
On Mon, Nov 30, 2020 at 12:57:47PM -0800, Yang Shi wrote:
> On Mon, Nov 30, 2020 at 12:09 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 10:45:14AM -0800, Yang Shi wrote:
> > > When investigating a slab cache bloat problem, significant amount of
> > > negative dentry cache was seen, but confusingly they neither got shrunk
> > > by reclaimer (the host has very tight memory) nor be shrunk by dropping
> > > cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> > > but tracing result shows they were even not scanned at all.  The further
> > > investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> > > reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> > > to reboot the hosts to get the memory back.
> > >
> > > I didn't manage to come up with a reproducer in test environment, and the
> > > problem can't be reproduced after rebooting.  But it seems there is race
> > > between shrinker map bit clear and reparenting by code inspection.  The
> > > hypothesis is elaborated as below.
> > >
> > > The memcg hierarchy on our production environment looks like:
> > >                 root
> > >                /    \
> > >           system   user
> > >
> > > The main workloads are running under user slice's children, and it creates
> > > and removes memcg frequently.  So reparenting happens very often under user
> > > slice, but no task is under user slice directly.
> > >
> > > So with the frequent reparenting and tight memory pressure, the below
> > > hypothetical race condition may happen:
> > >
> > >     CPU A                            CPU B                         CPU C
> > > reparent
> > >     dst->nr_items == 0
> > >                                  shrinker:
> > >                                      total_objects == 0
> > >     add src->nr_items to dst
> > >     set_bit
> > >                                      retrun SHRINK_EMPTY
> > >                                      clear_bit
> > >                                                                   list_lru_del()
> > > reparent again
> > >     dst->nr_items may go negative
> > >     due to current list_lru_del()
> > >     on CPU C
> > >                                  The second run of shrinker:
> > >                                      read nr_items without any
> > >                                      synchronization, so it may
> > >                                      see intermediate negative
> > >                                      nr_items then total_objects
> > >                                      may return 0 conincidently
> > >
> > >                                      keep the bit cleared
> > >     dst->nr_items != 0
> > >     skip set_bit
> > >     add scr->nr_item to dst

Btw, I think I have a simpler explanation:

A (0 objects)
|
B (N objects)

Let's say the reparenting races with the deletion of a single slab object.
list_lru_del() can see parent's lru list and substract 1 from nr_items == 0,
setting A's nr_items to -1 (the item is actually still in B's list).

memcg_drain_list_lru_node() will check !dst->nr_items && src->nr_items
!-1 && N => 0 and not set the bit. But now we have (N-1) objects in A's list
and the shrinker bit not set.

My proposed fix should resolve it. Alternatively, we maybe can check if
dst->nr_items <= 0 and only then set the bit, but it seems to be an unnecessary
optimization.

Thanks!
Yang Shi Nov. 30, 2020, 10:54 p.m. UTC | #5
On Mon, Nov 30, 2020 at 2:33 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Nov 30, 2020 at 12:57:47PM -0800, Yang Shi wrote:
> > On Mon, Nov 30, 2020 at 12:09 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Mon, Nov 30, 2020 at 10:45:14AM -0800, Yang Shi wrote:
> > > > When investigating a slab cache bloat problem, significant amount of
> > > > negative dentry cache was seen, but confusingly they neither got shrunk
> > > > by reclaimer (the host has very tight memory) nor be shrunk by dropping
> > > > cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> > > > but tracing result shows they were even not scanned at all.  The further
> > > > investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> > > > reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> > > > to reboot the hosts to get the memory back.
> > > >
> > > > I didn't manage to come up with a reproducer in test environment, and the
> > > > problem can't be reproduced after rebooting.  But it seems there is race
> > > > between shrinker map bit clear and reparenting by code inspection.  The
> > > > hypothesis is elaborated as below.
> > > >
> > > > The memcg hierarchy on our production environment looks like:
> > > >                 root
> > > >                /    \
> > > >           system   user
> > > >
> > > > The main workloads are running under user slice's children, and it creates
> > > > and removes memcg frequently.  So reparenting happens very often under user
> > > > slice, but no task is under user slice directly.
> > > >
> > > > So with the frequent reparenting and tight memory pressure, the below
> > > > hypothetical race condition may happen:
> > > >
> > > >     CPU A                            CPU B                         CPU C
> > > > reparent
> > > >     dst->nr_items == 0
> > > >                                  shrinker:
> > > >                                      total_objects == 0
> > > >     add src->nr_items to dst
> > > >     set_bit
> > > >                                      retrun SHRINK_EMPTY
> > > >                                      clear_bit
> > > >                                                                   list_lru_del()
> > > > reparent again
> > > >     dst->nr_items may go negative
> > > >     due to current list_lru_del()
> > > >     on CPU C
> > > >                                  The second run of shrinker:
> > > >                                      read nr_items without any
> > > >                                      synchronization, so it may
> > > >                                      see intermediate negative
> > > >                                      nr_items then total_objects
> > > >                                      may return 0 conincidently
> > > >
> > > >                                      keep the bit cleared
> > > >     dst->nr_items != 0
> > > >     skip set_bit
> > > >     add scr->nr_item to dst
> > > >
> > > > After this point dst->nr_item may never go zero, so reparenting will not
> > > > set shrinker_map bit anymore.  And since there is no task under user
> > > > slice directly, so no new object will be added to its lru to set the
> > > > shrinker map bit either.  That bit is kept cleared forever.
> > > >
> > > > How does list_lru_del() race with reparenting?  It is because
> > > > reparenting replaces childen's kmemcg_id to parent's without protecting
> > > > from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
> > > > actually deleting items from child's lru, but dec'ing parent's nr_items,
> > > > so the parent's nr_items may go negative as commit
> > > > 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
> > > > free kmemcg_id on css offline") says.
>
> Also note that since the introduction of the slab reparenting, list_lru_from_kmem()
> can return the parent lru.

Do you mean slab charge reparenting or lru reparenting? I think
list_lru_from_kmem() can return the parent lru since lru reparenting.

>
> > > >
> > > > Can we move kmemcg_id replacement after reparenting?  No, because the
> > > > race with list_lru_del() may result in negative src->nr_items, but it
> > > > will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
> > > > keep the shrinker map bit set always.  The shrinker will be always
> > > > called for nonsense.
> > > >
> > > > Can we synchronize list_lru_del() and reparenting?  Yes, it could be
> > > > done.  But it seems we need introduce a new lock or use nlru->lock.  But
> > > > it sounds complicated to move kmemcg_id replacement code under nlru->lock.
> > > > And list_lru_del() may be called quite often to exacerbate some hot
> > > > path, i.e. dentry kill.
> > > >
> > > > So, it sounds acceptable to synchronize reading nr_items to avoid seeing
> > > > intermediate negative nr_items given the simplicity and it is typically
> > > > just called by shrinkers when counting the freeable objects.
> > > >
> > > > The patch is tested with some shrinker intensive workloads, no
> > > > noticeable regression is soptted.
> > >
> > > Hi Yang!
> > >
> > > It's really tricky, thank you for digging in! It's a perfect analysis!
> > >
> > > I wonder though, if it's better to just always set the shrinker bit on reparenting
> > > if we do reparent some items? Then we'll avoid adding new synchronization
> > > to the hot path. What do you think?
> >
> > Thanks a lot for the suggestion. I was thinking about the same
> > approach too, but I thought src->nr_items may go zero due to
> > concurrent list_lru_del() at the first place. But I just rethought the
> > whole thing, it seems impossible that dst->nr_items goes negative and
> > src->nr_items goes zero at the same time.
>
> Even if it would be possible, it seems less scary: the next reparenting
> will likely set the bit. So we'll not get into the permanently bad state.

Unfortunately, no. Once the race happens, reparenting won't set the
bit anymore since dst->nr_items won't go zero because the shrinker
will not be called.

>
> > list_lru_del() should just
> > see either dst or src, it can't manipulate both lists simultaneously.
> > So I think your suggestion should work. I will incarnate your
> > suggestion in v2.
> >
> > >
> > > --
> > >
> > > @@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
> > >         struct list_lru_node *nlru = &lru->node[nid];
> > >         int dst_idx = dst_memcg->kmemcg_id;
> > >         struct list_lru_one *src, *dst;
> > > -       bool set;
> > >
> > >         /*
> > >          * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> > > @@ -546,9 +545,8 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
> > >         dst = list_lru_from_memcg_idx(nlru, dst_idx);
> > >
> > >         list_splice_init(&src->list, &dst->list);
> > > -       set = (!dst->nr_items && src->nr_items);
> > >         dst->nr_items += src->nr_items;
> > > -       if (set)
> > > +       if (src->nr_items)
> > >                 memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
> > >         src->nr_items = 0;
> > >
> > >
> > > --
> > >
> > > Btw, it seems that the bug is quite old. I wonder why we haven't seen it before?
> > > Any ideas?
> >
> > It is not new, but not that old from my point of view. The
> > shrinker_map thing was introduced since v4.19, I bet pre-v4.19 kernel
> > may still dominate in production environment. And, it needs some
> > conditions (i.e. nr_inode + nr_dentry == 0 coincidently, and there is
> > not task under dst memcg directly, etc) to trigger, so it seems
> > unlikely to hit.
> >
> > And the consequence may be not noticeable to the most people at all.
> > We happened to see frequent OOMs on a couple of small machines (32G
> > memory w/o swap, but most memory was consumed by anonymous pages)
> > recently and they were already up for long time (almost 300 days),
> > then the investigation leads to this race condition.
>
> I agree that most users will unlikely notice it.
>
> But https://www.spinics.net/lists/cgroups/msg27295.html looks very similar
> and can be caused by the same problem. Once you'll have v2, let's ask
> them to test it too.

Yeah, may be the same root cause.

>
> Thanks!
Yang Shi Nov. 30, 2020, 10:57 p.m. UTC | #6
On Mon, Nov 30, 2020 at 2:53 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Nov 30, 2020 at 12:57:47PM -0800, Yang Shi wrote:
> > On Mon, Nov 30, 2020 at 12:09 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Mon, Nov 30, 2020 at 10:45:14AM -0800, Yang Shi wrote:
> > > > When investigating a slab cache bloat problem, significant amount of
> > > > negative dentry cache was seen, but confusingly they neither got shrunk
> > > > by reclaimer (the host has very tight memory) nor be shrunk by dropping
> > > > cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> > > > but tracing result shows they were even not scanned at all.  The further
> > > > investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> > > > reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> > > > to reboot the hosts to get the memory back.
> > > >
> > > > I didn't manage to come up with a reproducer in test environment, and the
> > > > problem can't be reproduced after rebooting.  But it seems there is race
> > > > between shrinker map bit clear and reparenting by code inspection.  The
> > > > hypothesis is elaborated as below.
> > > >
> > > > The memcg hierarchy on our production environment looks like:
> > > >                 root
> > > >                /    \
> > > >           system   user
> > > >
> > > > The main workloads are running under user slice's children, and it creates
> > > > and removes memcg frequently.  So reparenting happens very often under user
> > > > slice, but no task is under user slice directly.
> > > >
> > > > So with the frequent reparenting and tight memory pressure, the below
> > > > hypothetical race condition may happen:
> > > >
> > > >     CPU A                            CPU B                         CPU C
> > > > reparent
> > > >     dst->nr_items == 0
> > > >                                  shrinker:
> > > >                                      total_objects == 0
> > > >     add src->nr_items to dst
> > > >     set_bit
> > > >                                      retrun SHRINK_EMPTY
> > > >                                      clear_bit
> > > >                                                                   list_lru_del()
> > > > reparent again
> > > >     dst->nr_items may go negative
> > > >     due to current list_lru_del()
> > > >     on CPU C
> > > >                                  The second run of shrinker:
> > > >                                      read nr_items without any
> > > >                                      synchronization, so it may
> > > >                                      see intermediate negative
> > > >                                      nr_items then total_objects
> > > >                                      may return 0 conincidently
> > > >
> > > >                                      keep the bit cleared
> > > >     dst->nr_items != 0
> > > >     skip set_bit
> > > >     add scr->nr_item to dst
>
> Btw, I think I have a simpler explanation:
>
> A (0 objects)
> |
> B (N objects)
>
> Let's say the reparenting races with the deletion of a single slab object.
> list_lru_del() can see parent's lru list and substract 1 from nr_items == 0,
> setting A's nr_items to -1 (the item is actually still in B's list).
>
> memcg_drain_list_lru_node() will check !dst->nr_items && src->nr_items
> !-1 && N => 0 and not set the bit. But now we have (N-1) objects in A's list
> and the shrinker bit not set.

Yes, this is the exact race I elaborated in the commit log.

>
> My proposed fix should resolve it. Alternatively, we maybe can check if
> dst->nr_items <= 0 and only then set the bit, but it seems to be an unnecessary
> optimization.

Yes, I think "src->nr_items != 0" is good enough.

>
> Thanks!
Roman Gushchin Nov. 30, 2020, 11:07 p.m. UTC | #7
On Mon, Nov 30, 2020 at 02:54:02PM -0800, Yang Shi wrote:
> On Mon, Nov 30, 2020 at 2:33 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 12:57:47PM -0800, Yang Shi wrote:
> > > On Mon, Nov 30, 2020 at 12:09 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Mon, Nov 30, 2020 at 10:45:14AM -0800, Yang Shi wrote:
> > > > > When investigating a slab cache bloat problem, significant amount of
> > > > > negative dentry cache was seen, but confusingly they neither got shrunk
> > > > > by reclaimer (the host has very tight memory) nor be shrunk by dropping
> > > > > cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> > > > > but tracing result shows they were even not scanned at all.  The further
> > > > > investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> > > > > reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> > > > > to reboot the hosts to get the memory back.
> > > > >
> > > > > I didn't manage to come up with a reproducer in test environment, and the
> > > > > problem can't be reproduced after rebooting.  But it seems there is race
> > > > > between shrinker map bit clear and reparenting by code inspection.  The
> > > > > hypothesis is elaborated as below.
> > > > >
> > > > > The memcg hierarchy on our production environment looks like:
> > > > >                 root
> > > > >                /    \
> > > > >           system   user
> > > > >
> > > > > The main workloads are running under user slice's children, and it creates
> > > > > and removes memcg frequently.  So reparenting happens very often under user
> > > > > slice, but no task is under user slice directly.
> > > > >
> > > > > So with the frequent reparenting and tight memory pressure, the below
> > > > > hypothetical race condition may happen:
> > > > >
> > > > >     CPU A                            CPU B                         CPU C
> > > > > reparent
> > > > >     dst->nr_items == 0
> > > > >                                  shrinker:
> > > > >                                      total_objects == 0
> > > > >     add src->nr_items to dst
> > > > >     set_bit
> > > > >                                      retrun SHRINK_EMPTY
> > > > >                                      clear_bit
> > > > >                                                                   list_lru_del()
> > > > > reparent again
> > > > >     dst->nr_items may go negative
> > > > >     due to current list_lru_del()
> > > > >     on CPU C
> > > > >                                  The second run of shrinker:
> > > > >                                      read nr_items without any
> > > > >                                      synchronization, so it may
> > > > >                                      see intermediate negative
> > > > >                                      nr_items then total_objects
> > > > >                                      may return 0 conincidently
> > > > >
> > > > >                                      keep the bit cleared
> > > > >     dst->nr_items != 0
> > > > >     skip set_bit
> > > > >     add scr->nr_item to dst
> > > > >
> > > > > After this point dst->nr_item may never go zero, so reparenting will not
> > > > > set shrinker_map bit anymore.  And since there is no task under user
> > > > > slice directly, so no new object will be added to its lru to set the
> > > > > shrinker map bit either.  That bit is kept cleared forever.
> > > > >
> > > > > How does list_lru_del() race with reparenting?  It is because
> > > > > reparenting replaces childen's kmemcg_id to parent's without protecting
> > > > > from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
> > > > > actually deleting items from child's lru, but dec'ing parent's nr_items,
> > > > > so the parent's nr_items may go negative as commit
> > > > > 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
> > > > > free kmemcg_id on css offline") says.
> >
> > Also note that since the introduction of the slab reparenting, list_lru_from_kmem()
> > can return the parent lru.
> 
> Do you mean slab charge reparenting or lru reparenting? I think
> list_lru_from_kmem() can return the parent lru since lru reparenting.

objcg reparenting to be precise. It's actually kinda weird now, because
there are two slightly different reparenting mechanisms. We might to
wanna merge them in the future.

> 
> >
> > > > >
> > > > > Can we move kmemcg_id replacement after reparenting?  No, because the
> > > > > race with list_lru_del() may result in negative src->nr_items, but it
> > > > > will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
> > > > > keep the shrinker map bit set always.  The shrinker will be always
> > > > > called for nonsense.
> > > > >
> > > > > Can we synchronize list_lru_del() and reparenting?  Yes, it could be
> > > > > done.  But it seems we need introduce a new lock or use nlru->lock.  But
> > > > > it sounds complicated to move kmemcg_id replacement code under nlru->lock.
> > > > > And list_lru_del() may be called quite often to exacerbate some hot
> > > > > path, i.e. dentry kill.
> > > > >
> > > > > So, it sounds acceptable to synchronize reading nr_items to avoid seeing
> > > > > intermediate negative nr_items given the simplicity and it is typically
> > > > > just called by shrinkers when counting the freeable objects.
> > > > >
> > > > > The patch is tested with some shrinker intensive workloads, no
> > > > > noticeable regression is soptted.
> > > >
> > > > Hi Yang!
> > > >
> > > > It's really tricky, thank you for digging in! It's a perfect analysis!
> > > >
> > > > I wonder though, if it's better to just always set the shrinker bit on reparenting
> > > > if we do reparent some items? Then we'll avoid adding new synchronization
> > > > to the hot path. What do you think?
> > >
> > > Thanks a lot for the suggestion. I was thinking about the same
> > > approach too, but I thought src->nr_items may go zero due to
> > > concurrent list_lru_del() at the first place. But I just rethought the
> > > whole thing, it seems impossible that dst->nr_items goes negative and
> > > src->nr_items goes zero at the same time.
> >
> > Even if it would be possible, it seems less scary: the next reparenting
> > will likely set the bit. So we'll not get into the permanently bad state.
> 
> Unfortunately, no. Once the race happens, reparenting won't set the
> bit anymore since dst->nr_items won't go zero because the shrinker
> will not be called.

I mean if we don't check dst->nr_items. Anyway, because it's an impossible case,
no matter to discuss it :)
Roman Gushchin Nov. 30, 2020, 11:09 p.m. UTC | #8
On Mon, Nov 30, 2020 at 02:57:23PM -0800, Yang Shi wrote:
> On Mon, Nov 30, 2020 at 2:53 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 12:57:47PM -0800, Yang Shi wrote:
> > > On Mon, Nov 30, 2020 at 12:09 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Mon, Nov 30, 2020 at 10:45:14AM -0800, Yang Shi wrote:
> > > > > When investigating a slab cache bloat problem, significant amount of
> > > > > negative dentry cache was seen, but confusingly they neither got shrunk
> > > > > by reclaimer (the host has very tight memory) nor be shrunk by dropping
> > > > > cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> > > > > but tracing result shows they were even not scanned at all.  The further
> > > > > investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> > > > > reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> > > > > to reboot the hosts to get the memory back.
> > > > >
> > > > > I didn't manage to come up with a reproducer in test environment, and the
> > > > > problem can't be reproduced after rebooting.  But it seems there is race
> > > > > between shrinker map bit clear and reparenting by code inspection.  The
> > > > > hypothesis is elaborated as below.
> > > > >
> > > > > The memcg hierarchy on our production environment looks like:
> > > > >                 root
> > > > >                /    \
> > > > >           system   user
> > > > >
> > > > > The main workloads are running under user slice's children, and it creates
> > > > > and removes memcg frequently.  So reparenting happens very often under user
> > > > > slice, but no task is under user slice directly.
> > > > >
> > > > > So with the frequent reparenting and tight memory pressure, the below
> > > > > hypothetical race condition may happen:
> > > > >
> > > > >     CPU A                            CPU B                         CPU C
> > > > > reparent
> > > > >     dst->nr_items == 0
> > > > >                                  shrinker:
> > > > >                                      total_objects == 0
> > > > >     add src->nr_items to dst
> > > > >     set_bit
> > > > >                                      retrun SHRINK_EMPTY
> > > > >                                      clear_bit
> > > > >                                                                   list_lru_del()
> > > > > reparent again
> > > > >     dst->nr_items may go negative
> > > > >     due to current list_lru_del()
> > > > >     on CPU C
> > > > >                                  The second run of shrinker:
> > > > >                                      read nr_items without any
> > > > >                                      synchronization, so it may
> > > > >                                      see intermediate negative
> > > > >                                      nr_items then total_objects
> > > > >                                      may return 0 conincidently
> > > > >
> > > > >                                      keep the bit cleared
> > > > >     dst->nr_items != 0
> > > > >     skip set_bit
> > > > >     add scr->nr_item to dst
> >
> > Btw, I think I have a simpler explanation:
> >
> > A (0 objects)
> > |
> > B (N objects)
> >
> > Let's say the reparenting races with the deletion of a single slab object.
> > list_lru_del() can see parent's lru list and substract 1 from nr_items == 0,
> > setting A's nr_items to -1 (the item is actually still in B's list).
> >
> > memcg_drain_list_lru_node() will check !dst->nr_items && src->nr_items
> > !-1 && N => 0 and not set the bit. But now we have (N-1) objects in A's list
> > and the shrinker bit not set.
> 
> Yes, this is the exact race I elaborated in the commit log.

Yes, the same problem for sure, I just think if we don't need to actually run
into the shrinker code to mentally reproduce it, it's a bit easier model to follow.

> 
> >
> > My proposed fix should resolve it. Alternatively, we maybe can check if
> > dst->nr_items <= 0 and only then set the bit, but it seems to be an unnecessary
> > optimization.
> 
> Yes, I think "src->nr_items != 0" is good enough.

I agree.
Kirill Tkhai Dec. 1, 2020, 10:09 a.m. UTC | #9
On 30.11.2020 21:45, Yang Shi wrote:
> When investigating a slab cache bloat problem, significant amount of
> negative dentry cache was seen, but confusingly they neither got shrunk
> by reclaimer (the host has very tight memory) nor be shrunk by dropping
> cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> but tracing result shows they were even not scanned at all.  The further
> investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> to reboot the hosts to get the memory back.
> 
> I didn't manage to come up with a reproducer in test environment, and the
> problem can't be reproduced after rebooting.  But it seems there is race
> between shrinker map bit clear and reparenting by code inspection.  The
> hypothesis is elaborated as below.
> 
> The memcg hierarchy on our production environment looks like:
>                 root
>                /    \
>           system   user
> 
> The main workloads are running under user slice's children, and it creates
> and removes memcg frequently.  So reparenting happens very often under user
> slice, but no task is under user slice directly.
> 
> So with the frequent reparenting and tight memory pressure, the below
> hypothetical race condition may happen:
> 
>     CPU A                            CPU B                         CPU C
> reparent
>     dst->nr_items == 0
>                                  shrinker:
>                                      total_objects == 0
>     add src->nr_items to dst
>     set_bit
>                                      retrun SHRINK_EMPTY
>                                      clear_bit
>                                                                   list_lru_del()
> reparent again
>     dst->nr_items may go negative
>     due to current list_lru_del()
>     on CPU C
>                                  The second run of shrinker:
>                                      read nr_items without any
>                                      synchronization, so it may
>                                      see intermediate negative
>                                      nr_items then total_objects
>                                      may return 0 conincidently
> 
>                                      keep the bit cleared
>     dst->nr_items != 0
>     skip set_bit
>     add scr->nr_item to dst

Good catch, Yang. Thanks for investigating this.

But I agree with Roman it's better to fix that in rare-called place
(memcg_drain_list_lru_node()), than in hot place (list_lru_count_one()).

Also, I'd added to description of new patch a reference to memcg_offline_kmem(),
because this is the place, where child->kmemcg_id is rewritten, and
this is the reason of lru's nr_items may become negative.
 
> After this point dst->nr_item may never go zero, so reparenting will not
> set shrinker_map bit anymore.  And since there is no task under user
> slice directly, so no new object will be added to its lru to set the
> shrinker map bit either.  That bit is kept cleared forever.
> 
> How does list_lru_del() race with reparenting?  It is because
> reparenting replaces childen's kmemcg_id to parent's without protecting
> from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
> actually deleting items from child's lru, but dec'ing parent's nr_items,
> so the parent's nr_items may go negative as commit
> 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
> free kmemcg_id on css offline") says.
> 
> Can we move kmemcg_id replacement after reparenting?  No, because the
> race with list_lru_del() may result in negative src->nr_items, but it
> will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
> keep the shrinker map bit set always.  The shrinker will be always
> called for nonsense.
> 
> Can we synchronize list_lru_del() and reparenting?  Yes, it could be
> done.  But it seems we need introduce a new lock or use nlru->lock.  But
> it sounds complicated to move kmemcg_id replacement code under nlru->lock.
> And list_lru_del() may be called quite often to exacerbate some hot
> path, i.e. dentry kill.
> 
> So, it sounds acceptable to synchronize reading nr_items to avoid seeing
> intermediate negative nr_items given the simplicity and it is typically
> just called by shrinkers when counting the freeable objects.
> 
> The patch is tested with some shrinker intensive workloads, no
> noticeable regression is soptted.
> 
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/list_lru.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 5aa6e44bc2ae..5c128a7710ff 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -178,10 +178,17 @@ unsigned long list_lru_count_one(struct list_lru *lru,
>  	struct list_lru_one *l;
>  	unsigned long count;
>  
> -	rcu_read_lock();
> +	/*
> +	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> +	 * we have to use IRQ-safe primitives here to avoid deadlock.
> +	 *
> +	 * Hold the lock to prevent from seeing transient negative
> +	 * nr_items value.
> +	 */
> +	spin_lock_irq(&nlru->lock);
>  	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
>  	count = READ_ONCE(l->nr_items);
> -	rcu_read_unlock();
> +	spin_unlock_irq(&nlru->lock);
>  
>  	return count;
>  }
>
Kirill Tkhai Dec. 1, 2020, 10:25 a.m. UTC | #10
On 30.11.2020 23:09, Roman Gushchin wrote:
> On Mon, Nov 30, 2020 at 10:45:14AM -0800, Yang Shi wrote:
>> When investigating a slab cache bloat problem, significant amount of
>> negative dentry cache was seen, but confusingly they neither got shrunk
>> by reclaimer (the host has very tight memory) nor be shrunk by dropping
>> cache.  The vmcore shows there are over 14M negative dentry objects on lru,
>> but tracing result shows they were even not scanned at all.  The further
>> investigation shows the memcg's vfs shrinker_map bit is not set.  So the
>> reclaimer or dropping cache just skip calling vfs shrinker.  So we have
>> to reboot the hosts to get the memory back.
>>
>> I didn't manage to come up with a reproducer in test environment, and the
>> problem can't be reproduced after rebooting.  But it seems there is race
>> between shrinker map bit clear and reparenting by code inspection.  The
>> hypothesis is elaborated as below.
>>
>> The memcg hierarchy on our production environment looks like:
>>                 root
>>                /    \
>>           system   user
>>
>> The main workloads are running under user slice's children, and it creates
>> and removes memcg frequently.  So reparenting happens very often under user
>> slice, but no task is under user slice directly.
>>
>> So with the frequent reparenting and tight memory pressure, the below
>> hypothetical race condition may happen:
>>
>>     CPU A                            CPU B                         CPU C
>> reparent
>>     dst->nr_items == 0
>>                                  shrinker:
>>                                      total_objects == 0
>>     add src->nr_items to dst
>>     set_bit
>>                                      retrun SHRINK_EMPTY
>>                                      clear_bit
>>                                                                   list_lru_del()
>> reparent again
>>     dst->nr_items may go negative
>>     due to current list_lru_del()
>>     on CPU C
>>                                  The second run of shrinker:
>>                                      read nr_items without any
>>                                      synchronization, so it may
>>                                      see intermediate negative
>>                                      nr_items then total_objects
>>                                      may return 0 conincidently
>>
>>                                      keep the bit cleared
>>     dst->nr_items != 0
>>     skip set_bit
>>     add scr->nr_item to dst
>>
>> After this point dst->nr_item may never go zero, so reparenting will not
>> set shrinker_map bit anymore.  And since there is no task under user
>> slice directly, so no new object will be added to its lru to set the
>> shrinker map bit either.  That bit is kept cleared forever.
>>
>> How does list_lru_del() race with reparenting?  It is because
>> reparenting replaces childen's kmemcg_id to parent's without protecting
>> from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
>> actually deleting items from child's lru, but dec'ing parent's nr_items,
>> so the parent's nr_items may go negative as commit
>> 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
>> free kmemcg_id on css offline") says.
>>
>> Can we move kmemcg_id replacement after reparenting?  No, because the
>> race with list_lru_del() may result in negative src->nr_items, but it
>> will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
>> keep the shrinker map bit set always.  The shrinker will be always
>> called for nonsense.
>>
>> Can we synchronize list_lru_del() and reparenting?  Yes, it could be
>> done.  But it seems we need introduce a new lock or use nlru->lock.  But
>> it sounds complicated to move kmemcg_id replacement code under nlru->lock.
>> And list_lru_del() may be called quite often to exacerbate some hot
>> path, i.e. dentry kill.
>>
>> So, it sounds acceptable to synchronize reading nr_items to avoid seeing
>> intermediate negative nr_items given the simplicity and it is typically
>> just called by shrinkers when counting the freeable objects.
>>
>> The patch is tested with some shrinker intensive workloads, no
>> noticeable regression is soptted.
> 
> Hi Yang!
> 
> It's really tricky, thank you for digging in! It's a perfect analysis!
> 
> I wonder though, if it's better to just always set the shrinker bit on reparenting
> if we do reparent some items? Then we'll avoid adding new synchronization
> to the hot path. What do you think?
> 
> --
> 
> @@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
>  	struct list_lru_node *nlru = &lru->node[nid];
>  	int dst_idx = dst_memcg->kmemcg_id;
>  	struct list_lru_one *src, *dst;
> -	bool set;
>  
>  	/*
>  	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> @@ -546,9 +545,8 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
>  	dst = list_lru_from_memcg_idx(nlru, dst_idx);
>  
>  	list_splice_init(&src->list, &dst->list);
> -	set = (!dst->nr_items && src->nr_items);
>  	dst->nr_items += src->nr_items;
> -	if (set)
> +	if (src->nr_items)
>  		memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
>  	src->nr_items = 0;

This looks like a good fix.

To make a code more clear, we may also want to group neighbouring lines
under the same "if" branch in Yang's v2 resend.

Thanks,
Kirill
Yang Shi Dec. 1, 2020, 5:15 p.m. UTC | #11
On Tue, Dec 1, 2020 at 2:25 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 30.11.2020 23:09, Roman Gushchin wrote:
> > On Mon, Nov 30, 2020 at 10:45:14AM -0800, Yang Shi wrote:
> >> When investigating a slab cache bloat problem, significant amount of
> >> negative dentry cache was seen, but confusingly they neither got shrunk
> >> by reclaimer (the host has very tight memory) nor be shrunk by dropping
> >> cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> >> but tracing result shows they were even not scanned at all.  The further
> >> investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> >> reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> >> to reboot the hosts to get the memory back.
> >>
> >> I didn't manage to come up with a reproducer in test environment, and the
> >> problem can't be reproduced after rebooting.  But it seems there is race
> >> between shrinker map bit clear and reparenting by code inspection.  The
> >> hypothesis is elaborated as below.
> >>
> >> The memcg hierarchy on our production environment looks like:
> >>                 root
> >>                /    \
> >>           system   user
> >>
> >> The main workloads are running under user slice's children, and it creates
> >> and removes memcg frequently.  So reparenting happens very often under user
> >> slice, but no task is under user slice directly.
> >>
> >> So with the frequent reparenting and tight memory pressure, the below
> >> hypothetical race condition may happen:
> >>
> >>     CPU A                            CPU B                         CPU C
> >> reparent
> >>     dst->nr_items == 0
> >>                                  shrinker:
> >>                                      total_objects == 0
> >>     add src->nr_items to dst
> >>     set_bit
> >>                                      retrun SHRINK_EMPTY
> >>                                      clear_bit
> >>                                                                   list_lru_del()
> >> reparent again
> >>     dst->nr_items may go negative
> >>     due to current list_lru_del()
> >>     on CPU C
> >>                                  The second run of shrinker:
> >>                                      read nr_items without any
> >>                                      synchronization, so it may
> >>                                      see intermediate negative
> >>                                      nr_items then total_objects
> >>                                      may return 0 conincidently
> >>
> >>                                      keep the bit cleared
> >>     dst->nr_items != 0
> >>     skip set_bit
> >>     add scr->nr_item to dst
> >>
> >> After this point dst->nr_item may never go zero, so reparenting will not
> >> set shrinker_map bit anymore.  And since there is no task under user
> >> slice directly, so no new object will be added to its lru to set the
> >> shrinker map bit either.  That bit is kept cleared forever.
> >>
> >> How does list_lru_del() race with reparenting?  It is because
> >> reparenting replaces childen's kmemcg_id to parent's without protecting
> >> from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
> >> actually deleting items from child's lru, but dec'ing parent's nr_items,
> >> so the parent's nr_items may go negative as commit
> >> 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
> >> free kmemcg_id on css offline") says.
> >>
> >> Can we move kmemcg_id replacement after reparenting?  No, because the
> >> race with list_lru_del() may result in negative src->nr_items, but it
> >> will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
> >> keep the shrinker map bit set always.  The shrinker will be always
> >> called for nonsense.
> >>
> >> Can we synchronize list_lru_del() and reparenting?  Yes, it could be
> >> done.  But it seems we need introduce a new lock or use nlru->lock.  But
> >> it sounds complicated to move kmemcg_id replacement code under nlru->lock.
> >> And list_lru_del() may be called quite often to exacerbate some hot
> >> path, i.e. dentry kill.
> >>
> >> So, it sounds acceptable to synchronize reading nr_items to avoid seeing
> >> intermediate negative nr_items given the simplicity and it is typically
> >> just called by shrinkers when counting the freeable objects.
> >>
> >> The patch is tested with some shrinker intensive workloads, no
> >> noticeable regression is soptted.
> >
> > Hi Yang!
> >
> > It's really tricky, thank you for digging in! It's a perfect analysis!
> >
> > I wonder though, if it's better to just always set the shrinker bit on reparenting
> > if we do reparent some items? Then we'll avoid adding new synchronization
> > to the hot path. What do you think?
> >
> > --
> >
> > @@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
> >       struct list_lru_node *nlru = &lru->node[nid];
> >       int dst_idx = dst_memcg->kmemcg_id;
> >       struct list_lru_one *src, *dst;
> > -     bool set;
> >
> >       /*
> >        * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> > @@ -546,9 +545,8 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
> >       dst = list_lru_from_memcg_idx(nlru, dst_idx);
> >
> >       list_splice_init(&src->list, &dst->list);
> > -     set = (!dst->nr_items && src->nr_items);
> >       dst->nr_items += src->nr_items;
> > -     if (set)
> > +     if (src->nr_items)
> >               memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
> >       src->nr_items = 0;
>
> This looks like a good fix.
>
> To make a code more clear, we may also want to group neighbouring lines
> under the same "if" branch in Yang's v2 resend.

You mean something like the below (diff based on Roman's proposal)?

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 127c2cf9f831..fe230081690b 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -545,10 +545,12 @@ static void memcg_drain_list_lru_node(struct
list_lru *lru, int nid,
        dst = list_lru_from_memcg_idx(nlru, dst_idx);

        list_splice_init(&src->list, &dst->list);
-       dst->nr_items += src->nr_items;
-       if (src->nr_items)
+
+       if (src->nr_items) {
+               dst->nr_items += src->nr_items;
                memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
-       src->nr_items = 0;
+               src->nr_items = 0;
+       }

        spin_unlock_irq(&nlru->lock);

>
> Thanks,
> Kirill
Kirill Tkhai Dec. 1, 2020, 5:17 p.m. UTC | #12
On 01.12.2020 20:15, Yang Shi wrote:
> On Tue, Dec 1, 2020 at 2:25 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 30.11.2020 23:09, Roman Gushchin wrote:
>>> On Mon, Nov 30, 2020 at 10:45:14AM -0800, Yang Shi wrote:
>>>> When investigating a slab cache bloat problem, significant amount of
>>>> negative dentry cache was seen, but confusingly they neither got shrunk
>>>> by reclaimer (the host has very tight memory) nor be shrunk by dropping
>>>> cache.  The vmcore shows there are over 14M negative dentry objects on lru,
>>>> but tracing result shows they were even not scanned at all.  The further
>>>> investigation shows the memcg's vfs shrinker_map bit is not set.  So the
>>>> reclaimer or dropping cache just skip calling vfs shrinker.  So we have
>>>> to reboot the hosts to get the memory back.
>>>>
>>>> I didn't manage to come up with a reproducer in test environment, and the
>>>> problem can't be reproduced after rebooting.  But it seems there is race
>>>> between shrinker map bit clear and reparenting by code inspection.  The
>>>> hypothesis is elaborated as below.
>>>>
>>>> The memcg hierarchy on our production environment looks like:
>>>>                 root
>>>>                /    \
>>>>           system   user
>>>>
>>>> The main workloads are running under user slice's children, and it creates
>>>> and removes memcg frequently.  So reparenting happens very often under user
>>>> slice, but no task is under user slice directly.
>>>>
>>>> So with the frequent reparenting and tight memory pressure, the below
>>>> hypothetical race condition may happen:
>>>>
>>>>     CPU A                            CPU B                         CPU C
>>>> reparent
>>>>     dst->nr_items == 0
>>>>                                  shrinker:
>>>>                                      total_objects == 0
>>>>     add src->nr_items to dst
>>>>     set_bit
>>>>                                      retrun SHRINK_EMPTY
>>>>                                      clear_bit
>>>>                                                                   list_lru_del()
>>>> reparent again
>>>>     dst->nr_items may go negative
>>>>     due to current list_lru_del()
>>>>     on CPU C
>>>>                                  The second run of shrinker:
>>>>                                      read nr_items without any
>>>>                                      synchronization, so it may
>>>>                                      see intermediate negative
>>>>                                      nr_items then total_objects
>>>>                                      may return 0 conincidently
>>>>
>>>>                                      keep the bit cleared
>>>>     dst->nr_items != 0
>>>>     skip set_bit
>>>>     add scr->nr_item to dst
>>>>
>>>> After this point dst->nr_item may never go zero, so reparenting will not
>>>> set shrinker_map bit anymore.  And since there is no task under user
>>>> slice directly, so no new object will be added to its lru to set the
>>>> shrinker map bit either.  That bit is kept cleared forever.
>>>>
>>>> How does list_lru_del() race with reparenting?  It is because
>>>> reparenting replaces childen's kmemcg_id to parent's without protecting
>>>> from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
>>>> actually deleting items from child's lru, but dec'ing parent's nr_items,
>>>> so the parent's nr_items may go negative as commit
>>>> 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
>>>> free kmemcg_id on css offline") says.
>>>>
>>>> Can we move kmemcg_id replacement after reparenting?  No, because the
>>>> race with list_lru_del() may result in negative src->nr_items, but it
>>>> will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
>>>> keep the shrinker map bit set always.  The shrinker will be always
>>>> called for nonsense.
>>>>
>>>> Can we synchronize list_lru_del() and reparenting?  Yes, it could be
>>>> done.  But it seems we need introduce a new lock or use nlru->lock.  But
>>>> it sounds complicated to move kmemcg_id replacement code under nlru->lock.
>>>> And list_lru_del() may be called quite often to exacerbate some hot
>>>> path, i.e. dentry kill.
>>>>
>>>> So, it sounds acceptable to synchronize reading nr_items to avoid seeing
>>>> intermediate negative nr_items given the simplicity and it is typically
>>>> just called by shrinkers when counting the freeable objects.
>>>>
>>>> The patch is tested with some shrinker intensive workloads, no
>>>> noticeable regression is soptted.
>>>
>>> Hi Yang!
>>>
>>> It's really tricky, thank you for digging in! It's a perfect analysis!
>>>
>>> I wonder though, if it's better to just always set the shrinker bit on reparenting
>>> if we do reparent some items? Then we'll avoid adding new synchronization
>>> to the hot path. What do you think?
>>>
>>> --
>>>
>>> @@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
>>>       struct list_lru_node *nlru = &lru->node[nid];
>>>       int dst_idx = dst_memcg->kmemcg_id;
>>>       struct list_lru_one *src, *dst;
>>> -     bool set;
>>>
>>>       /*
>>>        * Since list_lru_{add,del} may be called under an IRQ-safe lock,
>>> @@ -546,9 +545,8 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
>>>       dst = list_lru_from_memcg_idx(nlru, dst_idx);
>>>
>>>       list_splice_init(&src->list, &dst->list);
>>> -     set = (!dst->nr_items && src->nr_items);
>>>       dst->nr_items += src->nr_items;
>>> -     if (set)
>>> +     if (src->nr_items)
>>>               memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
>>>       src->nr_items = 0;
>>
>> This looks like a good fix.
>>
>> To make a code more clear, we may also want to group neighbouring lines
>> under the same "if" branch in Yang's v2 resend.
> 
> You mean something like the below (diff based on Roman's proposal)?
> 
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 127c2cf9f831..fe230081690b 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -545,10 +545,12 @@ static void memcg_drain_list_lru_node(struct
> list_lru *lru, int nid,
>         dst = list_lru_from_memcg_idx(nlru, dst_idx);
> 
>         list_splice_init(&src->list, &dst->list);
> -       dst->nr_items += src->nr_items;
> -       if (src->nr_items)
> +
> +       if (src->nr_items) {
> +               dst->nr_items += src->nr_items;
>                 memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
> -       src->nr_items = 0;
> +               src->nr_items = 0;
> +       }
> 
>         spin_unlock_irq(&nlru->lock);

Yes.
Yang Shi Dec. 1, 2020, 5:19 p.m. UTC | #13
On Tue, Dec 1, 2020 at 2:09 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 30.11.2020 21:45, Yang Shi wrote:
> > When investigating a slab cache bloat problem, significant amount of
> > negative dentry cache was seen, but confusingly they neither got shrunk
> > by reclaimer (the host has very tight memory) nor be shrunk by dropping
> > cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> > but tracing result shows they were even not scanned at all.  The further
> > investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> > reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> > to reboot the hosts to get the memory back.
> >
> > I didn't manage to come up with a reproducer in test environment, and the
> > problem can't be reproduced after rebooting.  But it seems there is race
> > between shrinker map bit clear and reparenting by code inspection.  The
> > hypothesis is elaborated as below.
> >
> > The memcg hierarchy on our production environment looks like:
> >                 root
> >                /    \
> >           system   user
> >
> > The main workloads are running under user slice's children, and it creates
> > and removes memcg frequently.  So reparenting happens very often under user
> > slice, but no task is under user slice directly.
> >
> > So with the frequent reparenting and tight memory pressure, the below
> > hypothetical race condition may happen:
> >
> >     CPU A                            CPU B                         CPU C
> > reparent
> >     dst->nr_items == 0
> >                                  shrinker:
> >                                      total_objects == 0
> >     add src->nr_items to dst
> >     set_bit
> >                                      retrun SHRINK_EMPTY
> >                                      clear_bit
> >                                                                   list_lru_del()
> > reparent again
> >     dst->nr_items may go negative
> >     due to current list_lru_del()
> >     on CPU C
> >                                  The second run of shrinker:
> >                                      read nr_items without any
> >                                      synchronization, so it may
> >                                      see intermediate negative
> >                                      nr_items then total_objects
> >                                      may return 0 conincidently
> >
> >                                      keep the bit cleared
> >     dst->nr_items != 0
> >     skip set_bit
> >     add scr->nr_item to dst
>
> Good catch, Yang. Thanks for investigating this.
>
> But I agree with Roman it's better to fix that in rare-called place
> (memcg_drain_list_lru_node()), than in hot place (list_lru_count_one()).

Yes, agreed. Will incarnate Roman's proposal in v2.

>
> Also, I'd added to description of new patch a reference to memcg_offline_kmem(),
> because this is the place, where child->kmemcg_id is rewritten, and
> this is the reason of lru's nr_items may become negative.

Sure.

>
> > After this point dst->nr_item may never go zero, so reparenting will not
> > set shrinker_map bit anymore.  And since there is no task under user
> > slice directly, so no new object will be added to its lru to set the
> > shrinker map bit either.  That bit is kept cleared forever.
> >
> > How does list_lru_del() race with reparenting?  It is because
> > reparenting replaces childen's kmemcg_id to parent's without protecting
> > from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
> > actually deleting items from child's lru, but dec'ing parent's nr_items,
> > so the parent's nr_items may go negative as commit
> > 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
> > free kmemcg_id on css offline") says.
> >
> > Can we move kmemcg_id replacement after reparenting?  No, because the
> > race with list_lru_del() may result in negative src->nr_items, but it
> > will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
> > keep the shrinker map bit set always.  The shrinker will be always
> > called for nonsense.
> >
> > Can we synchronize list_lru_del() and reparenting?  Yes, it could be
> > done.  But it seems we need introduce a new lock or use nlru->lock.  But
> > it sounds complicated to move kmemcg_id replacement code under nlru->lock.
> > And list_lru_del() may be called quite often to exacerbate some hot
> > path, i.e. dentry kill.
> >
> > So, it sounds acceptable to synchronize reading nr_items to avoid seeing
> > intermediate negative nr_items given the simplicity and it is typically
> > just called by shrinkers when counting the freeable objects.
> >
> > The patch is tested with some shrinker intensive workloads, no
> > noticeable regression is soptted.
> >
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/list_lru.c | 11 +++++++++--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > index 5aa6e44bc2ae..5c128a7710ff 100644
> > --- a/mm/list_lru.c
> > +++ b/mm/list_lru.c
> > @@ -178,10 +178,17 @@ unsigned long list_lru_count_one(struct list_lru *lru,
> >       struct list_lru_one *l;
> >       unsigned long count;
> >
> > -     rcu_read_lock();
> > +     /*
> > +      * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> > +      * we have to use IRQ-safe primitives here to avoid deadlock.
> > +      *
> > +      * Hold the lock to prevent from seeing transient negative
> > +      * nr_items value.
> > +      */
> > +     spin_lock_irq(&nlru->lock);
> >       l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
> >       count = READ_ONCE(l->nr_items);
> > -     rcu_read_unlock();
> > +     spin_unlock_irq(&nlru->lock);
> >
> >       return count;
> >  }
> >
>
>
Yang Shi Dec. 1, 2020, 5:20 p.m. UTC | #14
On Tue, Dec 1, 2020 at 9:17 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 01.12.2020 20:15, Yang Shi wrote:
> > On Tue, Dec 1, 2020 at 2:25 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >> On 30.11.2020 23:09, Roman Gushchin wrote:
> >>> On Mon, Nov 30, 2020 at 10:45:14AM -0800, Yang Shi wrote:
> >>>> When investigating a slab cache bloat problem, significant amount of
> >>>> negative dentry cache was seen, but confusingly they neither got shrunk
> >>>> by reclaimer (the host has very tight memory) nor be shrunk by dropping
> >>>> cache.  The vmcore shows there are over 14M negative dentry objects on lru,
> >>>> but tracing result shows they were even not scanned at all.  The further
> >>>> investigation shows the memcg's vfs shrinker_map bit is not set.  So the
> >>>> reclaimer or dropping cache just skip calling vfs shrinker.  So we have
> >>>> to reboot the hosts to get the memory back.
> >>>>
> >>>> I didn't manage to come up with a reproducer in test environment, and the
> >>>> problem can't be reproduced after rebooting.  But it seems there is race
> >>>> between shrinker map bit clear and reparenting by code inspection.  The
> >>>> hypothesis is elaborated as below.
> >>>>
> >>>> The memcg hierarchy on our production environment looks like:
> >>>>                 root
> >>>>                /    \
> >>>>           system   user
> >>>>
> >>>> The main workloads are running under user slice's children, and it creates
> >>>> and removes memcg frequently.  So reparenting happens very often under user
> >>>> slice, but no task is under user slice directly.
> >>>>
> >>>> So with the frequent reparenting and tight memory pressure, the below
> >>>> hypothetical race condition may happen:
> >>>>
> >>>>     CPU A                            CPU B                         CPU C
> >>>> reparent
> >>>>     dst->nr_items == 0
> >>>>                                  shrinker:
> >>>>                                      total_objects == 0
> >>>>     add src->nr_items to dst
> >>>>     set_bit
> >>>>                                      retrun SHRINK_EMPTY
> >>>>                                      clear_bit
> >>>>                                                                   list_lru_del()
> >>>> reparent again
> >>>>     dst->nr_items may go negative
> >>>>     due to current list_lru_del()
> >>>>     on CPU C
> >>>>                                  The second run of shrinker:
> >>>>                                      read nr_items without any
> >>>>                                      synchronization, so it may
> >>>>                                      see intermediate negative
> >>>>                                      nr_items then total_objects
> >>>>                                      may return 0 conincidently
> >>>>
> >>>>                                      keep the bit cleared
> >>>>     dst->nr_items != 0
> >>>>     skip set_bit
> >>>>     add scr->nr_item to dst
> >>>>
> >>>> After this point dst->nr_item may never go zero, so reparenting will not
> >>>> set shrinker_map bit anymore.  And since there is no task under user
> >>>> slice directly, so no new object will be added to its lru to set the
> >>>> shrinker map bit either.  That bit is kept cleared forever.
> >>>>
> >>>> How does list_lru_del() race with reparenting?  It is because
> >>>> reparenting replaces childen's kmemcg_id to parent's without protecting
> >>>> from nlru->lock, so list_lru_del() may see parent's kmemcg_id but
> >>>> actually deleting items from child's lru, but dec'ing parent's nr_items,
> >>>> so the parent's nr_items may go negative as commit
> >>>> 2788cf0c401c268b4819c5407493a8769b7007aa ("memcg: reparent list_lrus and
> >>>> free kmemcg_id on css offline") says.
> >>>>
> >>>> Can we move kmemcg_id replacement after reparenting?  No, because the
> >>>> race with list_lru_del() may result in negative src->nr_items, but it
> >>>> will never be fixed.  So the shrinker may never return SHRINK_EMPTY then
> >>>> keep the shrinker map bit set always.  The shrinker will be always
> >>>> called for nonsense.
> >>>>
> >>>> Can we synchronize list_lru_del() and reparenting?  Yes, it could be
> >>>> done.  But it seems we need introduce a new lock or use nlru->lock.  But
> >>>> it sounds complicated to move kmemcg_id replacement code under nlru->lock.
> >>>> And list_lru_del() may be called quite often to exacerbate some hot
> >>>> path, i.e. dentry kill.
> >>>>
> >>>> So, it sounds acceptable to synchronize reading nr_items to avoid seeing
> >>>> intermediate negative nr_items given the simplicity and it is typically
> >>>> just called by shrinkers when counting the freeable objects.
> >>>>
> >>>> The patch is tested with some shrinker intensive workloads, no
> >>>> noticeable regression is soptted.
> >>>
> >>> Hi Yang!
> >>>
> >>> It's really tricky, thank you for digging in! It's a perfect analysis!
> >>>
> >>> I wonder though, if it's better to just always set the shrinker bit on reparenting
> >>> if we do reparent some items? Then we'll avoid adding new synchronization
> >>> to the hot path. What do you think?
> >>>
> >>> --
> >>>
> >>> @@ -534,7 +534,6 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
> >>>       struct list_lru_node *nlru = &lru->node[nid];
> >>>       int dst_idx = dst_memcg->kmemcg_id;
> >>>       struct list_lru_one *src, *dst;
> >>> -     bool set;
> >>>
> >>>       /*
> >>>        * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> >>> @@ -546,9 +545,8 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
> >>>       dst = list_lru_from_memcg_idx(nlru, dst_idx);
> >>>
> >>>       list_splice_init(&src->list, &dst->list);
> >>> -     set = (!dst->nr_items && src->nr_items);
> >>>       dst->nr_items += src->nr_items;
> >>> -     if (set)
> >>> +     if (src->nr_items)
> >>>               memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
> >>>       src->nr_items = 0;
> >>
> >> This looks like a good fix.
> >>
> >> To make a code more clear, we may also want to group neighbouring lines
> >> under the same "if" branch in Yang's v2 resend.
> >
> > You mean something like the below (diff based on Roman's proposal)?
> >
> > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > index 127c2cf9f831..fe230081690b 100644
> > --- a/mm/list_lru.c
> > +++ b/mm/list_lru.c
> > @@ -545,10 +545,12 @@ static void memcg_drain_list_lru_node(struct
> > list_lru *lru, int nid,
> >         dst = list_lru_from_memcg_idx(nlru, dst_idx);
> >
> >         list_splice_init(&src->list, &dst->list);
> > -       dst->nr_items += src->nr_items;
> > -       if (src->nr_items)
> > +
> > +       if (src->nr_items) {
> > +               dst->nr_items += src->nr_items;
> >                 memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
> > -       src->nr_items = 0;
> > +               src->nr_items = 0;
> > +       }
> >
> >         spin_unlock_irq(&nlru->lock);
>
> Yes.

Thanks for confirming. Will solve all the comments in v2.
diff mbox series

Patch

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 5aa6e44bc2ae..5c128a7710ff 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -178,10 +178,17 @@  unsigned long list_lru_count_one(struct list_lru *lru,
 	struct list_lru_one *l;
 	unsigned long count;
 
-	rcu_read_lock();
+	/*
+	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
+	 * we have to use IRQ-safe primitives here to avoid deadlock.
+	 *
+	 * Hold the lock to prevent from seeing transient negative
+	 * nr_items value.
+	 */
+	spin_lock_irq(&nlru->lock);
 	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
 	count = READ_ONCE(l->nr_items);
-	rcu_read_unlock();
+	spin_unlock_irq(&nlru->lock);
 
 	return count;
 }