diff mbox series

mm: memcg/slab: wait for !root kmem_cache refcnt killing on root kmem_cache destruction

Message ID 20191125185453.278468-1-guro@fb.com (mailing list archive)
State New, archived
Headers show
Series mm: memcg/slab: wait for !root kmem_cache refcnt killing on root kmem_cache destruction | expand

Commit Message

Roman Gushchin Nov. 25, 2019, 6:54 p.m. UTC
Christian reported a warning like the following obtained during running some
KVM-related tests on s390:

WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
Hardware name: IBM 2964 NC9 712 (LPAR)
Workqueue: events sysfs_slab_remove_workfn
Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
           0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
           0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
           0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
Krnl Code: 0000001529746842: f0a0000407fe        srp        4(11,%r0),2046,0
           0000001529746848: 47000700            bc         0,1792
          #000000152974684c: a7f40001            brc        15,152974684e
          >0000001529746850: a7f4fff2            brc        15,1529746834
           0000001529746854: 0707                bcr        0,%r7
           0000001529746856: 0707                bcr        0,%r7
           0000001529746858: eb8ff0580024        stmg       %r8,%r15,88(%r15)
           000000152974685e: a738ffff            lhi        %r3,-1
Call Trace:
([<000003e009263d00>] 0x3e009263d00)
 [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
 [<0000001529b04882>] kobject_put+0xaa/0xe8
 [<000000152918cf28>] process_one_work+0x1e8/0x428
 [<000000152918d1b0>] worker_thread+0x48/0x460
 [<00000015291942c6>] kthread+0x126/0x160
 [<0000001529b22344>] ret_from_fork+0x28/0x30
 [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
Last Breaking-Event-Address:
 [<000000152974684c>] percpu_ref_exit+0x4c/0x58
---[ end trace b035e7da5788eb09 ]---

The problem occurs because kmem_cache_destroy() is called immediately
after deleting of a memcg, so it races with the memcg kmem_cache
deactivation.

flush_memcg_workqueue() at the beginning of kmem_cache_destroy()
is supposed to guarantee that all deactivation processes are finished,
but failed to do so. It waits for an rcu grace period, after which all
children kmem_caches should be deactivated. During the deactivation
percpu_ref_kill() is called for non root kmem_cache refcounters,
but it requires yet another rcu grace period to finish the transition
to the atomic (dead) state.

So in a rare case when not all children kmem_caches are destroyed
at the moment when the root kmem_cache is about to be gone, we need
to wait another rcu grace period before destroying the root
kmem_cache.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org
---
 mm/slab_common.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Shakeel Butt Nov. 25, 2019, 7:20 p.m. UTC | #1
On Mon, Nov 25, 2019 at 10:55 AM Roman Gushchin <guro@fb.com> wrote:
>
> Christian reported a warning like the following obtained during running some
> KVM-related tests on s390:
>
> WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
> Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
> CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
> Hardware name: IBM 2964 NC9 712 (LPAR)
> Workqueue: events sysfs_slab_remove_workfn
> Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
>            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
>            0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
>            0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
>            0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
> Krnl Code: 0000001529746842: f0a0000407fe        srp        4(11,%r0),2046,0
>            0000001529746848: 47000700            bc         0,1792
>           #000000152974684c: a7f40001            brc        15,152974684e
>           >0000001529746850: a7f4fff2            brc        15,1529746834
>            0000001529746854: 0707                bcr        0,%r7
>            0000001529746856: 0707                bcr        0,%r7
>            0000001529746858: eb8ff0580024        stmg       %r8,%r15,88(%r15)
>            000000152974685e: a738ffff            lhi        %r3,-1
> Call Trace:
> ([<000003e009263d00>] 0x3e009263d00)
>  [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
>  [<0000001529b04882>] kobject_put+0xaa/0xe8
>  [<000000152918cf28>] process_one_work+0x1e8/0x428
>  [<000000152918d1b0>] worker_thread+0x48/0x460
>  [<00000015291942c6>] kthread+0x126/0x160
>  [<0000001529b22344>] ret_from_fork+0x28/0x30
>  [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
> Last Breaking-Event-Address:
>  [<000000152974684c>] percpu_ref_exit+0x4c/0x58
> ---[ end trace b035e7da5788eb09 ]---
>
> The problem occurs because kmem_cache_destroy() is called immediately
> after deleting of a memcg, so it races with the memcg kmem_cache
> deactivation.
>
> flush_memcg_workqueue() at the beginning of kmem_cache_destroy()
> is supposed to guarantee that all deactivation processes are finished,
> but failed to do so. It waits for an rcu grace period, after which all
> children kmem_caches should be deactivated. During the deactivation
> percpu_ref_kill() is called for non root kmem_cache refcounters,
> but it requires yet another rcu grace period to finish the transition
> to the atomic (dead) state.
>
> So in a rare case when not all children kmem_caches are destroyed
> at the moment when the root kmem_cache is about to be gone, we need
> to wait another rcu grace period before destroying the root
> kmem_cache.
>
> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> Cc: stable@vger.kernel.org
> ---
>  mm/slab_common.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 8afa188f6e20..f0ab6d4ceb4c 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -904,6 +904,18 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
>          * previous workitems on workqueue are processed.
>          */
>         flush_workqueue(memcg_kmem_cache_wq);
> +
> +       /*
> +        * If we're racing with children kmem_cache deactivation, it might
> +        * take another rcu grace period to complete their destruction.
> +        * At this moment the corresponding percpu_ref_kill() call should be
> +        * done, but it might take another rcu grace period to complete
> +        * switching to the atomic mode.
> +        * Please, note that we check without grabbing the slab_mutex. It's safe
> +        * because at this moment the children list can't grow.
> +        */
> +       if (!list_empty(&s->memcg_params.children))
> +               rcu_barrier();
>  }
>  #else
>  static inline int shutdown_memcg_caches(struct kmem_cache *s)
> --
> 2.23.0
>
Michal Hocko Nov. 26, 2019, 9:29 a.m. UTC | #2
On Mon 25-11-19 10:54:53, Roman Gushchin wrote:
> Christian reported a warning like the following obtained during running some
> KVM-related tests on s390:
> 
> WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
> Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
> CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
> Hardware name: IBM 2964 NC9 712 (LPAR)
> Workqueue: events sysfs_slab_remove_workfn
> Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
>            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
>            0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
>            0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
>            0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
> Krnl Code: 0000001529746842: f0a0000407fe        srp        4(11,%r0),2046,0
>            0000001529746848: 47000700            bc         0,1792
>           #000000152974684c: a7f40001            brc        15,152974684e
>           >0000001529746850: a7f4fff2            brc        15,1529746834
>            0000001529746854: 0707                bcr        0,%r7
>            0000001529746856: 0707                bcr        0,%r7
>            0000001529746858: eb8ff0580024        stmg       %r8,%r15,88(%r15)
>            000000152974685e: a738ffff            lhi        %r3,-1
> Call Trace:
> ([<000003e009263d00>] 0x3e009263d00)
>  [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
>  [<0000001529b04882>] kobject_put+0xaa/0xe8
>  [<000000152918cf28>] process_one_work+0x1e8/0x428
>  [<000000152918d1b0>] worker_thread+0x48/0x460
>  [<00000015291942c6>] kthread+0x126/0x160
>  [<0000001529b22344>] ret_from_fork+0x28/0x30
>  [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
> Last Breaking-Event-Address:
>  [<000000152974684c>] percpu_ref_exit+0x4c/0x58
> ---[ end trace b035e7da5788eb09 ]---
> 
> The problem occurs because kmem_cache_destroy() is called immediately
> after deleting of a memcg, so it races with the memcg kmem_cache
> deactivation.
> 
> flush_memcg_workqueue() at the beginning of kmem_cache_destroy()
> is supposed to guarantee that all deactivation processes are finished,
> but failed to do so. It waits for an rcu grace period, after which all
> children kmem_caches should be deactivated. During the deactivation
> percpu_ref_kill() is called for non root kmem_cache refcounters,
> but it requires yet another rcu grace period to finish the transition
> to the atomic (dead) state.
> 
> So in a rare case when not all children kmem_caches are destroyed
> at the moment when the root kmem_cache is about to be gone, we need
> to wait another rcu grace period before destroying the root
> kmem_cache.

Could you explain how rare this really is please? I still have to wrap
my head around the overall logic here. It looks quite fragile to me TBH.
I am worried that is relies on implementation detail of the PCP ref
counters too much.

> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: stable@vger.kernel.org
> ---
>  mm/slab_common.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 8afa188f6e20..f0ab6d4ceb4c 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -904,6 +904,18 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
>  	 * previous workitems on workqueue are processed.
>  	 */
>  	flush_workqueue(memcg_kmem_cache_wq);
> +
> +	/*
> +	 * If we're racing with children kmem_cache deactivation, it might
> +	 * take another rcu grace period to complete their destruction.
> +	 * At this moment the corresponding percpu_ref_kill() call should be
> +	 * done, but it might take another rcu grace period to complete
> +	 * switching to the atomic mode.
> +	 * Please, note that we check without grabbing the slab_mutex. It's safe
> +	 * because at this moment the children list can't grow.
> +	 */
> +	if (!list_empty(&s->memcg_params.children))
> +		rcu_barrier();
>  }
>  #else
>  static inline int shutdown_memcg_caches(struct kmem_cache *s)
> -- 
> 2.23.0
Christian Borntraeger Nov. 26, 2019, 9:33 a.m. UTC | #3
On 26.11.19 10:29, Michal Hocko wrote:
> On Mon 25-11-19 10:54:53, Roman Gushchin wrote:
>> Christian reported a warning like the following obtained during running some
>> KVM-related tests on s390:
>>
>> WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
>> Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
>> CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
>> Hardware name: IBM 2964 NC9 712 (LPAR)
>> Workqueue: events sysfs_slab_remove_workfn
>> Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
>>            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>> Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
>>            0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
>>            0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
>>            0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
>> Krnl Code: 0000001529746842: f0a0000407fe        srp        4(11,%r0),2046,0
>>            0000001529746848: 47000700            bc         0,1792
>>           #000000152974684c: a7f40001            brc        15,152974684e
>>           >0000001529746850: a7f4fff2            brc        15,1529746834
>>            0000001529746854: 0707                bcr        0,%r7
>>            0000001529746856: 0707                bcr        0,%r7
>>            0000001529746858: eb8ff0580024        stmg       %r8,%r15,88(%r15)
>>            000000152974685e: a738ffff            lhi        %r3,-1
>> Call Trace:
>> ([<000003e009263d00>] 0x3e009263d00)
>>  [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
>>  [<0000001529b04882>] kobject_put+0xaa/0xe8
>>  [<000000152918cf28>] process_one_work+0x1e8/0x428
>>  [<000000152918d1b0>] worker_thread+0x48/0x460
>>  [<00000015291942c6>] kthread+0x126/0x160
>>  [<0000001529b22344>] ret_from_fork+0x28/0x30
>>  [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
>> Last Breaking-Event-Address:
>>  [<000000152974684c>] percpu_ref_exit+0x4c/0x58
>> ---[ end trace b035e7da5788eb09 ]---
>>
>> The problem occurs because kmem_cache_destroy() is called immediately
>> after deleting of a memcg, so it races with the memcg kmem_cache
>> deactivation.
>>
>> flush_memcg_workqueue() at the beginning of kmem_cache_destroy()
>> is supposed to guarantee that all deactivation processes are finished,
>> but failed to do so. It waits for an rcu grace period, after which all
>> children kmem_caches should be deactivated. During the deactivation
>> percpu_ref_kill() is called for non root kmem_cache refcounters,
>> but it requires yet another rcu grace period to finish the transition
>> to the atomic (dead) state.
>>
>> So in a rare case when not all children kmem_caches are destroyed
>> at the moment when the root kmem_cache is about to be gone, we need
>> to wait another rcu grace period before destroying the root
>> kmem_cache.
> 
> Could you explain how rare this really is please? I still have to wrap
> my head around the overall logic here. It looks quite fragile to me TBH.
> I am worried that is relies on implementation detail of the PCP ref
> counters too much.

I can actually reproduce this very reliably by doing an

# virsh destroy <lastguest>; rmmod kvm
Roman Gushchin Nov. 26, 2019, 6:41 p.m. UTC | #4
On Tue, Nov 26, 2019 at 10:29:18AM +0100, Michal Hocko wrote:
> On Mon 25-11-19 10:54:53, Roman Gushchin wrote:
> > Christian reported a warning like the following obtained during running some
> > KVM-related tests on s390:
> > 
> > WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
> > Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
> > CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
> > Hardware name: IBM 2964 NC9 712 (LPAR)
> > Workqueue: events sysfs_slab_remove_workfn
> > Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
> >            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> > Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
> >            0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
> >            0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
> >            0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
> > Krnl Code: 0000001529746842: f0a0000407fe        srp        4(11,%r0),2046,0
> >            0000001529746848: 47000700            bc         0,1792
> >           #000000152974684c: a7f40001            brc        15,152974684e
> >           >0000001529746850: a7f4fff2            brc        15,1529746834
> >            0000001529746854: 0707                bcr        0,%r7
> >            0000001529746856: 0707                bcr        0,%r7
> >            0000001529746858: eb8ff0580024        stmg       %r8,%r15,88(%r15)
> >            000000152974685e: a738ffff            lhi        %r3,-1
> > Call Trace:
> > ([<000003e009263d00>] 0x3e009263d00)
> >  [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
> >  [<0000001529b04882>] kobject_put+0xaa/0xe8
> >  [<000000152918cf28>] process_one_work+0x1e8/0x428
> >  [<000000152918d1b0>] worker_thread+0x48/0x460
> >  [<00000015291942c6>] kthread+0x126/0x160
> >  [<0000001529b22344>] ret_from_fork+0x28/0x30
> >  [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
> > Last Breaking-Event-Address:
> >  [<000000152974684c>] percpu_ref_exit+0x4c/0x58
> > ---[ end trace b035e7da5788eb09 ]---
> > 
> > The problem occurs because kmem_cache_destroy() is called immediately
> > after deleting of a memcg, so it races with the memcg kmem_cache
> > deactivation.
> > 
> > flush_memcg_workqueue() at the beginning of kmem_cache_destroy()
> > is supposed to guarantee that all deactivation processes are finished,
> > but failed to do so. It waits for an rcu grace period, after which all
> > children kmem_caches should be deactivated. During the deactivation
> > percpu_ref_kill() is called for non root kmem_cache refcounters,
> > but it requires yet another rcu grace period to finish the transition
> > to the atomic (dead) state.
> > 
> > So in a rare case when not all children kmem_caches are destroyed
> > at the moment when the root kmem_cache is about to be gone, we need
> > to wait another rcu grace period before destroying the root
> > kmem_cache.
> 
> Could you explain how rare this really is please?

It seems that we don't destroy root kmem_caches with enabled memcg
accounting that often, but maybe I'm biased here.

> I still have to wrap
> my head around the overall logic here. It looks quite fragile to me TBH.
> I am worried that is relies on implementation detail of the PCP ref
> counters too much.

It is definitely very complicated and fragile, but I hope it won't remain
in this state for long. The new slab controller, which I'm working on,
eliminates all this logic all together and generally simplifies things a lot.
Simple because there will be no need to create and destroy per-memcg
kmem_caches.

Thanks!
Michal Hocko Nov. 27, 2019, 12:32 p.m. UTC | #5
On Tue 26-11-19 18:41:41, Roman Gushchin wrote:
> On Tue, Nov 26, 2019 at 10:29:18AM +0100, Michal Hocko wrote:
> > On Mon 25-11-19 10:54:53, Roman Gushchin wrote:
[...]
> > > So in a rare case when not all children kmem_caches are destroyed
> > > at the moment when the root kmem_cache is about to be gone, we need
> > > to wait another rcu grace period before destroying the root
> > > kmem_cache.
> > 
> > Could you explain how rare this really is please?
> 
> It seems that we don't destroy root kmem_caches with enabled memcg
> accounting that often, but maybe I'm biased here.

So this happens each time a root kmem_cache is destroyed? Which would
imply that only dynamically created ones?
Roman Gushchin Nov. 27, 2019, 5:27 p.m. UTC | #6
On Wed, Nov 27, 2019 at 01:32:25PM +0100, Michal Hocko wrote:
> On Tue 26-11-19 18:41:41, Roman Gushchin wrote:
> > On Tue, Nov 26, 2019 at 10:29:18AM +0100, Michal Hocko wrote:
> > > On Mon 25-11-19 10:54:53, Roman Gushchin wrote:
> [...]
> > > > So in a rare case when not all children kmem_caches are destroyed
> > > > at the moment when the root kmem_cache is about to be gone, we need
> > > > to wait another rcu grace period before destroying the root
> > > > kmem_cache.
> > > 
> > > Could you explain how rare this really is please?
> > 
> > It seems that we don't destroy root kmem_caches with enabled memcg
> > accounting that often, but maybe I'm biased here.
> 
> So this happens each time a root kmem_cache is destroyed? Which would
> imply that only dynamically created ones?

Yes, only dynamically created and only in those cases when destruction
of the root cache happens immediately after the deactivation of the
non-root cache. Tbh I can't imagine any other case except rmmod after
removing the cgroup.

Thanks!
Michal Hocko Nov. 28, 2019, 9:43 a.m. UTC | #7
On Wed 27-11-19 17:27:29, Roman Gushchin wrote:
> On Wed, Nov 27, 2019 at 01:32:25PM +0100, Michal Hocko wrote:
> > On Tue 26-11-19 18:41:41, Roman Gushchin wrote:
> > > On Tue, Nov 26, 2019 at 10:29:18AM +0100, Michal Hocko wrote:
> > > > On Mon 25-11-19 10:54:53, Roman Gushchin wrote:
> > [...]
> > > > > So in a rare case when not all children kmem_caches are destroyed
> > > > > at the moment when the root kmem_cache is about to be gone, we need
> > > > > to wait another rcu grace period before destroying the root
> > > > > kmem_cache.
> > > > 
> > > > Could you explain how rare this really is please?
> > > 
> > > It seems that we don't destroy root kmem_caches with enabled memcg
> > > accounting that often, but maybe I'm biased here.
> > 
> > So this happens each time a root kmem_cache is destroyed? Which would
> > imply that only dynamically created ones?
> 
> Yes, only dynamically created and only in those cases when destruction
> of the root cache happens immediately after the deactivation of the
> non-root cache.
> Tbh I can't imagine any other case except rmmod after
> removing the cgroup.

Thanks for the confirmation! Could you please make this explicit in the
changelog please? Maybe it is obvious to you but it took me quite some
time to grasp what the hell is going on here. Both memcg and kmem_cache
destruction are quite complex and convoluted.

Thanks a lot.
Roman Gushchin Nov. 29, 2019, 2:28 a.m. UTC | #8
On Thu, Nov 28, 2019 at 10:43:12AM +0100, Michal Hocko wrote:
> On Wed 27-11-19 17:27:29, Roman Gushchin wrote:
> > On Wed, Nov 27, 2019 at 01:32:25PM +0100, Michal Hocko wrote:
> > > On Tue 26-11-19 18:41:41, Roman Gushchin wrote:
> > > > On Tue, Nov 26, 2019 at 10:29:18AM +0100, Michal Hocko wrote:
> > > > > On Mon 25-11-19 10:54:53, Roman Gushchin wrote:
> > > [...]
> > > > > > So in a rare case when not all children kmem_caches are destroyed
> > > > > > at the moment when the root kmem_cache is about to be gone, we need
> > > > > > to wait another rcu grace period before destroying the root
> > > > > > kmem_cache.
> > > > > 
> > > > > Could you explain how rare this really is please?
> > > > 
> > > > It seems that we don't destroy root kmem_caches with enabled memcg
> > > > accounting that often, but maybe I'm biased here.
> > > 
> > > So this happens each time a root kmem_cache is destroyed? Which would
> > > imply that only dynamically created ones?
> > 
> > Yes, only dynamically created and only in those cases when destruction
> > of the root cache happens immediately after the deactivation of the
> > non-root cache.
> > Tbh I can't imagine any other case except rmmod after
> > removing the cgroup.
> 
> Thanks for the confirmation! Could you please make this explicit in the
> changelog please? Maybe it is obvious to you but it took me quite some
> time to grasp what the hell is going on here. Both memcg and kmem_cache
> destruction are quite complex and convoluted.

Sure, will send v2 shorty.

Thanks!
diff mbox series

Patch

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8afa188f6e20..f0ab6d4ceb4c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -904,6 +904,18 @@  static void flush_memcg_workqueue(struct kmem_cache *s)
 	 * previous workitems on workqueue are processed.
 	 */
 	flush_workqueue(memcg_kmem_cache_wq);
+
+	/*
+	 * If we're racing with children kmem_cache deactivation, it might
+	 * take another rcu grace period to complete their destruction.
+	 * At this moment the corresponding percpu_ref_kill() call should be
+	 * done, but it might take another rcu grace period to complete
+	 * switching to the atomic mode.
+	 * Please, note that we check without grabbing the slab_mutex. It's safe
+	 * because at this moment the children list can't grow.
+	 */
+	if (!list_empty(&s->memcg_params.children))
+		rcu_barrier();
 }
 #else
 static inline int shutdown_memcg_caches(struct kmem_cache *s)