Message ID | 20200715165023.29919-1-songmuchun@bytedance.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm: memcg/slab: fix memory leak at non-root kmem_cache destroy | expand |
On Thu, Jul 16, 2020 at 12:50:22AM +0800, Muchun Song wrote: > If the kmem_cache refcount is greater than one, we should not > mark the root kmem_cache as dying. If we mark the root kmem_cache > dying incorrectly, the non-root kmem_cache can never be destroyed. > It resulted in memory leak when memcg was destroyed. We can use the > following steps to reproduce. > > 1) Use kmem_cache_create() to create a new kmem_cache named A. > 2) Coincidentally, the kmem_cache A is an alias for kmem_cache B, > so the refcount of B is just increased. > 3) Use kmem_cache_destroy() to destroy the kmem_cache A, just > decrease the B's refcount but mark the B as dying. > 4) Create a new memory cgroup and alloc memory from the kmem_cache > B. It leads to create a non-root kmem_cache for allocating memory. > 5) When destroy the memory cgroup created in the step 4), the > non-root kmem_cache can never be destroyed. > > If we repeat steps 4) and 5), this will cause a lot of memory leak. > So only when refcount reach zero, we mark the root kmem_cache as dying. > > Fixes: 92ee383f6daa ("mm: fix race between kmem_cache destroy, create and deactivate") > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > Reviewed-by: Shakeel Butt <shakeelb@google.com> > --- > > changelog in v2: > 1) Fix a confusing typo in the commit log. Ok, now I see the problem. Thank you for fixing the commit log! > 2) Remove flush_memcg_workqueue() for !CONFIG_MEMCG_KMEM. > 3) Introduce a new helper memcg_set_kmem_cache_dying() to fix a race > condition between flush_memcg_workqueue() and slab_unmergeable(). > > mm/slab_common.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++------- > 1 file changed, 47 insertions(+), 7 deletions(-) > > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 8c1ffbf7de45..c4958116e3fd 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -258,6 +258,11 @@ static void memcg_unlink_cache(struct kmem_cache *s) > list_del(&s->memcg_params.kmem_caches_node); > } > } > + > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) > +{ > + return is_root_cache(s) && s->memcg_params.dying; > +} > #else > static inline int init_memcg_params(struct kmem_cache *s, > struct kmem_cache *root_cache) > @@ -272,6 +277,11 @@ static inline void destroy_memcg_params(struct kmem_cache *s) > static inline void memcg_unlink_cache(struct kmem_cache *s) > { > } > + > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) > +{ > + return false; > +} > #endif /* CONFIG_MEMCG_KMEM */ > > /* > @@ -326,6 +336,13 @@ int slab_unmergeable(struct kmem_cache *s) > if (s->refcount < 0) > return 1; > > + /* > + * If the kmem_cache is dying. We should also skip this > + * kmem_cache. > + */ > + if (memcg_kmem_cache_dying(s)) > + return 1; > + > return 0; > } > > @@ -886,12 +903,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s) > return 0; > } > > -static void flush_memcg_workqueue(struct kmem_cache *s) > +static void memcg_set_kmem_cache_dying(struct kmem_cache *s) > { > spin_lock_irq(&memcg_kmem_wq_lock); > s->memcg_params.dying = true; > spin_unlock_irq(&memcg_kmem_wq_lock); > +} > > +static void flush_memcg_workqueue(struct kmem_cache *s) > +{ > /* > * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make > * sure all registered rcu callbacks have been invoked. > @@ -923,10 +943,6 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s) > { > return 0; > } > - > -static inline void flush_memcg_workqueue(struct kmem_cache *s) > -{ > -} > #endif /* CONFIG_MEMCG_KMEM */ > > void slab_kmem_cache_release(struct kmem_cache *s) > @@ -944,8 +960,6 @@ void kmem_cache_destroy(struct kmem_cache *s) > if (unlikely(!s)) > return; > > - flush_memcg_workqueue(s); > - > get_online_cpus(); > get_online_mems(); > > @@ -955,6 +969,32 @@ void kmem_cache_destroy(struct kmem_cache *s) > if (s->refcount) > goto out_unlock; > > +#ifdef CONFIG_MEMCG_KMEM > + memcg_set_kmem_cache_dying(s); > + > + mutex_unlock(&slab_mutex); Hm, but in theory s->refcount can be increased here? So it doesn't solve the problem completely, but makes it less probable, right? I wonder if it's possible to (additionally) protect s->refcount with a memcg_kmem_wq_lock, so that we can check it in the context of flush_memcg_workqueue()? > + > + put_online_mems(); > + put_online_cpus(); > + > + flush_memcg_workqueue(s); > + > + get_online_cpus(); > + get_online_mems(); > + > + mutex_lock(&slab_mutex); > + > + if (WARN(s->refcount, > + "kmem_cache_destroy %s: Slab cache is still referenced\n", > + s->name)) { > + /* > + * Reset the dying flag setted by memcg_set_kmem_cache_dying(). > + */ > + s->memcg_params.dying = false; > + goto out_unlock; > + } > +#endif > + > err = shutdown_memcg_caches(s); > if (!err) > err = shutdown_cache(s); > -- > 2.11.0 > Other than the problem above your patch looks really good to me, however we should be really careful here, as it should in theory be back-ported to a big number of old stable kernels. And because it's (hopefully) fixed in 5.9, it's a backport-only patch. So I wonder if we can mitigate the problem by disabling cache sharing for some specific kmem_caches instead? Like for all with SLAB_ACCOUNT and maybe for all except a hard-coded list (if kmem accounting is enabled). Do you mind sharing any details on how this problem reveals itself in the real life? Thanks! PS I'm off the keyboard for the rest of today, will think more and hopefully come back with some ideas tomorrow.
On Thu, Jul 16, 2020 at 1:54 AM Roman Gushchin <guro@fb.com> wrote: > > On Thu, Jul 16, 2020 at 12:50:22AM +0800, Muchun Song wrote: > > If the kmem_cache refcount is greater than one, we should not > > mark the root kmem_cache as dying. If we mark the root kmem_cache > > dying incorrectly, the non-root kmem_cache can never be destroyed. > > It resulted in memory leak when memcg was destroyed. We can use the > > following steps to reproduce. > > > > 1) Use kmem_cache_create() to create a new kmem_cache named A. > > 2) Coincidentally, the kmem_cache A is an alias for kmem_cache B, > > so the refcount of B is just increased. > > 3) Use kmem_cache_destroy() to destroy the kmem_cache A, just > > decrease the B's refcount but mark the B as dying. > > 4) Create a new memory cgroup and alloc memory from the kmem_cache > > B. It leads to create a non-root kmem_cache for allocating memory. > > 5) When destroy the memory cgroup created in the step 4), the > > non-root kmem_cache can never be destroyed. > > > > If we repeat steps 4) and 5), this will cause a lot of memory leak. > > So only when refcount reach zero, we mark the root kmem_cache as dying. > > > > Fixes: 92ee383f6daa ("mm: fix race between kmem_cache destroy, create and deactivate") > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > Reviewed-by: Shakeel Butt <shakeelb@google.com> > > --- > > > > changelog in v2: > > 1) Fix a confusing typo in the commit log. > > Ok, now I see the problem. Thank you for fixing the commit log! > > > 2) Remove flush_memcg_workqueue() for !CONFIG_MEMCG_KMEM. > > 3) Introduce a new helper memcg_set_kmem_cache_dying() to fix a race > > condition between flush_memcg_workqueue() and slab_unmergeable(). > > > > mm/slab_common.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++------- > > 1 file changed, 47 insertions(+), 7 deletions(-) > > > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > index 8c1ffbf7de45..c4958116e3fd 100644 > > --- a/mm/slab_common.c > > +++ b/mm/slab_common.c > > @@ -258,6 +258,11 @@ static void memcg_unlink_cache(struct kmem_cache *s) > > list_del(&s->memcg_params.kmem_caches_node); > > } > > } > > + > > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) > > +{ > > + return is_root_cache(s) && s->memcg_params.dying; > > +} > > #else > > static inline int init_memcg_params(struct kmem_cache *s, > > struct kmem_cache *root_cache) > > @@ -272,6 +277,11 @@ static inline void destroy_memcg_params(struct kmem_cache *s) > > static inline void memcg_unlink_cache(struct kmem_cache *s) > > { > > } > > + > > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) > > +{ > > + return false; > > +} > > #endif /* CONFIG_MEMCG_KMEM */ > > > > /* > > @@ -326,6 +336,13 @@ int slab_unmergeable(struct kmem_cache *s) > > if (s->refcount < 0) > > return 1; > > > > + /* > > + * If the kmem_cache is dying. We should also skip this > > + * kmem_cache. > > + */ > > + if (memcg_kmem_cache_dying(s)) > > + return 1; > > + > > return 0; > > } > > > > @@ -886,12 +903,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s) > > return 0; > > } > > > > -static void flush_memcg_workqueue(struct kmem_cache *s) > > +static void memcg_set_kmem_cache_dying(struct kmem_cache *s) > > { > > spin_lock_irq(&memcg_kmem_wq_lock); > > s->memcg_params.dying = true; > > spin_unlock_irq(&memcg_kmem_wq_lock); > > +} > > > > +static void flush_memcg_workqueue(struct kmem_cache *s) > > +{ > > /* > > * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make > > * sure all registered rcu callbacks have been invoked. > > @@ -923,10 +943,6 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s) > > { > > return 0; > > } > > - > > -static inline void flush_memcg_workqueue(struct kmem_cache *s) > > -{ > > -} > > #endif /* CONFIG_MEMCG_KMEM */ > > > > void slab_kmem_cache_release(struct kmem_cache *s) > > @@ -944,8 +960,6 @@ void kmem_cache_destroy(struct kmem_cache *s) > > if (unlikely(!s)) > > return; > > > > - flush_memcg_workqueue(s); > > - > > get_online_cpus(); > > get_online_mems(); > > > > @@ -955,6 +969,32 @@ void kmem_cache_destroy(struct kmem_cache *s) > > if (s->refcount) > > goto out_unlock; > > > > +#ifdef CONFIG_MEMCG_KMEM > > + memcg_set_kmem_cache_dying(s); > > + > > + mutex_unlock(&slab_mutex); > > Hm, but in theory s->refcount can be increased here? I have tried my best to read all the codes that operate on s->refcount. There is only one place which increases the s->refcount, it is the __kmem_cache_alias(). If the kmem cache is dying, the slab_unmergeable() can never return true for the dying kmem cache because it is the same slab_mutex protection, so I think that there is not a problem, right? > So it doesn't solve the problem completely, but makes it less probable, right? > > I wonder if it's possible to (additionally) protect s->refcount with a > memcg_kmem_wq_lock, so that we can check it in the context of flush_memcg_workqueue()? > > > + > > + put_online_mems(); > > + put_online_cpus(); > > + > > + flush_memcg_workqueue(s); > > + > > + get_online_cpus(); > > + get_online_mems(); > > + > > + mutex_lock(&slab_mutex); > > + > > + if (WARN(s->refcount, > > + "kmem_cache_destroy %s: Slab cache is still referenced\n", > > + s->name)) { > > + /* > > + * Reset the dying flag setted by memcg_set_kmem_cache_dying(). > > + */ > > + s->memcg_params.dying = false; > > + goto out_unlock; > > + } > > +#endif > > + > > err = shutdown_memcg_caches(s); > > if (!err) > > err = shutdown_cache(s); > > -- > > 2.11.0 > > > > Other than the problem above your patch looks really good to me, however we should > be really careful here, as it should in theory be back-ported to a big number > of old stable kernels. And because it's (hopefully) fixed in 5.9, it's a backport-only > patch. > > So I wonder if we can mitigate the problem by disabling cache sharing for some > specific kmem_caches instead? Like for all with SLAB_ACCOUNT and maybe for all except > a hard-coded list (if kmem accounting is enabled). Do you mind sharing any details > on how this problem reveals itself in the real life? One day I debug another problem, coincidentally I executed the following command: cat /sys/kernel/debug/memcg_slabinfo | grep deact And I found an interesting thing, a lot of non-root kmem_cache's active_objs is zero. But why are they not destroyed? So I launched a detailed investigation. Finally I found out that the reason is that we mark the root kmem_cache as dying incorrectly. I have applied this patch on our server for several weeks. And this problem has disappeared. > > Thanks! > > PS I'm off the keyboard for the rest of today, will think more and hopefully > come back with some ideas tomorrow. > >
On Thu, Jul 16, 2020 at 01:07:02PM +0800, Muchun Song wrote: > On Thu, Jul 16, 2020 at 1:54 AM Roman Gushchin <guro@fb.com> wrote: > > > > On Thu, Jul 16, 2020 at 12:50:22AM +0800, Muchun Song wrote: > > > If the kmem_cache refcount is greater than one, we should not > > > mark the root kmem_cache as dying. If we mark the root kmem_cache > > > dying incorrectly, the non-root kmem_cache can never be destroyed. > > > It resulted in memory leak when memcg was destroyed. We can use the > > > following steps to reproduce. > > > > > > 1) Use kmem_cache_create() to create a new kmem_cache named A. > > > 2) Coincidentally, the kmem_cache A is an alias for kmem_cache B, > > > so the refcount of B is just increased. > > > 3) Use kmem_cache_destroy() to destroy the kmem_cache A, just > > > decrease the B's refcount but mark the B as dying. > > > 4) Create a new memory cgroup and alloc memory from the kmem_cache > > > B. It leads to create a non-root kmem_cache for allocating memory. > > > 5) When destroy the memory cgroup created in the step 4), the > > > non-root kmem_cache can never be destroyed. > > > > > > If we repeat steps 4) and 5), this will cause a lot of memory leak. > > > So only when refcount reach zero, we mark the root kmem_cache as dying. > > > > > > Fixes: 92ee383f6daa ("mm: fix race between kmem_cache destroy, create and deactivate") > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > > Reviewed-by: Shakeel Butt <shakeelb@google.com> > > > --- > > > > > > changelog in v2: > > > 1) Fix a confusing typo in the commit log. > > > > Ok, now I see the problem. Thank you for fixing the commit log! > > > > > 2) Remove flush_memcg_workqueue() for !CONFIG_MEMCG_KMEM. > > > 3) Introduce a new helper memcg_set_kmem_cache_dying() to fix a race > > > condition between flush_memcg_workqueue() and slab_unmergeable(). > > > > > > mm/slab_common.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++------- > > > 1 file changed, 47 insertions(+), 7 deletions(-) > > > > > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > > index 8c1ffbf7de45..c4958116e3fd 100644 > > > --- a/mm/slab_common.c > > > +++ b/mm/slab_common.c > > > @@ -258,6 +258,11 @@ static void memcg_unlink_cache(struct kmem_cache *s) > > > list_del(&s->memcg_params.kmem_caches_node); > > > } > > > } > > > + > > > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) > > > +{ > > > + return is_root_cache(s) && s->memcg_params.dying; > > > +} > > > #else > > > static inline int init_memcg_params(struct kmem_cache *s, > > > struct kmem_cache *root_cache) > > > @@ -272,6 +277,11 @@ static inline void destroy_memcg_params(struct kmem_cache *s) > > > static inline void memcg_unlink_cache(struct kmem_cache *s) > > > { > > > } > > > + > > > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) > > > +{ > > > + return false; > > > +} > > > #endif /* CONFIG_MEMCG_KMEM */ > > > > > > /* > > > @@ -326,6 +336,13 @@ int slab_unmergeable(struct kmem_cache *s) > > > if (s->refcount < 0) > > > return 1; > > > > > > + /* > > > + * If the kmem_cache is dying. We should also skip this > > > + * kmem_cache. > > > + */ > > > + if (memcg_kmem_cache_dying(s)) > > > + return 1; > > > + > > > return 0; > > > } > > > > > > @@ -886,12 +903,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s) > > > return 0; > > > } > > > > > > -static void flush_memcg_workqueue(struct kmem_cache *s) > > > +static void memcg_set_kmem_cache_dying(struct kmem_cache *s) > > > { > > > spin_lock_irq(&memcg_kmem_wq_lock); > > > s->memcg_params.dying = true; > > > spin_unlock_irq(&memcg_kmem_wq_lock); > > > +} > > > > > > +static void flush_memcg_workqueue(struct kmem_cache *s) > > > +{ > > > /* > > > * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make > > > * sure all registered rcu callbacks have been invoked. > > > @@ -923,10 +943,6 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s) > > > { > > > return 0; > > > } > > > - > > > -static inline void flush_memcg_workqueue(struct kmem_cache *s) > > > -{ > > > -} > > > #endif /* CONFIG_MEMCG_KMEM */ > > > > > > void slab_kmem_cache_release(struct kmem_cache *s) > > > @@ -944,8 +960,6 @@ void kmem_cache_destroy(struct kmem_cache *s) > > > if (unlikely(!s)) > > > return; > > > > > > - flush_memcg_workqueue(s); > > > - > > > get_online_cpus(); > > > get_online_mems(); > > > > > > @@ -955,6 +969,32 @@ void kmem_cache_destroy(struct kmem_cache *s) > > > if (s->refcount) > > > goto out_unlock; > > > > > > +#ifdef CONFIG_MEMCG_KMEM > > > + memcg_set_kmem_cache_dying(s); > > > + > > > + mutex_unlock(&slab_mutex); > > > > Hm, but in theory s->refcount can be increased here? > > I have tried my best to read all the codes that operate on s->refcount. > There is only one place which increases the s->refcount, it is the > __kmem_cache_alias(). If the kmem cache is dying, the slab_unmergeable() > can never return true for the dying kmem cache because it is the same slab_mutex > protection, so I think that there is not a problem, right? So the problem is that you're checking s->refcount under slab_mutex, then you're releasing the mutex and then set the dying bit. But nothing prevents someone else to clone the kmem_cache on a different CPU as soon as you've released the slab_mutex. However, I think there is a simple solution for this problem. Can you, please, check the following patch? -- diff --git a/mm/slab_common.c b/mm/slab_common.c index 37d48a56431d..d6661e7af01e 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -326,6 +326,11 @@ int slab_unmergeable(struct kmem_cache *s) if (s->refcount < 0) return 1; +#ifdef CONFIG_MEMCG_KMEM + if (s->memcg_params.dying) + return 1; +#endif + return 0; } @@ -888,9 +893,20 @@ static int shutdown_memcg_caches(struct kmem_cache *s) static void flush_memcg_workqueue(struct kmem_cache *s) { - spin_lock_irq(&memcg_kmem_wq_lock); - s->memcg_params.dying = true; - spin_unlock_irq(&memcg_kmem_wq_lock); + bool skip = false; + + mutex_lock(&slab_mutex); + if (s->refcount == 1) { + spin_lock_irq(&memcg_kmem_wq_lock); + s->memcg_params.dying = true; + spin_unlock_irq(&memcg_kmem_wq_lock); + } else { + skip = true; + } + mutex_unlock(&slab_mutex); + + if (skip) + return; /* * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
On Thu, Jul 16, 2020 at 11:46 PM Roman Gushchin <guro@fb.com> wrote: > > On Thu, Jul 16, 2020 at 01:07:02PM +0800, Muchun Song wrote: > > On Thu, Jul 16, 2020 at 1:54 AM Roman Gushchin <guro@fb.com> wrote: > > > > > > On Thu, Jul 16, 2020 at 12:50:22AM +0800, Muchun Song wrote: > > > > If the kmem_cache refcount is greater than one, we should not > > > > mark the root kmem_cache as dying. If we mark the root kmem_cache > > > > dying incorrectly, the non-root kmem_cache can never be destroyed. > > > > It resulted in memory leak when memcg was destroyed. We can use the > > > > following steps to reproduce. > > > > > > > > 1) Use kmem_cache_create() to create a new kmem_cache named A. > > > > 2) Coincidentally, the kmem_cache A is an alias for kmem_cache B, > > > > so the refcount of B is just increased. > > > > 3) Use kmem_cache_destroy() to destroy the kmem_cache A, just > > > > decrease the B's refcount but mark the B as dying. > > > > 4) Create a new memory cgroup and alloc memory from the kmem_cache > > > > B. It leads to create a non-root kmem_cache for allocating memory. > > > > 5) When destroy the memory cgroup created in the step 4), the > > > > non-root kmem_cache can never be destroyed. > > > > > > > > If we repeat steps 4) and 5), this will cause a lot of memory leak. > > > > So only when refcount reach zero, we mark the root kmem_cache as dying. > > > > > > > > Fixes: 92ee383f6daa ("mm: fix race between kmem_cache destroy, create and deactivate") > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > > > Reviewed-by: Shakeel Butt <shakeelb@google.com> > > > > --- > > > > > > > > changelog in v2: > > > > 1) Fix a confusing typo in the commit log. > > > > > > Ok, now I see the problem. Thank you for fixing the commit log! > > > > > > > 2) Remove flush_memcg_workqueue() for !CONFIG_MEMCG_KMEM. > > > > 3) Introduce a new helper memcg_set_kmem_cache_dying() to fix a race > > > > condition between flush_memcg_workqueue() and slab_unmergeable(). > > > > > > > > mm/slab_common.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++------- > > > > 1 file changed, 47 insertions(+), 7 deletions(-) > > > > > > > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > > > index 8c1ffbf7de45..c4958116e3fd 100644 > > > > --- a/mm/slab_common.c > > > > +++ b/mm/slab_common.c > > > > @@ -258,6 +258,11 @@ static void memcg_unlink_cache(struct kmem_cache *s) > > > > list_del(&s->memcg_params.kmem_caches_node); > > > > } > > > > } > > > > + > > > > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) > > > > +{ > > > > + return is_root_cache(s) && s->memcg_params.dying; > > > > +} > > > > #else > > > > static inline int init_memcg_params(struct kmem_cache *s, > > > > struct kmem_cache *root_cache) > > > > @@ -272,6 +277,11 @@ static inline void destroy_memcg_params(struct kmem_cache *s) > > > > static inline void memcg_unlink_cache(struct kmem_cache *s) > > > > { > > > > } > > > > + > > > > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) > > > > +{ > > > > + return false; > > > > +} > > > > #endif /* CONFIG_MEMCG_KMEM */ > > > > > > > > /* > > > > @@ -326,6 +336,13 @@ int slab_unmergeable(struct kmem_cache *s) > > > > if (s->refcount < 0) > > > > return 1; > > > > > > > > + /* > > > > + * If the kmem_cache is dying. We should also skip this > > > > + * kmem_cache. > > > > + */ > > > > + if (memcg_kmem_cache_dying(s)) > > > > + return 1; > > > > + > > > > return 0; > > > > } > > > > > > > > @@ -886,12 +903,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s) > > > > return 0; > > > > } > > > > > > > > -static void flush_memcg_workqueue(struct kmem_cache *s) > > > > +static void memcg_set_kmem_cache_dying(struct kmem_cache *s) > > > > { > > > > spin_lock_irq(&memcg_kmem_wq_lock); > > > > s->memcg_params.dying = true; > > > > spin_unlock_irq(&memcg_kmem_wq_lock); > > > > +} > > > > > > > > +static void flush_memcg_workqueue(struct kmem_cache *s) > > > > +{ > > > > /* > > > > * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make > > > > * sure all registered rcu callbacks have been invoked. > > > > @@ -923,10 +943,6 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s) > > > > { > > > > return 0; > > > > } > > > > - > > > > -static inline void flush_memcg_workqueue(struct kmem_cache *s) > > > > -{ > > > > -} > > > > #endif /* CONFIG_MEMCG_KMEM */ > > > > > > > > void slab_kmem_cache_release(struct kmem_cache *s) > > > > @@ -944,8 +960,6 @@ void kmem_cache_destroy(struct kmem_cache *s) > > > > if (unlikely(!s)) > > > > return; > > > > > > > > - flush_memcg_workqueue(s); > > > > - > > > > get_online_cpus(); > > > > get_online_mems(); > > > > > > > > @@ -955,6 +969,32 @@ void kmem_cache_destroy(struct kmem_cache *s) > > > > if (s->refcount) > > > > goto out_unlock; > > > > > > > > +#ifdef CONFIG_MEMCG_KMEM > > > > + memcg_set_kmem_cache_dying(s); > > > > + > > > > + mutex_unlock(&slab_mutex); > > > > > > Hm, but in theory s->refcount can be increased here? > > > > I have tried my best to read all the codes that operate on s->refcount. > > There is only one place which increases the s->refcount, it is the > > __kmem_cache_alias(). If the kmem cache is dying, the slab_unmergeable() > > can never return true for the dying kmem cache because it is the same slab_mutex > > protection, so I think that there is not a problem, right? > > So the problem is that you're checking s->refcount under slab_mutex, > then you're releasing the mutex and then set the dying bit. But nothing prevents Maybe you miss something. The dying bit is set in the memcg_set_kmem_cache_dying() which is under the slab_mutex protection. So I think there is no problem. > someone else to clone the kmem_cache on a different CPU as soon as you've released > the slab_mutex. > > However, I think there is a simple solution for this problem. > Can you, please, check the following patch? > > -- > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 37d48a56431d..d6661e7af01e 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -326,6 +326,11 @@ int slab_unmergeable(struct kmem_cache *s) > if (s->refcount < 0) > return 1; > > +#ifdef CONFIG_MEMCG_KMEM > + if (s->memcg_params.dying) > + return 1; > +#endif > + > return 0; > } > > @@ -888,9 +893,20 @@ static int shutdown_memcg_caches(struct kmem_cache *s) > > static void flush_memcg_workqueue(struct kmem_cache *s) > { > - spin_lock_irq(&memcg_kmem_wq_lock); > - s->memcg_params.dying = true; > - spin_unlock_irq(&memcg_kmem_wq_lock); > + bool skip = false; > + > + mutex_lock(&slab_mutex); > + if (s->refcount == 1) { > + spin_lock_irq(&memcg_kmem_wq_lock); > + s->memcg_params.dying = true; > + spin_unlock_irq(&memcg_kmem_wq_lock); > + } else { > + skip = true; > + } > + mutex_unlock(&slab_mutex); > + > + if (skip) > + return; > > /* > * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make > -- > 2.26.2 > > > -- > > > > > So it doesn't solve the problem completely, but makes it less probable, right? > > > > > > I wonder if it's possible to (additionally) protect s->refcount with a > > > memcg_kmem_wq_lock, so that we can check it in the context of flush_memcg_workqueue()? > > > > > > > + > > > > + put_online_mems(); > > > > + put_online_cpus(); > > > > + > > > > + flush_memcg_workqueue(s); > > > > + > > > > + get_online_cpus(); > > > > + get_online_mems(); > > > > + > > > > + mutex_lock(&slab_mutex); > > > > + > > > > + if (WARN(s->refcount, > > > > + "kmem_cache_destroy %s: Slab cache is still referenced\n", > > > > + s->name)) { > > > > + /* > > > > + * Reset the dying flag setted by memcg_set_kmem_cache_dying(). > > > > + */ > > > > + s->memcg_params.dying = false; > > > > + goto out_unlock; > > > > + } > > > > +#endif > > > > + > > > > err = shutdown_memcg_caches(s); > > > > if (!err) > > > > err = shutdown_cache(s); > > > > -- > > > > 2.11.0 > > > > > > > > > > Other than the problem above your patch looks really good to me, however we should > > > be really careful here, as it should in theory be back-ported to a big number > > > of old stable kernels. And because it's (hopefully) fixed in 5.9, it's a backport-only > > > patch. > > > > > > So I wonder if we can mitigate the problem by disabling cache sharing for some > > > specific kmem_caches instead? Like for all with SLAB_ACCOUNT and maybe for all except > > > a hard-coded list (if kmem accounting is enabled). Do you mind sharing any details > > > on how this problem reveals itself in the real life? > > > > One day I debug another problem, coincidentally I executed the > > following command: > > > > cat /sys/kernel/debug/memcg_slabinfo | grep deact > > > > And I found an interesting thing, a lot of non-root kmem_cache's > > active_objs is zero. > > But why are they not destroyed? So I launched a detailed > > investigation. Finally I found > > out that the reason is that we mark the root kmem_cache as dying > > incorrectly. I have > > applied this patch on our server for several weeks. And this problem > > has disappeared. > > Got it, it's a great catch! > > Thank you! > > Roman
On Thu, Jul 16, 2020 at 11:54:37PM +0800, Muchun Song wrote: > On Thu, Jul 16, 2020 at 11:46 PM Roman Gushchin <guro@fb.com> wrote: > > > > On Thu, Jul 16, 2020 at 01:07:02PM +0800, Muchun Song wrote: > > > On Thu, Jul 16, 2020 at 1:54 AM Roman Gushchin <guro@fb.com> wrote: > > > > > > > > On Thu, Jul 16, 2020 at 12:50:22AM +0800, Muchun Song wrote: > > > > > If the kmem_cache refcount is greater than one, we should not > > > > > mark the root kmem_cache as dying. If we mark the root kmem_cache > > > > > dying incorrectly, the non-root kmem_cache can never be destroyed. > > > > > It resulted in memory leak when memcg was destroyed. We can use the > > > > > following steps to reproduce. > > > > > > > > > > 1) Use kmem_cache_create() to create a new kmem_cache named A. > > > > > 2) Coincidentally, the kmem_cache A is an alias for kmem_cache B, > > > > > so the refcount of B is just increased. > > > > > 3) Use kmem_cache_destroy() to destroy the kmem_cache A, just > > > > > decrease the B's refcount but mark the B as dying. > > > > > 4) Create a new memory cgroup and alloc memory from the kmem_cache > > > > > B. It leads to create a non-root kmem_cache for allocating memory. > > > > > 5) When destroy the memory cgroup created in the step 4), the > > > > > non-root kmem_cache can never be destroyed. > > > > > > > > > > If we repeat steps 4) and 5), this will cause a lot of memory leak. > > > > > So only when refcount reach zero, we mark the root kmem_cache as dying. > > > > > > > > > > Fixes: 92ee383f6daa ("mm: fix race between kmem_cache destroy, create and deactivate") > > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > > > > Reviewed-by: Shakeel Butt <shakeelb@google.com> > > > > > --- > > > > > > > > > > changelog in v2: > > > > > 1) Fix a confusing typo in the commit log. > > > > > > > > Ok, now I see the problem. Thank you for fixing the commit log! > > > > > > > > > 2) Remove flush_memcg_workqueue() for !CONFIG_MEMCG_KMEM. > > > > > 3) Introduce a new helper memcg_set_kmem_cache_dying() to fix a race > > > > > condition between flush_memcg_workqueue() and slab_unmergeable(). > > > > > > > > > > mm/slab_common.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++------- > > > > > 1 file changed, 47 insertions(+), 7 deletions(-) > > > > > > > > > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > > > > index 8c1ffbf7de45..c4958116e3fd 100644 > > > > > --- a/mm/slab_common.c > > > > > +++ b/mm/slab_common.c > > > > > @@ -258,6 +258,11 @@ static void memcg_unlink_cache(struct kmem_cache *s) > > > > > list_del(&s->memcg_params.kmem_caches_node); > > > > > } > > > > > } > > > > > + > > > > > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) > > > > > +{ > > > > > + return is_root_cache(s) && s->memcg_params.dying; > > > > > +} > > > > > #else > > > > > static inline int init_memcg_params(struct kmem_cache *s, > > > > > struct kmem_cache *root_cache) > > > > > @@ -272,6 +277,11 @@ static inline void destroy_memcg_params(struct kmem_cache *s) > > > > > static inline void memcg_unlink_cache(struct kmem_cache *s) > > > > > { > > > > > } > > > > > + > > > > > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) > > > > > +{ > > > > > + return false; > > > > > +} > > > > > #endif /* CONFIG_MEMCG_KMEM */ > > > > > > > > > > /* > > > > > @@ -326,6 +336,13 @@ int slab_unmergeable(struct kmem_cache *s) > > > > > if (s->refcount < 0) > > > > > return 1; > > > > > > > > > > + /* > > > > > + * If the kmem_cache is dying. We should also skip this > > > > > + * kmem_cache. > > > > > + */ > > > > > + if (memcg_kmem_cache_dying(s)) > > > > > + return 1; > > > > > + > > > > > return 0; > > > > > } > > > > > > > > > > @@ -886,12 +903,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s) > > > > > return 0; > > > > > } > > > > > > > > > > -static void flush_memcg_workqueue(struct kmem_cache *s) > > > > > +static void memcg_set_kmem_cache_dying(struct kmem_cache *s) > > > > > { > > > > > spin_lock_irq(&memcg_kmem_wq_lock); > > > > > s->memcg_params.dying = true; > > > > > spin_unlock_irq(&memcg_kmem_wq_lock); > > > > > +} > > > > > > > > > > +static void flush_memcg_workqueue(struct kmem_cache *s) > > > > > +{ > > > > > /* > > > > > * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make > > > > > * sure all registered rcu callbacks have been invoked. > > > > > @@ -923,10 +943,6 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s) > > > > > { > > > > > return 0; > > > > > } > > > > > - > > > > > -static inline void flush_memcg_workqueue(struct kmem_cache *s) > > > > > -{ > > > > > -} > > > > > #endif /* CONFIG_MEMCG_KMEM */ > > > > > > > > > > void slab_kmem_cache_release(struct kmem_cache *s) > > > > > @@ -944,8 +960,6 @@ void kmem_cache_destroy(struct kmem_cache *s) > > > > > if (unlikely(!s)) > > > > > return; > > > > > > > > > > - flush_memcg_workqueue(s); > > > > > - > > > > > get_online_cpus(); > > > > > get_online_mems(); > > > > > > > > > > @@ -955,6 +969,32 @@ void kmem_cache_destroy(struct kmem_cache *s) > > > > > if (s->refcount) > > > > > goto out_unlock; > > > > > > > > > > +#ifdef CONFIG_MEMCG_KMEM > > > > > + memcg_set_kmem_cache_dying(s); > > > > > + > > > > > + mutex_unlock(&slab_mutex); > > > > > > > > Hm, but in theory s->refcount can be increased here? > > > > > > I have tried my best to read all the codes that operate on s->refcount. > > > There is only one place which increases the s->refcount, it is the > > > __kmem_cache_alias(). If the kmem cache is dying, the slab_unmergeable() > > > can never return true for the dying kmem cache because it is the same slab_mutex > > > protection, so I think that there is not a problem, right? > > > > So the problem is that you're checking s->refcount under slab_mutex, > > then you're releasing the mutex and then set the dying bit. But nothing prevents > > Maybe you miss something. The dying bit is set in the > memcg_set_kmem_cache_dying() > which is under the slab_mutex protection. So I think there is no problem. I'm sorry, I haven't noticed that you've fixed the race in v2. But then we can probably drop the WARN() and resetting the dying flag part, right? And because it's a backport-only patch I'd try to make it smaller (e.g. avoid introducing new helpers, etc) to simplify the back-porting to old kernels. But it's up to you (and Andrew). Please, feel free to add Acked-by: Roman Gushchin <guro@fb.com> after dropping the dying flag reset part. Thank you!
On Fri, Jul 17, 2020 at 12:16 AM Roman Gushchin <guro@fb.com> wrote: > > On Thu, Jul 16, 2020 at 11:54:37PM +0800, Muchun Song wrote: > > On Thu, Jul 16, 2020 at 11:46 PM Roman Gushchin <guro@fb.com> wrote: > > > > > > On Thu, Jul 16, 2020 at 01:07:02PM +0800, Muchun Song wrote: > > > > On Thu, Jul 16, 2020 at 1:54 AM Roman Gushchin <guro@fb.com> wrote: > > > > > > > > > > On Thu, Jul 16, 2020 at 12:50:22AM +0800, Muchun Song wrote: > > > > > > If the kmem_cache refcount is greater than one, we should not > > > > > > mark the root kmem_cache as dying. If we mark the root kmem_cache > > > > > > dying incorrectly, the non-root kmem_cache can never be destroyed. > > > > > > It resulted in memory leak when memcg was destroyed. We can use the > > > > > > following steps to reproduce. > > > > > > > > > > > > 1) Use kmem_cache_create() to create a new kmem_cache named A. > > > > > > 2) Coincidentally, the kmem_cache A is an alias for kmem_cache B, > > > > > > so the refcount of B is just increased. > > > > > > 3) Use kmem_cache_destroy() to destroy the kmem_cache A, just > > > > > > decrease the B's refcount but mark the B as dying. > > > > > > 4) Create a new memory cgroup and alloc memory from the kmem_cache > > > > > > B. It leads to create a non-root kmem_cache for allocating memory. > > > > > > 5) When destroy the memory cgroup created in the step 4), the > > > > > > non-root kmem_cache can never be destroyed. > > > > > > > > > > > > If we repeat steps 4) and 5), this will cause a lot of memory leak. > > > > > > So only when refcount reach zero, we mark the root kmem_cache as dying. > > > > > > > > > > > > Fixes: 92ee383f6daa ("mm: fix race between kmem_cache destroy, create and deactivate") > > > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > > > > > Reviewed-by: Shakeel Butt <shakeelb@google.com> > > > > > > --- > > > > > > > > > > > > changelog in v2: > > > > > > 1) Fix a confusing typo in the commit log. > > > > > > > > > > Ok, now I see the problem. Thank you for fixing the commit log! > > > > > > > > > > > 2) Remove flush_memcg_workqueue() for !CONFIG_MEMCG_KMEM. > > > > > > 3) Introduce a new helper memcg_set_kmem_cache_dying() to fix a race > > > > > > condition between flush_memcg_workqueue() and slab_unmergeable(). > > > > > > > > > > > > mm/slab_common.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++------- > > > > > > 1 file changed, 47 insertions(+), 7 deletions(-) > > > > > > > > > > > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > > > > > index 8c1ffbf7de45..c4958116e3fd 100644 > > > > > > --- a/mm/slab_common.c > > > > > > +++ b/mm/slab_common.c > > > > > > @@ -258,6 +258,11 @@ static void memcg_unlink_cache(struct kmem_cache *s) > > > > > > list_del(&s->memcg_params.kmem_caches_node); > > > > > > } > > > > > > } > > > > > > + > > > > > > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) > > > > > > +{ > > > > > > + return is_root_cache(s) && s->memcg_params.dying; > > > > > > +} > > > > > > #else > > > > > > static inline int init_memcg_params(struct kmem_cache *s, > > > > > > struct kmem_cache *root_cache) > > > > > > @@ -272,6 +277,11 @@ static inline void destroy_memcg_params(struct kmem_cache *s) > > > > > > static inline void memcg_unlink_cache(struct kmem_cache *s) > > > > > > { > > > > > > } > > > > > > + > > > > > > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) > > > > > > +{ > > > > > > + return false; > > > > > > +} > > > > > > #endif /* CONFIG_MEMCG_KMEM */ > > > > > > > > > > > > /* > > > > > > @@ -326,6 +336,13 @@ int slab_unmergeable(struct kmem_cache *s) > > > > > > if (s->refcount < 0) > > > > > > return 1; > > > > > > > > > > > > + /* > > > > > > + * If the kmem_cache is dying. We should also skip this > > > > > > + * kmem_cache. > > > > > > + */ > > > > > > + if (memcg_kmem_cache_dying(s)) > > > > > > + return 1; > > > > > > + > > > > > > return 0; > > > > > > } > > > > > > > > > > > > @@ -886,12 +903,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s) > > > > > > return 0; > > > > > > } > > > > > > > > > > > > -static void flush_memcg_workqueue(struct kmem_cache *s) > > > > > > +static void memcg_set_kmem_cache_dying(struct kmem_cache *s) > > > > > > { > > > > > > spin_lock_irq(&memcg_kmem_wq_lock); > > > > > > s->memcg_params.dying = true; > > > > > > spin_unlock_irq(&memcg_kmem_wq_lock); > > > > > > +} > > > > > > > > > > > > +static void flush_memcg_workqueue(struct kmem_cache *s) > > > > > > +{ > > > > > > /* > > > > > > * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make > > > > > > * sure all registered rcu callbacks have been invoked. > > > > > > @@ -923,10 +943,6 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s) > > > > > > { > > > > > > return 0; > > > > > > } > > > > > > - > > > > > > -static inline void flush_memcg_workqueue(struct kmem_cache *s) > > > > > > -{ > > > > > > -} > > > > > > #endif /* CONFIG_MEMCG_KMEM */ > > > > > > > > > > > > void slab_kmem_cache_release(struct kmem_cache *s) > > > > > > @@ -944,8 +960,6 @@ void kmem_cache_destroy(struct kmem_cache *s) > > > > > > if (unlikely(!s)) > > > > > > return; > > > > > > > > > > > > - flush_memcg_workqueue(s); > > > > > > - > > > > > > get_online_cpus(); > > > > > > get_online_mems(); > > > > > > > > > > > > @@ -955,6 +969,32 @@ void kmem_cache_destroy(struct kmem_cache *s) > > > > > > if (s->refcount) > > > > > > goto out_unlock; > > > > > > > > > > > > +#ifdef CONFIG_MEMCG_KMEM > > > > > > + memcg_set_kmem_cache_dying(s); > > > > > > + > > > > > > + mutex_unlock(&slab_mutex); > > > > > > > > > > Hm, but in theory s->refcount can be increased here? > > > > > > > > I have tried my best to read all the codes that operate on s->refcount. > > > > There is only one place which increases the s->refcount, it is the > > > > __kmem_cache_alias(). If the kmem cache is dying, the slab_unmergeable() > > > > can never return true for the dying kmem cache because it is the same slab_mutex > > > > protection, so I think that there is not a problem, right? > > > > > > So the problem is that you're checking s->refcount under slab_mutex, > > > then you're releasing the mutex and then set the dying bit. But nothing prevents > > > > Maybe you miss something. The dying bit is set in the > > memcg_set_kmem_cache_dying() > > which is under the slab_mutex protection. So I think there is no problem. > > I'm sorry, I haven't noticed that you've fixed the race in v2. > But then we can probably drop the WARN() and resetting the dying flag part, right? > > And because it's a backport-only patch I'd try to make it smaller > (e.g. avoid introducing new helpers, etc) to simplify the back-porting to old kernels. I agree with you. I will simplify the code. > But it's up to you (and Andrew). > > Please, feel free to add > Acked-by: Roman Gushchin <guro@fb.com> > after dropping the dying flag reset part. Ok, I will do that. Thanks. > > Thank you!
diff --git a/mm/slab_common.c b/mm/slab_common.c index 8c1ffbf7de45..c4958116e3fd 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -258,6 +258,11 @@ static void memcg_unlink_cache(struct kmem_cache *s) list_del(&s->memcg_params.kmem_caches_node); } } + +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) +{ + return is_root_cache(s) && s->memcg_params.dying; +} #else static inline int init_memcg_params(struct kmem_cache *s, struct kmem_cache *root_cache) @@ -272,6 +277,11 @@ static inline void destroy_memcg_params(struct kmem_cache *s) static inline void memcg_unlink_cache(struct kmem_cache *s) { } + +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s) +{ + return false; +} #endif /* CONFIG_MEMCG_KMEM */ /* @@ -326,6 +336,13 @@ int slab_unmergeable(struct kmem_cache *s) if (s->refcount < 0) return 1; + /* + * If the kmem_cache is dying. We should also skip this + * kmem_cache. + */ + if (memcg_kmem_cache_dying(s)) + return 1; + return 0; } @@ -886,12 +903,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s) return 0; } -static void flush_memcg_workqueue(struct kmem_cache *s) +static void memcg_set_kmem_cache_dying(struct kmem_cache *s) { spin_lock_irq(&memcg_kmem_wq_lock); s->memcg_params.dying = true; spin_unlock_irq(&memcg_kmem_wq_lock); +} +static void flush_memcg_workqueue(struct kmem_cache *s) +{ /* * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make * sure all registered rcu callbacks have been invoked. @@ -923,10 +943,6 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s) { return 0; } - -static inline void flush_memcg_workqueue(struct kmem_cache *s) -{ -} #endif /* CONFIG_MEMCG_KMEM */ void slab_kmem_cache_release(struct kmem_cache *s) @@ -944,8 +960,6 @@ void kmem_cache_destroy(struct kmem_cache *s) if (unlikely(!s)) return; - flush_memcg_workqueue(s); - get_online_cpus(); get_online_mems(); @@ -955,6 +969,32 @@ void kmem_cache_destroy(struct kmem_cache *s) if (s->refcount) goto out_unlock; +#ifdef CONFIG_MEMCG_KMEM + memcg_set_kmem_cache_dying(s); + + mutex_unlock(&slab_mutex); + + put_online_mems(); + put_online_cpus(); + + flush_memcg_workqueue(s); + + get_online_cpus(); + get_online_mems(); + + mutex_lock(&slab_mutex); + + if (WARN(s->refcount, + "kmem_cache_destroy %s: Slab cache is still referenced\n", + s->name)) { + /* + * Reset the dying flag setted by memcg_set_kmem_cache_dying(). + */ + s->memcg_params.dying = false; + goto out_unlock; + } +#endif + err = shutdown_memcg_caches(s); if (!err) err = shutdown_cache(s);