Message ID | 20230911094444.68966-43-zhengqi.arch@bytedance.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | use refcount+RCU method to implement lockless slab shrink | expand |
On Tue, Sep 12, 2023 at 9:57 PM Qi Zheng <zhengqi.arch@bytedance.com> wrote: > - if (!down_read_trylock(&shrinker_rwsem)) > - goto out; > - > - list_for_each_entry(shrinker, &shrinker_list, list) { > + /* > + * lockless algorithm of global shrink. > + * > + * In the unregistration setp, the shrinker will be freed asynchronously > + * via RCU after its refcount reaches 0. So both rcu_read_lock() and > + * shrinker_try_get() can be used to ensure the existence of the shrinker. > + * > + * So in the global shrink: > + * step 1: use rcu_read_lock() to guarantee existence of the shrinker > + * and the validity of the shrinker_list walk. > + * step 2: use shrinker_try_get() to try get the refcount, if successful, > + * then the existence of the shrinker can also be guaranteed, > + * so we can release the RCU lock to do do_shrink_slab() that > + * may sleep. > + * step 3: *MUST* to reacquire the RCU lock before calling shrinker_put(), > + * which ensures that neither this shrinker nor the next shrinker > + * will be freed in the next traversal operation. Hello, Qi, Andrew, Paul, I wonder know how RCU can ensure the lifespan of the next shrinker. it seems it is diverged from the common pattern usage of RCU+reference. cpu1: rcu_read_lock(); shrinker_try_get(this_shrinker); rcu_read_unlock(); cpu2: shrinker_free(this_shrinker); cpu2: shrinker_free(next_shrinker); and free the memory of next_shrinker cpu2: when shrinker_free(next_shrinker), no one updates this_shrinker's next cpu2: since this_shrinker has been removed first. rcu_read_lock(); shrinker_put(this_shrinker); travel to the freed next_shrinker. a quick simple fix: // called with other references other than RCU (i.e. refcount) static inline rcu_list_deleted(struct list_head *entry) { // something like this: return entry->prev == LIST_POISON2; } // in the loop if (rcu_list_deleted(&shrinker->list)) { shrinker_put(shrinker); goto restart; } rcu_read_lock(); shrinker_put(shrinker); Thanks Lai > + * step 4: do shrinker_put() paired with step 2 to put the refcount, > + * if the refcount reaches 0, then wake up the waiter in > + * shrinker_free() by calling complete(). > + */ > + rcu_read_lock(); > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > struct shrink_control sc = { > .gfp_mask = gfp_mask, > .nid = nid, > .memcg = memcg, > }; > > + if (!shrinker_try_get(shrinker)) > + continue; > + > + rcu_read_unlock(); > + > ret = do_shrink_slab(&sc, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > - /* > - * Bail out if someone want to register a new shrinker to > - * prevent the registration from being stalled for long periods > - * by parallel ongoing shrinking. > - */ > - if (rwsem_is_contended(&shrinker_rwsem)) { > - freed = freed ? : 1; > - break; > - } > + > + rcu_read_lock(); > + shrinker_put(shrinker); > } >
Hi, On 2023/12/6 15:47, Lai Jiangshan wrote: > On Tue, Sep 12, 2023 at 9:57 PM Qi Zheng <zhengqi.arch@bytedance.com> wrote: > >> - if (!down_read_trylock(&shrinker_rwsem)) >> - goto out; >> - >> - list_for_each_entry(shrinker, &shrinker_list, list) { >> + /* >> + * lockless algorithm of global shrink. >> + * >> + * In the unregistration setp, the shrinker will be freed asynchronously >> + * via RCU after its refcount reaches 0. So both rcu_read_lock() and >> + * shrinker_try_get() can be used to ensure the existence of the shrinker. >> + * >> + * So in the global shrink: >> + * step 1: use rcu_read_lock() to guarantee existence of the shrinker >> + * and the validity of the shrinker_list walk. >> + * step 2: use shrinker_try_get() to try get the refcount, if successful, >> + * then the existence of the shrinker can also be guaranteed, >> + * so we can release the RCU lock to do do_shrink_slab() that >> + * may sleep. >> + * step 3: *MUST* to reacquire the RCU lock before calling shrinker_put(), >> + * which ensures that neither this shrinker nor the next shrinker >> + * will be freed in the next traversal operation. > > Hello, Qi, Andrew, Paul, > > I wonder know how RCU can ensure the lifespan of the next shrinker. > it seems it is diverged from the common pattern usage of RCU+reference. > > cpu1: > rcu_read_lock(); > shrinker_try_get(this_shrinker); > rcu_read_unlock(); > cpu2: shrinker_free(this_shrinker); > cpu2: shrinker_free(next_shrinker); and free the memory of next_shrinker > cpu2: when shrinker_free(next_shrinker), no one updates this_shrinker's next > cpu2: since this_shrinker has been removed first. No, this_shrinker will not be removed from the shrinker_list until the last refcount is released. See below: > rcu_read_lock(); > shrinker_put(this_shrinker); CPU 1 CPU 2 --> if (refcount_dec_and_test(&shrinker->refcount)) complete(&shrinker->done); wait_for_completion(&shrinker->done); list_del_rcu(&shrinker->list); > travel to the freed next_shrinker. > > a quick simple fix: > > // called with other references other than RCU (i.e. refcount) > static inline rcu_list_deleted(struct list_head *entry) > { > // something like this: > return entry->prev == LIST_POISON2; > } > > // in the loop > if (rcu_list_deleted(&shrinker->list)) { > shrinker_put(shrinker); > goto restart; > } > rcu_read_lock(); > shrinker_put(shrinker); > > Thanks > Lai > >> + * step 4: do shrinker_put() paired with step 2 to put the refcount, >> + * if the refcount reaches 0, then wake up the waiter in >> + * shrinker_free() by calling complete(). >> + */ >> + rcu_read_lock(); >> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { >> struct shrink_control sc = { >> .gfp_mask = gfp_mask, >> .nid = nid, >> .memcg = memcg, >> }; >> >> + if (!shrinker_try_get(shrinker)) >> + continue; >> + >> + rcu_read_unlock(); >> + >> ret = do_shrink_slab(&sc, shrinker, priority); >> if (ret == SHRINK_EMPTY) >> ret = 0; >> freed += ret; >> - /* >> - * Bail out if someone want to register a new shrinker to >> - * prevent the registration from being stalled for long periods >> - * by parallel ongoing shrinking. >> - */ >> - if (rwsem_is_contended(&shrinker_rwsem)) { >> - freed = freed ? : 1; >> - break; >> - } >> + >> + rcu_read_lock(); >> + shrinker_put(shrinker); >> } >>
On Wed, Dec 6, 2023 at 3:55 PM Qi Zheng <zhengqi.arch@bytedance.com> wrote: > > Hi, > > On 2023/12/6 15:47, Lai Jiangshan wrote: > > On Tue, Sep 12, 2023 at 9:57 PM Qi Zheng <zhengqi.arch@bytedance.com> wrote: > > > >> - if (!down_read_trylock(&shrinker_rwsem)) > >> - goto out; > >> - > >> - list_for_each_entry(shrinker, &shrinker_list, list) { > >> + /* > >> + * lockless algorithm of global shrink. > >> + * > >> + * In the unregistration setp, the shrinker will be freed asynchronously > >> + * via RCU after its refcount reaches 0. So both rcu_read_lock() and > >> + * shrinker_try_get() can be used to ensure the existence of the shrinker. > >> + * > >> + * So in the global shrink: > >> + * step 1: use rcu_read_lock() to guarantee existence of the shrinker > >> + * and the validity of the shrinker_list walk. > >> + * step 2: use shrinker_try_get() to try get the refcount, if successful, > >> + * then the existence of the shrinker can also be guaranteed, > >> + * so we can release the RCU lock to do do_shrink_slab() that > >> + * may sleep. > >> + * step 3: *MUST* to reacquire the RCU lock before calling shrinker_put(), > >> + * which ensures that neither this shrinker nor the next shrinker > >> + * will be freed in the next traversal operation. > > > > Hello, Qi, Andrew, Paul, > > > > I wonder know how RCU can ensure the lifespan of the next shrinker. > > it seems it is diverged from the common pattern usage of RCU+reference. > > > > cpu1: > > rcu_read_lock(); > > shrinker_try_get(this_shrinker); > > rcu_read_unlock(); > > cpu2: shrinker_free(this_shrinker); > > cpu2: shrinker_free(next_shrinker); and free the memory of next_shrinker > > cpu2: when shrinker_free(next_shrinker), no one updates this_shrinker's next > > cpu2: since this_shrinker has been removed first. > > No, this_shrinker will not be removed from the shrinker_list until the > last refcount is released. See below: Oh, I miss the code here. Thanks Lai > > > rcu_read_lock(); > > shrinker_put(this_shrinker); > > CPU 1 CPU 2 > > --> if (refcount_dec_and_test(&shrinker->refcount)) > complete(&shrinker->done); > > wait_for_completion(&shrinker->done); > list_del_rcu(&shrinker->list); > > > travel to the freed next_shrinker. > > > > a quick simple fix: > > > > // called with other references other than RCU (i.e. refcount) > > static inline rcu_list_deleted(struct list_head *entry) > > { > > // something like this: > > return entry->prev == LIST_POISON2; > > } > > > > // in the loop > > if (rcu_list_deleted(&shrinker->list)) { > > shrinker_put(shrinker); > > goto restart; > > } > > rcu_read_lock(); > > shrinker_put(shrinker); > > > > Thanks > > Lai > > > >> + * step 4: do shrinker_put() paired with step 2 to put the refcount, > >> + * if the refcount reaches 0, then wake up the waiter in > >> + * shrinker_free() by calling complete(). > >> + */ > >> + rcu_read_lock(); > >> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > >> struct shrink_control sc = { > >> .gfp_mask = gfp_mask, > >> .nid = nid, > >> .memcg = memcg, > >> }; > >> > >> + if (!shrinker_try_get(shrinker)) > >> + continue; > >> + > >> + rcu_read_unlock(); > >> + > >> ret = do_shrink_slab(&sc, shrinker, priority); > >> if (ret == SHRINK_EMPTY) > >> ret = 0; > >> freed += ret; > >> - /* > >> - * Bail out if someone want to register a new shrinker to > >> - * prevent the registration from being stalled for long periods > >> - * by parallel ongoing shrinking. > >> - */ > >> - if (rwsem_is_contended(&shrinker_rwsem)) { > >> - freed = freed ? : 1; > >> - break; > >> - } > >> + > >> + rcu_read_lock(); > >> + shrinker_put(shrinker); > >> } > >>
On Wed, Dec 6, 2023 at 3:55 PM Qi Zheng <zhengqi.arch@bytedance.com> wrote: > > Hi, > > On 2023/12/6 15:47, Lai Jiangshan wrote: > > On Tue, Sep 12, 2023 at 9:57 PM Qi Zheng <zhengqi.arch@bytedance.com> wrote: > > > >> - if (!down_read_trylock(&shrinker_rwsem)) > >> - goto out; > >> - > >> - list_for_each_entry(shrinker, &shrinker_list, list) { > >> + /* > >> + * lockless algorithm of global shrink. > >> + * > >> + * In the unregistration setp, the shrinker will be freed asynchronously > >> + * via RCU after its refcount reaches 0. So both rcu_read_lock() and > >> + * shrinker_try_get() can be used to ensure the existence of the shrinker. > >> + * > >> + * So in the global shrink: > >> + * step 1: use rcu_read_lock() to guarantee existence of the shrinker > >> + * and the validity of the shrinker_list walk. > >> + * step 2: use shrinker_try_get() to try get the refcount, if successful, > >> + * then the existence of the shrinker can also be guaranteed, > >> + * so we can release the RCU lock to do do_shrink_slab() that > >> + * may sleep. > >> + * step 3: *MUST* to reacquire the RCU lock before calling shrinker_put(), > >> + * which ensures that neither this shrinker nor the next shrinker > >> + * will be freed in the next traversal operation. > > > > Hello, Qi, Andrew, Paul, > > > > I wonder know how RCU can ensure the lifespan of the next shrinker. > > it seems it is diverged from the common pattern usage of RCU+reference. > > > > cpu1: > > rcu_read_lock(); > > shrinker_try_get(this_shrinker); > > rcu_read_unlock(); > > cpu2: shrinker_free(this_shrinker); > > cpu2: shrinker_free(next_shrinker); and free the memory of next_shrinker > > cpu2: when shrinker_free(next_shrinker), no one updates this_shrinker's next > > cpu2: since this_shrinker has been removed first. > > No, this_shrinker will not be removed from the shrinker_list until the > last refcount is released. See below: > > > rcu_read_lock(); > > shrinker_put(this_shrinker); > > CPU 1 CPU 2 > > --> if (refcount_dec_and_test(&shrinker->refcount)) > complete(&shrinker->done); > > wait_for_completion(&shrinker->done); > list_del_rcu(&shrinker->list); since shrinker will not be removed from the shrinker_list until the last refcount is released. Is it possible that shrinker_free() can be starved by continuous scanners getting and putting the refcount? Thanks Lai > > > travel to the freed next_shrinker. > > > > a quick simple fix: > > > > // called with other references other than RCU (i.e. refcount) > > static inline rcu_list_deleted(struct list_head *entry) > > { > > // something like this: > > return entry->prev == LIST_POISON2; > > } > > > > // in the loop > > if (rcu_list_deleted(&shrinker->list)) { > > shrinker_put(shrinker); > > goto restart; > > } > > rcu_read_lock(); > > shrinker_put(shrinker); > > > > Thanks > > Lai > > > >> + * step 4: do shrinker_put() paired with step 2 to put the refcount, > >> + * if the refcount reaches 0, then wake up the waiter in > >> + * shrinker_free() by calling complete(). > >> + */ > >> + rcu_read_lock(); > >> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { > >> struct shrink_control sc = { > >> .gfp_mask = gfp_mask, > >> .nid = nid, > >> .memcg = memcg, > >> }; > >> > >> + if (!shrinker_try_get(shrinker)) > >> + continue; > >> + > >> + rcu_read_unlock(); > >> + > >> ret = do_shrink_slab(&sc, shrinker, priority); > >> if (ret == SHRINK_EMPTY) > >> ret = 0; > >> freed += ret; > >> - /* > >> - * Bail out if someone want to register a new shrinker to > >> - * prevent the registration from being stalled for long periods > >> - * by parallel ongoing shrinking. > >> - */ > >> - if (rwsem_is_contended(&shrinker_rwsem)) { > >> - freed = freed ? : 1; > >> - break; > >> - } > >> + > >> + rcu_read_lock(); > >> + shrinker_put(shrinker); > >> } > >>
Hi, On 2023/12/6 16:23, Lai Jiangshan wrote: > On Wed, Dec 6, 2023 at 3:55 PM Qi Zheng <zhengqi.arch@bytedance.com> wrote: >> >> Hi, >> >> On 2023/12/6 15:47, Lai Jiangshan wrote: >>> On Tue, Sep 12, 2023 at 9:57 PM Qi Zheng <zhengqi.arch@bytedance.com> wrote: >>> >>>> - if (!down_read_trylock(&shrinker_rwsem)) >>>> - goto out; >>>> - >>>> - list_for_each_entry(shrinker, &shrinker_list, list) { >>>> + /* >>>> + * lockless algorithm of global shrink. >>>> + * >>>> + * In the unregistration setp, the shrinker will be freed asynchronously >>>> + * via RCU after its refcount reaches 0. So both rcu_read_lock() and >>>> + * shrinker_try_get() can be used to ensure the existence of the shrinker. >>>> + * >>>> + * So in the global shrink: >>>> + * step 1: use rcu_read_lock() to guarantee existence of the shrinker >>>> + * and the validity of the shrinker_list walk. >>>> + * step 2: use shrinker_try_get() to try get the refcount, if successful, >>>> + * then the existence of the shrinker can also be guaranteed, >>>> + * so we can release the RCU lock to do do_shrink_slab() that >>>> + * may sleep. >>>> + * step 3: *MUST* to reacquire the RCU lock before calling shrinker_put(), >>>> + * which ensures that neither this shrinker nor the next shrinker >>>> + * will be freed in the next traversal operation. >>> >>> Hello, Qi, Andrew, Paul, >>> >>> I wonder know how RCU can ensure the lifespan of the next shrinker. >>> it seems it is diverged from the common pattern usage of RCU+reference. >>> >>> cpu1: >>> rcu_read_lock(); >>> shrinker_try_get(this_shrinker); >>> rcu_read_unlock(); >>> cpu2: shrinker_free(this_shrinker); >>> cpu2: shrinker_free(next_shrinker); and free the memory of next_shrinker >>> cpu2: when shrinker_free(next_shrinker), no one updates this_shrinker's next >>> cpu2: since this_shrinker has been removed first. >> >> No, this_shrinker will not be removed from the shrinker_list until the >> last refcount is released. See below: >> >>> rcu_read_lock(); >>> shrinker_put(this_shrinker); >> >> CPU 1 CPU 2 >> >> --> if (refcount_dec_and_test(&shrinker->refcount)) >> complete(&shrinker->done); >> >> wait_for_completion(&shrinker->done); >> list_del_rcu(&shrinker->list); > > since shrinker will not be removed from the shrinker_list until the > last refcount is released. > > Is it possible that shrinker_free() can be starved by continuous > scanners getting and putting the refcount? I actually considered this case, but the probability of this happening was low, so I discarded the relevant code (v2 --> v3). If this problem really occurs in a production environment, we can fix it, like below: diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 1a00be90d93a..e5ebbbf1414f 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -88,6 +88,7 @@ struct shrinker { long batch; /* reclaim batch size, 0 = default */ int seeks; /* seeks to recreate an obj */ unsigned flags; + bool registered; /* * The reference count of this shrinker. Registered shrinker have an @@ -138,7 +139,8 @@ void shrinker_free(struct shrinker *shrinker); static inline bool shrinker_try_get(struct shrinker *shrinker) { - return refcount_inc_not_zero(&shrinker->refcount); + return READ_ONCE(shrinker->registered) && + refcount_inc_not_zero(&shrinker->refcount); } static inline void shrinker_put(struct shrinker *shrinker) diff --git a/mm/shrinker.c b/mm/shrinker.c index dd91eab43ed3..9b8881d178c6 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -753,6 +753,7 @@ void shrinker_register(struct shrinker *shrinker) * shrinker_try_get(). */ refcount_set(&shrinker->refcount, 1); + WRITE_ONCE(shrinker->registered, true); } EXPORT_SYMBOL_GPL(shrinker_register); @@ -773,6 +774,7 @@ void shrinker_free(struct shrinker *shrinker) return; if (shrinker->flags & SHRINKER_REGISTERED) { + WRITE_ONCE(shrinker->registered, false); /* drop the initial refcount */ shrinker_put(shrinker); /* Thanks, Qi > > Thanks > Lai > > >> >>> travel to the freed next_shrinker. >>> >>> a quick simple fix: >>> >>> // called with other references other than RCU (i.e. refcount) >>> static inline rcu_list_deleted(struct list_head *entry) >>> { >>> // something like this: >>> return entry->prev == LIST_POISON2; >>> } >>> >>> // in the loop >>> if (rcu_list_deleted(&shrinker->list)) { >>> shrinker_put(shrinker); >>> goto restart; >>> } >>> rcu_read_lock(); >>> shrinker_put(shrinker); >>> >>> Thanks >>> Lai >>> >>>> + * step 4: do shrinker_put() paired with step 2 to put the refcount, >>>> + * if the refcount reaches 0, then wake up the waiter in >>>> + * shrinker_free() by calling complete(). >>>> + */ >>>> + rcu_read_lock(); >>>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { >>>> struct shrink_control sc = { >>>> .gfp_mask = gfp_mask, >>>> .nid = nid, >>>> .memcg = memcg, >>>> }; >>>> >>>> + if (!shrinker_try_get(shrinker)) >>>> + continue; >>>> + >>>> + rcu_read_unlock(); >>>> + >>>> ret = do_shrink_slab(&sc, shrinker, priority); >>>> if (ret == SHRINK_EMPTY) >>>> ret = 0; >>>> freed += ret; >>>> - /* >>>> - * Bail out if someone want to register a new shrinker to >>>> - * prevent the registration from being stalled for long periods >>>> - * by parallel ongoing shrinking. >>>> - */ >>>> - if (rwsem_is_contended(&shrinker_rwsem)) { >>>> - freed = freed ? : 1; >>>> - break; >>>> - } >>>> + >>>> + rcu_read_lock(); >>>> + shrinker_put(shrinker); >>>> } >>>>
On Wed, Dec 06, 2023 at 04:23:24PM +0800, Lai Jiangshan wrote: > On Wed, Dec 6, 2023 at 3:55 PM Qi Zheng <zhengqi.arch@bytedance.com> wrote: > > On 2023/12/6 15:47, Lai Jiangshan wrote: > > > On Tue, Sep 12, 2023 at 9:57 PM Qi Zheng <zhengqi.arch@bytedance.com> wrote: > > > > > No, this_shrinker will not be removed from the shrinker_list until the > > last refcount is released. See below: > > > > > rcu_read_lock(); > > > shrinker_put(this_shrinker); > > > > CPU 1 CPU 2 > > > > --> if (refcount_dec_and_test(&shrinker->refcount)) > > complete(&shrinker->done); > > > > wait_for_completion(&shrinker->done); > > list_del_rcu(&shrinker->list); > > since shrinker will not be removed from the shrinker_list until the > last refcount is released. > > Is it possible that shrinker_free() can be starved by continuous > scanners getting and putting the refcount? In theory, yes. In practice, highly improbable. i.e. I don't think this will ever be an issue because the timing conditions for memory reclaim to keep a permanently elevated reference count on a shrinker for a subsystem that is being torn down are -highly- improbable. And even if you could pull it off, who cares if shrinker_free() is delayed? It's a teardown operation meaning the subsystem using the shrinker will no longer be in use so the latency of the teardown operation is largely irrelevant to whatever is still running on the system. -Dave.
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index eb342994675a..4908109d924f 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -4,6 +4,8 @@ #include <linux/atomic.h> #include <linux/types.h> +#include <linux/refcount.h> +#include <linux/completion.h> #define SHRINKER_UNIT_BITS BITS_PER_LONG @@ -87,6 +89,17 @@ struct shrinker { int seeks; /* seeks to recreate an obj */ unsigned flags; + /* + * The reference count of this shrinker. Registered shrinker have an + * initial refcount of 1, then the lookup operations are now allowed + * to use it via shrinker_try_get(). Later in the unregistration step, + * the initial refcount will be discarded, and will free the shrinker + * asynchronously via RCU after its refcount reaches 0. + */ + refcount_t refcount; + struct completion done; /* use to wait for refcount to reach 0 */ + struct rcu_head rcu; + void *private_data; /* These are for internal use */ @@ -120,6 +133,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, const char *fmt, ...); void shrinker_register(struct shrinker *shrinker); void shrinker_free(struct shrinker *shrinker); +static inline bool shrinker_try_get(struct shrinker *shrinker) +{ + return refcount_inc_not_zero(&shrinker->refcount); +} + +static inline void shrinker_put(struct shrinker *shrinker) +{ + if (refcount_dec_and_test(&shrinker->refcount)) + complete(&shrinker->done); +} + #ifdef CONFIG_SHRINKER_DEBUG extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...); diff --git a/mm/shrinker.c b/mm/shrinker.c index 0b9a43ce2d6f..82dc61133c5b 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -2,6 +2,7 @@ #include <linux/memcontrol.h> #include <linux/rwsem.h> #include <linux/shrinker.h> +#include <linux/rculist.h> #include <trace/events/vmscan.h> #include "internal.h" @@ -576,33 +577,50 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) return shrink_slab_memcg(gfp_mask, nid, memcg, priority); - if (!down_read_trylock(&shrinker_rwsem)) - goto out; - - list_for_each_entry(shrinker, &shrinker_list, list) { + /* + * lockless algorithm of global shrink. + * + * In the unregistration setp, the shrinker will be freed asynchronously + * via RCU after its refcount reaches 0. So both rcu_read_lock() and + * shrinker_try_get() can be used to ensure the existence of the shrinker. + * + * So in the global shrink: + * step 1: use rcu_read_lock() to guarantee existence of the shrinker + * and the validity of the shrinker_list walk. + * step 2: use shrinker_try_get() to try get the refcount, if successful, + * then the existence of the shrinker can also be guaranteed, + * so we can release the RCU lock to do do_shrink_slab() that + * may sleep. + * step 3: *MUST* to reacquire the RCU lock before calling shrinker_put(), + * which ensures that neither this shrinker nor the next shrinker + * will be freed in the next traversal operation. + * step 4: do shrinker_put() paired with step 2 to put the refcount, + * if the refcount reaches 0, then wake up the waiter in + * shrinker_free() by calling complete(). + */ + rcu_read_lock(); + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, .memcg = memcg, }; + if (!shrinker_try_get(shrinker)) + continue; + + rcu_read_unlock(); + ret = do_shrink_slab(&sc, shrinker, priority); if (ret == SHRINK_EMPTY) ret = 0; freed += ret; - /* - * Bail out if someone want to register a new shrinker to - * prevent the registration from being stalled for long periods - * by parallel ongoing shrinking. - */ - if (rwsem_is_contended(&shrinker_rwsem)) { - freed = freed ? : 1; - break; - } + + rcu_read_lock(); + shrinker_put(shrinker); } - up_read(&shrinker_rwsem); -out: + rcu_read_unlock(); cond_resched(); return freed; } @@ -671,13 +689,29 @@ void shrinker_register(struct shrinker *shrinker) } down_write(&shrinker_rwsem); - list_add_tail(&shrinker->list, &shrinker_list); + list_add_tail_rcu(&shrinker->list, &shrinker_list); shrinker->flags |= SHRINKER_REGISTERED; shrinker_debugfs_add(shrinker); up_write(&shrinker_rwsem); + + init_completion(&shrinker->done); + /* + * Now the shrinker is fully set up, take the first reference to it to + * indicate that lookup operations are now allowed to use it via + * shrinker_try_get(). + */ + refcount_set(&shrinker->refcount, 1); } EXPORT_SYMBOL_GPL(shrinker_register); +static void shrinker_free_rcu_cb(struct rcu_head *head) +{ + struct shrinker *shrinker = container_of(head, struct shrinker, rcu); + + kfree(shrinker->nr_deferred); + kfree(shrinker); +} + void shrinker_free(struct shrinker *shrinker) { struct dentry *debugfs_entry = NULL; @@ -686,9 +720,25 @@ void shrinker_free(struct shrinker *shrinker) if (!shrinker) return; + if (shrinker->flags & SHRINKER_REGISTERED) { + /* drop the initial refcount */ + shrinker_put(shrinker); + /* + * Wait for all lookups of the shrinker to complete, after that, + * no shrinker is running or will run again, then we can safely + * free it asynchronously via RCU and safely free the structure + * where the shrinker is located, such as super_block etc. + */ + wait_for_completion(&shrinker->done); + } + down_write(&shrinker_rwsem); if (shrinker->flags & SHRINKER_REGISTERED) { - list_del(&shrinker->list); + /* + * Now we can safely remove it from the shrinker_list and then + * free it. + */ + list_del_rcu(&shrinker->list); debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id); shrinker->flags &= ~SHRINKER_REGISTERED; } else { @@ -702,9 +752,6 @@ void shrinker_free(struct shrinker *shrinker) if (debugfs_entry) shrinker_debugfs_remove(debugfs_entry, debugfs_id); - kfree(shrinker->nr_deferred); - shrinker->nr_deferred = NULL; - - kfree(shrinker); + call_rcu(&shrinker->rcu, shrinker_free_rcu_cb); } EXPORT_SYMBOL_GPL(shrinker_free);
The shrinker_rwsem is a global read-write lock in shrinkers subsystem, which protects most operations such as slab shrink, registration and unregistration of shrinkers, etc. This can easily cause problems in the following cases. 1) When the memory pressure is high and there are many filesystems mounted or unmounted at the same time, slab shrink will be affected (down_read_trylock() failed). Such as the real workload mentioned by Kirill Tkhai: ``` One of the real workloads from my experience is start of an overcommitted node containing many starting containers after node crash (or many resuming containers after reboot for kernel update). In these cases memory pressure is huge, and the node goes round in long reclaim. ``` 2) If a shrinker is blocked (such as the case mentioned in [1]) and a writer comes in (such as mount a fs), then this writer will be blocked and cause all subsequent shrinker-related operations to be blocked. Even if there is no competitor when shrinking slab, there may still be a problem. The down_read_trylock() may become a perf hotspot with frequent calls to shrink_slab(). Because of the poor multicore scalability of atomic operations, this can lead to a significant drop in IPC (instructions per cycle). We used to implement the lockless slab shrink with SRCU [2], but then kernel test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4]. This commit uses the refcount+RCU method [5] proposed by Dave Chinner to re-implement the lockless global slab shrink. The memcg slab shrink is handled in the subsequent patch. For now, all shrinker instances are converted to dynamically allocated and will be freed by call_rcu(). So we can use rcu_read_{lock,unlock}() to ensure that the shrinker instance is valid. And the shrinker instance will not be run again after unregistration. So the structure that records the pointer of shrinker instance can be safely freed without waiting for the RCU read-side critical section. In this way, while we implement the lockless slab shrink, we don't need to be blocked in unregister_shrinker(). The following are the test results: stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & 1) Before applying this patchset: setting to a 60 second run per stressor dispatching hogs: 9 ramfs stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s (secs) (secs) (secs) (real time) (usr+sys time) ramfs 473062 60.00 8.00 279.13 7884.12 1647.59 for a 60.01s run time: 1440.34s available CPU time 7.99s user time ( 0.55%) 279.13s system time ( 19.38%) 287.12s total time ( 19.93%) load average: 7.12 2.99 1.15 successful run completed in 60.01s (1 min, 0.01 secs) 2) After applying this patchset: setting to a 60 second run per stressor dispatching hogs: 9 ramfs stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s (secs) (secs) (secs) (real time) (usr+sys time) ramfs 477165 60.00 8.13 281.34 7952.55 1648.40 for a 60.01s run time: 1440.33s available CPU time 8.12s user time ( 0.56%) 281.34s system time ( 19.53%) 289.46s total time ( 20.10%) load average: 6.98 3.03 1.19 successful run completed in 60.01s (1 min, 0.01 secs) We can see that the ops/s has hardly changed. [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/ [2]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/ [3]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ [4]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/ [5]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> --- include/linux/shrinker.h | 24 +++++++++++ mm/shrinker.c | 89 ++++++++++++++++++++++++++++++---------- 2 files changed, 92 insertions(+), 21 deletions(-)