Message ID | 20220601211824.89626-3-longman@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-cgroup: Optimize blkcg_rstat_flush() | expand |
On Wed, Jun 01, 2022 at 05:18:24PM -0400, Waiman Long wrote: > @@ -59,6 +59,57 @@ static struct workqueue_struct *blkcg_punt_bio_wq; > > #define BLKG_DESTROY_BATCH_SIZE 64 > > +/* > + * lnode.next of the last entry in a lockless list is NULL. To make it > + * always non-NULL for lnode's, a sentinel node has to be put at the > + * end of the lockless list. So all the percpu lhead's are initialized > + * to point to that sentinel node. > + */ Can you please add why we want all entries to have non-NULL next? > +static inline bool blkcg_llist_empty(struct llist_head *lhead) > +{ > + return lhead->first == &llist_last; > +} > + > +static inline void init_blkcg_llists(struct blkcg *blkcg) > +{ > + int cpu; > + > + for_each_possible_cpu(cpu) > + per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last; > +} > + > +static inline struct llist_node * > +fetch_delete_blkcg_llist(struct llist_head *lhead) > +{ > + return xchg(&lhead->first, &llist_last); > +} > + > +static inline struct llist_node * > +fetch_delete_lnode_next(struct llist_node *lnode) > +{ > + struct llist_node *next = READ_ONCE(lnode->next); > + struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set, > + lnode)->blkg; > + > + WRITE_ONCE(lnode->next, NULL); > + percpu_ref_put(&blkg->refcnt); > + return next; > +} It's not a strong opinion but I'm not too fond of using inlines to mark trivial functions. The compiler should be able to make these decisions, right? Other than the above two bikesheddings, Acked-by: Tejun Heo <tj@kernel.org> Thanks.
On 6/1/22 17:26, Tejun Heo wrote: > On Wed, Jun 01, 2022 at 05:18:24PM -0400, Waiman Long wrote: >> @@ -59,6 +59,57 @@ static struct workqueue_struct *blkcg_punt_bio_wq; >> >> #define BLKG_DESTROY_BATCH_SIZE 64 >> >> +/* >> + * lnode.next of the last entry in a lockless list is NULL. To make it >> + * always non-NULL for lnode's, a sentinel node has to be put at the >> + * end of the lockless list. So all the percpu lhead's are initialized >> + * to point to that sentinel node. >> + */ > Can you please add why we want all entries to have non-NULL next? As said elsewhere, lnode->next is used as a flag to indicate its presence in a lockless list. Sorry for not being explicit here. > >> +static inline bool blkcg_llist_empty(struct llist_head *lhead) >> +{ >> + return lhead->first == &llist_last; >> +} >> + >> +static inline void init_blkcg_llists(struct blkcg *blkcg) >> +{ >> + int cpu; >> + >> + for_each_possible_cpu(cpu) >> + per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last; >> +} >> + >> +static inline struct llist_node * >> +fetch_delete_blkcg_llist(struct llist_head *lhead) >> +{ >> + return xchg(&lhead->first, &llist_last); >> +} >> + >> +static inline struct llist_node * >> +fetch_delete_lnode_next(struct llist_node *lnode) >> +{ >> + struct llist_node *next = READ_ONCE(lnode->next); >> + struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set, >> + lnode)->blkg; >> + >> + WRITE_ONCE(lnode->next, NULL); >> + percpu_ref_put(&blkg->refcnt); >> + return next; >> +} > It's not a strong opinion but I'm not too fond of using inlines to mark > trivial functions. The compiler should be able to make these decisions, > right? > > Other than the above two bikesheddings, Sure, I can remove the inline keywords. I think I do it out of habit:-) Regards, Longman > > Acked-by: Tejun Heo <tj@kernel.org> > > Thanks. >
Hi Waiman, I love your patch! Perhaps something to improve: [auto build test WARNING on axboe-block/for-next] [also build test WARNING on linus/master next-20220601] [cannot apply to v5.18] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/intel-lab-lkp/linux/commits/Waiman-Long/blk-cgroup-Optimize-blkcg_rstat_flush/20220602-052441 base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next config: x86_64-randconfig-a012 (https://download.01.org/0day-ci/archive/20220602/202206021418.wpJNbe3g-lkp@intel.com/config) compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project c825abd6b0198fb088d9752f556a70705bc99dfd) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/3f979cef411e5d5512b725753034b02f3b7baf44 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Waiman-Long/blk-cgroup-Optimize-blkcg_rstat_flush/20220602-052441 git checkout 3f979cef411e5d5512b725753034b02f3b7baf44 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> block/blk-cgroup.c:1255:6: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized] if (!blkcg->lhead) ^~~~~~~~~~~~~ block/blk-cgroup.c:1306:9: note: uninitialized use occurs here return ret; ^~~ block/blk-cgroup.c:1255:2: note: remove the 'if' if its condition is always false if (!blkcg->lhead) ^~~~~~~~~~~~~~~~~~ block/blk-cgroup.c:1239:33: note: initialize the variable 'ret' to silence this warning struct cgroup_subsys_state *ret; ^ = NULL 1 warning generated. vim +1255 block/blk-cgroup.c 1234 1235 static struct cgroup_subsys_state * 1236 blkcg_css_alloc(struct cgroup_subsys_state *parent_css) 1237 { 1238 struct blkcg *blkcg; 1239 struct cgroup_subsys_state *ret; 1240 int i; 1241 1242 mutex_lock(&blkcg_pol_mutex); 1243 1244 if (!parent_css) { 1245 blkcg = &blkcg_root; 1246 } else { 1247 blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL); 1248 if (!blkcg) { 1249 ret = ERR_PTR(-ENOMEM); 1250 goto unlock; 1251 } 1252 } 1253 1254 blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL); > 1255 if (!blkcg->lhead) 1256 goto free_blkcg; 1257 init_blkcg_llists(blkcg); 1258 1259 for (i = 0; i < BLKCG_MAX_POLS ; i++) { 1260 struct blkcg_policy *pol = blkcg_policy[i]; 1261 struct blkcg_policy_data *cpd; 1262 1263 /* 1264 * If the policy hasn't been attached yet, wait for it 1265 * to be attached before doing anything else. Otherwise, 1266 * check if the policy requires any specific per-cgroup 1267 * data: if it does, allocate and initialize it. 1268 */ 1269 if (!pol || !pol->cpd_alloc_fn) 1270 continue; 1271 1272 cpd = pol->cpd_alloc_fn(GFP_KERNEL); 1273 if (!cpd) { 1274 ret = ERR_PTR(-ENOMEM); 1275 goto free_pd_blkcg; 1276 } 1277 blkcg->cpd[i] = cpd; 1278 cpd->blkcg = blkcg; 1279 cpd->plid = i; 1280 if (pol->cpd_init_fn) 1281 pol->cpd_init_fn(cpd); 1282 } 1283 1284 spin_lock_init(&blkcg->lock); 1285 refcount_set(&blkcg->online_pin, 1); 1286 INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT | __GFP_NOWARN); 1287 INIT_HLIST_HEAD(&blkcg->blkg_list); 1288 #ifdef CONFIG_CGROUP_WRITEBACK 1289 INIT_LIST_HEAD(&blkcg->cgwb_list); 1290 #endif 1291 list_add_tail(&blkcg->all_blkcgs_node, &all_blkcgs); 1292 1293 mutex_unlock(&blkcg_pol_mutex); 1294 return &blkcg->css; 1295 1296 free_pd_blkcg: 1297 for (i--; i >= 0; i--) 1298 if (blkcg->cpd[i]) 1299 blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]); 1300 free_percpu(blkcg->lhead); 1301 free_blkcg: 1302 if (blkcg != &blkcg_root) 1303 kfree(blkcg); 1304 unlock: 1305 mutex_unlock(&blkcg_pol_mutex); 1306 return ret; 1307 } 1308
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index acd9b0aa8dc8..0143dda589bd 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -59,6 +59,57 @@ static struct workqueue_struct *blkcg_punt_bio_wq; #define BLKG_DESTROY_BATCH_SIZE 64 +/* + * lnode.next of the last entry in a lockless list is NULL. To make it + * always non-NULL for lnode's, a sentinel node has to be put at the + * end of the lockless list. So all the percpu lhead's are initialized + * to point to that sentinel node. + */ +static struct llist_node llist_last; + +static inline bool blkcg_llist_empty(struct llist_head *lhead) +{ + return lhead->first == &llist_last; +} + +static inline void init_blkcg_llists(struct blkcg *blkcg) +{ + int cpu; + + for_each_possible_cpu(cpu) + per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last; +} + +static inline struct llist_node * +fetch_delete_blkcg_llist(struct llist_head *lhead) +{ + return xchg(&lhead->first, &llist_last); +} + +static inline struct llist_node * +fetch_delete_lnode_next(struct llist_node *lnode) +{ + struct llist_node *next = READ_ONCE(lnode->next); + struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set, + lnode)->blkg; + + WRITE_ONCE(lnode->next, NULL); + percpu_ref_put(&blkg->refcnt); + return next; +} + +/* + * The retrieved blkg_iostat_set is immediately marked as not in the + * lockless list by clearing its node->next pointer. It could be put + * back into the list by a parallel update before the iostat's are + * finally flushed including probably the new update. + */ +#define blkcg_llist_for_each_entry_safe(pos, node, nxt) \ + for (; (node != &llist_last) && \ + (pos = llist_entry(node, struct blkg_iostat_set, lnode), \ + nxt = fetch_delete_lnode_next(node), true); \ + node = nxt) + /** * blkcg_css - find the current css * @@ -236,8 +287,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q, blkg->blkcg = blkcg; u64_stats_init(&blkg->iostat.sync); - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync); + per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg; + } for (i = 0; i < BLKCG_MAX_POLS; i++) { struct blkcg_policy *pol = blkcg_policy[i]; @@ -852,17 +905,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, struct blkg_iostat *src) static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) { struct blkcg *blkcg = css_to_blkcg(css); - struct blkcg_gq *blkg; + struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); + struct llist_node *lnode, *lnext; + struct blkg_iostat_set *bisc; /* Root-level stats are sourced from system-wide IO stats */ if (!cgroup_parent(css->cgroup)) return; + if (blkcg_llist_empty(lhead)) + return; + rcu_read_lock(); - hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { + lnode = fetch_delete_blkcg_llist(lhead); + blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) { + struct blkcg_gq *blkg = bisc->blkg; struct blkcg_gq *parent = blkg->parent; - struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu); struct blkg_iostat cur, delta; unsigned long flags; unsigned int seq; @@ -1192,6 +1251,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) } } + blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL); + if (!blkcg->lhead) + goto free_blkcg; + init_blkcg_llists(blkcg); + for (i = 0; i < BLKCG_MAX_POLS ; i++) { struct blkcg_policy *pol = blkcg_policy[i]; struct blkcg_policy_data *cpd; @@ -1233,7 +1297,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) for (i--; i >= 0; i--) if (blkcg->cpd[i]) blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]); - + free_percpu(blkcg->lhead); +free_blkcg: if (blkcg != &blkcg_root) kfree(blkcg); unlock: @@ -1997,6 +2062,7 @@ static int blk_cgroup_io_type(struct bio *bio) void blk_cgroup_bio_start(struct bio *bio) { + struct blkcg *blkcg = bio->bi_blkg->blkcg; int rwd = blk_cgroup_io_type(bio), cpu; struct blkg_iostat_set *bis; unsigned long flags; @@ -2015,9 +2081,16 @@ void blk_cgroup_bio_start(struct bio *bio) } bis->cur.ios[rwd]++; + if (!READ_ONCE(bis->lnode.next)) { + struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); + + llist_add(&bis->lnode, lhead); + percpu_ref_get(&bis->blkg->refcnt); + } + u64_stats_update_end_irqrestore(&bis->sync, flags); if (cgroup_subsys_on_dfl(io_cgrp_subsys)) - cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu); + cgroup_rstat_updated(blkcg->css.cgroup, cpu); put_cpu(); } diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h index d4de0a35e066..2c36362a332e 100644 --- a/block/blk-cgroup.h +++ b/block/blk-cgroup.h @@ -18,6 +18,7 @@ #include <linux/cgroup.h> #include <linux/kthread.h> #include <linux/blk-mq.h> +#include <linux/llist.h> struct blkcg_gq; struct blkg_policy_data; @@ -43,6 +44,8 @@ struct blkg_iostat { struct blkg_iostat_set { struct u64_stats_sync sync; + struct llist_node lnode; + struct blkcg_gq *blkg; struct blkg_iostat cur; struct blkg_iostat last; }; @@ -97,6 +100,12 @@ struct blkcg { struct blkcg_policy_data *cpd[BLKCG_MAX_POLS]; struct list_head all_blkcgs_node; + + /* + * List of updated percpu blkg_iostat_set's since the last flush. + */ + struct llist_head __percpu *lhead; + #ifdef CONFIG_BLK_CGROUP_FC_APPID char fc_app_id[FC_APPID_LEN]; #endif
For a system with many CPUs and block devices, the time to do blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It can be especially problematic as interrupt is disabled during the flush. It was reported that it might take seconds to complete in some extreme cases leading to hard lockup messages. As it is likely that not all the percpu blkg_iostat_set's has been updated since the last flush, those stale blkg_iostat_set's don't need to be flushed in this case. This patch optimizes blkcg_rstat_flush() by keeping a lockless list of recently updated blkg_iostat_set's in a newly added percpu blkcg->lhead pointer. The blkg_iostat_set is added to the lockless list on the update side in blk_cgroup_bio_start(). It is removed from the lockless list when flushed in blkcg_rstat_flush(). Due to racing, it is possible that blk_iostat_set's in the lockless list may have no new IO stats to be flushed. To protect against destruction of blkg, a percpu reference is taken when putting into the lockless list and released when removed. A blkg_iostat_set can determine if it is in a lockless list by checking the content of its lnode.next pointer which will be non-NULL when in a lockless list. This requires the presence of a special llist_last sentinel node to be put at the end of the lockless list. When booting up an instrumented test kernel with this patch on a 2-socket 96-thread system with cgroup v2, out of the 2051 calls to cgroup_rstat_flush() after bootup, 1788 of the calls were exited immediately because of empty lockless list. After an all-cpu kernel build, the ratio became 6295424/6340513. That was more than 99%. Signed-off-by: Waiman Long <longman@redhat.com> --- block/blk-cgroup.c | 85 ++++++++++++++++++++++++++++++++++++++++++---- block/blk-cgroup.h | 9 +++++ 2 files changed, 88 insertions(+), 6 deletions(-)