Message ID | 20240327155751.3536-3-shikemeng@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Improve visibility of writeback | expand |
On Wed, Mar 27, 2024 at 11:57:47PM +0800, Kemeng Shi wrote: > /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information > of whole bdi, but only writeback information of bdi in root cgroup is > collected. So writeback information in non-root cgroup are missing now. > To be more specific, considering following case: > > /* create writeback cgroup */ > cd /sys/fs/cgroup > echo "+memory +io" > cgroup.subtree_control > mkdir group1 > cd group1 > echo $$ > cgroup.procs > /* do writeback in cgroup */ > fio -name test -filename=/dev/vdb ... > /* get writeback info of bdi */ > cat /sys/kernel/debug/bdi/xxx/stats > The cat result unexpectedly implies that there is no writeback on target > bdi. > > Fix this by collecting stats of all wb in bdi instead of only wb in > root cgroup. > > Following domain hierarchy is tested: > global domain (320G) > / \ > cgroup domain1(10G) cgroup domain2(10G) > | | > bdi wb1 wb2 > > /* all writeback info of bdi is successfully collected */ > cat stats > BdiWriteback: 2912 kB > BdiReclaimable: 1598464 kB > BdiDirtyThresh: 167479028 kB > DirtyThresh: 195038532 kB > BackgroundThresh: 32466728 kB > BdiDirtied: 19141696 kB > BdiWritten: 17543456 kB > BdiWriteBandwidth: 1136172 kBps > b_dirty: 2 > b_io: 0 > b_more_io: 1 > b_dirty_time: 0 > bdi_list: 1 > state: 1 > > Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> > --- > mm/backing-dev.c | 100 +++++++++++++++++++++++++++++++++-------------- > 1 file changed, 71 insertions(+), 29 deletions(-) > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index 70f02959f3bd..8daf950e6855 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c ... > @@ -65,16 +78,54 @@ static struct backing_dev_info *lookup_bdi(struct seq_file *m) > return NULL; > } > > +static void collect_wb_stats(struct wb_stats *stats, > + struct bdi_writeback *wb) > +{ > + struct inode *inode; > + > + spin_lock(&wb->list_lock); > + list_for_each_entry(inode, &wb->b_dirty, i_io_list) > + stats->nr_dirty++; > + list_for_each_entry(inode, &wb->b_io, i_io_list) > + stats->nr_io++; > + list_for_each_entry(inode, &wb->b_more_io, i_io_list) > + stats->nr_more_io++; > + list_for_each_entry(inode, &wb->b_dirty_time, i_io_list) > + if (inode->i_state & I_DIRTY_TIME) > + stats->nr_dirty_time++; > + spin_unlock(&wb->list_lock); > + > + stats->nr_writeback += wb_stat(wb, WB_WRITEBACK); > + stats->nr_reclaimable += wb_stat(wb, WB_RECLAIMABLE); > + stats->nr_dirtied += wb_stat(wb, WB_DIRTIED); > + stats->nr_written += wb_stat(wb, WB_WRITTEN); > + stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh); Kinda nitty question, but is this a sum of per-wb writeback thresholds? If so, do you consider that useful information vs. the per-wb threshold data presumably exposed in the next patch? I'm not really that worried about what debug data we expose, it just seems a little odd. How would you document this value in a sentence or two, for example? > +} > + > +#ifdef CONFIG_CGROUP_WRITEBACK > +static void bdi_collect_stats(struct backing_dev_info *bdi, > + struct wb_stats *stats) > +{ > + struct bdi_writeback *wb; > + > + list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) > + collect_wb_stats(stats, wb); Depending on discussion on the previous patch and whether the higher level rcu protection in bdi_debug_stats_show() is really necessary, it might make more sense to move it here. I'm also wondering if you'd want to check the state of the individual wb (i.e. WB_registered?) before reading it..? > +} > +#else > +static void bdi_collect_stats(struct backing_dev_info *bdi, > + struct wb_stats *stats) > +{ > + collect_wb_stats(stats, &bdi->wb); > +} > +#endif ... > @@ -115,18 +157,18 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) > "b_dirty_time: %10lu\n" > "bdi_list: %10u\n" > "state: %10lx\n", > - (unsigned long) K(wb_stat(wb, WB_WRITEBACK)), > - (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)), > - K(wb_thresh), > + K(stats.nr_writeback), > + K(stats.nr_reclaimable), > + K(stats.wb_thresh), > K(dirty_thresh), > K(background_thresh), > - (unsigned long) K(wb_stat(wb, WB_DIRTIED)), > - (unsigned long) K(wb_stat(wb, WB_WRITTEN)), > - (unsigned long) K(wb->write_bandwidth), > - nr_dirty, > - nr_io, > - nr_more_io, > - nr_dirty_time, > + K(stats.nr_dirtied), > + K(stats.nr_written), > + K(tot_bw), > + stats.nr_dirty, > + stats.nr_io, > + stats.nr_more_io, > + stats.nr_dirty_time, > !list_empty(&bdi->bdi_list), bdi->wb.state); Is it worth showing a list count here rather than list_empty() state? Brian > > rcu_read_unlock(); > -- > 2.30.0 >
on 3/29/2024 9:04 PM, Brian Foster wrote: > On Wed, Mar 27, 2024 at 11:57:47PM +0800, Kemeng Shi wrote: >> /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information >> of whole bdi, but only writeback information of bdi in root cgroup is >> collected. So writeback information in non-root cgroup are missing now. >> To be more specific, considering following case: >> >> /* create writeback cgroup */ >> cd /sys/fs/cgroup >> echo "+memory +io" > cgroup.subtree_control >> mkdir group1 >> cd group1 >> echo $$ > cgroup.procs >> /* do writeback in cgroup */ >> fio -name test -filename=/dev/vdb ... >> /* get writeback info of bdi */ >> cat /sys/kernel/debug/bdi/xxx/stats >> The cat result unexpectedly implies that there is no writeback on target >> bdi. >> >> Fix this by collecting stats of all wb in bdi instead of only wb in >> root cgroup. >> >> Following domain hierarchy is tested: >> global domain (320G) >> / \ >> cgroup domain1(10G) cgroup domain2(10G) >> | | >> bdi wb1 wb2 >> >> /* all writeback info of bdi is successfully collected */ >> cat stats >> BdiWriteback: 2912 kB >> BdiReclaimable: 1598464 kB >> BdiDirtyThresh: 167479028 kB >> DirtyThresh: 195038532 kB >> BackgroundThresh: 32466728 kB >> BdiDirtied: 19141696 kB >> BdiWritten: 17543456 kB >> BdiWriteBandwidth: 1136172 kBps >> b_dirty: 2 >> b_io: 0 >> b_more_io: 1 >> b_dirty_time: 0 >> bdi_list: 1 >> state: 1 >> >> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> >> --- >> mm/backing-dev.c | 100 +++++++++++++++++++++++++++++++++-------------- >> 1 file changed, 71 insertions(+), 29 deletions(-) >> >> diff --git a/mm/backing-dev.c b/mm/backing-dev.c >> index 70f02959f3bd..8daf950e6855 100644 >> --- a/mm/backing-dev.c >> +++ b/mm/backing-dev.c > ... >> @@ -65,16 +78,54 @@ static struct backing_dev_info *lookup_bdi(struct seq_file *m) >> return NULL; >> } >> >> +static void collect_wb_stats(struct wb_stats *stats, >> + struct bdi_writeback *wb) >> +{ >> + struct inode *inode; >> + >> + spin_lock(&wb->list_lock); >> + list_for_each_entry(inode, &wb->b_dirty, i_io_list) >> + stats->nr_dirty++; >> + list_for_each_entry(inode, &wb->b_io, i_io_list) >> + stats->nr_io++; >> + list_for_each_entry(inode, &wb->b_more_io, i_io_list) >> + stats->nr_more_io++; >> + list_for_each_entry(inode, &wb->b_dirty_time, i_io_list) >> + if (inode->i_state & I_DIRTY_TIME) >> + stats->nr_dirty_time++; >> + spin_unlock(&wb->list_lock); >> + >> + stats->nr_writeback += wb_stat(wb, WB_WRITEBACK); >> + stats->nr_reclaimable += wb_stat(wb, WB_RECLAIMABLE); >> + stats->nr_dirtied += wb_stat(wb, WB_DIRTIED); >> + stats->nr_written += wb_stat(wb, WB_WRITTEN); >> + stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh); > > Kinda nitty question, but is this a sum of per-wb writeback thresholds? > If so, do you consider that useful information vs. the per-wb threshold > data presumably exposed in the next patch? It's sum of per-wb wirteback thresholds in global domain. For each wb, it's threshold is min of threshold in global domain and threshold in cgroup domain (if any). As the debug data of bdi existed before writeback cgroup was introduced, so it would be better to show bdi threshold in global domain which is more compatible. > > I'm not really that worried about what debug data we expose, it just > seems a little odd. How would you document this value in a sentence or > two, for example? I think it could simply be "threshold of bdi in global domain". > >> +} >> + >> +#ifdef CONFIG_CGROUP_WRITEBACK >> +static void bdi_collect_stats(struct backing_dev_info *bdi, >> + struct wb_stats *stats) >> +{ >> + struct bdi_writeback *wb; >> + >> + list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) >> + collect_wb_stats(stats, wb); > > Depending on discussion on the previous patch and whether the higher > level rcu protection in bdi_debug_stats_show() is really necessary, it > might make more sense to move it here. Sure, will do it in next version. > > I'm also wondering if you'd want to check the state of the individual wb > (i.e. WB_registered?) before reading it..? I think it't better to keep full debug info. The user could filter it out with state in debug info anyway. > >> +} >> +#else >> +static void bdi_collect_stats(struct backing_dev_info *bdi, >> + struct wb_stats *stats) >> +{ >> + collect_wb_stats(stats, &bdi->wb); >> +} >> +#endif > ... >> @@ -115,18 +157,18 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) >> "b_dirty_time: %10lu\n" >> "bdi_list: %10u\n" >> "state: %10lx\n", >> - (unsigned long) K(wb_stat(wb, WB_WRITEBACK)), >> - (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)), >> - K(wb_thresh), >> + K(stats.nr_writeback), >> + K(stats.nr_reclaimable), >> + K(stats.wb_thresh), >> K(dirty_thresh), >> K(background_thresh), >> - (unsigned long) K(wb_stat(wb, WB_DIRTIED)), >> - (unsigned long) K(wb_stat(wb, WB_WRITTEN)), >> - (unsigned long) K(wb->write_bandwidth), >> - nr_dirty, >> - nr_io, >> - nr_more_io, >> - nr_dirty_time, >> + K(stats.nr_dirtied), >> + K(stats.nr_written), >> + K(tot_bw), >> + stats.nr_dirty, >> + stats.nr_io, >> + stats.nr_more_io, >> + stats.nr_dirty_time, >> !list_empty(&bdi->bdi_list), bdi->wb.state); > > Is it worth showing a list count here rather than list_empty() state? Actually, I don't know how this info was supposed to be used. So I keep it in old way for compatibility... As for bdi count, it would be easy to retrieve by counting the bdi number under /sys/kernel/debug/bdi/. As for wb count, it would be easy to count with wb_stats. So I still prefer to keep it in old way for compatibility or just simply remove it if the list_empty() state is not needed. Thansk for the review and all advise. Look forward to your reply. Kemeng > > Brian > >> >> rcu_read_unlock(); >> -- >> 2.30.0 >> >
diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 70f02959f3bd..8daf950e6855 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -39,6 +39,19 @@ struct workqueue_struct *bdi_wq; #include <linux/debugfs.h> #include <linux/seq_file.h> +struct wb_stats { + unsigned long nr_dirty; + unsigned long nr_io; + unsigned long nr_more_io; + unsigned long nr_dirty_time; + unsigned long nr_writeback; + unsigned long nr_reclaimable; + unsigned long nr_dirtied; + unsigned long nr_written; + unsigned long dirty_thresh; + unsigned long wb_thresh; +}; + static struct dentry *bdi_debug_root; static void bdi_debug_init(void) @@ -65,16 +78,54 @@ static struct backing_dev_info *lookup_bdi(struct seq_file *m) return NULL; } +static void collect_wb_stats(struct wb_stats *stats, + struct bdi_writeback *wb) +{ + struct inode *inode; + + spin_lock(&wb->list_lock); + list_for_each_entry(inode, &wb->b_dirty, i_io_list) + stats->nr_dirty++; + list_for_each_entry(inode, &wb->b_io, i_io_list) + stats->nr_io++; + list_for_each_entry(inode, &wb->b_more_io, i_io_list) + stats->nr_more_io++; + list_for_each_entry(inode, &wb->b_dirty_time, i_io_list) + if (inode->i_state & I_DIRTY_TIME) + stats->nr_dirty_time++; + spin_unlock(&wb->list_lock); + + stats->nr_writeback += wb_stat(wb, WB_WRITEBACK); + stats->nr_reclaimable += wb_stat(wb, WB_RECLAIMABLE); + stats->nr_dirtied += wb_stat(wb, WB_DIRTIED); + stats->nr_written += wb_stat(wb, WB_WRITTEN); + stats->wb_thresh += wb_calc_thresh(wb, stats->dirty_thresh); +} + +#ifdef CONFIG_CGROUP_WRITEBACK +static void bdi_collect_stats(struct backing_dev_info *bdi, + struct wb_stats *stats) +{ + struct bdi_writeback *wb; + + list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) + collect_wb_stats(stats, wb); +} +#else +static void bdi_collect_stats(struct backing_dev_info *bdi, + struct wb_stats *stats) +{ + collect_wb_stats(stats, &bdi->wb); +} +#endif static int bdi_debug_stats_show(struct seq_file *m, void *v) { struct backing_dev_info *bdi; - struct bdi_writeback *wb; unsigned long background_thresh; unsigned long dirty_thresh; - unsigned long wb_thresh; - unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time; - struct inode *inode; + struct wb_stats stats; + unsigned long tot_bw; rcu_read_lock(); bdi = lookup_bdi(m); @@ -83,22 +134,13 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) return -EEXIST; } - wb = &bdi->wb; - nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0; - spin_lock(&wb->list_lock); - list_for_each_entry(inode, &wb->b_dirty, i_io_list) - nr_dirty++; - list_for_each_entry(inode, &wb->b_io, i_io_list) - nr_io++; - list_for_each_entry(inode, &wb->b_more_io, i_io_list) - nr_more_io++; - list_for_each_entry(inode, &wb->b_dirty_time, i_io_list) - if (inode->i_state & I_DIRTY_TIME) - nr_dirty_time++; - spin_unlock(&wb->list_lock); - global_dirty_limits(&background_thresh, &dirty_thresh); - wb_thresh = wb_calc_thresh(wb, dirty_thresh); + + memset(&stats, 0, sizeof(stats)); + stats.dirty_thresh = dirty_thresh; + bdi_collect_stats(bdi, &stats); + + tot_bw = atomic_long_read(&bdi->tot_write_bandwidth); seq_printf(m, "BdiWriteback: %10lu kB\n" @@ -115,18 +157,18 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) "b_dirty_time: %10lu\n" "bdi_list: %10u\n" "state: %10lx\n", - (unsigned long) K(wb_stat(wb, WB_WRITEBACK)), - (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)), - K(wb_thresh), + K(stats.nr_writeback), + K(stats.nr_reclaimable), + K(stats.wb_thresh), K(dirty_thresh), K(background_thresh), - (unsigned long) K(wb_stat(wb, WB_DIRTIED)), - (unsigned long) K(wb_stat(wb, WB_WRITTEN)), - (unsigned long) K(wb->write_bandwidth), - nr_dirty, - nr_io, - nr_more_io, - nr_dirty_time, + K(stats.nr_dirtied), + K(stats.nr_written), + K(tot_bw), + stats.nr_dirty, + stats.nr_io, + stats.nr_more_io, + stats.nr_dirty_time, !list_empty(&bdi->bdi_list), bdi->wb.state); rcu_read_unlock();
/sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information of whole bdi, but only writeback information of bdi in root cgroup is collected. So writeback information in non-root cgroup are missing now. To be more specific, considering following case: /* create writeback cgroup */ cd /sys/fs/cgroup echo "+memory +io" > cgroup.subtree_control mkdir group1 cd group1 echo $$ > cgroup.procs /* do writeback in cgroup */ fio -name test -filename=/dev/vdb ... /* get writeback info of bdi */ cat /sys/kernel/debug/bdi/xxx/stats The cat result unexpectedly implies that there is no writeback on target bdi. Fix this by collecting stats of all wb in bdi instead of only wb in root cgroup. Following domain hierarchy is tested: global domain (320G) / \ cgroup domain1(10G) cgroup domain2(10G) | | bdi wb1 wb2 /* all writeback info of bdi is successfully collected */ cat stats BdiWriteback: 2912 kB BdiReclaimable: 1598464 kB BdiDirtyThresh: 167479028 kB DirtyThresh: 195038532 kB BackgroundThresh: 32466728 kB BdiDirtied: 19141696 kB BdiWritten: 17543456 kB BdiWriteBandwidth: 1136172 kBps b_dirty: 2 b_io: 0 b_more_io: 1 b_dirty_time: 0 bdi_list: 1 state: 1 Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> --- mm/backing-dev.c | 100 +++++++++++++++++++++++++++++++++-------------- 1 file changed, 71 insertions(+), 29 deletions(-)