Message ID | 1470769707-26079-3-git-send-email-jbacik@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue 09-08-16 15:08:27, Josef Bacik wrote: > Provide a mechanism for file systems to indicate how much dirty metadata they > are holding. This introduces a few things > > 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY. > 2) WB stat for dirty metadata. This way we know if we need to try and call into > the file system to write out metadata. This could potentially be used in the > future to make balancing of dirty pages smarter. > 3) A super callback to handle writing back dirty metadata. > > A future patch will take advantage of this work in btrfs. Thanks, Hum, I once had a patch to allow filesystems to hook more into writeback where a filesystem was just asked to do writeback and it could decide what to do with it (it could use generic helpers to essentially replicate what current writeback code does) but it could also choose some smarter strategy of picking inodes to write. This scheme could easily accommodate your metadata writeback as well and there are also other uses for it. But that patch got broken by Tejun's cgroup aware writeback so one would have to start from scratch. We certainly have to think how to integrate this with cgroup aware writeback. I guess your ->writeback_metadata() just does not bother and would write anything in the root cgroup, right? After all you don't even pass the information for which memcg the metadata writeback should be performed down to the fs callback (that is encoded in the bdi_writeback structure). And for now I think we could get away with that although it would need to be handled properly in future I think. If we created a generic filesystem writeback callback as I suggest, proper integration with memcg writeback in unavoidable. But I have to think how to do that best. Honza
On 08/10/2016 06:09 AM, Jan Kara wrote: > On Tue 09-08-16 15:08:27, Josef Bacik wrote: >> Provide a mechanism for file systems to indicate how much dirty metadata they >> are holding. This introduces a few things >> >> 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY. >> 2) WB stat for dirty metadata. This way we know if we need to try and call into >> the file system to write out metadata. This could potentially be used in the >> future to make balancing of dirty pages smarter. >> 3) A super callback to handle writing back dirty metadata. >> >> A future patch will take advantage of this work in btrfs. Thanks, > > Hum, I once had a patch to allow filesystems to hook more into writeback > where a filesystem was just asked to do writeback and it could decide what > to do with it (it could use generic helpers to essentially replicate what > current writeback code does) but it could also choose some smarter strategy > of picking inodes to write. This scheme could easily accommodate your > metadata writeback as well and there are also other uses for it. But that > patch got broken by Tejun's cgroup aware writeback so one would have to > start from scratch. > > We certainly have to think how to integrate this with cgroup aware > writeback. I guess your ->writeback_metadata() just does not bother and would > write anything in the root cgroup, right? After all you don't even pass the > information for which memcg the metadata writeback should be performed down > to the fs callback (that is encoded in the bdi_writeback structure). And > for now I think we could get away with that although it would need to be > handled properly in future I think. > I thought about this some but I'm not sure how to work it out so it's sane. Currently no other file system's metadata is covered by the writeback cgroup. Btrfs is simply by accident, we have an inode where all of our metadata is attached. This doesn't make a whole lot of sense as the inode is tied to whichever task dirited it last, so you are going to end up with weird writeback behavior on btrfs metadata if you are using writeback cgroups. I think removing this capability for now is actually better overall so we can come up with a different solution. > If we created a generic filesystem writeback callback as I suggest, proper > integration with memcg writeback in unavoidable. But I have to think how to > do that best. So the reason I'm doing this is because the last time I tried to kill our btree inode I got bogged down trying to reproduce our own special writeback logic for metadata. I basically constantly oom'ed the box because we'd fill up memory with dirty metadata, and then I started just wholesale copying mm/page-writeback.c and mm/fs-writeback.c to try and stop the madness and gave up because that was just as crazy. I think that having writeback a little more modularized so file systems can be smarter about picking inodes if they want is a good long term goal, but for now I'd like to get this work in so I can go about killing our fs wide inode. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Josef. On Tue, Aug 09, 2016 at 03:08:27PM -0400, Josef Bacik wrote: > Provide a mechanism for file systems to indicate how much dirty metadata they > are holding. This introduces a few things > > 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY. > 2) WB stat for dirty metadata. This way we know if we need to try and call into > the file system to write out metadata. This could potentially be used in the > future to make balancing of dirty pages smarter. > 3) A super callback to handle writing back dirty metadata. > > A future patch will take advantage of this work in btrfs. Thanks, > > Signed-off-by: Josef Bacik <jbacik@fb.com> ... > + if (!done && wb_stat(wb, WB_METADATA_DIRTY)) { > + struct list_head list; > + > + spin_unlock(&wb->list_lock); > + spin_lock(&wb->bdi->sb_list_lock); > + list_splice_init(&wb->bdi->sb_list, &list); > + while (!list_empty(&list)) { > + struct super_block *sb; > + > + sb = list_first_entry(&list, struct super_block, > + s_bdi_list); > + list_move_tail(&wb->bdi->sb_list, &sb->s_bdi_list); It bothers me a bit that sb's can actually be off bdi->sb_list while sb_list_lock is released. Can we make this explicit? e.g. keep separate bdi sb list for sb's pending metadata writeout (like b_dirty) or by just walking the list in a smart way? > + if (!sb->s_op->write_metadata) > + continue; > + if (!trylock_super(sb)) > + continue; > + spin_unlock(&wb->bdi->sb_list_lock); > + wrote += writeback_sb_metadata(sb, wb, work); > + spin_lock(&wb->bdi->sb_list_lock); > + up_read(&sb->s_umount); > } > + spin_unlock(&wb->bdi->sb_list_lock); > + spin_lock(&wb->list_lock); > } > /* Leave any unwritten inodes on b_io */ > return wrote; > diff --git a/fs/super.c b/fs/super.c > index c2ff475..940696d 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -215,6 +215,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, > spin_lock_init(&s->s_inode_list_lock); > INIT_LIST_HEAD(&s->s_inodes_wb); > spin_lock_init(&s->s_inode_wblist_lock); > + INIT_LIST_HEAD(&s->s_bdi_list); I can't find where sb's are actually added to the list. Is that supposed to happen from specific filesystems? Also, it might make sense to split up additions of sb_list and stat into separate patches. Thanks.
On 08/10/2016 04:12 PM, Tejun Heo wrote: > Hello, Josef. > > On Tue, Aug 09, 2016 at 03:08:27PM -0400, Josef Bacik wrote: >> Provide a mechanism for file systems to indicate how much dirty metadata they >> are holding. This introduces a few things >> >> 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY. >> 2) WB stat for dirty metadata. This way we know if we need to try and call into >> the file system to write out metadata. This could potentially be used in the >> future to make balancing of dirty pages smarter. >> 3) A super callback to handle writing back dirty metadata. >> >> A future patch will take advantage of this work in btrfs. Thanks, >> >> Signed-off-by: Josef Bacik <jbacik@fb.com> > ... >> + if (!done && wb_stat(wb, WB_METADATA_DIRTY)) { >> + struct list_head list; >> + >> + spin_unlock(&wb->list_lock); >> + spin_lock(&wb->bdi->sb_list_lock); >> + list_splice_init(&wb->bdi->sb_list, &list); >> + while (!list_empty(&list)) { >> + struct super_block *sb; >> + >> + sb = list_first_entry(&list, struct super_block, >> + s_bdi_list); >> + list_move_tail(&wb->bdi->sb_list, &sb->s_bdi_list); > > It bothers me a bit that sb's can actually be off bdi->sb_list while > sb_list_lock is released. Can we make this explicit? e.g. keep > separate bdi sb list for sb's pending metadata writeout (like b_dirty) > or by just walking the list in a smart way? > Yeah I wasn't super thrilled with this either, but since I'm only using it for writeback I figured it was ok for now. I suppose I could make it less sb_list and just make it dirty_sb_list, and only add if the super says it has dirty metadata. Does that sound better? Then us being off of the list during writeback isn't that big of a deal. >> + if (!sb->s_op->write_metadata) >> + continue; >> + if (!trylock_super(sb)) >> + continue; >> + spin_unlock(&wb->bdi->sb_list_lock); >> + wrote += writeback_sb_metadata(sb, wb, work); >> + spin_lock(&wb->bdi->sb_list_lock); >> + up_read(&sb->s_umount); >> } >> + spin_unlock(&wb->bdi->sb_list_lock); >> + spin_lock(&wb->list_lock); >> } >> /* Leave any unwritten inodes on b_io */ >> return wrote; >> diff --git a/fs/super.c b/fs/super.c >> index c2ff475..940696d 100644 >> --- a/fs/super.c >> +++ b/fs/super.c >> @@ -215,6 +215,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, >> spin_lock_init(&s->s_inode_list_lock); >> INIT_LIST_HEAD(&s->s_inodes_wb); >> spin_lock_init(&s->s_inode_wblist_lock); >> + INIT_LIST_HEAD(&s->s_bdi_list); > > I can't find where sb's are actually added to the list. Is that > supposed to happen from specific filesystems? Also, it might make > sense to split up additions of sb_list and stat into separate patches. > I was going to be lazy and only add it if we cared about writing out metadata, and then we could expand it to everybody if somebody else had a usecase. But I think maybe the b_sb_dirty list idea is a better fit overall so I can just do that so it makes more sense. I can split up these patches as well, thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Josef. On Wed, Aug 10, 2016 at 05:16:03PM -0400, Josef Bacik wrote: > > It bothers me a bit that sb's can actually be off bdi->sb_list while > > sb_list_lock is released. Can we make this explicit? e.g. keep > > separate bdi sb list for sb's pending metadata writeout (like b_dirty) > > or by just walking the list in a smart way? > > > > Yeah I wasn't super thrilled with this either, but since I'm only using it > for writeback I figured it was ok for now. I suppose I could make it less > sb_list and just make it dirty_sb_list, and only add if the super says it > has dirty metadata. Does that sound better? Then us being off of the list > during writeback isn't that big of a deal. Yeah, that feels more logical to me. > > I can't find where sb's are actually added to the list. Is that > > supposed to happen from specific filesystems? Also, it might make > > sense to split up additions of sb_list and stat into separate patches. > > > > I was going to be lazy and only add it if we cared about writing out > metadata, and then we could expand it to everybody if somebody else had a > usecase. But I think maybe the b_sb_dirty list idea is a better fit overall > so I can just do that so it makes more sense. I can split up these patches > as well, thanks, b_sb_dirty list sounds good and we wouldn't have to worry too much about lifetime issues either as the list would be linked iff the sb is dirty. It might still make sense to ensure that the list is unlinked when sb is putting the bdi tho. Thanks.
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 56c8fda..c670d45 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1615,11 +1615,36 @@ static long writeback_sb_inodes(struct super_block *sb, return wrote; } +static long writeback_sb_metadata(struct super_block *sb, + struct bdi_writeback *wb, + struct wb_writeback_work *work) +{ + struct writeback_control wbc = { + .sync_mode = work->sync_mode, + .tagged_writepages = work->tagged_writepages, + .for_kupdate = work->for_kupdate, + .for_background = work->for_background, + .for_sync = work->for_sync, + .range_cyclic = work->range_cyclic, + .range_start = 0, + .range_end = LLONG_MAX, + }; + long write_chunk; + + write_chunk = writeback_chunk_size(wb, work); + wbc.nr_to_write = write_chunk; + sb->s_op->write_metadata(sb, &wbc); + work->nr_pages -= write_chunk - wbc.nr_to_write; + + return write_chunk - wbc.nr_to_write; +} + static long __writeback_inodes_wb(struct bdi_writeback *wb, struct wb_writeback_work *work) { unsigned long start_time = jiffies; long wrote = 0; + bool done = false; while (!list_empty(&wb->b_io)) { struct inode *inode = wb_inode(wb->b_io.prev); @@ -1639,11 +1664,37 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb, /* refer to the same tests at the end of writeback_sb_inodes */ if (wrote) { - if (time_is_before_jiffies(start_time + HZ / 10UL)) - break; - if (work->nr_pages <= 0) + if (time_is_before_jiffies(start_time + HZ / 10UL) || + work->nr_pages <= 0) { + done = true; break; + } + } + } + + if (!done && wb_stat(wb, WB_METADATA_DIRTY)) { + struct list_head list; + + spin_unlock(&wb->list_lock); + spin_lock(&wb->bdi->sb_list_lock); + list_splice_init(&wb->bdi->sb_list, &list); + while (!list_empty(&list)) { + struct super_block *sb; + + sb = list_first_entry(&list, struct super_block, + s_bdi_list); + list_move_tail(&wb->bdi->sb_list, &sb->s_bdi_list); + if (!sb->s_op->write_metadata) + continue; + if (!trylock_super(sb)) + continue; + spin_unlock(&wb->bdi->sb_list_lock); + wrote += writeback_sb_metadata(sb, wb, work); + spin_lock(&wb->bdi->sb_list_lock); + up_read(&sb->s_umount); } + spin_unlock(&wb->bdi->sb_list_lock); + spin_lock(&wb->list_lock); } /* Leave any unwritten inodes on b_io */ return wrote; diff --git a/fs/super.c b/fs/super.c index c2ff475..940696d 100644 --- a/fs/super.c +++ b/fs/super.c @@ -215,6 +215,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, spin_lock_init(&s->s_inode_list_lock); INIT_LIST_HEAD(&s->s_inodes_wb); spin_lock_init(&s->s_inode_wblist_lock); + INIT_LIST_HEAD(&s->s_bdi_list); if (list_lru_init_memcg(&s->s_dentry_lru)) goto fail; diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 3f10307..0e731a3 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -34,6 +34,7 @@ typedef int (congested_fn)(void *, int); enum wb_stat_item { WB_RECLAIMABLE, WB_WRITEBACK, + WB_METADATA_DIRTY, WB_DIRTIED, WB_WRITTEN, NR_WB_STAT_ITEMS @@ -166,6 +167,8 @@ struct backing_dev_info { struct timer_list laptop_mode_wb_timer; + spinlock_t sb_list_lock; + struct list_head sb_list; #ifdef CONFIG_DEBUG_FS struct dentry *debug_dir; struct dentry *debug_stats; diff --git a/include/linux/fs.h b/include/linux/fs.h index f3f0b4c8..015b06d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1430,6 +1430,8 @@ struct super_block { spinlock_t s_inode_wblist_lock; struct list_head s_inodes_wb; /* writeback inodes */ + + struct list_head s_bdi_list; }; /* Helper functions so that in most cases filesystems will @@ -1805,6 +1807,8 @@ struct super_operations { struct shrink_control *); long (*free_cached_objects)(struct super_block *, struct shrink_control *); + long (*write_metadata)(struct super_block *sb, + struct writeback_control *wbc); }; /* diff --git a/include/linux/mm.h b/include/linux/mm.h index 08ed53e..0e90fde 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -31,6 +31,7 @@ struct file_ra_state; struct user_struct; struct writeback_control; struct bdi_writeback; +struct backing_dev_info; #ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */ extern unsigned long max_mapnr; @@ -1363,6 +1364,9 @@ int redirty_page_for_writepage(struct writeback_control *wbc, void account_page_dirtied(struct page *page, struct address_space *mapping); void account_page_cleaned(struct page *page, struct address_space *mapping, struct bdi_writeback *wb); +void account_metadata_dirtied(struct page *page, struct backing_dev_info *bdi); +void account_metadata_cleaned(struct page *page, struct backing_dev_info *bdi); +void account_metadata_writeback(struct page *page, struct backing_dev_info *bdi); int set_page_dirty(struct page *page); int set_page_dirty_lock(struct page *page); void cancel_dirty_page(struct page *page); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f2e4e90..c4177ef 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -167,6 +167,7 @@ enum node_stat_item { NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ + NR_METADATA_DIRTY, /* Metadata dirty pages */ NR_VM_NODE_STAT_ITEMS }; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index efe2377..0f1692c 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -78,6 +78,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) "BackgroundThresh: %10lu kB\n" "BdiDirtied: %10lu kB\n" "BdiWritten: %10lu kB\n" + "BdiMetadataDirty: %10lu kB\n" "BdiWriteBandwidth: %10lu kBps\n" "b_dirty: %10lu\n" "b_io: %10lu\n" @@ -92,6 +93,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) K(background_thresh), (unsigned long) K(wb_stat(wb, WB_DIRTIED)), (unsigned long) K(wb_stat(wb, WB_WRITTEN)), + (unsigned long) K(wb_stat(wb, WB_METADATA_DIRTY)), (unsigned long) K(wb->write_bandwidth), nr_dirty, nr_io, @@ -780,6 +782,7 @@ int bdi_init(struct backing_dev_info *bdi) bdi->max_prop_frac = FPROP_FRAC_BASE; INIT_LIST_HEAD(&bdi->bdi_list); INIT_LIST_HEAD(&bdi->wb_list); + INIT_LIST_HEAD(&bdi->sb_list); init_waitqueue_head(&bdi->wb_waitq); ret = cgwb_bdi_init(bdi); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 121a6e3..c6492de1a 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -506,6 +506,7 @@ bool node_dirty_ok(struct pglist_data *pgdat) nr_pages += node_page_state(pgdat, NR_FILE_DIRTY); nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS); nr_pages += node_page_state(pgdat, NR_WRITEBACK); + nr_pages += node_page_state(pgdat, NR_METADATA_DIRTY); return nr_pages <= limit; } @@ -1595,7 +1596,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb, * been flushed to permanent storage. */ nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + global_node_page_state(NR_UNSTABLE_NFS) + + global_node_page_state(NR_METADATA_DIRTY); gdtc->avail = global_dirtyable_memory(); gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); @@ -1935,7 +1937,8 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) */ gdtc->avail = global_dirtyable_memory(); gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + global_node_page_state(NR_UNSTABLE_NFS) + + global_node_page_state(NR_METADATA_DIRTY); domain_dirty_limits(gdtc); if (gdtc->dirty > gdtc->bg_thresh) @@ -2009,7 +2012,8 @@ void laptop_mode_timer_fn(unsigned long data) { struct request_queue *q = (struct request_queue *)data; int nr_pages = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + global_node_page_state(NR_UNSTABLE_NFS) + + global_node_page_state(NR_METADATA_DIRTY); struct bdi_writeback *wb; /* @@ -2473,6 +2477,74 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) EXPORT_SYMBOL(account_page_dirtied); /* + * account_metadata_dirtied + * @page - the page being dirited + * @bdi - the bdi that owns this page + * + * Do the dirty page accounting for metadata pages that aren't backed by an + * address_space. + */ +void account_metadata_dirtied(struct page *page, struct backing_dev_info *bdi) +{ + unsigned long flags; + + local_irq_save(flags); + __inc_node_page_state(page, NR_METADATA_DIRTY); + __inc_node_page_state(page, NR_FILE_DIRTY); + __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); + __inc_node_page_state(page, NR_DIRTIED); + __inc_wb_stat(&bdi->wb, WB_RECLAIMABLE); + __inc_wb_stat(&bdi->wb, WB_DIRTIED); + __inc_wb_stat(&bdi->wb, WB_METADATA_DIRTY); + current->nr_dirtied++; + task_io_account_write(PAGE_SIZE); + this_cpu_inc(bdp_ratelimits); + local_irq_restore(flags); +} +EXPORT_SYMBOL(account_metadata_dirtied); + +/* + * account_metadata_cleaned + * @page - the page being cleaned + * @bdi - the bdi that owns this page + * + * Called on a no longer dirty metadata page. + */ +void account_metadata_cleaned(struct page *page, struct backing_dev_info *bdi) +{ + unsigned long flags; + + local_irq_save(flags); + __dec_node_page_state(page, NR_METADATA_DIRTY); + __dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); + __dec_wb_stat(&bdi->wb, WB_RECLAIMABLE); + __dec_wb_stat(&bdi->wb, WB_METADATA_DIRTY); + task_io_account_cancelled_write(PAGE_SIZE); + local_irq_restore(flags); +} +EXPORT_SYMBOL(account_metadata_cleaned); + +/* + * account_metadata_writeback + * @page - the page being marked as writeback + * @bdi - the bdi that owns this page + * + * Called on a metadata page that has been marked writeback. + */ +void account_metadata_writeback(struct page *page, struct backing_dev_info *bdi) +{ + unsigned long flags; + + local_irq_save(flags); + __inc_wb_stat(&bdi->wb, WB_WRITEBACK); + __dec_wb_stat(&bdi->wb, WB_METADATA_DIRTY); + __inc_node_page_state(page, NR_WRITEBACK); + __dec_node_page_state(page, NR_METADATA_DIRTY); + local_irq_restore(flags); +} +EXPORT_SYMBOL(account_metadata_writeback); + +/* * Helper function for deaccounting dirty page without writeback. * * Caller must hold lock_page_memcg().
Provide a mechanism for file systems to indicate how much dirty metadata they are holding. This introduces a few things 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY. 2) WB stat for dirty metadata. This way we know if we need to try and call into the file system to write out metadata. This could potentially be used in the future to make balancing of dirty pages smarter. 3) A super callback to handle writing back dirty metadata. A future patch will take advantage of this work in btrfs. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> --- fs/fs-writeback.c | 57 +++++++++++++++++++++++++++-- fs/super.c | 1 + include/linux/backing-dev-defs.h | 3 ++ include/linux/fs.h | 4 +++ include/linux/mm.h | 4 +++ include/linux/mmzone.h | 1 + mm/backing-dev.c | 3 ++ mm/page-writeback.c | 78 ++++++++++++++++++++++++++++++++++++++-- 8 files changed, 145 insertions(+), 6 deletions(-)