Message ID | 20210526222557.3118114-3-guro@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups | expand |
On Wed, 26 May 2021 15:25:57 -0700 Roman Gushchin wrote: >+ >+ if (!list_empty(&offline_cgwbs)) >+ schedule_work(&cleanup_offline_cgwbs_work); >+ Good work overall. Nit, given cond_resched_lock() in cleanup_offline_wb(), what you need instead is queue_work(system_unbound_wq, &cleanup_offline_cgwbs_work);
On Wed 26-05-21 15:25:57, Roman Gushchin wrote: > Asynchronously try to release dying cgwbs by switching clean attached > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback > structures themselves and of pinned memory and block cgroups, which > are way larger structures (mostly due to large per-cpu statistics > data). It helps to prevent memory waste and different scalability > problems caused by large piles of dying cgroups. > > A cgwb cleanup operation can fail due to different reasons (e.g. the > cgwb has in-glight/pending io, an attached inode is locked or isn't > clean, etc). In this case the next scheduled cleanup will make a new > attempt. An attempt is made each time a new cgwb is offlined (in other > words a memcg and/or a blkcg is deleted by a user). In the future an > additional attempt scheduled by a timer can be implemented. > > Signed-off-by: Roman Gushchin <guro@fb.com> > --- > fs/fs-writeback.c | 35 ++++++++++++++++++ > include/linux/backing-dev-defs.h | 1 + > include/linux/writeback.h | 1 + > mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++-- > 4 files changed, 96 insertions(+), 2 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 631ef6366293..8fbcd50844f0 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) > kfree(isw); > } > > +/** > + * cleanup_offline_wb - detach associated clean inodes > + * @wb: target wb > + * > + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and > + * drop the corresponding per-cgroup wb's reference. Skip inodes which are > + * dirty, freeing, in the active writeback process or are in any way busy. I think the comment doesn't match the function anymore. > + */ > +void cleanup_offline_wb(struct bdi_writeback *wb) > +{ > + struct inode *inode, *tmp; > + > + spin_lock(&wb->list_lock); > +restart: > + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) { > + if (!spin_trylock(&inode->i_lock)) > + continue; > + xa_lock_irq(&inode->i_mapping->i_pages); > + if ((inode->i_state & I_REFERENCED) != I_REFERENCED) { Why the I_REFERENCED check here? That's just inode aging bit and I have hard time seeing how it would relate to whether inode should switch wbs... > + struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb; > + > + WARN_ON_ONCE(inode->i_wb != wb); > + > + inode->i_wb = bdi_wb; > + list_del_init(&inode->i_io_list); > + wb_put(wb); I was kind of hoping you'll use some variant of inode_switch_wbs() here. That way we have single function handling all the subtleties of switching inode->i_wb of an active inode. Maybe it isn't strictly needed here because you detach only from b_attached list and move to bdi_wb so things are indeed simpler here. But you definitely miss transferring WB_WRITEBACK stat and I'd also like to have a comment here explaining why this cannot race with other writeback handling or wb switching in a harmful way. > + } > + xa_unlock_irq(&inode->i_mapping->i_pages); > + spin_unlock(&inode->i_lock); > + if (cond_resched_lock(&wb->list_lock)) > + goto restart; > + } > + spin_unlock(&wb->list_lock); > +} > + > /** > * wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it > * @wbc: writeback_control of interest > diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h > index e5dc238ebe4f..07d6b6d6dbdf 100644 > --- a/include/linux/backing-dev-defs.h > +++ b/include/linux/backing-dev-defs.h > @@ -155,6 +155,7 @@ struct bdi_writeback { > struct list_head memcg_node; /* anchored at memcg->cgwb_list */ > struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */ > struct list_head b_attached; /* attached inodes, protected by list_lock */ > + struct list_head offline_node; /* anchored at offline_cgwbs */ > > union { > struct work_struct release_work; > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > index 572a13c40c90..922f15fe6ad4 100644 > --- a/include/linux/writeback.h > +++ b/include/linux/writeback.h > @@ -222,6 +222,7 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page, > int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages, > enum wb_reason reason, struct wb_completion *done); > void cgroup_writeback_umount(void); > +void cleanup_offline_wb(struct bdi_writeback *wb); > > /** > * inode_attach_wb - associate an inode with its wb > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index 54c5dc4b8c24..92a00bcaa504 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -371,12 +371,16 @@ static void wb_exit(struct bdi_writeback *wb) > #include <linux/memcontrol.h> > > /* > - * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, and memcg->cgwb_list. > - * bdi->cgwb_tree is also RCU protected. > + * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, offline_cgwbs and > + * memcg->cgwb_list. bdi->cgwb_tree is also RCU protected. > */ > static DEFINE_SPINLOCK(cgwb_lock); > static struct workqueue_struct *cgwb_release_wq; > > +static LIST_HEAD(offline_cgwbs); > +static void cleanup_offline_cgwbs_workfn(struct work_struct *work); > +static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn); > + > static void cgwb_release_workfn(struct work_struct *work) > { > struct bdi_writeback *wb = container_of(work, struct bdi_writeback, > @@ -395,6 +399,7 @@ static void cgwb_release_workfn(struct work_struct *work) > > fprop_local_destroy_percpu(&wb->memcg_completions); > percpu_ref_exit(&wb->refcnt); > + WARN_ON(!list_empty(&wb->offline_node)); Hum, cannot this happen when when wb had originally some attached inodes, we added it to offline_cgwbs but then normal inode reclaim cleaned all the inodes (and thus all wb refs were dropped) before cleanup_offline_cgwbs_workfn() was executed? So either the offline_cgwbs list has to hold its own wb ref or we have to remove cgwb from the list in cgwb_release_workfn(). > wb_exit(wb); > WARN_ON_ONCE(!list_empty(&wb->b_attached)); > kfree_rcu(wb, rcu); > @@ -414,6 +419,10 @@ static void cgwb_kill(struct bdi_writeback *wb) > WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id)); > list_del(&wb->memcg_node); > list_del(&wb->blkcg_node); > + if (!list_empty(&wb->b_attached)) > + list_add(&wb->offline_node, &offline_cgwbs); > + else > + INIT_LIST_HEAD(&wb->offline_node); > percpu_ref_kill(&wb->refcnt); > } > > @@ -635,6 +644,50 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi) > mutex_unlock(&bdi->cgwb_release_mutex); > } > > +/** > + * cleanup_offline_cgwbs - try to release dying cgwbs > + * > + * Try to release dying cgwbs by switching attached inodes to the wb > + * belonging to the root memory cgroup. Processed wbs are placed at the > + * end of the list to guarantee the forward progress. > + * > + * Should be called with the acquired cgwb_lock lock, which might > + * be released and re-acquired in the process. > + */ > +static void cleanup_offline_cgwbs_workfn(struct work_struct *work) > +{ > + struct bdi_writeback *wb; > + LIST_HEAD(processed); > + > + spin_lock_irq(&cgwb_lock); > + > + while (!list_empty(&offline_cgwbs)) { > + wb = list_first_entry(&offline_cgwbs, struct bdi_writeback, > + offline_node); > + list_move(&wb->offline_node, &processed); > + > + if (wb_has_dirty_io(wb)) > + continue; > + > + if (!percpu_ref_tryget(&wb->refcnt)) > + continue; > + > + spin_unlock_irq(&cgwb_lock); > + cleanup_offline_wb(wb); > + spin_lock_irq(&cgwb_lock); > + > + if (list_empty(&wb->b_attached)) > + list_del_init(&wb->offline_node); But the cgwb can still have inodes in its dirty lists which will eventually move to b_attached. So you can delete cgwb here prematurely, cannot you? > + > + wb_put(wb); > + } > + > + if (!list_empty(&processed)) > + list_splice_tail(&processed, &offline_cgwbs); > + > + spin_unlock_irq(&cgwb_lock); > +} > + > /** > * wb_memcg_offline - kill all wb's associated with a memcg being offlined > * @memcg: memcg being offlined > @@ -650,6 +703,10 @@ void wb_memcg_offline(struct mem_cgroup *memcg) > list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node) > cgwb_kill(wb); > memcg_cgwb_list->next = NULL; /* prevent new wb's */ > + > + if (!list_empty(&offline_cgwbs)) > + schedule_work(&cleanup_offline_cgwbs_work); > + > spin_unlock_irq(&cgwb_lock); Honza
On Thu, May 27, 2021 at 11:24:59AM +0800, Hillf Danton wrote: > On Wed, 26 May 2021 15:25:57 -0700 Roman Gushchin wrote: > >+ > >+ if (!list_empty(&offline_cgwbs)) > >+ schedule_work(&cleanup_offline_cgwbs_work); > >+ > > Good work overall. Thanks! > > Nit, given cond_resched_lock() in cleanup_offline_wb(), what you need instead > is > queue_work(system_unbound_wq, &cleanup_offline_cgwbs_work); Neat, will do in the next version. Thanks!
On Thu, May 27, 2021 at 01:24:03PM +0200, Jan Kara wrote: > On Wed 26-05-21 15:25:57, Roman Gushchin wrote: > > Asynchronously try to release dying cgwbs by switching clean attached > > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback > > structures themselves and of pinned memory and block cgroups, which > > are way larger structures (mostly due to large per-cpu statistics > > data). It helps to prevent memory waste and different scalability > > problems caused by large piles of dying cgroups. > > > > A cgwb cleanup operation can fail due to different reasons (e.g. the > > cgwb has in-glight/pending io, an attached inode is locked or isn't > > clean, etc). In this case the next scheduled cleanup will make a new > > attempt. An attempt is made each time a new cgwb is offlined (in other > > words a memcg and/or a blkcg is deleted by a user). In the future an > > additional attempt scheduled by a timer can be implemented. > > > > Signed-off-by: Roman Gushchin <guro@fb.com> > > --- > > fs/fs-writeback.c | 35 ++++++++++++++++++ > > include/linux/backing-dev-defs.h | 1 + > > include/linux/writeback.h | 1 + > > mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++-- > > 4 files changed, 96 insertions(+), 2 deletions(-) > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > index 631ef6366293..8fbcd50844f0 100644 > > --- a/fs/fs-writeback.c > > +++ b/fs/fs-writeback.c > > @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) > > kfree(isw); > > } > > > > +/** > > + * cleanup_offline_wb - detach associated clean inodes > > + * @wb: target wb > > + * > > + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and > > + * drop the corresponding per-cgroup wb's reference. Skip inodes which are > > + * dirty, freeing, in the active writeback process or are in any way busy. > > I think the comment doesn't match the function anymore. > > > + */ > > +void cleanup_offline_wb(struct bdi_writeback *wb) > > +{ > > + struct inode *inode, *tmp; > > + > > + spin_lock(&wb->list_lock); > > +restart: > > + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) { > > + if (!spin_trylock(&inode->i_lock)) > > + continue; > > + xa_lock_irq(&inode->i_mapping->i_pages); > > + if ((inode->i_state & I_REFERENCED) != I_REFERENCED) { > > Why the I_REFERENCED check here? That's just inode aging bit and I have > hard time seeing how it would relate to whether inode should switch wbs... What I tried to say (and failed :) ) was that I_REFERENCED is the only accepted flag here. So there must be if ((inode->i_state | I_REFERENCED) != I_REFERENCED) Does this look good or I am wrong and there are other flags acceptable here? > > > + struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb; > > + > > + WARN_ON_ONCE(inode->i_wb != wb); > > + > > + inode->i_wb = bdi_wb; > > + list_del_init(&inode->i_io_list); > > + wb_put(wb); > > I was kind of hoping you'll use some variant of inode_switch_wbs() here. My reasoning was that by definition inode_switch_wbs() handles dirty inodes, while in the cleanup case we can deal only with clean inodes and clean wb's. Hopefully this can make the whole procedure simpler/cheaper. Also, the number of simultaneous switches is limited and I don't think cleanups should share this limit. However I agree that it would be nice to share at least some code. > That way we have single function handling all the subtleties of switching > inode->i_wb of an active inode. Maybe it isn't strictly needed here because > you detach only from b_attached list and move to bdi_wb so things are > indeed simpler here. But you definitely miss transferring WB_WRITEBACK stat > and I'd also like to have a comment here explaining why this cannot race > with other writeback handling or wb switching in a harmful way. If we'll check under wb->list_lock that wb has no inodes on any writeback lists (excluding b_attached), doesn't it mean that WB_WRITEBACK must be 0? Re racing: my logic here was that we're taking all possible locks before doing anything and then we check that the inode is entirely clean, so this must be safe: spin_lock(&wb->list_lock); spin_trylock(&inode->i_lock); xa_lock_irq(&inode->i_mapping->i_pages); ... But now I see that the unlocked inode's wb access mechanism (unlocked_inode_to_wb_begin()/end()) probably requires additional care. Repeating the mechanism with scheduling the switching of each inode separately after an rcu grace period looks too slow. Maybe we can mark all inodes at once and then switch them all at once, all in two steps. I need to think more. Do you have any ideas/suggestions here? > > > + } > > + xa_unlock_irq(&inode->i_mapping->i_pages); > > + spin_unlock(&inode->i_lock); > > + if (cond_resched_lock(&wb->list_lock)) > > + goto restart; > > + } > > + spin_unlock(&wb->list_lock); > > +} > > + > > /** > > * wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it > > * @wbc: writeback_control of interest > > diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h > > index e5dc238ebe4f..07d6b6d6dbdf 100644 > > --- a/include/linux/backing-dev-defs.h > > +++ b/include/linux/backing-dev-defs.h > > @@ -155,6 +155,7 @@ struct bdi_writeback { > > struct list_head memcg_node; /* anchored at memcg->cgwb_list */ > > struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */ > > struct list_head b_attached; /* attached inodes, protected by list_lock */ > > + struct list_head offline_node; /* anchored at offline_cgwbs */ > > > > union { > > struct work_struct release_work; > > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > > index 572a13c40c90..922f15fe6ad4 100644 > > --- a/include/linux/writeback.h > > +++ b/include/linux/writeback.h > > @@ -222,6 +222,7 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page, > > int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages, > > enum wb_reason reason, struct wb_completion *done); > > void cgroup_writeback_umount(void); > > +void cleanup_offline_wb(struct bdi_writeback *wb); > > > > /** > > * inode_attach_wb - associate an inode with its wb > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > index 54c5dc4b8c24..92a00bcaa504 100644 > > --- a/mm/backing-dev.c > > +++ b/mm/backing-dev.c > > @@ -371,12 +371,16 @@ static void wb_exit(struct bdi_writeback *wb) > > #include <linux/memcontrol.h> > > > > /* > > - * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, and memcg->cgwb_list. > > - * bdi->cgwb_tree is also RCU protected. > > + * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, offline_cgwbs and > > + * memcg->cgwb_list. bdi->cgwb_tree is also RCU protected. > > */ > > static DEFINE_SPINLOCK(cgwb_lock); > > static struct workqueue_struct *cgwb_release_wq; > > > > +static LIST_HEAD(offline_cgwbs); > > +static void cleanup_offline_cgwbs_workfn(struct work_struct *work); > > +static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn); > > + > > static void cgwb_release_workfn(struct work_struct *work) > > { > > struct bdi_writeback *wb = container_of(work, struct bdi_writeback, > > @@ -395,6 +399,7 @@ static void cgwb_release_workfn(struct work_struct *work) > > > > fprop_local_destroy_percpu(&wb->memcg_completions); > > percpu_ref_exit(&wb->refcnt); > > + WARN_ON(!list_empty(&wb->offline_node)); > > Hum, cannot this happen when when wb had originally some attached inodes, > we added it to offline_cgwbs but then normal inode reclaim cleaned all the > inodes (and thus all wb refs were dropped) before > cleanup_offline_cgwbs_workfn() was executed? So either the offline_cgwbs > list has to hold its own wb ref or we have to remove cgwb from the list > in cgwb_release_workfn(). Yes, clearly a bug, thanks for catching! > > > wb_exit(wb); > > WARN_ON_ONCE(!list_empty(&wb->b_attached)); > > kfree_rcu(wb, rcu); > > @@ -414,6 +419,10 @@ static void cgwb_kill(struct bdi_writeback *wb) > > WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id)); > > list_del(&wb->memcg_node); > > list_del(&wb->blkcg_node); > > + if (!list_empty(&wb->b_attached)) > > + list_add(&wb->offline_node, &offline_cgwbs); > > + else > > + INIT_LIST_HEAD(&wb->offline_node); > > percpu_ref_kill(&wb->refcnt); > > } > > > > @@ -635,6 +644,50 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi) > > mutex_unlock(&bdi->cgwb_release_mutex); > > } > > > > +/** > > + * cleanup_offline_cgwbs - try to release dying cgwbs > > + * > > + * Try to release dying cgwbs by switching attached inodes to the wb > > + * belonging to the root memory cgroup. Processed wbs are placed at the > > + * end of the list to guarantee the forward progress. > > + * > > + * Should be called with the acquired cgwb_lock lock, which might > > + * be released and re-acquired in the process. > > + */ > > +static void cleanup_offline_cgwbs_workfn(struct work_struct *work) > > +{ > > + struct bdi_writeback *wb; > > + LIST_HEAD(processed); > > + > > + spin_lock_irq(&cgwb_lock); > > + > > + while (!list_empty(&offline_cgwbs)) { > > + wb = list_first_entry(&offline_cgwbs, struct bdi_writeback, > > + offline_node); > > + list_move(&wb->offline_node, &processed); > > + > > + if (wb_has_dirty_io(wb)) > > + continue; > > + > > + if (!percpu_ref_tryget(&wb->refcnt)) > > + continue; > > + > > + spin_unlock_irq(&cgwb_lock); > > + cleanup_offline_wb(wb); > > + spin_lock_irq(&cgwb_lock); > > + > > + if (list_empty(&wb->b_attached)) > > + list_del_init(&wb->offline_node); > > But the cgwb can still have inodes in its dirty lists which will eventually > move to b_attached. So you can delete cgwb here prematurely, cannot you? Hm, I thought that in this case wb_has_dirty_io() check above will fail. But you're right, nothing really protects wb from being re-dirtied after the check. At least we need to hold wb->list_lock for wb_has_dirty_io(wb) and list_empty(&wb->b_attached) checks... Will fix it. Thank you, really appreciate your feedback!
On Thu, May 27, 2021 at 10:48:59AM -0700, Roman Gushchin wrote: > On Thu, May 27, 2021 at 01:24:03PM +0200, Jan Kara wrote: > > On Wed 26-05-21 15:25:57, Roman Gushchin wrote: > > > Asynchronously try to release dying cgwbs by switching clean attached > > > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback > > > structures themselves and of pinned memory and block cgroups, which > > > are way larger structures (mostly due to large per-cpu statistics > > > data). It helps to prevent memory waste and different scalability > > > problems caused by large piles of dying cgroups. > > > > > > A cgwb cleanup operation can fail due to different reasons (e.g. the > > > cgwb has in-glight/pending io, an attached inode is locked or isn't > > > clean, etc). In this case the next scheduled cleanup will make a new > > > attempt. An attempt is made each time a new cgwb is offlined (in other > > > words a memcg and/or a blkcg is deleted by a user). In the future an > > > additional attempt scheduled by a timer can be implemented. > > > > > > Signed-off-by: Roman Gushchin <guro@fb.com> > > > --- > > > fs/fs-writeback.c | 35 ++++++++++++++++++ > > > include/linux/backing-dev-defs.h | 1 + > > > include/linux/writeback.h | 1 + > > > mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++-- > > > 4 files changed, 96 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > > index 631ef6366293..8fbcd50844f0 100644 > > > --- a/fs/fs-writeback.c > > > +++ b/fs/fs-writeback.c > > > @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) > > > kfree(isw); > > > } > > > > > > +/** > > > + * cleanup_offline_wb - detach associated clean inodes > > > + * @wb: target wb > > > + * > > > + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and > > > + * drop the corresponding per-cgroup wb's reference. Skip inodes which are > > > + * dirty, freeing, in the active writeback process or are in any way busy. > > > > I think the comment doesn't match the function anymore. > > > > > + */ > > > +void cleanup_offline_wb(struct bdi_writeback *wb) > > > +{ > > > + struct inode *inode, *tmp; > > > + > > > + spin_lock(&wb->list_lock); > > > +restart: > > > + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) { > > > + if (!spin_trylock(&inode->i_lock)) > > > + continue; > > > + xa_lock_irq(&inode->i_mapping->i_pages); > > > + if ((inode->i_state & I_REFERENCED) != I_REFERENCED) { > > > > Why the I_REFERENCED check here? That's just inode aging bit and I have > > hard time seeing how it would relate to whether inode should switch wbs... > > What I tried to say (and failed :) ) was that I_REFERENCED is the only accepted > flag here. So there must be > if ((inode->i_state | I_REFERENCED) != I_REFERENCED) Sorry, I'm wrong. Must be: if ((inode->i_state | I_REFERENCED) == I_REFERENCED) { ... } or even simpler: if (!(inode->i_state & ~I_REFERENCED)) { ... }
On Wed, May 26, 2021 at 03:25:57PM -0700, Roman Gushchin wrote: > Asynchronously try to release dying cgwbs by switching clean attached > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback > structures themselves and of pinned memory and block cgroups, which > are way larger structures (mostly due to large per-cpu statistics > data). It helps to prevent memory waste and different scalability > problems caused by large piles of dying cgroups. > > A cgwb cleanup operation can fail due to different reasons (e.g. the > cgwb has in-glight/pending io, an attached inode is locked or isn't > clean, etc). In this case the next scheduled cleanup will make a new > attempt. An attempt is made each time a new cgwb is offlined (in other > words a memcg and/or a blkcg is deleted by a user). In the future an > additional attempt scheduled by a timer can be implemented. > > Signed-off-by: Roman Gushchin <guro@fb.com> > --- > fs/fs-writeback.c | 35 ++++++++++++++++++ > include/linux/backing-dev-defs.h | 1 + > include/linux/writeback.h | 1 + > mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++-- > 4 files changed, 96 insertions(+), 2 deletions(-) > Hello Roman, The following kernel panic is triggered by this patch: [root@ktest-01 xfstests-dev]# ./check generic/563 [ 47.186811] SGI XFS with ACLs, security attributes, realtime, verbose warnings, quota, no debug enabled [ 47.190152] XFS (sdb): Mounting V5 Filesystem [ 47.201551] XFS (sdb): Ending clean mount [ 47.205501] xfs filesystem being mounted at /mnt/test supports timestamps until 2038 (0x7fffffff) FSTYP -- xfs (non-debug) PLATFORM -- Linux/x86_64 ktest-01 5.13.0-rc3+ #294 SMP Fri May 28 10:51:02 CST 2021 MKFS_OPTIONS -- -f -bsize=4096 /dev/sda MOUNT_OPTIONS -- /dev/sda /mnt/scratch [ 47.431775] XFS (sda): Mounting V5 Filesystem [ 47.441731] XFS (sda): Ending clean mount [ 47.445080] xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff) [ 47.449189] XFS (sda): Unmounting Filesystem [ 47.473863] XFS (sdb): Unmounting Filesystem [ 47.614561] XFS (sdb): Mounting V5 Filesystem [ 47.628670] XFS (sdb): Ending clean mount [ 47.631904] xfs filesystem being mounted at /mnt/test supports timestamps until 2038 (0x7fffffff) generic/563 1s ... [ 47.661393] run fstests generic/563 at 2021-05-28 02:54:59 [ 47.947414] loop0: detected capacity change from 0 to 16777216 [ 48.034564] XFS (loop0): Mounting V5 Filesystem [ 48.069959] XFS (loop0): Ending clean mount [ 48.070726] xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff) [ 48.132314] XFS (loop0): Unmounting Filesystem [ 48.204548] XFS (loop0): Mounting V5 Filesystem [ 48.215500] XFS (loop0): Ending clean mount [ 48.219223] xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff) [ 48.534420] XFS (loop0): Unmounting Filesystem [ 48.535142] ------------[ cut here ]------------ [ 48.535921] WARNING: CPU: 3 PID: 114 at mm/backing-dev.c:402 cgwb_release_workfn+0xa4/0xd8 [ 48.537461] Modules linked in: xfs libcrc32c iTCO_wdt i2c_i801 iTCO_vendor_support nvme i2c_smbus lpc_ich usb_storage i2c_s [ 48.540613] CPU: 3 PID: 114 Comm: kworker/3:1 Not tainted 5.13.0-rc3+ #294 [ 48.541927] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014 [ 48.543439] Workqueue: cgwb_release cgwb_release_workfn [ 48.544365] RIP: 0010:cgwb_release_workfn+0xa4/0xd8 [ 48.545185] Code: 00 00 00 48 85 db 75 d5 48 8d 7d 80 e8 98 71 20 00 48 8d bd 70 ff ff ff e8 36 7b 1d 00 48 8b 55 f0 48 8d4 [ 48.548935] RSP: 0018:ffffc90001f47e88 EFLAGS: 00010202 [ 48.549844] RAX: ffff88810321d280 RBX: ffffffff82f69ac0 RCX: 0000000080400011 [ 48.552645] RDX: ffffffff82669f00 RSI: 0000000000210d00 RDI: ffff888100042500 [ 48.553935] RBP: ffff88810321d290 R08: 0000000000000001 R09: ffffffff811c8754 [ 48.555054] R10: 000000000000005e R11: 0000000000000046 R12: ffff88810321d000 [ 48.556183] R13: ffff88815c4f2300 R14: 0000000000000000 R15: 0000000000000000 [ 48.557116] FS: 0000000000000000(0000) GS:ffff88815c4c0000(0000) knlGS:0000000000000000 [ 48.558131] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 48.558818] CR2: 00007f3aba17f9c0 CR3: 0000000108e86004 CR4: 0000000000370ee0 [ 48.559950] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 48.561057] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 48.562075] Call Trace: [ 48.562421] elfcorehdr_read+0xf/0xf [ 48.562920] ? worker_thread+0x117/0x1b9 [ 48.563443] ? rescuer_thread+0x291/0x291 [ 48.564001] ? kthread+0xec/0xf4 [ 48.564411] ? kthread_create_worker_on_cpu+0x65/0x65 [ 48.565086] ? ret_from_fork+0x1f/0x30 [ 48.565594] ---[ end trace bdeef00aa75cca5c ]--- [ 48.601694] XFS (loop0): Mounting V5 Filesystem [ 48.605863] XFS (loop0): Ending clean mount [ 48.607129] xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff) [ 48.830734] general protection fault, probably for non-canonical address 0xffff11033f71f000: 0000 [#1] SMP NOPTI [ 48.832720] CPU: 10 PID: 234 Comm: kworker/10:1 Tainted: G W 5.13.0-rc3+ #294 [ 48.833932] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014 [ 48.835146] Workqueue: events cleanup_offline_cgwbs_workfn [ 48.835952] RIP: 0010:percpu_ref_tryget_many.constprop.0+0x12/0x43 [ 48.836849] Code: 48 8b 47 08 48 8b 40 08 ff d0 0f 1f 00 eb 04 65 48 ff 08 e9 25 fd ff ff 41 54 48 8b 07 a8 03 74 09 48 8b4 [ 48.839494] RSP: 0018:ffffc90001383e58 EFLAGS: 00010046 [ 48.840246] RAX: ffff88810321f000 RBX: ffff88810321d280 RCX: ffff8881016f9770 [ 48.841165] RDX: ffffffff82669f00 RSI: 0000000000000280 RDI: ffff88810321d200 [ 48.842050] RBP: ffffc90001383e68 R08: ffff88810006c8b0 R09: 000073746e657665 [ 48.843224] R10: 8080808080808080 R11: fefefefefefefeff R12: ffff88810321d200 [ 48.844133] R13: ffff88810321d000 R14: 0000000000000000 R15: 0000000000000000 [ 48.845022] FS: 0000000000000000(0000) GS:ffff88823c500000(0000) knlGS:0000000000000000 [ 48.845903] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 48.846495] CR2: 00007fb630166198 CR3: 0000000179a1e006 CR4: 0000000000370ee0 [ 48.847414] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 48.848405] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 48.849126] Call Trace: [ 48.849388] cleanup_offline_cgwbs_workfn+0x8a/0x14c [ 48.849906] process_one_work+0x15c/0x234 [ 48.850346] worker_thread+0x117/0x1b9 [ 48.850706] ? rescuer_thread+0x291/0x291 [ 48.851065] kthread+0xec/0xf4 [ 48.851346] ? kthread_create_worker_on_cpu+0x65/0x65 [ 48.851815] ret_from_fork+0x1f/0x30 [ 48.852151] Modules linked in: xfs libcrc32c iTCO_wdt i2c_i801 iTCO_vendor_support nvme i2c_smbus lpc_ich usb_storage i2c_s [ 48.854166] Dumping ftrace buffer: [ 48.854546] (ftrace buffer empty) [ 48.854909] ---[ end trace bdeef00aa75cca5d ]--- Thanks, Ming
On Thu 27-05-21 10:48:59, Roman Gushchin wrote: > On Thu, May 27, 2021 at 01:24:03PM +0200, Jan Kara wrote: > > On Wed 26-05-21 15:25:57, Roman Gushchin wrote: > > > Asynchronously try to release dying cgwbs by switching clean attached > > > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback > > > structures themselves and of pinned memory and block cgroups, which > > > are way larger structures (mostly due to large per-cpu statistics > > > data). It helps to prevent memory waste and different scalability > > > problems caused by large piles of dying cgroups. > > > > > > A cgwb cleanup operation can fail due to different reasons (e.g. the > > > cgwb has in-glight/pending io, an attached inode is locked or isn't > > > clean, etc). In this case the next scheduled cleanup will make a new > > > attempt. An attempt is made each time a new cgwb is offlined (in other > > > words a memcg and/or a blkcg is deleted by a user). In the future an > > > additional attempt scheduled by a timer can be implemented. > > > > > > Signed-off-by: Roman Gushchin <guro@fb.com> > > > --- > > > fs/fs-writeback.c | 35 ++++++++++++++++++ > > > include/linux/backing-dev-defs.h | 1 + > > > include/linux/writeback.h | 1 + > > > mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++-- > > > 4 files changed, 96 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > > index 631ef6366293..8fbcd50844f0 100644 > > > --- a/fs/fs-writeback.c > > > +++ b/fs/fs-writeback.c > > > @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) > > > kfree(isw); > > > } > > > > > > +/** > > > + * cleanup_offline_wb - detach associated clean inodes > > > + * @wb: target wb > > > + * > > > + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and > > > + * drop the corresponding per-cgroup wb's reference. Skip inodes which are > > > + * dirty, freeing, in the active writeback process or are in any way busy. > > > > I think the comment doesn't match the function anymore. > > > > > + */ > > > +void cleanup_offline_wb(struct bdi_writeback *wb) > > > +{ > > > + struct inode *inode, *tmp; > > > + > > > + spin_lock(&wb->list_lock); > > > +restart: > > > + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) { > > > + if (!spin_trylock(&inode->i_lock)) > > > + continue; > > > + xa_lock_irq(&inode->i_mapping->i_pages); > > > + if ((inode->i_state & I_REFERENCED) != I_REFERENCED) { > > > > Why the I_REFERENCED check here? That's just inode aging bit and I have > > hard time seeing how it would relate to whether inode should switch wbs... > > What I tried to say (and failed :) ) was that I_REFERENCED is the only accepted > flag here. So there must be > if ((inode->i_state | I_REFERENCED) != I_REFERENCED) > > Does this look good or I am wrong and there are other flags acceptable here? Ah, I see. That makes more sense. I guess you could also exclude I_DONTCACHE and I_OVL_INUSE but that's not that important. > > > + struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb; > > > + > > > + WARN_ON_ONCE(inode->i_wb != wb); > > > + > > > + inode->i_wb = bdi_wb; > > > + list_del_init(&inode->i_io_list); > > > + wb_put(wb); > > > > I was kind of hoping you'll use some variant of inode_switch_wbs() here. > > My reasoning was that by definition inode_switch_wbs() handles dirty inodes, > while in the cleanup case we can deal only with clean inodes and clean wb's. > Hopefully this can make the whole procedure simpler/cheaper. Also, the number > of simultaneous switches is limited and I don't think cleanups should share > this limit. > However I agree that it would be nice to share at least some code. I agree limits on parallel switches should not apply. Otherwise I agree some bits of inode_switch_wbs_work_fn() should not be strictly necessary but they should be pretty cheap anyway. > > That way we have single function handling all the subtleties of switching > > inode->i_wb of an active inode. Maybe it isn't strictly needed here because > > you detach only from b_attached list and move to bdi_wb so things are > > indeed simpler here. But you definitely miss transferring WB_WRITEBACK stat > > and I'd also like to have a comment here explaining why this cannot race > > with other writeback handling or wb switching in a harmful way. > > If we'll check under wb->list_lock that wb has no inodes on any writeback > lists (excluding b_attached), doesn't it mean that WB_WRITEBACK must be > 0? No, pages under writeback are not reflected in inode->i_state in any way. You would need to check mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK) to find that out. But if you'd use inode_switch_wbs_work_fn() you wouldn't even have to be that careful when switching inodes as it can handle alive inodes just fine... > Re racing: my logic here was that we're taking all possible locks before doing > anything and then we check that the inode is entirely clean, so this must be > safe: > spin_lock(&wb->list_lock); > spin_trylock(&inode->i_lock); > xa_lock_irq(&inode->i_mapping->i_pages); > ... > > But now I see that the unlocked inode's wb access mechanism > (unlocked_inode_to_wb_begin()/end()) probably requires additional care. Yeah, exactly corner case like this were not quite clear to me whether you have them correct or not. > Repeating the mechanism with scheduling the switching of each inode separately > after an rcu grace period looks too slow. Maybe we can mark all inodes at once > and then switch them all at once, all in two steps. I need to think more. > Do you have any ideas/suggestions here? Nothing really bright. As you say I'd do this in batches - i.e., tag all inodes for switching with I_WB_SWITCH, then synchronize_rcu(), then call inode_switch_wbs_work_fn() for each inode (or probably some helper function that has guts of inode_switch_wbs_work_fn() as we probably don't want to acquire wb->list_lock's and wb_switch_rwsem repeatedly unnecessarily). Honza
On Fri, May 28, 2021 at 10:58:04AM +0800, Ming Lei wrote: > On Wed, May 26, 2021 at 03:25:57PM -0700, Roman Gushchin wrote: > > Asynchronously try to release dying cgwbs by switching clean attached > > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback > > structures themselves and of pinned memory and block cgroups, which > > are way larger structures (mostly due to large per-cpu statistics > > data). It helps to prevent memory waste and different scalability > > problems caused by large piles of dying cgroups. > > > > A cgwb cleanup operation can fail due to different reasons (e.g. the > > cgwb has in-glight/pending io, an attached inode is locked or isn't > > clean, etc). In this case the next scheduled cleanup will make a new > > attempt. An attempt is made each time a new cgwb is offlined (in other > > words a memcg and/or a blkcg is deleted by a user). In the future an > > additional attempt scheduled by a timer can be implemented. > > > > Signed-off-by: Roman Gushchin <guro@fb.com> > > --- > > fs/fs-writeback.c | 35 ++++++++++++++++++ > > include/linux/backing-dev-defs.h | 1 + > > include/linux/writeback.h | 1 + > > mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++-- > > 4 files changed, 96 insertions(+), 2 deletions(-) > > > > Hello Roman, > > The following kernel panic is triggered by this patch: Hello Ming! Thank you for the report and for trying my patches! I think I know what it is and will fix in the next version. Thanks!
On Fri, May 28, 2021 at 03:05:43PM +0200, Jan Kara wrote: > On Thu 27-05-21 10:48:59, Roman Gushchin wrote: > > On Thu, May 27, 2021 at 01:24:03PM +0200, Jan Kara wrote: > > > On Wed 26-05-21 15:25:57, Roman Gushchin wrote: > > > > Asynchronously try to release dying cgwbs by switching clean attached > > > > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback > > > > structures themselves and of pinned memory and block cgroups, which > > > > are way larger structures (mostly due to large per-cpu statistics > > > > data). It helps to prevent memory waste and different scalability > > > > problems caused by large piles of dying cgroups. > > > > > > > > A cgwb cleanup operation can fail due to different reasons (e.g. the > > > > cgwb has in-glight/pending io, an attached inode is locked or isn't > > > > clean, etc). In this case the next scheduled cleanup will make a new > > > > attempt. An attempt is made each time a new cgwb is offlined (in other > > > > words a memcg and/or a blkcg is deleted by a user). In the future an > > > > additional attempt scheduled by a timer can be implemented. > > > > > > > > Signed-off-by: Roman Gushchin <guro@fb.com> > > > > --- > > > > fs/fs-writeback.c | 35 ++++++++++++++++++ > > > > include/linux/backing-dev-defs.h | 1 + > > > > include/linux/writeback.h | 1 + > > > > mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++-- > > > > 4 files changed, 96 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > > > index 631ef6366293..8fbcd50844f0 100644 > > > > --- a/fs/fs-writeback.c > > > > +++ b/fs/fs-writeback.c > > > > @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) > > > > kfree(isw); > > > > } > > > > > > > > +/** > > > > + * cleanup_offline_wb - detach associated clean inodes > > > > + * @wb: target wb > > > > + * > > > > + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and > > > > + * drop the corresponding per-cgroup wb's reference. Skip inodes which are > > > > + * dirty, freeing, in the active writeback process or are in any way busy. > > > > > > I think the comment doesn't match the function anymore. > > > > > > > + */ > > > > +void cleanup_offline_wb(struct bdi_writeback *wb) > > > > +{ > > > > + struct inode *inode, *tmp; > > > > + > > > > + spin_lock(&wb->list_lock); > > > > +restart: > > > > + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) { > > > > + if (!spin_trylock(&inode->i_lock)) > > > > + continue; > > > > + xa_lock_irq(&inode->i_mapping->i_pages); > > > > + if ((inode->i_state & I_REFERENCED) != I_REFERENCED) { > > > > > > Why the I_REFERENCED check here? That's just inode aging bit and I have > > > hard time seeing how it would relate to whether inode should switch wbs... > > > > What I tried to say (and failed :) ) was that I_REFERENCED is the only accepted > > flag here. So there must be > > if ((inode->i_state | I_REFERENCED) != I_REFERENCED) > > > > Does this look good or I am wrong and there are other flags acceptable here? > > Ah, I see. That makes more sense. I guess you could also exclude I_DONTCACHE > and I_OVL_INUSE but that's not that important. > > > > > + struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb; > > > > + > > > > + WARN_ON_ONCE(inode->i_wb != wb); > > > > + > > > > + inode->i_wb = bdi_wb; > > > > + list_del_init(&inode->i_io_list); > > > > + wb_put(wb); > > > > > > I was kind of hoping you'll use some variant of inode_switch_wbs() here. > > > > My reasoning was that by definition inode_switch_wbs() handles dirty inodes, > > while in the cleanup case we can deal only with clean inodes and clean wb's. > > Hopefully this can make the whole procedure simpler/cheaper. Also, the number > > of simultaneous switches is limited and I don't think cleanups should share > > this limit. > > However I agree that it would be nice to share at least some code. > > I agree limits on parallel switches should not apply. Otherwise I agree > some bits of inode_switch_wbs_work_fn() should not be strictly necessary > but they should be pretty cheap anyway. > > > > That way we have single function handling all the subtleties of switching > > > inode->i_wb of an active inode. Maybe it isn't strictly needed here because > > > you detach only from b_attached list and move to bdi_wb so things are > > > indeed simpler here. But you definitely miss transferring WB_WRITEBACK stat > > > and I'd also like to have a comment here explaining why this cannot race > > > with other writeback handling or wb switching in a harmful way. > > > > If we'll check under wb->list_lock that wb has no inodes on any writeback > > lists (excluding b_attached), doesn't it mean that WB_WRITEBACK must be > > 0? > > No, pages under writeback are not reflected in inode->i_state in any way. > You would need to check mapping_tagged(inode->i_mapping, > PAGECACHE_TAG_WRITEBACK) to find that out. But if you'd use > inode_switch_wbs_work_fn() you wouldn't even have to be that careful when > switching inodes as it can handle alive inodes just fine... I see... > > > Re racing: my logic here was that we're taking all possible locks before doing > > anything and then we check that the inode is entirely clean, so this must be > > safe: > > spin_lock(&wb->list_lock); > > spin_trylock(&inode->i_lock); > > xa_lock_irq(&inode->i_mapping->i_pages); > > ... > > > > But now I see that the unlocked inode's wb access mechanism > > (unlocked_inode_to_wb_begin()/end()) probably requires additional care. > > Yeah, exactly corner case like this were not quite clear to me whether you > have them correct or not. > > > Repeating the mechanism with scheduling the switching of each inode separately > > after an rcu grace period looks too slow. Maybe we can mark all inodes at once > > and then switch them all at once, all in two steps. I need to think more. > > Do you have any ideas/suggestions here? > > Nothing really bright. As you say I'd do this in batches - i.e., tag all > inodes for switching with I_WB_SWITCH, then synchronize_rcu(), then call > inode_switch_wbs_work_fn() for each inode (or probably some helper function > that has guts of inode_switch_wbs_work_fn() as we probably don't want to > acquire wb->list_lock's and wb_switch_rwsem repeatedly unnecessarily). Ok, sounds good to me. I'm a bit worried about the possible CPU overhead, but hopefully we can switch inodes slow enough so that the impact on the system will be acceptable. Thanks!
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 631ef6366293..8fbcd50844f0 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) kfree(isw); } +/** + * cleanup_offline_wb - detach associated clean inodes + * @wb: target wb + * + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and + * drop the corresponding per-cgroup wb's reference. Skip inodes which are + * dirty, freeing, in the active writeback process or are in any way busy. + */ +void cleanup_offline_wb(struct bdi_writeback *wb) +{ + struct inode *inode, *tmp; + + spin_lock(&wb->list_lock); +restart: + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) { + if (!spin_trylock(&inode->i_lock)) + continue; + xa_lock_irq(&inode->i_mapping->i_pages); + if ((inode->i_state & I_REFERENCED) != I_REFERENCED) { + struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb; + + WARN_ON_ONCE(inode->i_wb != wb); + + inode->i_wb = bdi_wb; + list_del_init(&inode->i_io_list); + wb_put(wb); + } + xa_unlock_irq(&inode->i_mapping->i_pages); + spin_unlock(&inode->i_lock); + if (cond_resched_lock(&wb->list_lock)) + goto restart; + } + spin_unlock(&wb->list_lock); +} + /** * wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it * @wbc: writeback_control of interest diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index e5dc238ebe4f..07d6b6d6dbdf 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -155,6 +155,7 @@ struct bdi_writeback { struct list_head memcg_node; /* anchored at memcg->cgwb_list */ struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */ struct list_head b_attached; /* attached inodes, protected by list_lock */ + struct list_head offline_node; /* anchored at offline_cgwbs */ union { struct work_struct release_work; diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 572a13c40c90..922f15fe6ad4 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -222,6 +222,7 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page, int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages, enum wb_reason reason, struct wb_completion *done); void cgroup_writeback_umount(void); +void cleanup_offline_wb(struct bdi_writeback *wb); /** * inode_attach_wb - associate an inode with its wb diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 54c5dc4b8c24..92a00bcaa504 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -371,12 +371,16 @@ static void wb_exit(struct bdi_writeback *wb) #include <linux/memcontrol.h> /* - * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, and memcg->cgwb_list. - * bdi->cgwb_tree is also RCU protected. + * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, offline_cgwbs and + * memcg->cgwb_list. bdi->cgwb_tree is also RCU protected. */ static DEFINE_SPINLOCK(cgwb_lock); static struct workqueue_struct *cgwb_release_wq; +static LIST_HEAD(offline_cgwbs); +static void cleanup_offline_cgwbs_workfn(struct work_struct *work); +static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn); + static void cgwb_release_workfn(struct work_struct *work) { struct bdi_writeback *wb = container_of(work, struct bdi_writeback, @@ -395,6 +399,7 @@ static void cgwb_release_workfn(struct work_struct *work) fprop_local_destroy_percpu(&wb->memcg_completions); percpu_ref_exit(&wb->refcnt); + WARN_ON(!list_empty(&wb->offline_node)); wb_exit(wb); WARN_ON_ONCE(!list_empty(&wb->b_attached)); kfree_rcu(wb, rcu); @@ -414,6 +419,10 @@ static void cgwb_kill(struct bdi_writeback *wb) WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id)); list_del(&wb->memcg_node); list_del(&wb->blkcg_node); + if (!list_empty(&wb->b_attached)) + list_add(&wb->offline_node, &offline_cgwbs); + else + INIT_LIST_HEAD(&wb->offline_node); percpu_ref_kill(&wb->refcnt); } @@ -635,6 +644,50 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi) mutex_unlock(&bdi->cgwb_release_mutex); } +/** + * cleanup_offline_cgwbs - try to release dying cgwbs + * + * Try to release dying cgwbs by switching attached inodes to the wb + * belonging to the root memory cgroup. Processed wbs are placed at the + * end of the list to guarantee the forward progress. + * + * Should be called with the acquired cgwb_lock lock, which might + * be released and re-acquired in the process. + */ +static void cleanup_offline_cgwbs_workfn(struct work_struct *work) +{ + struct bdi_writeback *wb; + LIST_HEAD(processed); + + spin_lock_irq(&cgwb_lock); + + while (!list_empty(&offline_cgwbs)) { + wb = list_first_entry(&offline_cgwbs, struct bdi_writeback, + offline_node); + list_move(&wb->offline_node, &processed); + + if (wb_has_dirty_io(wb)) + continue; + + if (!percpu_ref_tryget(&wb->refcnt)) + continue; + + spin_unlock_irq(&cgwb_lock); + cleanup_offline_wb(wb); + spin_lock_irq(&cgwb_lock); + + if (list_empty(&wb->b_attached)) + list_del_init(&wb->offline_node); + + wb_put(wb); + } + + if (!list_empty(&processed)) + list_splice_tail(&processed, &offline_cgwbs); + + spin_unlock_irq(&cgwb_lock); +} + /** * wb_memcg_offline - kill all wb's associated with a memcg being offlined * @memcg: memcg being offlined @@ -650,6 +703,10 @@ void wb_memcg_offline(struct mem_cgroup *memcg) list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node) cgwb_kill(wb); memcg_cgwb_list->next = NULL; /* prevent new wb's */ + + if (!list_empty(&offline_cgwbs)) + schedule_work(&cleanup_offline_cgwbs_work); + spin_unlock_irq(&cgwb_lock); }
Asynchronously try to release dying cgwbs by switching clean attached inodes to the bdi's wb. It helps to get rid of per-cgroup writeback structures themselves and of pinned memory and block cgroups, which are way larger structures (mostly due to large per-cpu statistics data). It helps to prevent memory waste and different scalability problems caused by large piles of dying cgroups. A cgwb cleanup operation can fail due to different reasons (e.g. the cgwb has in-glight/pending io, an attached inode is locked or isn't clean, etc). In this case the next scheduled cleanup will make a new attempt. An attempt is made each time a new cgwb is offlined (in other words a memcg and/or a blkcg is deleted by a user). In the future an additional attempt scheduled by a timer can be implemented. Signed-off-by: Roman Gushchin <guro@fb.com> --- fs/fs-writeback.c | 35 ++++++++++++++++++ include/linux/backing-dev-defs.h | 1 + include/linux/writeback.h | 1 + mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++-- 4 files changed, 96 insertions(+), 2 deletions(-)