Message ID | 20190803140155.181190-3-tj@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/4] writeback: Generalize and expose wb_completion | expand |
On Sat, Aug 03, 2019 at 07:01:53AM -0700, Tejun Heo wrote: > There currently is no way to universally identify and lookup a bdi > without holding a reference and pointer to it. This patch adds an > non-recycling bdi->id and implements bdi_get_by_id() which looks up > bdis by their ids. This will be used by memcg foreign inode flushing. > > I left bdi_list alone for simplicity and because while rb_tree does > support rcu assignment it doesn't seem to guarantee lossless walk when > walk is racing aginst tree rebalance operations. This would seem like the perfect use for an allocating xarray. That does guarantee lossless walk under the RCU lock. You could get rid of the bdi_list too.
Hey, Matthew. On Sat, Aug 03, 2019 at 08:39:08AM -0700, Matthew Wilcox wrote: > On Sat, Aug 03, 2019 at 07:01:53AM -0700, Tejun Heo wrote: > > There currently is no way to universally identify and lookup a bdi > > without holding a reference and pointer to it. This patch adds an > > non-recycling bdi->id and implements bdi_get_by_id() which looks up > > bdis by their ids. This will be used by memcg foreign inode flushing. > > > > I left bdi_list alone for simplicity and because while rb_tree does > > support rcu assignment it doesn't seem to guarantee lossless walk when > > walk is racing aginst tree rebalance operations. > > This would seem like the perfect use for an allocating xarray. That > does guarantee lossless walk under the RCU lock. You could get rid of the > bdi_list too. It definitely came to mind but there's a bunch of downsides to recycling IDs or using radix tree for non-compacting allocations. Thanks.
On Sat, Aug 03, 2019 at 08:53:49AM -0700, Tejun Heo wrote: > Hey, Matthew. > > On Sat, Aug 03, 2019 at 08:39:08AM -0700, Matthew Wilcox wrote: > > On Sat, Aug 03, 2019 at 07:01:53AM -0700, Tejun Heo wrote: > > > There currently is no way to universally identify and lookup a bdi > > > without holding a reference and pointer to it. This patch adds an > > > non-recycling bdi->id and implements bdi_get_by_id() which looks up > > > bdis by their ids. This will be used by memcg foreign inode flushing. > > > > > > I left bdi_list alone for simplicity and because while rb_tree does > > > support rcu assignment it doesn't seem to guarantee lossless walk when > > > walk is racing aginst tree rebalance operations. > > > > This would seem like the perfect use for an allocating xarray. That > > does guarantee lossless walk under the RCU lock. You could get rid of the > > bdi_list too. > > It definitely came to mind but there's a bunch of downsides to > recycling IDs or using radix tree for non-compacting allocations. Ah, I wasn't sure what would happen if you recycled an ID. I agree, the radix tree is pretty horrid for monotonically increasing IDs. I'm still working on the maple tree to replace it, but that's going slower than I would like, so I can't in good conscience ask you to wait for it to be ready.
On Sat, 3 Aug 2019 07:01:53 -0700 Tejun Heo <tj@kernel.org> wrote: > There currently is no way to universally identify and lookup a bdi > without holding a reference and pointer to it. This patch adds an > non-recycling bdi->id and implements bdi_get_by_id() which looks up > bdis by their ids. This will be used by memcg foreign inode flushing. Why is the id non-recycling? Presumably to address some lifetime/lookup issues, but what are they? Why was the IDR code not used? > I left bdi_list alone for simplicity and because while rb_tree does > support rcu assignment it doesn't seem to guarantee lossless walk when > walk is racing aginst tree rebalance operations. > > ... > > +/** > + * bdi_get_by_id - lookup and get bdi from its id > + * @id: bdi id to lookup > + * > + * Find bdi matching @id and get it. Returns NULL if the matching bdi > + * doesn't exist or is already unregistered. > + */ > +struct backing_dev_info *bdi_get_by_id(u64 id) > +{ > + struct backing_dev_info *bdi = NULL; > + struct rb_node **p; > + > + spin_lock_irq(&bdi_lock); Why irq-safe? Everywhere else uses spin_lock_bh(&bdi_lock). > + p = bdi_lookup_rb_node(id, NULL); > + if (*p) { > + bdi = rb_entry(*p, struct backing_dev_info, rb_node); > + bdi_get(bdi); > + } > + spin_unlock_irq(&bdi_lock); > + > + return bdi; > +} > + > > ... >
Hello, On Tue, Aug 06, 2019 at 04:01:02PM -0700, Andrew Morton wrote: > On Sat, 3 Aug 2019 07:01:53 -0700 Tejun Heo <tj@kernel.org> wrote: > > There currently is no way to universally identify and lookup a bdi > > without holding a reference and pointer to it. This patch adds an > > non-recycling bdi->id and implements bdi_get_by_id() which looks up > > bdis by their ids. This will be used by memcg foreign inode flushing. > > Why is the id non-recycling? Presumably to address some > lifetime/lookup issues, but what are they? The ID by itself is used to point to the bdi from cgroup and idr recycles really aggressively. Combined with, for example, loop device based containers, stale pointing can become pretty common. We're having similar issues with cgroup IDs. > Why was the IDR code not used? Because of the rapid recycling. In the longer term, I think we want IDR to be able to support non-recycling IDs, or at least less agressive recycling. > > +struct backing_dev_info *bdi_get_by_id(u64 id) > > +{ > > + struct backing_dev_info *bdi = NULL; > > + struct rb_node **p; > > + > > + spin_lock_irq(&bdi_lock); > > Why irq-safe? Everywhere else uses spin_lock_bh(&bdi_lock). By mistake, I'll change them to bh. Thanks.
On Wed, 7 Aug 2019 11:31:51 -0700 Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Tue, Aug 06, 2019 at 04:01:02PM -0700, Andrew Morton wrote: > > On Sat, 3 Aug 2019 07:01:53 -0700 Tejun Heo <tj@kernel.org> wrote: > > > There currently is no way to universally identify and lookup a bdi > > > without holding a reference and pointer to it. This patch adds an > > > non-recycling bdi->id and implements bdi_get_by_id() which looks up > > > bdis by their ids. This will be used by memcg foreign inode flushing. > > > > Why is the id non-recycling? Presumably to address some > > lifetime/lookup issues, but what are they? > > The ID by itself is used to point to the bdi from cgroup and idr > recycles really aggressively. Combined with, for example, loop device > based containers, stale pointing can become pretty common. We're > having similar issues with cgroup IDs. OK, but why is recycling a problem? For example, file descriptors recycle as aggressively as is possible, and that doesn't cause any trouble. Presumably recycling is a problem with cgroups because of some sort of stale reference problem?
Hello, On Wed, Aug 07, 2019 at 12:00:37PM -0700, Andrew Morton wrote: > OK, but why is recycling a problem? For example, file descriptors > recycle as aggressively as is possible, and that doesn't cause any > trouble. Presumably recycling is a problem with cgroups because of > some sort of stale reference problem? Oh, because there are use cases where the consumers are detached from the lifetime synchronization. In this example, the benefit of using IDs is that memcgs don't need to pin foreign bdi_wb's and just look up and verify when it wants to flush them. If we use pointers, we have to pin the objects which then requires either shooting down those references with timers or somehow doing reverse lookup to shoot them down when bdi_wb is taken offline. If we use IDs which can be recycling aggressively, there can be pathological cases where remote flushes are issued on the wrong target possibly repeatedly, which may or may not be a real problem. For cgroup ID, the problem is more immediate. We give out the IDs to userspace and there is no way to shoot those references down when the cgroup goes away and the ID gets recycled, so when the user comes back and asks "I want to attach this bpf program to the cgroup identified with this ID", we can end up attaching it to the wrong cgroup if we recycle IDs. cgroup ended up with two separate IDs, which is kinda dumb. tl;dr is that it's either cumbersome or impossible to synchronize the users of these IDs, so if they get recycled, they end up identifying the wrong thing. Thanks.
On Wed, 2019-08-07 at 12:00 -0700, Andrew Morton wrote: > On Wed, 7 Aug 2019 11:31:51 -0700 Tejun Heo <tj@kernel.org> wrote: > > > Hello, > > > > On Tue, Aug 06, 2019 at 04:01:02PM -0700, Andrew Morton wrote: > > > On Sat, 3 Aug 2019 07:01:53 -0700 Tejun Heo <tj@kernel.org> > > > wrote: > > > > There currently is no way to universally identify and lookup a > > > > bdi > > > > without holding a reference and pointer to it. This patch adds > > > > an > > > > non-recycling bdi->id and implements bdi_get_by_id() which > > > > looks up > > > > bdis by their ids. This will be used by memcg foreign inode > > > > flushing. > > > > > > Why is the id non-recycling? Presumably to address some > > > lifetime/lookup issues, but what are they? > > > > The ID by itself is used to point to the bdi from cgroup and idr > > recycles really aggressively. Combined with, for example, loop > > device > > based containers, stale pointing can become pretty common. We're > > having similar issues with cgroup IDs. > > OK, but why is recycling a problem? For example, file descriptors > recycle as aggressively as is possible, and that doesn't cause any > trouble. Presumably recycling is a problem with cgroups because of > some sort of stale reference problem? PIDs, on the other hand, we recycle as slowly as possible...
On Sat 03-08-19 07:01:53, Tejun Heo wrote: > There currently is no way to universally identify and lookup a bdi > without holding a reference and pointer to it. This patch adds an > non-recycling bdi->id and implements bdi_get_by_id() which looks up > bdis by their ids. This will be used by memcg foreign inode flushing. > > I left bdi_list alone for simplicity and because while rb_tree does > support rcu assignment it doesn't seem to guarantee lossless walk when > walk is racing aginst tree rebalance operations. > > Signed-off-by: Tejun Heo <tj@kernel.org> The patch looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Although I would note that here you take effort not to recycle bdi->id so that you don't flush wrong devices while in patch 4 you take pretty lax approach to feeding garbage into the writeback system. So these two don't quite match to me... Honza > --- > include/linux/backing-dev-defs.h | 2 + > include/linux/backing-dev.h | 1 + > mm/backing-dev.c | 65 +++++++++++++++++++++++++++++++- > 3 files changed, 66 insertions(+), 2 deletions(-) > > diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h > index 8fb740178d5d..1075f2552cfc 100644 > --- a/include/linux/backing-dev-defs.h > +++ b/include/linux/backing-dev-defs.h > @@ -185,6 +185,8 @@ struct bdi_writeback { > }; > > struct backing_dev_info { > + u64 id; > + struct rb_node rb_node; /* keyed by ->id */ > struct list_head bdi_list; > unsigned long ra_pages; /* max readahead in PAGE_SIZE units */ > unsigned long io_pages; /* max allowed IO size */ > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h > index 02650b1253a2..84cdcfbc763f 100644 > --- a/include/linux/backing-dev.h > +++ b/include/linux/backing-dev.h > @@ -24,6 +24,7 @@ static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi) > return bdi; > } > > +struct backing_dev_info *bdi_get_by_id(u64 id); > void bdi_put(struct backing_dev_info *bdi); > > __printf(2, 3) > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index e8e89158adec..4a8816e0b8d4 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -1,6 +1,7 @@ > // SPDX-License-Identifier: GPL-2.0-only > > #include <linux/wait.h> > +#include <linux/rbtree.h> > #include <linux/backing-dev.h> > #include <linux/kthread.h> > #include <linux/freezer.h> > @@ -22,10 +23,12 @@ EXPORT_SYMBOL_GPL(noop_backing_dev_info); > static struct class *bdi_class; > > /* > - * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side > - * locking. > + * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU > + * reader side locking. > */ > DEFINE_SPINLOCK(bdi_lock); > +static u64 bdi_id_cursor; > +static struct rb_root bdi_tree = RB_ROOT; > LIST_HEAD(bdi_list); > > /* bdi_wq serves all asynchronous writeback tasks */ > @@ -859,9 +862,58 @@ struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id) > } > EXPORT_SYMBOL(bdi_alloc_node); > > +struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp) > +{ > + struct rb_node **p = &bdi_tree.rb_node; > + struct rb_node *parent = NULL; > + struct backing_dev_info *bdi; > + > + lockdep_assert_held(&bdi_lock); > + > + while (*p) { > + parent = *p; > + bdi = rb_entry(parent, struct backing_dev_info, rb_node); > + > + if (bdi->id > id) > + p = &(*p)->rb_left; > + else if (bdi->id < id) > + p = &(*p)->rb_right; > + else > + break; > + } > + > + if (parentp) > + *parentp = parent; > + return p; > +} > + > +/** > + * bdi_get_by_id - lookup and get bdi from its id > + * @id: bdi id to lookup > + * > + * Find bdi matching @id and get it. Returns NULL if the matching bdi > + * doesn't exist or is already unregistered. > + */ > +struct backing_dev_info *bdi_get_by_id(u64 id) > +{ > + struct backing_dev_info *bdi = NULL; > + struct rb_node **p; > + > + spin_lock_irq(&bdi_lock); > + p = bdi_lookup_rb_node(id, NULL); > + if (*p) { > + bdi = rb_entry(*p, struct backing_dev_info, rb_node); > + bdi_get(bdi); > + } > + spin_unlock_irq(&bdi_lock); > + > + return bdi; > +} > + > int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) > { > struct device *dev; > + struct rb_node *parent, **p; > > if (bdi->dev) /* The driver needs to use separate queues per device */ > return 0; > @@ -877,7 +929,15 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) > set_bit(WB_registered, &bdi->wb.state); > > spin_lock_bh(&bdi_lock); > + > + bdi->id = ++bdi_id_cursor; > + > + p = bdi_lookup_rb_node(bdi->id, &parent); > + rb_link_node(&bdi->rb_node, parent, p); > + rb_insert_color(&bdi->rb_node, &bdi_tree); > + > list_add_tail_rcu(&bdi->bdi_list, &bdi_list); > + > spin_unlock_bh(&bdi_lock); > > trace_writeback_bdi_register(bdi); > @@ -918,6 +978,7 @@ EXPORT_SYMBOL(bdi_register_owner); > static void bdi_remove_from_list(struct backing_dev_info *bdi) > { > spin_lock_bh(&bdi_lock); > + rb_erase(&bdi->rb_node, &bdi_tree); > list_del_rcu(&bdi->bdi_list); > spin_unlock_bh(&bdi_lock); > > -- > 2.17.1 >
Hello, On Thu, Aug 15, 2019 at 04:46:23PM +0200, Jan Kara wrote: > Although I would note that here you take effort not to recycle bdi->id so > that you don't flush wrong devices while in patch 4 you take pretty lax > approach to feeding garbage into the writeback system. So these two don't > quite match to me... So, I was trying to avoid systemic errors where the wrong thing can be triggered repeatedly. A wrong flush once in a blue moon shouldn't be a big problem but if they can be triggered consistently by some pathological behavior, it's an a lot bigger problem. Thanks.
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 8fb740178d5d..1075f2552cfc 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -185,6 +185,8 @@ struct bdi_writeback { }; struct backing_dev_info { + u64 id; + struct rb_node rb_node; /* keyed by ->id */ struct list_head bdi_list; unsigned long ra_pages; /* max readahead in PAGE_SIZE units */ unsigned long io_pages; /* max allowed IO size */ diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 02650b1253a2..84cdcfbc763f 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -24,6 +24,7 @@ static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi) return bdi; } +struct backing_dev_info *bdi_get_by_id(u64 id); void bdi_put(struct backing_dev_info *bdi); __printf(2, 3) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index e8e89158adec..4a8816e0b8d4 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only #include <linux/wait.h> +#include <linux/rbtree.h> #include <linux/backing-dev.h> #include <linux/kthread.h> #include <linux/freezer.h> @@ -22,10 +23,12 @@ EXPORT_SYMBOL_GPL(noop_backing_dev_info); static struct class *bdi_class; /* - * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side - * locking. + * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU + * reader side locking. */ DEFINE_SPINLOCK(bdi_lock); +static u64 bdi_id_cursor; +static struct rb_root bdi_tree = RB_ROOT; LIST_HEAD(bdi_list); /* bdi_wq serves all asynchronous writeback tasks */ @@ -859,9 +862,58 @@ struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id) } EXPORT_SYMBOL(bdi_alloc_node); +struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp) +{ + struct rb_node **p = &bdi_tree.rb_node; + struct rb_node *parent = NULL; + struct backing_dev_info *bdi; + + lockdep_assert_held(&bdi_lock); + + while (*p) { + parent = *p; + bdi = rb_entry(parent, struct backing_dev_info, rb_node); + + if (bdi->id > id) + p = &(*p)->rb_left; + else if (bdi->id < id) + p = &(*p)->rb_right; + else + break; + } + + if (parentp) + *parentp = parent; + return p; +} + +/** + * bdi_get_by_id - lookup and get bdi from its id + * @id: bdi id to lookup + * + * Find bdi matching @id and get it. Returns NULL if the matching bdi + * doesn't exist or is already unregistered. + */ +struct backing_dev_info *bdi_get_by_id(u64 id) +{ + struct backing_dev_info *bdi = NULL; + struct rb_node **p; + + spin_lock_irq(&bdi_lock); + p = bdi_lookup_rb_node(id, NULL); + if (*p) { + bdi = rb_entry(*p, struct backing_dev_info, rb_node); + bdi_get(bdi); + } + spin_unlock_irq(&bdi_lock); + + return bdi; +} + int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) { struct device *dev; + struct rb_node *parent, **p; if (bdi->dev) /* The driver needs to use separate queues per device */ return 0; @@ -877,7 +929,15 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) set_bit(WB_registered, &bdi->wb.state); spin_lock_bh(&bdi_lock); + + bdi->id = ++bdi_id_cursor; + + p = bdi_lookup_rb_node(bdi->id, &parent); + rb_link_node(&bdi->rb_node, parent, p); + rb_insert_color(&bdi->rb_node, &bdi_tree); + list_add_tail_rcu(&bdi->bdi_list, &bdi_list); + spin_unlock_bh(&bdi_lock); trace_writeback_bdi_register(bdi); @@ -918,6 +978,7 @@ EXPORT_SYMBOL(bdi_register_owner); static void bdi_remove_from_list(struct backing_dev_info *bdi) { spin_lock_bh(&bdi_lock); + rb_erase(&bdi->rb_node, &bdi_tree); list_del_rcu(&bdi->bdi_list); spin_unlock_bh(&bdi_lock);
There currently is no way to universally identify and lookup a bdi without holding a reference and pointer to it. This patch adds an non-recycling bdi->id and implements bdi_get_by_id() which looks up bdis by their ids. This will be used by memcg foreign inode flushing. I left bdi_list alone for simplicity and because while rb_tree does support rcu assignment it doesn't seem to guarantee lossless walk when walk is racing aginst tree rebalance operations. Signed-off-by: Tejun Heo <tj@kernel.org> --- include/linux/backing-dev-defs.h | 2 + include/linux/backing-dev.h | 1 + mm/backing-dev.c | 65 +++++++++++++++++++++++++++++++- 3 files changed, 66 insertions(+), 2 deletions(-)