diff mbox series

[2/4] bdi: Add bdi->id

Message ID 20190803140155.181190-3-tj@kernel.org (mailing list archive)
State New, archived
Headers show
Series [1/4] writeback: Generalize and expose wb_completion | expand

Commit Message

Tejun Heo Aug. 3, 2019, 2:01 p.m. UTC
There currently is no way to universally identify and lookup a bdi
without holding a reference and pointer to it.  This patch adds an
non-recycling bdi->id and implements bdi_get_by_id() which looks up
bdis by their ids.  This will be used by memcg foreign inode flushing.

I left bdi_list alone for simplicity and because while rb_tree does
support rcu assignment it doesn't seem to guarantee lossless walk when
walk is racing aginst tree rebalance operations.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/backing-dev-defs.h |  2 +
 include/linux/backing-dev.h      |  1 +
 mm/backing-dev.c                 | 65 +++++++++++++++++++++++++++++++-
 3 files changed, 66 insertions(+), 2 deletions(-)

Comments

Matthew Wilcox (Oracle) Aug. 3, 2019, 3:39 p.m. UTC | #1
On Sat, Aug 03, 2019 at 07:01:53AM -0700, Tejun Heo wrote:
> There currently is no way to universally identify and lookup a bdi
> without holding a reference and pointer to it.  This patch adds an
> non-recycling bdi->id and implements bdi_get_by_id() which looks up
> bdis by their ids.  This will be used by memcg foreign inode flushing.
> 
> I left bdi_list alone for simplicity and because while rb_tree does
> support rcu assignment it doesn't seem to guarantee lossless walk when
> walk is racing aginst tree rebalance operations.

This would seem like the perfect use for an allocating xarray.  That
does guarantee lossless walk under the RCU lock.  You could get rid of the
bdi_list too.
Tejun Heo Aug. 3, 2019, 3:53 p.m. UTC | #2
Hey, Matthew.

On Sat, Aug 03, 2019 at 08:39:08AM -0700, Matthew Wilcox wrote:
> On Sat, Aug 03, 2019 at 07:01:53AM -0700, Tejun Heo wrote:
> > There currently is no way to universally identify and lookup a bdi
> > without holding a reference and pointer to it.  This patch adds an
> > non-recycling bdi->id and implements bdi_get_by_id() which looks up
> > bdis by their ids.  This will be used by memcg foreign inode flushing.
> > 
> > I left bdi_list alone for simplicity and because while rb_tree does
> > support rcu assignment it doesn't seem to guarantee lossless walk when
> > walk is racing aginst tree rebalance operations.
> 
> This would seem like the perfect use for an allocating xarray.  That
> does guarantee lossless walk under the RCU lock.  You could get rid of the
> bdi_list too.

It definitely came to mind but there's a bunch of downsides to
recycling IDs or using radix tree for non-compacting allocations.

Thanks.
Matthew Wilcox (Oracle) Aug. 3, 2019, 4:17 p.m. UTC | #3
On Sat, Aug 03, 2019 at 08:53:49AM -0700, Tejun Heo wrote:
> Hey, Matthew.
> 
> On Sat, Aug 03, 2019 at 08:39:08AM -0700, Matthew Wilcox wrote:
> > On Sat, Aug 03, 2019 at 07:01:53AM -0700, Tejun Heo wrote:
> > > There currently is no way to universally identify and lookup a bdi
> > > without holding a reference and pointer to it.  This patch adds an
> > > non-recycling bdi->id and implements bdi_get_by_id() which looks up
> > > bdis by their ids.  This will be used by memcg foreign inode flushing.
> > > 
> > > I left bdi_list alone for simplicity and because while rb_tree does
> > > support rcu assignment it doesn't seem to guarantee lossless walk when
> > > walk is racing aginst tree rebalance operations.
> > 
> > This would seem like the perfect use for an allocating xarray.  That
> > does guarantee lossless walk under the RCU lock.  You could get rid of the
> > bdi_list too.
> 
> It definitely came to mind but there's a bunch of downsides to
> recycling IDs or using radix tree for non-compacting allocations.

Ah, I wasn't sure what would happen if you recycled an ID.  I agree, the
radix tree is pretty horrid for monotonically increasing IDs.  I'm still
working on the maple tree to replace it, but that's going slower than
I would like, so I can't in good conscience ask you to wait for it to
be ready.
Andrew Morton Aug. 6, 2019, 11:01 p.m. UTC | #4
On Sat,  3 Aug 2019 07:01:53 -0700 Tejun Heo <tj@kernel.org> wrote:

> There currently is no way to universally identify and lookup a bdi
> without holding a reference and pointer to it.  This patch adds an
> non-recycling bdi->id and implements bdi_get_by_id() which looks up
> bdis by their ids.  This will be used by memcg foreign inode flushing.

Why is the id non-recycling?  Presumably to address some
lifetime/lookup issues, but what are they?

Why was the IDR code not used?

> I left bdi_list alone for simplicity and because while rb_tree does
> support rcu assignment it doesn't seem to guarantee lossless walk when
> walk is racing aginst tree rebalance operations.
> 
> ...
>
> +/**
> + * bdi_get_by_id - lookup and get bdi from its id
> + * @id: bdi id to lookup
> + *
> + * Find bdi matching @id and get it.  Returns NULL if the matching bdi
> + * doesn't exist or is already unregistered.
> + */
> +struct backing_dev_info *bdi_get_by_id(u64 id)
> +{
> +	struct backing_dev_info *bdi = NULL;
> +	struct rb_node **p;
> +
> +	spin_lock_irq(&bdi_lock);

Why irq-safe?  Everywhere else uses spin_lock_bh(&bdi_lock).

> +	p = bdi_lookup_rb_node(id, NULL);
> +	if (*p) {
> +		bdi = rb_entry(*p, struct backing_dev_info, rb_node);
> +		bdi_get(bdi);
> +	}
> +	spin_unlock_irq(&bdi_lock);
> +
> +	return bdi;
> +}
> +
>
> ...
>
Tejun Heo Aug. 7, 2019, 6:31 p.m. UTC | #5
Hello,

On Tue, Aug 06, 2019 at 04:01:02PM -0700, Andrew Morton wrote:
> On Sat,  3 Aug 2019 07:01:53 -0700 Tejun Heo <tj@kernel.org> wrote:
> > There currently is no way to universally identify and lookup a bdi
> > without holding a reference and pointer to it.  This patch adds an
> > non-recycling bdi->id and implements bdi_get_by_id() which looks up
> > bdis by their ids.  This will be used by memcg foreign inode flushing.
> 
> Why is the id non-recycling?  Presumably to address some
> lifetime/lookup issues, but what are they?

The ID by itself is used to point to the bdi from cgroup and idr
recycles really aggressively.  Combined with, for example, loop device
based containers, stale pointing can become pretty common.  We're
having similar issues with cgroup IDs.

> Why was the IDR code not used?

Because of the rapid recycling.  In the longer term, I think we want
IDR to be able to support non-recycling IDs, or at least less
agressive recycling.

> > +struct backing_dev_info *bdi_get_by_id(u64 id)
> > +{
> > +	struct backing_dev_info *bdi = NULL;
> > +	struct rb_node **p;
> > +
> > +	spin_lock_irq(&bdi_lock);
> 
> Why irq-safe?  Everywhere else uses spin_lock_bh(&bdi_lock).

By mistake, I'll change them to bh.

Thanks.
Andrew Morton Aug. 7, 2019, 7 p.m. UTC | #6
On Wed, 7 Aug 2019 11:31:51 -0700 Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On Tue, Aug 06, 2019 at 04:01:02PM -0700, Andrew Morton wrote:
> > On Sat,  3 Aug 2019 07:01:53 -0700 Tejun Heo <tj@kernel.org> wrote:
> > > There currently is no way to universally identify and lookup a bdi
> > > without holding a reference and pointer to it.  This patch adds an
> > > non-recycling bdi->id and implements bdi_get_by_id() which looks up
> > > bdis by their ids.  This will be used by memcg foreign inode flushing.
> > 
> > Why is the id non-recycling?  Presumably to address some
> > lifetime/lookup issues, but what are they?
> 
> The ID by itself is used to point to the bdi from cgroup and idr
> recycles really aggressively.  Combined with, for example, loop device
> based containers, stale pointing can become pretty common.  We're
> having similar issues with cgroup IDs.

OK, but why is recycling a problem?  For example, file descriptors
recycle as aggressively as is possible, and that doesn't cause any
trouble.  Presumably recycling is a problem with cgroups because of
some sort of stale reference problem?
Tejun Heo Aug. 7, 2019, 8:34 p.m. UTC | #7
Hello,

On Wed, Aug 07, 2019 at 12:00:37PM -0700, Andrew Morton wrote:
> OK, but why is recycling a problem?  For example, file descriptors
> recycle as aggressively as is possible, and that doesn't cause any
> trouble.  Presumably recycling is a problem with cgroups because of
> some sort of stale reference problem?

Oh, because there are use cases where the consumers are detached from
the lifetime synchronization.  In this example, the benefit of using
IDs is that memcgs don't need to pin foreign bdi_wb's and just look up
and verify when it wants to flush them.  If we use pointers, we have
to pin the objects which then requires either shooting down those
references with timers or somehow doing reverse lookup to shoot them
down when bdi_wb is taken offline.  If we use IDs which can be
recycling aggressively, there can be pathological cases where remote
flushes are issued on the wrong target possibly repeatedly, which may
or may not be a real problem.

For cgroup ID, the problem is more immediate.  We give out the IDs to
userspace and there is no way to shoot those references down when the
cgroup goes away and the ID gets recycled, so when the user comes back
and asks "I want to attach this bpf program to the cgroup identified
with this ID", we can end up attaching it to the wrong cgroup if we
recycle IDs.  cgroup ended up with two separate IDs, which is kinda
dumb.

tl;dr is that it's either cumbersome or impossible to synchronize the
users of these IDs, so if they get recycled, they end up identifying
the wrong thing.

Thanks.
Rik van Riel Aug. 9, 2019, 12:57 a.m. UTC | #8
On Wed, 2019-08-07 at 12:00 -0700, Andrew Morton wrote:
> On Wed, 7 Aug 2019 11:31:51 -0700 Tejun Heo <tj@kernel.org> wrote:
> 
> > Hello,
> > 
> > On Tue, Aug 06, 2019 at 04:01:02PM -0700, Andrew Morton wrote:
> > > On Sat,  3 Aug 2019 07:01:53 -0700 Tejun Heo <tj@kernel.org>
> > > wrote:
> > > > There currently is no way to universally identify and lookup a
> > > > bdi
> > > > without holding a reference and pointer to it.  This patch adds
> > > > an
> > > > non-recycling bdi->id and implements bdi_get_by_id() which
> > > > looks up
> > > > bdis by their ids.  This will be used by memcg foreign inode
> > > > flushing.
> > > 
> > > Why is the id non-recycling?  Presumably to address some
> > > lifetime/lookup issues, but what are they?
> > 
> > The ID by itself is used to point to the bdi from cgroup and idr
> > recycles really aggressively.  Combined with, for example, loop
> > device
> > based containers, stale pointing can become pretty common.  We're
> > having similar issues with cgroup IDs.
> 
> OK, but why is recycling a problem?  For example, file descriptors
> recycle as aggressively as is possible, and that doesn't cause any
> trouble.  Presumably recycling is a problem with cgroups because of
> some sort of stale reference problem?

PIDs, on the other hand, we recycle as slowly as
possible...
Jan Kara Aug. 15, 2019, 2:46 p.m. UTC | #9
On Sat 03-08-19 07:01:53, Tejun Heo wrote:
> There currently is no way to universally identify and lookup a bdi
> without holding a reference and pointer to it.  This patch adds an
> non-recycling bdi->id and implements bdi_get_by_id() which looks up
> bdis by their ids.  This will be used by memcg foreign inode flushing.
> 
> I left bdi_list alone for simplicity and because while rb_tree does
> support rcu assignment it doesn't seem to guarantee lossless walk when
> walk is racing aginst tree rebalance operations.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

Although I would note that here you take effort not to recycle bdi->id so
that you don't flush wrong devices while in patch 4 you take pretty lax
approach to feeding garbage into the writeback system. So these two don't
quite match to me...

								Honza

> ---
>  include/linux/backing-dev-defs.h |  2 +
>  include/linux/backing-dev.h      |  1 +
>  mm/backing-dev.c                 | 65 +++++++++++++++++++++++++++++++-
>  3 files changed, 66 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index 8fb740178d5d..1075f2552cfc 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -185,6 +185,8 @@ struct bdi_writeback {
>  };
>  
>  struct backing_dev_info {
> +	u64 id;
> +	struct rb_node rb_node; /* keyed by ->id */
>  	struct list_head bdi_list;
>  	unsigned long ra_pages;	/* max readahead in PAGE_SIZE units */
>  	unsigned long io_pages;	/* max allowed IO size */
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 02650b1253a2..84cdcfbc763f 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -24,6 +24,7 @@ static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi)
>  	return bdi;
>  }
>  
> +struct backing_dev_info *bdi_get_by_id(u64 id);
>  void bdi_put(struct backing_dev_info *bdi);
>  
>  __printf(2, 3)
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index e8e89158adec..4a8816e0b8d4 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  
>  #include <linux/wait.h>
> +#include <linux/rbtree.h>
>  #include <linux/backing-dev.h>
>  #include <linux/kthread.h>
>  #include <linux/freezer.h>
> @@ -22,10 +23,12 @@ EXPORT_SYMBOL_GPL(noop_backing_dev_info);
>  static struct class *bdi_class;
>  
>  /*
> - * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side
> - * locking.
> + * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU
> + * reader side locking.
>   */
>  DEFINE_SPINLOCK(bdi_lock);
> +static u64 bdi_id_cursor;
> +static struct rb_root bdi_tree = RB_ROOT;
>  LIST_HEAD(bdi_list);
>  
>  /* bdi_wq serves all asynchronous writeback tasks */
> @@ -859,9 +862,58 @@ struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id)
>  }
>  EXPORT_SYMBOL(bdi_alloc_node);
>  
> +struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp)
> +{
> +	struct rb_node **p = &bdi_tree.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct backing_dev_info *bdi;
> +
> +	lockdep_assert_held(&bdi_lock);
> +
> +	while (*p) {
> +		parent = *p;
> +		bdi = rb_entry(parent, struct backing_dev_info, rb_node);
> +
> +		if (bdi->id > id)
> +			p = &(*p)->rb_left;
> +		else if (bdi->id < id)
> +			p = &(*p)->rb_right;
> +		else
> +			break;
> +	}
> +
> +	if (parentp)
> +		*parentp = parent;
> +	return p;
> +}
> +
> +/**
> + * bdi_get_by_id - lookup and get bdi from its id
> + * @id: bdi id to lookup
> + *
> + * Find bdi matching @id and get it.  Returns NULL if the matching bdi
> + * doesn't exist or is already unregistered.
> + */
> +struct backing_dev_info *bdi_get_by_id(u64 id)
> +{
> +	struct backing_dev_info *bdi = NULL;
> +	struct rb_node **p;
> +
> +	spin_lock_irq(&bdi_lock);
> +	p = bdi_lookup_rb_node(id, NULL);
> +	if (*p) {
> +		bdi = rb_entry(*p, struct backing_dev_info, rb_node);
> +		bdi_get(bdi);
> +	}
> +	spin_unlock_irq(&bdi_lock);
> +
> +	return bdi;
> +}
> +
>  int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
>  {
>  	struct device *dev;
> +	struct rb_node *parent, **p;
>  
>  	if (bdi->dev)	/* The driver needs to use separate queues per device */
>  		return 0;
> @@ -877,7 +929,15 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
>  	set_bit(WB_registered, &bdi->wb.state);
>  
>  	spin_lock_bh(&bdi_lock);
> +
> +	bdi->id = ++bdi_id_cursor;
> +
> +	p = bdi_lookup_rb_node(bdi->id, &parent);
> +	rb_link_node(&bdi->rb_node, parent, p);
> +	rb_insert_color(&bdi->rb_node, &bdi_tree);
> +
>  	list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
> +
>  	spin_unlock_bh(&bdi_lock);
>  
>  	trace_writeback_bdi_register(bdi);
> @@ -918,6 +978,7 @@ EXPORT_SYMBOL(bdi_register_owner);
>  static void bdi_remove_from_list(struct backing_dev_info *bdi)
>  {
>  	spin_lock_bh(&bdi_lock);
> +	rb_erase(&bdi->rb_node, &bdi_tree);
>  	list_del_rcu(&bdi->bdi_list);
>  	spin_unlock_bh(&bdi_lock);
>  
> -- 
> 2.17.1
>
Tejun Heo Aug. 15, 2019, 5:34 p.m. UTC | #10
Hello,

On Thu, Aug 15, 2019 at 04:46:23PM +0200, Jan Kara wrote:
> Although I would note that here you take effort not to recycle bdi->id so
> that you don't flush wrong devices while in patch 4 you take pretty lax
> approach to feeding garbage into the writeback system. So these two don't
> quite match to me...

So, I was trying to avoid systemic errors where the wrong thing can be
triggered repeatedly.  A wrong flush once in a blue moon shouldn't be
a big problem but if they can be triggered consistently by some
pathological behavior, it's an a lot bigger problem.

Thanks.
diff mbox series

Patch

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb740178d5d..1075f2552cfc 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -185,6 +185,8 @@  struct bdi_writeback {
 };
 
 struct backing_dev_info {
+	u64 id;
+	struct rb_node rb_node; /* keyed by ->id */
 	struct list_head bdi_list;
 	unsigned long ra_pages;	/* max readahead in PAGE_SIZE units */
 	unsigned long io_pages;	/* max allowed IO size */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 02650b1253a2..84cdcfbc763f 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -24,6 +24,7 @@  static inline struct backing_dev_info *bdi_get(struct backing_dev_info *bdi)
 	return bdi;
 }
 
+struct backing_dev_info *bdi_get_by_id(u64 id);
 void bdi_put(struct backing_dev_info *bdi);
 
 __printf(2, 3)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e8e89158adec..4a8816e0b8d4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1,6 +1,7 @@ 
 // SPDX-License-Identifier: GPL-2.0-only
 
 #include <linux/wait.h>
+#include <linux/rbtree.h>
 #include <linux/backing-dev.h>
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -22,10 +23,12 @@  EXPORT_SYMBOL_GPL(noop_backing_dev_info);
 static struct class *bdi_class;
 
 /*
- * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side
- * locking.
+ * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU
+ * reader side locking.
  */
 DEFINE_SPINLOCK(bdi_lock);
+static u64 bdi_id_cursor;
+static struct rb_root bdi_tree = RB_ROOT;
 LIST_HEAD(bdi_list);
 
 /* bdi_wq serves all asynchronous writeback tasks */
@@ -859,9 +862,58 @@  struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id)
 }
 EXPORT_SYMBOL(bdi_alloc_node);
 
+struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp)
+{
+	struct rb_node **p = &bdi_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct backing_dev_info *bdi;
+
+	lockdep_assert_held(&bdi_lock);
+
+	while (*p) {
+		parent = *p;
+		bdi = rb_entry(parent, struct backing_dev_info, rb_node);
+
+		if (bdi->id > id)
+			p = &(*p)->rb_left;
+		else if (bdi->id < id)
+			p = &(*p)->rb_right;
+		else
+			break;
+	}
+
+	if (parentp)
+		*parentp = parent;
+	return p;
+}
+
+/**
+ * bdi_get_by_id - lookup and get bdi from its id
+ * @id: bdi id to lookup
+ *
+ * Find bdi matching @id and get it.  Returns NULL if the matching bdi
+ * doesn't exist or is already unregistered.
+ */
+struct backing_dev_info *bdi_get_by_id(u64 id)
+{
+	struct backing_dev_info *bdi = NULL;
+	struct rb_node **p;
+
+	spin_lock_irq(&bdi_lock);
+	p = bdi_lookup_rb_node(id, NULL);
+	if (*p) {
+		bdi = rb_entry(*p, struct backing_dev_info, rb_node);
+		bdi_get(bdi);
+	}
+	spin_unlock_irq(&bdi_lock);
+
+	return bdi;
+}
+
 int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
 {
 	struct device *dev;
+	struct rb_node *parent, **p;
 
 	if (bdi->dev)	/* The driver needs to use separate queues per device */
 		return 0;
@@ -877,7 +929,15 @@  int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
 	set_bit(WB_registered, &bdi->wb.state);
 
 	spin_lock_bh(&bdi_lock);
+
+	bdi->id = ++bdi_id_cursor;
+
+	p = bdi_lookup_rb_node(bdi->id, &parent);
+	rb_link_node(&bdi->rb_node, parent, p);
+	rb_insert_color(&bdi->rb_node, &bdi_tree);
+
 	list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
+
 	spin_unlock_bh(&bdi_lock);
 
 	trace_writeback_bdi_register(bdi);
@@ -918,6 +978,7 @@  EXPORT_SYMBOL(bdi_register_owner);
 static void bdi_remove_from_list(struct backing_dev_info *bdi)
 {
 	spin_lock_bh(&bdi_lock);
+	rb_erase(&bdi->rb_node, &bdi_tree);
 	list_del_rcu(&bdi->bdi_list);
 	spin_unlock_bh(&bdi_lock);