diff mbox

[v2] blk-cgroup: remove entries in blkg_tree before queue release

Message ID 20180407102148.GA9729@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Alexandru Moise April 7, 2018, 10:21 a.m. UTC
The q->id is used as an index within the blkg_tree radix tree.

If the entry is not released before reclaiming the blk_queue_ida's id
blkcg_init_queue() within a different driver from which this id
was originally for can fail due to the entry at that index within
the radix tree already existing.

Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
---
v2: Added no-op for !CONFIG_BLK_CGROUP

 block/blk-cgroup.c         | 2 +-
 block/blk-sysfs.c          | 4 ++++
 include/linux/blk-cgroup.h | 3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

Comments

Tejun Heo April 9, 2018, 10:09 p.m. UTC | #1
(cc'ing Joseph as he worked on the area recently, hi!)

Hello,

On Sat, Apr 07, 2018 at 12:21:48PM +0200, Alexandru Moise wrote:
> The q->id is used as an index within the blkg_tree radix tree.
> 
> If the entry is not released before reclaiming the blk_queue_ida's id
> blkcg_init_queue() within a different driver from which this id
> was originally for can fail due to the entry at that index within
> the radix tree already existing.
> 
> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
> ---
> v2: Added no-op for !CONFIG_BLK_CGROUP
> 
>  block/blk-cgroup.c         | 2 +-
>  block/blk-sysfs.c          | 4 ++++
>  include/linux/blk-cgroup.h | 3 +++
>  3 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 1c16694ae145..224e937dbb59 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -369,7 +369,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
>   *
>   * Destroy all blkgs associated with @q.
>   */
> -static void blkg_destroy_all(struct request_queue *q)
> +void blkg_destroy_all(struct request_queue *q)
>  {
>  	struct blkcg_gq *blkg, *n;
>  
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index d00d1b0ec109..a72866458f22 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -816,6 +816,10 @@ static void __blk_release_queue(struct work_struct *work)
>  	if (q->bio_split)
>  		bioset_free(q->bio_split);
>  
> +	spin_lock_irq(q->queue_lock);
> +	blkg_destroy_all(q);
> +	spin_unlock_irq(q->queue_lock);
> +
>  	ida_simple_remove(&blk_queue_ida, q->id);
>  	call_rcu(&q->rcu_head, blk_free_queue_rcu);

But we already do this through calling blkcg_exit_queue() from
__blk_release_queue().  What's missing?

Thanks.
Alexandru Moise April 11, 2018, 10:12 a.m. UTC | #2
On Mon, Apr 09, 2018 at 03:09:38PM -0700, Tejun Heo wrote:
> (cc'ing Joseph as he worked on the area recently, hi!)
> 
> Hello,
> 
> On Sat, Apr 07, 2018 at 12:21:48PM +0200, Alexandru Moise wrote:
> > The q->id is used as an index within the blkg_tree radix tree.
> > 
> > If the entry is not released before reclaiming the blk_queue_ida's id
> > blkcg_init_queue() within a different driver from which this id
> > was originally for can fail due to the entry at that index within
> > the radix tree already existing.
> > 
> > Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
> > ---
> > v2: Added no-op for !CONFIG_BLK_CGROUP
> > 
> >  block/blk-cgroup.c         | 2 +-
> >  block/blk-sysfs.c          | 4 ++++
> >  include/linux/blk-cgroup.h | 3 +++
> >  3 files changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 1c16694ae145..224e937dbb59 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -369,7 +369,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
> >   *
> >   * Destroy all blkgs associated with @q.
> >   */
> > -static void blkg_destroy_all(struct request_queue *q)
> > +void blkg_destroy_all(struct request_queue *q)
> >  {
> >  	struct blkcg_gq *blkg, *n;
> >  
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index d00d1b0ec109..a72866458f22 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -816,6 +816,10 @@ static void __blk_release_queue(struct work_struct *work)
> >  	if (q->bio_split)
> >  		bioset_free(q->bio_split);
> >  
> > +	spin_lock_irq(q->queue_lock);
> > +	blkg_destroy_all(q);
> > +	spin_unlock_irq(q->queue_lock);
> > +
> >  	ida_simple_remove(&blk_queue_ida, q->id);
> >  	call_rcu(&q->rcu_head, blk_free_queue_rcu);
> 
> But we already do this through calling blkcg_exit_queue() from
> __blk_release_queue().  What's missing?

Hi,

It might be the jetlag but I can't see how you end up calling
blkcg_exit_queue() from __blk_release_queue().

As I see it the only way to reach blkcg_exit_queue() is from
blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue().

I suspect that I'm just fixing a corner case though and
the general case is what you describe or similar.

../Alex
> 
> Thanks.
> 
> -- 
> tejun
Tejun Heo April 11, 2018, 2:20 p.m. UTC | #3
Hello,

On Wed, Apr 11, 2018 at 12:12:56PM +0200, Alexandru Moise wrote:
> > But we already do this through calling blkcg_exit_queue() from
> > __blk_release_queue().  What's missing?
> 
> Hi,
> 
> It might be the jetlag but I can't see how you end up calling
> blkcg_exit_queue() from __blk_release_queue().
> 
> As I see it the only way to reach blkcg_exit_queue() is from
> blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue().
> 
> I suspect that I'm just fixing a corner case though and
> the general case is what you describe or similar.

Ah, that changed recently.  Can you please check out the current
upstream git master?

Thanks.
Alexandru Moise April 11, 2018, 2:28 p.m. UTC | #4
On Wed, Apr 11, 2018 at 07:20:19AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 11, 2018 at 12:12:56PM +0200, Alexandru Moise wrote:
> > > But we already do this through calling blkcg_exit_queue() from
> > > __blk_release_queue().  What's missing?
> > 
> > Hi,
> > 
> > It might be the jetlag but I can't see how you end up calling
> > blkcg_exit_queue() from __blk_release_queue().
> > 
> > As I see it the only way to reach blkcg_exit_queue() is from
> > blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue().
> > 
> > I suspect that I'm just fixing a corner case though and
> > the general case is what you describe or similar.
> 
> Ah, that changed recently.  Can you please check out the current
> upstream git master?
> 
> Thanks.
> 
Just did, without my patch I see this crash:

[    0.759999] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 4.16.0-ARCH+ #81                                 [7/1949]
[    0.759999] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/204
[    0.759999] RIP: 0010:pi_init+0x23f/0x2f0
[    0.759999] RSP: 0000:ffffc90000197d90 EFLAGS: 00010246
[    0.759999] RAX: 0000000000000000 RBX: 0000000000000020 RCX: 0000000000000038
[    0.759999] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000000
[    0.759999] RBP: ffffc90000197e18 R08: 00000000ffffffff R09: 00000000ffffffff
[    0.759999] R10: ffffea0000eda600 R11: ffff88003b69f164 R12: ffffffff82e2d740
[    0.759999] R13: 00000000ffffffff R14: 0000000000000000 R15: 0000000000000000
[    0.759999] FS:  0000000000000000(0000) GS:ffff88003e500000(0000) knlGS:0000000000000000
[    0.759999] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.759999] CR2: 0000000000000008 CR3: 0000000002209001 CR4: 00000000000606e0
[    0.759999] Call Trace:
[    0.759999]  pf_init+0x1db/0x3be
[    0.759999]  ? pcd_init+0x4e8/0x4e8
[    0.759999]  do_one_initcall+0x9e/0x1b0
[    0.759999]  ? do_early_param+0x97/0x97
[    0.759999]  kernel_init_freeable+0x259/0x2fd
[    0.759999]  ? rest_init+0xd0/0xd0
[    0.759999]  ? syscall_slow_exit_work+0x1c/0x160
[    0.759999]  kernel_init+0xe/0x100
[    0.759999]  ret_from_fork+0x3a/0x50
[    0.759999] Code: 75 6a 49 8b 06 48 8b 40 78 48 85 c0 74 08 4c 89 f7 e8 46 76 51 00 83 c3 01 3b 5d a8 7d 0d 49
[    0.759999] RIP: pi_init+0x23f/0x2f0 RSP: ffffc90000197d90
[    0.759999] CR2: 0000000000000008
[    0.759999] ---[ end trace 12004f267bb8bf7d ]---
[    0.766666] BUG: unable to handle kernel NULL pointer dereference at 00000000000001b4
[    0.763350] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[    0.763350]
[    0.766666] PGD 0 P4D 0
[    0.766666] Oops: 0000 [#2] PREEMPT SMP
[    0.766666] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G      D          4.16.0-ARCH+ #81
[    0.766666] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/20$
[    0.766666] Workqueue: nvme-reset-wq nvme_reset_work
[    0.766666] RIP: 0010:blk_queue_flag_set+0xf/0x40
[    0.766666] RSP: 0000:ffffc900001bfcb0 EFLAGS: 00010246
[    0.766666] RAX: ffff88003b698000 RBX: 0000000000000000 RCX: 0000000000000000
[    0.766666] RDX: ffff88003b698000 RSI: fffffffffffffff4 RDI: 000000000000001c
[    0.766666] RBP: ffffc900001bfcc0 R08: 0000000000000000 R09: 0000000000000000
[    0.766666] R10: ffffea0000eaa980 R11: ffffffff814e0970 R12: 000000000000001c
[    0.766666] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88003aad8010
[    0.766666] FS:  0000000000000000(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
[    0.766666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.766666] CR2: 00000000000001b4 CR3: 0000000002209001 CR4: 00000000000606f0
[    0.766666] Call Trace:
[    0.766666]  blk_mq_quiesce_queue+0x23/0x80
[    0.766666]  nvme_dev_disable+0x34f/0x480
[    0.766666]  ? nvme_irq+0x50/0x50
[    0.766666]  ? dev_warn+0x64/0x80
[    0.766666]  nvme_reset_work+0x13de/0x1570
[    0.766666]  ? __switch_to_asm+0x34/0x70
[    0.766666]  ? __switch_to_asm+0x40/0x70
[    0.766666]  ? _raw_spin_unlock_irq+0x15/0x30
[    0.766666]  ? finish_task_switch+0x156/0x210
[    0.766666]  process_one_work+0x20c/0x3d0
[    0.766666]  worker_thread+0x216/0x400
[    0.766666]  kthread+0x125/0x130
[    0.766666]  ? process_one_work+0x3d0/0x3d0
[    0.766666]  ? __kthread_bind_mask+0x60/0x60
[    0.766666]  ret_from_fork+0x3a/0x50


With the patch the crash goes away,

Thanks,
../Alex

> -- 
> tejun
Tejun Heo April 11, 2018, 2:46 p.m. UTC | #5
Hello,

On Wed, Apr 11, 2018 at 04:28:59PM +0200, Alexandru Moise wrote:
> > Ah, that changed recently.  Can you please check out the current
> > upstream git master?
> > 
> Just did, without my patch I see this crash:

lol I was looking at the old tree, so this is the fix for the new
breakage introduced by the recent change.  Sorry about the confusion.
Joseph, can you please take a look?

Thanks.
Tejun Heo April 11, 2018, 2:51 p.m. UTC | #6
Hello,

(cc'ing Bart)

On Wed, Apr 11, 2018 at 07:46:16AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 11, 2018 at 04:28:59PM +0200, Alexandru Moise wrote:
> > > Ah, that changed recently.  Can you please check out the current
> > > upstream git master?
> > > 
> > Just did, without my patch I see this crash:
> 
> lol I was looking at the old tree, so this is the fix for the new
> breakage introduced by the recent change.  Sorry about the confusion.
> Joseph, can you please take a look?

Oh, it wasn't Joseph's change.  It was Bart's fix for a problem
reported by Joseph.  Bart, a063057d7c73 ("block: Fix a race between
request queue removal and the block cgroup controller") created a
regression where a request_queue can be destroyed with blkgs still
attached.  The original report is..

 http://lkml.kernel.org/r/20180407102148.GA9729@gmail.com

Thanks.
Tejun Heo April 11, 2018, 2:56 p.m. UTC | #7
Hello, again.

On Wed, Apr 11, 2018 at 07:51:23AM -0700, Tejun Heo wrote:
> Oh, it wasn't Joseph's change.  It was Bart's fix for a problem
> reported by Joseph.  Bart, a063057d7c73 ("block: Fix a race between
> request queue removal and the block cgroup controller") created a
> regression where a request_queue can be destroyed with blkgs still
> attached.  The original report is..
> 
>  http://lkml.kernel.org/r/20180407102148.GA9729@gmail.com

And looking at the change, it looks like the right thing we should
have done is caching @lock on the print_blkg side and when switching
locks make sure both locks are held.  IOW, do the following in
blk_cleanup_queue()

	spin_lock_irq(lock);
	if (q->queue_lock != &q->__queue_lock) {
		spin_lock(&q->__queue_lock);
		q->queue_lock = &q->__queue_lock;
		spin_unlock(&q->__queue_lock);
	}
	spin_unlock_irq(lock);

Otherwise, there can be two lock holders thinking they have exclusive
access to the request_queue.

Thanks.
Bart Van Assche April 11, 2018, 3:54 p.m. UTC | #8
On Wed, 2018-04-11 at 16:28 +0200, Alexandru Moise wrote:
> [    0.766666] BUG: unable to handle kernel NULL pointer dereference at 00000000000001b4

> [    0.763350] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

> [    0.763350]

> [    0.766666] PGD 0 P4D 0

> [    0.766666] Oops: 0000 [#2] PREEMPT SMP

> [    0.766666] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G      D          4.16.0-ARCH+ #81

> [    0.766666] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/20$

> [    0.766666] Workqueue: nvme-reset-wq nvme_reset_work

> [    0.766666] RIP: 0010:blk_queue_flag_set+0xf/0x40

> [    0.766666] RSP: 0000:ffffc900001bfcb0 EFLAGS: 00010246

> [    0.766666] RAX: ffff88003b698000 RBX: 0000000000000000 RCX: 0000000000000000

> [    0.766666] RDX: ffff88003b698000 RSI: fffffffffffffff4 RDI: 000000000000001c

> [    0.766666] RBP: ffffc900001bfcc0 R08: 0000000000000000 R09: 0000000000000000

> [    0.766666] R10: ffffea0000eaa980 R11: ffffffff814e0970 R12: 000000000000001c

> [    0.766666] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88003aad8010

> [    0.766666] FS:  0000000000000000(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000

> [    0.766666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

> [    0.766666] CR2: 00000000000001b4 CR3: 0000000002209001 CR4: 00000000000606f0

> [    0.766666] Call Trace:

> [    0.766666]  blk_mq_quiesce_queue+0x23/0x80

> [    0.766666]  nvme_dev_disable+0x34f/0x480

> [    0.766666]  ? nvme_irq+0x50/0x50

> [    0.766666]  ? dev_warn+0x64/0x80

> [    0.766666]  nvme_reset_work+0x13de/0x1570

> [    0.766666]  ? __switch_to_asm+0x34/0x70

> [    0.766666]  ? __switch_to_asm+0x40/0x70

> [    0.766666]  ? _raw_spin_unlock_irq+0x15/0x30

> [    0.766666]  ? finish_task_switch+0x156/0x210

> [    0.766666]  process_one_work+0x20c/0x3d0

> [    0.766666]  worker_thread+0x216/0x400

> [    0.766666]  kthread+0x125/0x130

> [    0.766666]  ? process_one_work+0x3d0/0x3d0

> [    0.766666]  ? __kthread_bind_mask+0x60/0x60

> [    0.766666]  ret_from_fork+0x3a/0x50


Hello Alexandru,

What made you look at cgroups? In the above register dump I see that %rbx == NULL.
I think that means that the queue pointer argument of blk_queue_flag_set() is NULL.
The NVMe initiator driver should never pass a NULL pointer to blk_mq_quiesce_queue().
Please ask the NVMe driver maintainers for their opinion on the linux-nvme mailing
list.

Thanks,

Bart.
Bart Van Assche April 11, 2018, 4:42 p.m. UTC | #9
On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote:
> And looking at the change, it looks like the right thing we should

> have done is caching @lock on the print_blkg side and when switching

> locks make sure both locks are held.  IOW, do the following in

> blk_cleanup_queue()

> 

> 	spin_lock_irq(lock);

> 	if (q->queue_lock != &q->__queue_lock) {

> 		spin_lock(&q->__queue_lock);

> 		q->queue_lock = &q->__queue_lock;

> 		spin_unlock(&q->__queue_lock);

> 	}

> 	spin_unlock_irq(lock);

> 

> Otherwise, there can be two lock holders thinking they have exclusive

> access to the request_queue.


I think that's a bad idea. A block driver is allowed to destroy the
spinlock it associated with the request queue as soon as blk_cleanup_queue()
has finished. If the block cgroup controller would cache a pointer to the
block driver spinlock then that could cause the cgroup code to attempt to
lock a spinlock after it has been destroyed. I don't think we need that kind
of race conditions.

Bart.
Tejun Heo April 11, 2018, 5 p.m. UTC | #10
Hello,

On Wed, Apr 11, 2018 at 04:42:55PM +0000, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote:
> > And looking at the change, it looks like the right thing we should
> > have done is caching @lock on the print_blkg side and when switching
> > locks make sure both locks are held.  IOW, do the following in
> > blk_cleanup_queue()
> > 
> > 	spin_lock_irq(lock);
> > 	if (q->queue_lock != &q->__queue_lock) {
> > 		spin_lock(&q->__queue_lock);
> > 		q->queue_lock = &q->__queue_lock;
> > 		spin_unlock(&q->__queue_lock);
> > 	}
> > 	spin_unlock_irq(lock);
> > 
> > Otherwise, there can be two lock holders thinking they have exclusive
> > access to the request_queue.
> 
> I think that's a bad idea. A block driver is allowed to destroy the
> spinlock it associated with the request queue as soon as blk_cleanup_queue()
> has finished. If the block cgroup controller would cache a pointer to the
> block driver spinlock then that could cause the cgroup code to attempt to
> lock a spinlock after it has been destroyed. I don't think we need that kind
> of race conditions.

I see, but that problem is there with or without caching as long as we
have queu_lock usage which reach beyond cleanup_queue, right?  Whether
that user caches the lock for matching unlocking or not doesn't really
change the situation.

Short of adding protection around queue_lock switching, I can't think
of a solution tho.  Probably the right thing to do is adding queue
lock/unlock helpers which are safe to use beyond cleanup_queue.

Thanks.
Bart Van Assche April 11, 2018, 5:06 p.m. UTC | #11
On Wed, 2018-04-11 at 10:00 -0700, tj@kernel.org wrote:
> On Wed, Apr 11, 2018 at 04:42:55PM +0000, Bart Van Assche wrote:

> > On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote:

> > > And looking at the change, it looks like the right thing we should

> > > have done is caching @lock on the print_blkg side and when switching

> > > locks make sure both locks are held.  IOW, do the following in

> > > blk_cleanup_queue()

> > > 

> > > 	spin_lock_irq(lock);

> > > 	if (q->queue_lock != &q->__queue_lock) {

> > > 		spin_lock(&q->__queue_lock);

> > > 		q->queue_lock = &q->__queue_lock;

> > > 		spin_unlock(&q->__queue_lock);

> > > 	}

> > > 	spin_unlock_irq(lock);

> > > 

> > > Otherwise, there can be two lock holders thinking they have exclusive

> > > access to the request_queue.

> > 

> > I think that's a bad idea. A block driver is allowed to destroy the

> > spinlock it associated with the request queue as soon as blk_cleanup_queue()

> > has finished. If the block cgroup controller would cache a pointer to the

> > block driver spinlock then that could cause the cgroup code to attempt to

> > lock a spinlock after it has been destroyed. I don't think we need that kind

> > of race conditions.

> 

> I see, but that problem is there with or without caching as long as we

> have queu_lock usage which reach beyond cleanup_queue, right?  Whether

> that user caches the lock for matching unlocking or not doesn't really

> change the situation.

> 

> Short of adding protection around queue_lock switching, I can't think

> of a solution tho.  Probably the right thing to do is adding queue

> lock/unlock helpers which are safe to use beyond cleanup_queue.


Hello Tejun,

A simple and effective solution is to dissociate a request queue from the
block cgroup controller before blk_cleanup_queue() returns. This is why commit
a063057d7c73 ("block: Fix a race between request queue removal and the block
cgroup controller") moved the blkcg_exit_queue() call from __blk_release_queue()
into blk_cleanup_queue().

Thanks,

Bart.
Tejun Heo April 11, 2018, 5:15 p.m. UTC | #12
Hello,

On Wed, Apr 11, 2018 at 05:06:41PM +0000, Bart Van Assche wrote:
> A simple and effective solution is to dissociate a request queue from the
> block cgroup controller before blk_cleanup_queue() returns. This is why commit
> a063057d7c73 ("block: Fix a race between request queue removal and the block
> cgroup controller") moved the blkcg_exit_queue() call from __blk_release_queue()
> into blk_cleanup_queue().

which is broken.  We can try to switch the lifetime model to revoking
all live objects but that likely is a lot more involving than blindly
moving blkg shootdown from release to cleanup.  Implementing sever
semantics is usually a lot more involved / fragile because it requires
explicit participation from all users (exactly the same way revoking
->queue_lock is difficult).

I'm not necessarily against switching to sever model, but what the
patch did isn't that.  It just moved some code without actually
understanding or auditing what the implications are.

Thanks.
Bart Van Assche April 11, 2018, 5:26 p.m. UTC | #13
On Wed, 2018-04-11 at 10:15 -0700, tj@kernel.org wrote:
> On Wed, Apr 11, 2018 at 05:06:41PM +0000, Bart Van Assche wrote:

> > A simple and effective solution is to dissociate a request queue from the

> > block cgroup controller before blk_cleanup_queue() returns. This is why commit

> > a063057d7c73 ("block: Fix a race between request queue removal and the block

> > cgroup controller") moved the blkcg_exit_queue() call from __blk_release_queue()

> > into blk_cleanup_queue().

> 

> which is broken.  We can try to switch the lifetime model to revoking

> all live objects but that likely is a lot more involving than blindly

> moving blkg shootdown from release to cleanup.  Implementing sever

> semantics is usually a lot more involved / fragile because it requires

> explicit participation from all users (exactly the same way revoking

> ->queue_lock is difficult).

> 

> I'm not necessarily against switching to sever model, but what the

> patch did isn't that.  It just moved some code without actually

> understanding or auditing what the implications are.


Hello Tejun,

Please explain what you wrote further. It's not clear to me why you think
that anything is broken nor what a "sever model" is.

I think we really need the blkcg_exit_queue() call in blk_cleanup_queue()
to avoid race conditions between request queue cleanup and the block cgroup
controller. Additionally, since it is guaranteed that no new requests will
be submitted to a queue after it has been marked dead I don't see why it
would make sense to keep the association between a request queue and the
blkcg controller until the last reference on a queue is dropped.

Bart.
Tejun Heo April 11, 2018, 5:30 p.m. UTC | #14
Hello,

On Wed, Apr 11, 2018 at 05:26:12PM +0000, Bart Van Assche wrote:
> Please explain what you wrote further. It's not clear to me why you think
> that anything is broken nor what a "sever model" is.

So, sever (or revoke) model is where you actively disconnect an object
and ensuring that there'll be no further references from others.  In
contrast, what we usually do is refcnting, where we don't actively
sever the dying object from its users but let the users drain
themselves over time and destroy the object when we know all the users
are gone (recnt reaching zero).

> I think we really need the blkcg_exit_queue() call in blk_cleanup_queue()
> to avoid race conditions between request queue cleanup and the block cgroup
> controller. Additionally, since it is guaranteed that no new requests will
> be submitted to a queue after it has been marked dead I don't see why it
> would make sense to keep the association between a request queue and the
> blkcg controller until the last reference on a queue is dropped.

Sure, new requests aren't the only source tho.  e.g. there can be
derefs through sysfs / cgroupfs or writeback attempts on dead devices.
If you want to implement sever, you gotta hunt down all those and make
sure they can't make no further derefs.

Thanks.
Alexandru Moise April 11, 2018, 7 p.m. UTC | #15
On Wed, Apr 11, 2018 at 03:54:53PM +0000, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 16:28 +0200, Alexandru Moise wrote:
> > [    0.766666] BUG: unable to handle kernel NULL pointer dereference at 00000000000001b4
> > [    0.763350] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
> > [    0.763350]
> > [    0.766666] PGD 0 P4D 0
> > [    0.766666] Oops: 0000 [#2] PREEMPT SMP
> > [    0.766666] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G      D          4.16.0-ARCH+ #81
> > [    0.766666] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/20$
> > [    0.766666] Workqueue: nvme-reset-wq nvme_reset_work
> > [    0.766666] RIP: 0010:blk_queue_flag_set+0xf/0x40
> > [    0.766666] RSP: 0000:ffffc900001bfcb0 EFLAGS: 00010246
> > [    0.766666] RAX: ffff88003b698000 RBX: 0000000000000000 RCX: 0000000000000000
> > [    0.766666] RDX: ffff88003b698000 RSI: fffffffffffffff4 RDI: 000000000000001c
> > [    0.766666] RBP: ffffc900001bfcc0 R08: 0000000000000000 R09: 0000000000000000
> > [    0.766666] R10: ffffea0000eaa980 R11: ffffffff814e0970 R12: 000000000000001c
> > [    0.766666] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88003aad8010
> > [    0.766666] FS:  0000000000000000(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
> > [    0.766666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    0.766666] CR2: 00000000000001b4 CR3: 0000000002209001 CR4: 00000000000606f0
> > [    0.766666] Call Trace:
> > [    0.766666]  blk_mq_quiesce_queue+0x23/0x80
> > [    0.766666]  nvme_dev_disable+0x34f/0x480
> > [    0.766666]  ? nvme_irq+0x50/0x50
> > [    0.766666]  ? dev_warn+0x64/0x80
> > [    0.766666]  nvme_reset_work+0x13de/0x1570
> > [    0.766666]  ? __switch_to_asm+0x34/0x70
> > [    0.766666]  ? __switch_to_asm+0x40/0x70
> > [    0.766666]  ? _raw_spin_unlock_irq+0x15/0x30
> > [    0.766666]  ? finish_task_switch+0x156/0x210
> > [    0.766666]  process_one_work+0x20c/0x3d0
> > [    0.766666]  worker_thread+0x216/0x400
> > [    0.766666]  kthread+0x125/0x130
> > [    0.766666]  ? process_one_work+0x3d0/0x3d0
> > [    0.766666]  ? __kthread_bind_mask+0x60/0x60
> > [    0.766666]  ret_from_fork+0x3a/0x50
> 
> Hello Alexandru,
> 
> What made you look at cgroups? In the above register dump I see that %rbx == NULL.
> I think that means that the queue pointer argument of blk_queue_flag_set() is NULL.
> The NVMe initiator driver should never pass a NULL pointer to blk_mq_quiesce_queue().
> Please ask the NVMe driver maintainers for their opinion on the linux-nvme mailing
> list.
> 
> Thanks,
> 
> Bart.

The %rbx == NULL is only a symptom of the cgroup mishandling, perhaps we could
improve error handling in the NVMe driver, but I can say that about a lot of block
drivers actually, perhaps I will write some patches in the future to improve the
error handling.

But the root cause of it is in blkcg_init_queue() when blkg_create() returns
an ERR ptr, because it tries to insert into a populated index into blkcg->blkg_tree,
the entry that we fail to remove at __blk_release_queue().

Thanks,
../Alex


> 
> 
>
diff mbox

Patch

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 1c16694ae145..224e937dbb59 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -369,7 +369,7 @@  static void blkg_destroy(struct blkcg_gq *blkg)
  *
  * Destroy all blkgs associated with @q.
  */
-static void blkg_destroy_all(struct request_queue *q)
+void blkg_destroy_all(struct request_queue *q)
 {
 	struct blkcg_gq *blkg, *n;
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d00d1b0ec109..a72866458f22 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -816,6 +816,10 @@  static void __blk_release_queue(struct work_struct *work)
 	if (q->bio_split)
 		bioset_free(q->bio_split);
 
+	spin_lock_irq(q->queue_lock);
+	blkg_destroy_all(q);
+	spin_unlock_irq(q->queue_lock);
+
 	ida_simple_remove(&blk_queue_ida, q->id);
 	call_rcu(&q->rcu_head, blk_free_queue_rcu);
 }
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 6c666fd7de3c..3d60b1d1973d 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -179,6 +179,8 @@  struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 int blkcg_init_queue(struct request_queue *q);
 void blkcg_drain_queue(struct request_queue *q);
 void blkcg_exit_queue(struct request_queue *q);
+void blkg_destroy_all(struct request_queue *q);
+
 
 /* Blkio controller policy registration */
 int blkcg_policy_register(struct blkcg_policy *pol);
@@ -740,6 +742,7 @@  static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { ret
 static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
 static inline void blkcg_drain_queue(struct request_queue *q) { }
 static inline void blkcg_exit_queue(struct request_queue *q) { }
+static inline void blkg_destroy_all(struct request_queue *q) { }
 static inline int blkcg_policy_register(struct blkcg_policy *pol) { return 0; }
 static inline void blkcg_policy_unregister(struct blkcg_policy *pol) { }
 static inline int blkcg_activate_policy(struct request_queue *q,