diff mbox

block: Ensure that a request queue is dissociated from the cgroup controller

Message ID 20180412015852.26014-1-bart.vanassche@wdc.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bart Van Assche April 12, 2018, 1:58 a.m. UTC
Several block drivers call alloc_disk() followed by put_disk() if
something fails before device_add_disk() is called without calling
blk_cleanup_queue(). Make sure that also for this scenario a request
queue is dissociated from the cgroup controller. This patch avoids
that loading the parport_pc, paride and pf drivers triggers the
following kernel crash:

BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
Read of size 4 at addr 0000000000000008 by task modprobe/744
Call Trace:
dump_stack+0x9a/0xeb
kasan_report+0x139/0x350
pi_init+0x42e/0x580 [paride]
pf_init+0x2bb/0x1000 [pf]
do_one_initcall+0x8e/0x405
do_init_module+0xd9/0x2f2
load_module+0x3ab4/0x4700
SYSC_finit_module+0x176/0x1a0
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

Reported-by: Alexandru Moise <00moses.alexander00@gmail.com>
Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
---
 block/blk-sysfs.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

Comments

Alexandru Moise April 12, 2018, 4:20 a.m. UTC | #1
On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> Several block drivers call alloc_disk() followed by put_disk() if
> something fails before device_add_disk() is called without calling
> blk_cleanup_queue(). Make sure that also for this scenario a request
> queue is dissociated from the cgroup controller. This patch avoids
> that loading the parport_pc, paride and pf drivers triggers the
> following kernel crash:
> 
> BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
> Read of size 4 at addr 0000000000000008 by task modprobe/744
> Call Trace:
> dump_stack+0x9a/0xeb
> kasan_report+0x139/0x350
> pi_init+0x42e/0x580 [paride]
> pf_init+0x2bb/0x1000 [pf]
> do_one_initcall+0x8e/0x405
> do_init_module+0xd9/0x2f2
> load_module+0x3ab4/0x4700
> SYSC_finit_module+0x176/0x1a0
> do_syscall_64+0xee/0x2b0
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> Reported-by: Alexandru Moise <00moses.alexander00@gmail.com>
> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller")
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Alexandru Moise <00moses.alexander00@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-sysfs.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index f8457d6f0190..2e134da78f82 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -788,11 +788,37 @@ static void blk_free_queue_rcu(struct rcu_head *rcu_head)
>  static void __blk_release_queue(struct work_struct *work)
>  {
>  	struct request_queue *q = container_of(work, typeof(*q), release_work);
> +	struct blkcg_gq *gq;
>  
>  	if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
>  		blk_stat_remove_callback(q, q->poll_cb);
>  	blk_stat_free_callback(q->poll_cb);
>  
> +	if (!blk_queue_dead(q)) {
> +		/*
> +		 * Last reference was dropped without having called
> +		 * blk_cleanup_queue().
> +		 */
> +		WARN_ONCE(blk_queue_init_done(q),
> +			  "request queue %p has been registered but blk_cleanup_queue() has not been called for that queue\n",
> +			  q);
> +		bdi_put(q->backing_dev_info);
> +		blkcg_exit_queue(q);
> +
> +		if (q->elevator) {
> +			ioc_clear_queue(q);
> +			elevator_exit(q, q->elevator);
> +		}
> +	}
> +
> +	rcu_read_lock();
> +	gq = blkg_lookup(&blkcg_root, q);
> +	rcu_read_unlock();
> +
> +	WARN(gq,
> +	     "request queue %p is being released but it has not yet been removed from the blkcg controller\n",
> +	     q);
> +
>  	blk_free_queue_stats(q->stats);

This solution is good. Thanks for fixing this!

Tested-by: Alexandru Moise <00moses.alexander00@gmail.com>

../Alex

>  
>  	blk_exit_rl(q, &q->root_rl);
> -- 
> 2.16.2
>
Alexandru Moise April 12, 2018, 4:22 a.m. UTC | #2
On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> Several block drivers call alloc_disk() followed by put_disk() if
> something fails before device_add_disk() is called without calling
> blk_cleanup_queue(). Make sure that also for this scenario a request
> queue is dissociated from the cgroup controller. This patch avoids
> that loading the parport_pc, paride and pf drivers triggers the
> following kernel crash:
> 
> BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
> Read of size 4 at addr 0000000000000008 by task modprobe/744
> Call Trace:
> dump_stack+0x9a/0xeb
> kasan_report+0x139/0x350
> pi_init+0x42e/0x580 [paride]
> pf_init+0x2bb/0x1000 [pf]
> do_one_initcall+0x8e/0x405
> do_init_module+0xd9/0x2f2
> load_module+0x3ab4/0x4700
> SYSC_finit_module+0x176/0x1a0
> do_syscall_64+0xee/0x2b0
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> Reported-by: Alexandru Moise <00moses.alexander00@gmail.com>
> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller")
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Alexandru Moise <00moses.alexander00@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-sysfs.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index f8457d6f0190..2e134da78f82 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -788,11 +788,37 @@ static void blk_free_queue_rcu(struct rcu_head *rcu_head)
>  static void __blk_release_queue(struct work_struct *work)
>  {
>  	struct request_queue *q = container_of(work, typeof(*q), release_work);
> +	struct blkcg_gq *gq;
>  
>  	if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
>  		blk_stat_remove_callback(q, q->poll_cb);
>  	blk_stat_free_callback(q->poll_cb);
>  
> +	if (!blk_queue_dead(q)) {
> +		/*
> +		 * Last reference was dropped without having called
> +		 * blk_cleanup_queue().
> +		 */
> +		WARN_ONCE(blk_queue_init_done(q),
> +			  "request queue %p has been registered but blk_cleanup_queue() has not been called for that queue\n",
> +			  q);
> +		bdi_put(q->backing_dev_info);
> +		blkcg_exit_queue(q);
> +
> +		if (q->elevator) {
> +			ioc_clear_queue(q);
> +			elevator_exit(q, q->elevator);
> +		}
> +	}
> +
> +	rcu_read_lock();
> +	gq = blkg_lookup(&blkcg_root, q);
> +	rcu_read_unlock();
> +
> +	WARN(gq,
> +	     "request queue %p is being released but it has not yet been removed from the blkcg controller\n",
> +	     q);
> +
>  	blk_free_queue_stats(q->stats);

This solution is good. Thanks for fixing this!

Tested-by: Alexandru Moise <00moses.alexander00@gmail.com>

../Alex

>  
>  	blk_exit_rl(q, &q->root_rl);
> -- 
> 2.16.2
>
Alexandru Moise April 12, 2018, 4:32 a.m. UTC | #3
On Thu, Apr 12, 2018 at 06:20:34AM +0200, Alexandru Moise wrote:
> On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> > Several block drivers call alloc_disk() followed by put_disk() if
> > something fails before device_add_disk() is called without calling
> > blk_cleanup_queue(). Make sure that also for this scenario a request
> > queue is dissociated from the cgroup controller. This patch avoids
> > that loading the parport_pc, paride and pf drivers triggers the
> > following kernel crash:
> > 
> > BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
> > Read of size 4 at addr 0000000000000008 by task modprobe/744
> > Call Trace:
> > dump_stack+0x9a/0xeb
> > kasan_report+0x139/0x350
> > pi_init+0x42e/0x580 [paride]
> > pf_init+0x2bb/0x1000 [pf]
> > do_one_initcall+0x8e/0x405
> > do_init_module+0xd9/0x2f2
> > load_module+0x3ab4/0x4700
> > SYSC_finit_module+0x176/0x1a0
> > do_syscall_64+0xee/0x2b0
> > entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > 
> > Reported-by: Alexandru Moise <00moses.alexander00@gmail.com>
> > Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller")
> > Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> > Cc: Alexandru Moise <00moses.alexander00@gmail.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > ---
> >  block/blk-sysfs.c | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> > 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index f8457d6f0190..2e134da78f82 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -788,11 +788,37 @@ static void blk_free_queue_rcu(struct rcu_head *rcu_head)
> >  static void __blk_release_queue(struct work_struct *work)
> >  {
> >  	struct request_queue *q = container_of(work, typeof(*q), release_work);
> > +	struct blkcg_gq *gq;
> >  
> >  	if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
> >  		blk_stat_remove_callback(q, q->poll_cb);
> >  	blk_stat_free_callback(q->poll_cb);
> >  
> > +	if (!blk_queue_dead(q)) {
> > +		/*
> > +		 * Last reference was dropped without having called
> > +		 * blk_cleanup_queue().
> > +		 */
> > +		WARN_ONCE(blk_queue_init_done(q),
> > +			  "request queue %p has been registered but blk_cleanup_queue() has not been called for that queue\n",
> > +			  q);
> > +		bdi_put(q->backing_dev_info);
> > +		blkcg_exit_queue(q);
> > +
> > +		if (q->elevator) {
> > +			ioc_clear_queue(q);
> > +			elevator_exit(q, q->elevator);
> > +		}
> > +	}
> > +
> > +	rcu_read_lock();
> > +	gq = blkg_lookup(&blkcg_root, q);
> > +	rcu_read_unlock();
> > +
> > +	WARN(gq,
> > +	     "request queue %p is being released but it has not yet been removed from the blkcg controller\n",
> > +	     q);
> > +
> >  	blk_free_queue_stats(q->stats);
> 
> This solution is good. Thanks for fixing this!
> 
> Tested-by: Alexandru Moise <00moses.alexander00@gmail.com>
> 
> ../Alex

Oh, you also need the no-ops for !CONFIG_BLK_CGROUP, also there is no
blkcg_root then as well.

Thanks,
../Alex
> 
> >  
> >  	blk_exit_rl(q, &q->root_rl);
> > -- 
> > 2.16.2
> >
Christoph Hellwig April 12, 2018, 5:34 a.m. UTC | #4
On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> Several block drivers call alloc_disk() followed by put_disk() if
> something fails before device_add_disk() is called without calling
> blk_cleanup_queue(). Make sure that also for this scenario a request
> queue is dissociated from the cgroup controller. This patch avoids
> that loading the parport_pc, paride and pf drivers triggers the
> following kernel crash:

Can we move the cleanup to put_disk in general and not just for
this case?  Having alloc/free routines pair up generally avoids
a lot of confusion.
Bart Van Assche April 12, 2018, 11:52 a.m. UTC | #5
On Thu, 2018-04-12 at 07:34 +0200, Christoph Hellwig wrote:
> On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:

> > Several block drivers call alloc_disk() followed by put_disk() if

> > something fails before device_add_disk() is called without calling

> > blk_cleanup_queue(). Make sure that also for this scenario a request

> > queue is dissociated from the cgroup controller. This patch avoids

> > that loading the parport_pc, paride and pf drivers triggers the

> > following kernel crash:

> 

> Can we move the cleanup to put_disk in general and not just for

> this case?  Having alloc/free routines pair up generally avoids

> a lot of confusion.


Hello Christoph,

At least the SCSI ULP drivers drop the last reference to a disk after
the blk_cleanup_queue() call. As explained in the description of commit
a063057d7c73, removing a request queue from blkcg must happen before
blk_cleanup_queue() finishes because a block driver may free the
request queue spinlock immediately after blk_cleanup_queue() returns.
So I don't think that we can move the code that removes a request
queue from blkcg into put_disk(). Another challenge is that some block
drivers (e.g. skd) clear the disk->queue pointer if device_add_disk()
has not been called to avoid that put_disk() causes a request queue
reference count imbalance.

Bart.
Christoph Hellwig April 12, 2018, 1:14 p.m. UTC | #6
On Thu, Apr 12, 2018 at 11:52:11AM +0000, Bart Van Assche wrote:
> On Thu, 2018-04-12 at 07:34 +0200, Christoph Hellwig wrote:
> > On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> > > Several block drivers call alloc_disk() followed by put_disk() if
> > > something fails before device_add_disk() is called without calling
> > > blk_cleanup_queue(). Make sure that also for this scenario a request
> > > queue is dissociated from the cgroup controller. This patch avoids
> > > that loading the parport_pc, paride and pf drivers triggers the
> > > following kernel crash:
> > 
> > Can we move the cleanup to put_disk in general and not just for
> > this case?  Having alloc/free routines pair up generally avoids
> > a lot of confusion.
> 
> Hello Christoph,
> 
> At least the SCSI ULP drivers drop the last reference to a disk after
> the blk_cleanup_queue() call. As explained in the description of commit
> a063057d7c73, removing a request queue from blkcg must happen before
> blk_cleanup_queue() finishes because a block driver may free the
> request queue spinlock immediately after blk_cleanup_queue() returns.
> So I don't think that we can move the code that removes a request
> queue from blkcg into put_disk(). Another challenge is that some block
> drivers (e.g. skd) clear the disk->queue pointer if device_add_disk()
> has not been called to avoid that put_disk() causes a request queue
> reference count imbalance.

Which sounds like a very good reason not to use a driver controller
lock for internals like blkcq.

In fact splitting the lock used for synchronizing access to queue
fields from the driver controller lock used to synchronize I/O
in the legacy path in long overdue.
Tejun Heo April 12, 2018, 1:48 p.m. UTC | #7
Hello,

On Thu, Apr 12, 2018 at 03:14:40PM +0200, hch@lst.de wrote:
> > At least the SCSI ULP drivers drop the last reference to a disk after
> > the blk_cleanup_queue() call. As explained in the description of commit
> > a063057d7c73, removing a request queue from blkcg must happen before
> > blk_cleanup_queue() finishes because a block driver may free the
> > request queue spinlock immediately after blk_cleanup_queue() returns.
> > So I don't think that we can move the code that removes a request
> > queue from blkcg into put_disk(). Another challenge is that some block
> > drivers (e.g. skd) clear the disk->queue pointer if device_add_disk()
> > has not been called to avoid that put_disk() causes a request queue
> > reference count imbalance.
> 
> Which sounds like a very good reason not to use a driver controller
> lock for internals like blkcq.
> 
> In fact splitting the lock used for synchronizing access to queue
> fields from the driver controller lock used to synchronize I/O
> in the legacy path in long overdue.

It'd be probably a lot easier to make sure the shared lock doesn't go
away till all the request_queues using it go away.  The choice is
between refcnting something which carries the lock and double locking
in hot paths.  Can't think of many downsides of the former approach.

Thanks.
Tejun Heo April 12, 2018, 1:51 p.m. UTC | #8
On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> Several block drivers call alloc_disk() followed by put_disk() if
> something fails before device_add_disk() is called without calling
> blk_cleanup_queue(). Make sure that also for this scenario a request
> queue is dissociated from the cgroup controller. This patch avoids
> that loading the parport_pc, paride and pf drivers triggers the
> following kernel crash:
> 
> BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
> Read of size 4 at addr 0000000000000008 by task modprobe/744
> Call Trace:
> dump_stack+0x9a/0xeb
> kasan_report+0x139/0x350
> pi_init+0x42e/0x580 [paride]
> pf_init+0x2bb/0x1000 [pf]
> do_one_initcall+0x8e/0x405
> do_init_module+0xd9/0x2f2
> load_module+0x3ab4/0x4700
> SYSC_finit_module+0x176/0x1a0
> do_syscall_64+0xee/0x2b0
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> Reported-by: Alexandru Moise <00moses.alexander00@gmail.com>
> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller")
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Alexandru Moise <00moses.alexander00@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>

So, this might be correct for this reported case but I don't think
it's correct in general.  There's no synchronization between blkcg
q->lock usages and blk_cleanup_queue().  e.g. hotunplugging and
shutting down a device while file operations are still in progress can
easily blow this up.

Thanks.
Christoph Hellwig April 12, 2018, 1:56 p.m. UTC | #9
On Thu, Apr 12, 2018 at 06:48:12AM -0700, tj@kernel.org wrote:
> > Which sounds like a very good reason not to use a driver controller
> > lock for internals like blkcq.
> > 
> > In fact splitting the lock used for synchronizing access to queue
> > fields from the driver controller lock used to synchronize I/O
> > in the legacy path in long overdue.
> 
> It'd be probably a lot easier to make sure the shared lock doesn't go
> away till all the request_queues using it go away.  The choice is
> between refcnting something which carries the lock and double locking
> in hot paths.  Can't think of many downsides of the former approach.

We've stopped sharing request_queues between different devices a while
ago.  The problem is just that for legacy devices the driver still controls
the lock life time, and it might be shorter than the queue lifetime.
Note that for blk-mq we basically don't use the queue_lock at all,
and certainly not in the hot path.
Tejun Heo April 12, 2018, 1:58 p.m. UTC | #10
Hello,

On Thu, Apr 12, 2018 at 03:56:51PM +0200, hch@lst.de wrote:
> On Thu, Apr 12, 2018 at 06:48:12AM -0700, tj@kernel.org wrote:
> > > Which sounds like a very good reason not to use a driver controller
> > > lock for internals like blkcq.
> > > 
> > > In fact splitting the lock used for synchronizing access to queue
> > > fields from the driver controller lock used to synchronize I/O
> > > in the legacy path in long overdue.
> > 
> > It'd be probably a lot easier to make sure the shared lock doesn't go
> > away till all the request_queues using it go away.  The choice is
> > between refcnting something which carries the lock and double locking
> > in hot paths.  Can't think of many downsides of the former approach.
> 
> We've stopped sharing request_queues between different devices a while
> ago.  The problem is just that for legacy devices the driver still controls
> the lock life time, and it might be shorter than the queue lifetime.
> Note that for blk-mq we basically don't use the queue_lock at all,
> and certainly not in the hot path.

Cool, can we just factor out the queue lock from those drivers?  It's
just really messy to be switching locks like we do in the cleanup
path.

Thanks.
Tejun Heo April 12, 2018, 2:07 p.m. UTC | #11
On Thu, Apr 12, 2018 at 06:58:21AM -0700, tj@kernel.org wrote:
> Cool, can we just factor out the queue lock from those drivers?  It's
> just really messy to be switching locks like we do in the cleanup
> path.

So, looking at a couple drivers, it looks like all we'd need is a
struct which contains a spinlock and a refcnt.  Switching the drivers
to use that is a bit tedious but straight-forward and the block layer
can simply put that struct on queue release.

Thanks.
Bart Van Assche April 12, 2018, 2:09 p.m. UTC | #12
On 04/12/18 07:51, Tejun Heo wrote:
> On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
>> Several block drivers call alloc_disk() followed by put_disk() if
>> something fails before device_add_disk() is called without calling
>> blk_cleanup_queue(). Make sure that also for this scenario a request
>> queue is dissociated from the cgroup controller. This patch avoids
>> that loading the parport_pc, paride and pf drivers triggers the
>> following kernel crash:
>>
>> BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
>> Read of size 4 at addr 0000000000000008 by task modprobe/744
>> Call Trace:
>> dump_stack+0x9a/0xeb
>> kasan_report+0x139/0x350
>> pi_init+0x42e/0x580 [paride]
>> pf_init+0x2bb/0x1000 [pf]
>> do_one_initcall+0x8e/0x405
>> do_init_module+0xd9/0x2f2
>> load_module+0x3ab4/0x4700
>> SYSC_finit_module+0x176/0x1a0
>> do_syscall_64+0xee/0x2b0
>> entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>
>> Reported-by: Alexandru Moise <00moses.alexander00@gmail.com>
>> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller")
>> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
>> Cc: Alexandru Moise <00moses.alexander00@gmail.com>
>> Cc: Tejun Heo <tj@kernel.org>
> 
> So, this might be correct for this reported case but I don't think
> it's correct in general.  There's no synchronization between blkcg
> q->lock usages and blk_cleanup_queue().  e.g. hotunplugging and
> shutting down a device while file operations are still in progress can
> easily blow this up.

Hello Tejun,

I have retested hotunplugging by rerunning the srp-test software. It 
seems like you overlooked that this patch does not remove the 
blkcg_exit_queue() call from blk_cleanup_queue()? If a device is 
hotunplugged it is up to the block driver to call blk_cleanup_queue(). 
And blk_cleanup_queue() will call blkcg_exit_queue().

Bart.
Tejun Heo April 12, 2018, 3:37 p.m. UTC | #13
Hello, Bart.

On Thu, Apr 12, 2018 at 08:09:17AM -0600, Bart Van Assche wrote:
> I have retested hotunplugging by rerunning the srp-test software. It
> seems like you overlooked that this patch does not remove the
> blkcg_exit_queue() call from blk_cleanup_queue()? If a device is
> hotunplugged it is up to the block driver to call
> blk_cleanup_queue(). And blk_cleanup_queue() will call
> blkcg_exit_queue().

Hmm... what'd prevent blg_lookup_and_create() racing against that?

Thanks.
Bart Van Assche April 12, 2018, 4:03 p.m. UTC | #14
On Thu, 2018-04-12 at 08:37 -0700, Tejun Heo wrote:
> On Thu, Apr 12, 2018 at 08:09:17AM -0600, Bart Van Assche wrote:

> > I have retested hotunplugging by rerunning the srp-test software. It

> > seems like you overlooked that this patch does not remove the

> > blkcg_exit_queue() call from blk_cleanup_queue()? If a device is

> > hotunplugged it is up to the block driver to call

> > blk_cleanup_queue(). And blk_cleanup_queue() will call

> > blkcg_exit_queue().

> 

> Hmm... what'd prevent blg_lookup_and_create() racing against that?


Hello Tejun,

Did you perhaps mean blkg_lookup_create()? That function has one caller,
namely blkcg_bio_issue_check(). The only caller of that function is
generic_make_request_checks(). A patch was posted on the linux-block mailing
list recently that surrounds that call with blk_queue_enter() / blk_queue_exit().
I think that prevents that blkcg_exit_queue() is called concurrently with
blkg_lookup_create().

Bart.
Tejun Heo April 12, 2018, 4:12 p.m. UTC | #15
Hello,

On Thu, Apr 12, 2018 at 04:03:52PM +0000, Bart Van Assche wrote:
> On Thu, 2018-04-12 at 08:37 -0700, Tejun Heo wrote:
> > On Thu, Apr 12, 2018 at 08:09:17AM -0600, Bart Van Assche wrote:
> > > I have retested hotunplugging by rerunning the srp-test software. It
> > > seems like you overlooked that this patch does not remove the
> > > blkcg_exit_queue() call from blk_cleanup_queue()? If a device is
> > > hotunplugged it is up to the block driver to call
> > > blk_cleanup_queue(). And blk_cleanup_queue() will call
> > > blkcg_exit_queue().
> > 
> > Hmm... what'd prevent blg_lookup_and_create() racing against that?
> 
> Hello Tejun,
> 
> Did you perhaps mean blkg_lookup_create()? That function has one caller,

Ah yeah, sorry about the sloppiness.

> namely blkcg_bio_issue_check(). The only caller of that function is
> generic_make_request_checks(). A patch was posted on the linux-block mailing
> list recently that surrounds that call with blk_queue_enter() / blk_queue_exit().
> I think that prevents that blkcg_exit_queue() is called concurrently with
> blkg_lookup_create().

Yeah, that'd solve the problem for that particular path, but what
that's doing is adding another layer of refcnting which is used to
implement the revoke (or sever) semantics.  This is a fragile approach
tho because it isn't clear who's protecting what and there's nothing
in blkg's usage which suggests it'd be protected that way and we're
basically working around a different underlying problem (lock
switching) by expanding the coverage of a different lock.

I'd much prefer fixing the lock switching problem properly than
working around that shortcoming this way.

Thans.
Bart Van Assche April 12, 2018, 4:29 p.m. UTC | #16
On Thu, 2018-04-12 at 09:12 -0700, tj@kernel.org wrote:
> > Did you perhaps mean blkg_lookup_create()? That function has one caller,

> > namely blkcg_bio_issue_check(). The only caller of that function is

> > generic_make_request_checks(). A patch was posted on the linux-block mailing

> > list recently that surrounds that call with blk_queue_enter() / blk_queue_exit().

> > I think that prevents that blkcg_exit_queue() is called concurrently with

> > blkg_lookup_create().

> 

> Yeah, that'd solve the problem for that particular path, but what

> that's doing is adding another layer of refcnting which is used to

> implement the revoke (or sever) semantics.  This is a fragile approach

> tho because it isn't clear who's protecting what and there's nothing

> in blkg's usage which suggests it'd be protected that way and we're

> basically working around a different underlying problem (lock

> switching) by expanding the coverage of a different lock.


Hello Tejun,

Any code that submits a bio or request needs blk_queue_enter() /
blk_queue_exit() anyway. Please have a look at the following commit - you will
see that that commit reduces the number of blk_queue_enter() / blk_queue_exit()
calls in the hot path:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus&id=37f9579f4c31a6d698dbf3016d7bf132f9288d30

Thanks,

Bart.
Tejun Heo April 12, 2018, 6:11 p.m. UTC | #17
Hello,

On Thu, Apr 12, 2018 at 04:29:09PM +0000, Bart Van Assche wrote:
> Any code that submits a bio or request needs blk_queue_enter() /
> blk_queue_exit() anyway. Please have a look at the following commit - you will
> see that that commit reduces the number of blk_queue_enter() / blk_queue_exit()
> calls in the hot path:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus&id=37f9579f4c31a6d698dbf3016d7bf132f9288d30

So, this can work, but it's still pretty fragile.

* Lock switching is fragile and we really should get rid of it.  This
  is very likely to come back and bite us.

* Right now, blk_queue_enter/exit() doesn't have any annotations.
  IOW, there's no way for paths which need enter locked to actually
  assert that.  If we're gonna protect more things with queue
  enter/exit, it gotta be annotated.

Thanks.
Bart Van Assche April 12, 2018, 6:56 p.m. UTC | #18
On Thu, 2018-04-12 at 11:11 -0700, tj@kernel.org wrote:
> * Right now, blk_queue_enter/exit() doesn't have any annotations.

>   IOW, there's no way for paths which need enter locked to actually

>   assert that.  If we're gonna protect more things with queue

>   enter/exit, it gotta be annotated.


Hello Tejun,

One possibility is to check the count for the local CPU of q->q_usage_counter
at runtime, e.g. using WARN_ON_ONCE(). However, that might have a runtime
overhead. Another possibility is to follow the example of kernfs and to use
a lockdep map and lockdep_assert_held(). There might be other alternatives I
have not thought of.

Bart.
Tejun Heo April 12, 2018, 7:09 p.m. UTC | #19
Hello,

On Thu, Apr 12, 2018 at 06:56:26PM +0000, Bart Van Assche wrote:
> On Thu, 2018-04-12 at 11:11 -0700, tj@kernel.org wrote:
> > * Right now, blk_queue_enter/exit() doesn't have any annotations.
> >   IOW, there's no way for paths which need enter locked to actually
> >   assert that.  If we're gonna protect more things with queue
> >   enter/exit, it gotta be annotated.
> 
> Hello Tejun,
> 
> One possibility is to check the count for the local CPU of q->q_usage_counter
> at runtime, e.g. using WARN_ON_ONCE(). However, that might have a runtime
> overhead. Another possibility is to follow the example of kernfs and to use
> a lockdep map and lockdep_assert_held(). There might be other alternatives I
> have not thought of.

Oh, I'd just do straight-forward lockdep annotation on
queue_enter/exit functions and provide the necessary assert helpers.

Thanks.
diff mbox

Patch

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f8457d6f0190..2e134da78f82 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -788,11 +788,37 @@  static void blk_free_queue_rcu(struct rcu_head *rcu_head)
 static void __blk_release_queue(struct work_struct *work)
 {
 	struct request_queue *q = container_of(work, typeof(*q), release_work);
+	struct blkcg_gq *gq;
 
 	if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
 		blk_stat_remove_callback(q, q->poll_cb);
 	blk_stat_free_callback(q->poll_cb);
 
+	if (!blk_queue_dead(q)) {
+		/*
+		 * Last reference was dropped without having called
+		 * blk_cleanup_queue().
+		 */
+		WARN_ONCE(blk_queue_init_done(q),
+			  "request queue %p has been registered but blk_cleanup_queue() has not been called for that queue\n",
+			  q);
+		bdi_put(q->backing_dev_info);
+		blkcg_exit_queue(q);
+
+		if (q->elevator) {
+			ioc_clear_queue(q);
+			elevator_exit(q, q->elevator);
+		}
+	}
+
+	rcu_read_lock();
+	gq = blkg_lookup(&blkcg_root, q);
+	rcu_read_unlock();
+
+	WARN(gq,
+	     "request queue %p is being released but it has not yet been removed from the blkcg controller\n",
+	     q);
+
 	blk_free_queue_stats(q->stats);
 
 	blk_exit_rl(q, &q->root_rl);