diff mbox series

[1/2] block: also mark disk-owned queues as dying in __blk_mark_disk_dead

Message ID 20241009113831.557606-2-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [1/2] block: also mark disk-owned queues as dying in __blk_mark_disk_dead | expand

Commit Message

Christoph Hellwig Oct. 9, 2024, 11:38 a.m. UTC
When del_gendisk shuts down access to a gendisk, it could lead to a
deadlock with sd or, which try to submit passthrough SCSI commands from
their ->release method under open_mutex.  The submission can be blocked
in blk_enter_queue while del_gendisk can't get to actually telling them
top stop and wake them up.

As the disk is going away there is no real point in sending these
commands, but we have no really good way to distinguish between the
cases.  For now mark even standalone (aka SCSI queues) as dying in
del_gendisk to avoid this deadlock, but the real fix will be to split
freeing a disk from freezing a queue for not disk associated requests.

Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 block/genhd.c          | 16 ++++++++++++++--
 include/linux/blkdev.h |  1 +
 2 files changed, 15 insertions(+), 2 deletions(-)

Comments

Sergey Senozhatsky Oct. 9, 2024, 12:31 p.m. UTC | #1
On (24/10/09 13:38), Christoph Hellwig wrote:
[..]
> @@ -589,8 +589,16 @@ static void __blk_mark_disk_dead(struct gendisk *disk)
>  	if (test_and_set_bit(GD_DEAD, &disk->state))
>  		return;
>  
> -	if (test_bit(GD_OWNS_QUEUE, &disk->state))
> -		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
> +	/*
> +	 * Also mark the disk dead if it is not owned by the gendisk.  This
> +	 * means we can't allow /dev/sg passthrough or SCSI internal commands
> +	 * while unbinding a ULP.  That is more than just a bit ugly, but until
> +	 * we untangle q_usage_counter into one owned by the disk and one owned
> +	 * by the queue this is as good as it gets.  The flag will be cleared
> +	 * at the end of del_gendisk if it wasn't set before.
> +	 */
> +	if (!test_and_set_bit(QUEUE_FLAG_DYING, &disk->queue->queue_flags))
> +		set_bit(QUEUE_FLAG_RESURRECT, &disk->queue->queue_flags);
>  
>  	/*
>  	 * Stop buffered writers from dirtying pages that can't be written out.
> @@ -719,6 +727,10 @@ void del_gendisk(struct gendisk *disk)
>  	 * again.  Else leave the queue frozen to fail all I/O.
>  	 */
>  	if (!test_bit(GD_OWNS_QUEUE, &disk->state)) {
> +		if (test_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags)) {
> +			clear_bit(QUEUE_FLAG_DYING, &q->queue_flags);
> +			clear_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags);
> +		}

Christoph, shouldn't QUEUE_FLAG_RESURRECT handling be outside of
GD_OWNS_QUEUE if-block? Because __blk_mark_disk_dead() sets
QUEUE_FLAG_DYING/QUEUE_FLAG_RESURRECT regardless of GD_OWNS_QUEUE.


// A silly nit: it seems the code uses blk_queue_flag_set() and
// blk_queue_flag_clear() helpers, but there is no queue_flag_test(),
// I don't know what if the preference here - stick to queue_flag
// helpers, or is it ok to mix them.

>  		blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q);
>  		__blk_mq_unfreeze_queue(q, true);
Christoph Hellwig Oct. 9, 2024, 12:41 p.m. UTC | #2
On Wed, Oct 09, 2024 at 09:31:23PM +0900, Sergey Senozhatsky wrote:
> >  	if (!test_bit(GD_OWNS_QUEUE, &disk->state)) {
> > +		if (test_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags)) {
> > +			clear_bit(QUEUE_FLAG_DYING, &q->queue_flags);
> > +			clear_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags);
> > +		}
> 
> Christoph, shouldn't QUEUE_FLAG_RESURRECT handling be outside of
> GD_OWNS_QUEUE if-block? Because __blk_mark_disk_dead() sets
> QUEUE_FLAG_DYING/QUEUE_FLAG_RESURRECT regardless of GD_OWNS_QUEUE.

For !GD_OWNS_QUEUE the queue is freed right below, so there isn't much
of a point.

> // A silly nit: it seems the code uses blk_queue_flag_set() and
> // blk_queue_flag_clear() helpers, but there is no queue_flag_test(),
> // I don't know what if the preference here - stick to queue_flag
> // helpers, or is it ok to mix them.

Yeah.  I looked into a test_and_set wrapper, but then saw how pointless
the existing wrappers are.  So for now this just open codes it, and
once we're done with the fixes I plan to just send a patch to remove
the wrappers entirely.
Sergey Senozhatsky Oct. 9, 2024, 12:43 p.m. UTC | #3
On (24/10/09 14:41), Christoph Hellwig wrote:
> On Wed, Oct 09, 2024 at 09:31:23PM +0900, Sergey Senozhatsky wrote:
> > >  	if (!test_bit(GD_OWNS_QUEUE, &disk->state)) {
> > > +		if (test_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags)) {
> > > +			clear_bit(QUEUE_FLAG_DYING, &q->queue_flags);
> > > +			clear_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags);
> > > +		}
> > 
> > Christoph, shouldn't QUEUE_FLAG_RESURRECT handling be outside of
> > GD_OWNS_QUEUE if-block? Because __blk_mark_disk_dead() sets
> > QUEUE_FLAG_DYING/QUEUE_FLAG_RESURRECT regardless of GD_OWNS_QUEUE.
> 
> For !GD_OWNS_QUEUE the queue is freed right below, so there isn't much
> of a point.

Oh, right.

> > // A silly nit: it seems the code uses blk_queue_flag_set() and
> > // blk_queue_flag_clear() helpers, but there is no queue_flag_test(),
> > // I don't know what if the preference here - stick to queue_flag
> > // helpers, or is it ok to mix them.
> 
> Yeah.  I looked into a test_and_set wrapper, but then saw how pointless
> the existing wrappers are.

Likewise.

> So for now this just open codes it, and once we're done with the fixes
> I plan to just send a patch to remove the wrappers entirely.

Ack.
Jens Axboe Oct. 9, 2024, 1:49 p.m. UTC | #4
On 10/9/24 6:41 AM, Christoph Hellwig wrote:
>> // A silly nit: it seems the code uses blk_queue_flag_set() and
>> // blk_queue_flag_clear() helpers, but there is no queue_flag_test(),
>> // I don't know what if the preference here - stick to queue_flag
>> // helpers, or is it ok to mix them.
> 
> Yeah.  I looked into a test_and_set wrapper, but then saw how pointless
> the existing wrappers are.  So for now this just open codes it, and
> once we're done with the fixes I plan to just send a patch to remove
> the wrappers entirely.

Agree, but that's because you didn't do it back when you changed them to
be just set/clear bit operations ;-). They should definitely just go
away now.
YangYang Oct. 16, 2024, 4:14 a.m. UTC | #5
On 2024/10/9 19:38, Christoph Hellwig wrote:
> When del_gendisk shuts down access to a gendisk, it could lead to a
> deadlock with sd or, which try to submit passthrough SCSI commands from
> their ->release method under open_mutex.  The submission can be blocked
> in blk_enter_queue while del_gendisk can't get to actually telling them
> top stop and wake them up.
> 
> As the disk is going away there is no real point in sending these
> commands, but we have no really good way to distinguish between the
> cases.  For now mark even standalone (aka SCSI queues) as dying in
> del_gendisk to avoid this deadlock, but the real fix will be to split
> freeing a disk from freezing a queue for not disk associated requests.
> 
> Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>   block/genhd.c          | 16 ++++++++++++++--
>   include/linux/blkdev.h |  1 +
>   2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 1c05dd4c6980b5..7026569fa8a0be 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -589,8 +589,16 @@ static void __blk_mark_disk_dead(struct gendisk *disk)
>   	if (test_and_set_bit(GD_DEAD, &disk->state))
>   		return;
>   
> -	if (test_bit(GD_OWNS_QUEUE, &disk->state))
> -		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
> +	/*
> +	 * Also mark the disk dead if it is not owned by the gendisk.  This
> +	 * means we can't allow /dev/sg passthrough or SCSI internal commands
> +	 * while unbinding a ULP.  That is more than just a bit ugly, but until
> +	 * we untangle q_usage_counter into one owned by the disk and one owned
> +	 * by the queue this is as good as it gets.  The flag will be cleared
> +	 * at the end of del_gendisk if it wasn't set before.
> +	 */
> +	if (!test_and_set_bit(QUEUE_FLAG_DYING, &disk->queue->queue_flags))
> +		set_bit(QUEUE_FLAG_RESURRECT, &disk->queue->queue_flags);
>   
>   	/*
>   	 * Stop buffered writers from dirtying pages that can't be written out.
> @@ -719,6 +727,10 @@ void del_gendisk(struct gendisk *disk)
>   	 * again.  Else leave the queue frozen to fail all I/O.
>   	 */
>   	if (!test_bit(GD_OWNS_QUEUE, &disk->state)) {
> +		if (test_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags)) {
> +			clear_bit(QUEUE_FLAG_DYING, &q->queue_flags);
> +			clear_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags);
> +		}
>   		blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q);
>   		__blk_mq_unfreeze_queue(q, true);
>   	} else {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 50c3b959da2816..391e3eb3bb5e61 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -590,6 +590,7 @@ struct request_queue {
>   /* Keep blk_queue_flag_name[] in sync with the definitions below */
>   enum {
>   	QUEUE_FLAG_DYING,		/* queue being torn down */
> +	QUEUE_FLAG_RESURRECT,		/* temporarily dying */
>   	QUEUE_FLAG_NOMERGES,		/* disable merge attempts */
>   	QUEUE_FLAG_SAME_COMP,		/* complete on same CPU-group */
>   	QUEUE_FLAG_FAIL_IO,		/* fake timeout */


Looks good. Feel free to add:

Reviewed-by: Yang Yang <yang.yang@vivo.com>

Thanks.
Ming Lei Oct. 16, 2024, 11:09 a.m. UTC | #6
On Wed, Oct 09, 2024 at 01:38:20PM +0200, Christoph Hellwig wrote:
> When del_gendisk shuts down access to a gendisk, it could lead to a
> deadlock with sd or, which try to submit passthrough SCSI commands from
> their ->release method under open_mutex.  The submission can be blocked
> in blk_enter_queue while del_gendisk can't get to actually telling them
> top stop and wake them up.
> 
> As the disk is going away there is no real point in sending these
> commands, but we have no really good way to distinguish between the
> cases.  For now mark even standalone (aka SCSI queues) as dying in
> del_gendisk to avoid this deadlock, but the real fix will be to split
> freeing a disk from freezing a queue for not disk associated requests.
> 
> Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  block/genhd.c          | 16 ++++++++++++++--
>  include/linux/blkdev.h |  1 +
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 1c05dd4c6980b5..7026569fa8a0be 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -589,8 +589,16 @@ static void __blk_mark_disk_dead(struct gendisk *disk)
>  	if (test_and_set_bit(GD_DEAD, &disk->state))
>  		return;
>  
> -	if (test_bit(GD_OWNS_QUEUE, &disk->state))
> -		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
> +	/*
> +	 * Also mark the disk dead if it is not owned by the gendisk.  This
> +	 * means we can't allow /dev/sg passthrough or SCSI internal commands
> +	 * while unbinding a ULP.  That is more than just a bit ugly, but until
> +	 * we untangle q_usage_counter into one owned by the disk and one owned
> +	 * by the queue this is as good as it gets.  The flag will be cleared
> +	 * at the end of del_gendisk if it wasn't set before.
> +	 */
> +	if (!test_and_set_bit(QUEUE_FLAG_DYING, &disk->queue->queue_flags))
> +		set_bit(QUEUE_FLAG_RESURRECT, &disk->queue->queue_flags);

Setting QUEUE_FLAG_DYING may fail passthrough request for
!GD_OWNS_QUEUE, I guess this may cause SCSI regression.

blk_queue_enter() need to wait until RESURRECT & DYING are cleared
instead of returning failure.


Thanks, 
Ming
Christoph Hellwig Oct. 16, 2024, 12:32 p.m. UTC | #7
On Wed, Oct 16, 2024 at 07:09:48PM +0800, Ming Lei wrote:
> Setting QUEUE_FLAG_DYING may fail passthrough request for
> !GD_OWNS_QUEUE, I guess this may cause SCSI regression.

Yes, as clearly documented in the commit log.

> 
> blk_queue_enter() need to wait until RESURRECT & DYING are cleared
> instead of returning failure.

What we really need to is to split the enter conditions between
disk and standalone queue.  But until then I think the current
version is reasonable enough.
Ming Lei Oct. 16, 2024, 12:49 p.m. UTC | #8
On Wed, Oct 16, 2024 at 02:32:40PM +0200, Christoph Hellwig wrote:
> On Wed, Oct 16, 2024 at 07:09:48PM +0800, Ming Lei wrote:
> > Setting QUEUE_FLAG_DYING may fail passthrough request for
> > !GD_OWNS_QUEUE, I guess this may cause SCSI regression.
> 
> Yes, as clearly documented in the commit log.

The change need Cc linux-scsi.

> As the disk is going away there is no real point in sending these
> commands, but we have no really good way to distinguish between the
> cases.  

scsi request queue has very different lifetime with gendisk, not sure the
above comment is correct.


Thanks,
Ming
Ming Lei Oct. 16, 2024, 1:35 p.m. UTC | #9
On Wed, Oct 09, 2024 at 01:38:20PM +0200, Christoph Hellwig wrote:
> When del_gendisk shuts down access to a gendisk, it could lead to a
> deadlock with sd or, which try to submit passthrough SCSI commands from
> their ->release method under open_mutex.  The submission can be blocked
> in blk_enter_queue while del_gendisk can't get to actually telling them
> top stop and wake them up.

When ->release() waits in blk_enter_queue(), the following code block

	mutex_lock(&disk->open_mutex);
	__blk_mark_disk_dead(disk);
	xa_for_each_start(&disk->part_tbl, idx, part, 1)
	        drop_partition(part);
	mutex_unlock(&disk->open_mutex);

in del_gendisk() should have been done.

Then del_gendisk() should move on and finally unfreeze queue, so I still
don't get the idea how the above dead lock is triggered.


Thanks,
Ming
Sergey Senozhatsky Oct. 19, 2024, 1:25 a.m. UTC | #10
On (24/10/16 21:35), Ming Lei wrote:
> On Wed, Oct 09, 2024 at 01:38:20PM +0200, Christoph Hellwig wrote:
> > When del_gendisk shuts down access to a gendisk, it could lead to a
> > deadlock with sd or, which try to submit passthrough SCSI commands from
> > their ->release method under open_mutex.  The submission can be blocked
> > in blk_enter_queue while del_gendisk can't get to actually telling them
> > top stop and wake them up.
> 
> When ->release() waits in blk_enter_queue(), the following code block
> 
> 	mutex_lock(&disk->open_mutex);
> 	__blk_mark_disk_dead(disk);
> 	xa_for_each_start(&disk->part_tbl, idx, part, 1)
> 	        drop_partition(part);
> 	mutex_unlock(&disk->open_mutex);

blk_enter_queue()->schedule() holds ->open_mutex, so that block
of code sleeps on ->open_mutex. We can't drain under ->open_mutex.
Ming Lei Oct. 19, 2024, 12:32 p.m. UTC | #11
On Sat, Oct 19, 2024 at 10:25:41AM +0900, Sergey Senozhatsky wrote:
> On (24/10/16 21:35), Ming Lei wrote:
> > On Wed, Oct 09, 2024 at 01:38:20PM +0200, Christoph Hellwig wrote:
> > > When del_gendisk shuts down access to a gendisk, it could lead to a
> > > deadlock with sd or, which try to submit passthrough SCSI commands from
> > > their ->release method under open_mutex.  The submission can be blocked
> > > in blk_enter_queue while del_gendisk can't get to actually telling them
> > > top stop and wake them up.
> > 
> > When ->release() waits in blk_enter_queue(), the following code block
> > 
> > 	mutex_lock(&disk->open_mutex);
> > 	__blk_mark_disk_dead(disk);
> > 	xa_for_each_start(&disk->part_tbl, idx, part, 1)
> > 	        drop_partition(part);
> > 	mutex_unlock(&disk->open_mutex);
> 
> blk_enter_queue()->schedule() holds ->open_mutex, so that block
> of code sleeps on ->open_mutex. We can't drain under ->open_mutex.

We don't start to drain yet, then why does blk_enter_queue() sleeps and
it waits for what?



Thanks,
Ming
Sergey Senozhatsky Oct. 19, 2024, 12:37 p.m. UTC | #12
On (24/10/19 20:32), Ming Lei wrote:
[..]
> > > When ->release() waits in blk_enter_queue(), the following code block
> > > 
> > > 	mutex_lock(&disk->open_mutex);
> > > 	__blk_mark_disk_dead(disk);
> > > 	xa_for_each_start(&disk->part_tbl, idx, part, 1)
> > > 	        drop_partition(part);
> > > 	mutex_unlock(&disk->open_mutex);
> > 
> > blk_enter_queue()->schedule() holds ->open_mutex, so that block
> > of code sleeps on ->open_mutex. We can't drain under ->open_mutex.
> 
> We don't start to drain yet, then why does blk_enter_queue() sleeps and
> it waits for what?

Unfortunately I don't have a device to repro this, but it happens to a
number of our customers (using different peripheral devices, but, as far
as I'm concerned, all running 6.6 kernel).
Ming Lei Oct. 19, 2024, 12:50 p.m. UTC | #13
On Sat, Oct 19, 2024 at 09:37:27PM +0900, Sergey Senozhatsky wrote:
> On (24/10/19 20:32), Ming Lei wrote:
> [..]
> > > > When ->release() waits in blk_enter_queue(), the following code block
> > > > 
> > > > 	mutex_lock(&disk->open_mutex);
> > > > 	__blk_mark_disk_dead(disk);
> > > > 	xa_for_each_start(&disk->part_tbl, idx, part, 1)
> > > > 	        drop_partition(part);
> > > > 	mutex_unlock(&disk->open_mutex);
> > > 
> > > blk_enter_queue()->schedule() holds ->open_mutex, so that block
> > > of code sleeps on ->open_mutex. We can't drain under ->open_mutex.
> > 
> > We don't start to drain yet, then why does blk_enter_queue() sleeps and
> > it waits for what?
> 
> Unfortunately I don't have a device to repro this, but it happens to a
> number of our customers (using different peripheral devices, but, as far
> as I'm concerned, all running 6.6 kernel).

I can understand the issue on v6.6 because it doesn't have commit
7e04da2dc701 ("block: fix deadlock between sd_remove & sd_release").

But for the latest upstream, I don't get idea how it can happen.


Thanks,
Ming
Sergey Senozhatsky Oct. 19, 2024, 12:58 p.m. UTC | #14
On (24/10/19 20:50), Ming Lei wrote:
> On Sat, Oct 19, 2024 at 09:37:27PM +0900, Sergey Senozhatsky wrote:
> > On (24/10/19 20:32), Ming Lei wrote:
> > [..]
> > Unfortunately I don't have a device to repro this, but it happens to a
> > number of our customers (using different peripheral devices, but, as far
> > as I'm concerned, all running 6.6 kernel).
> 
> I can understand the issue on v6.6 because it doesn't have commit
> 7e04da2dc701 ("block: fix deadlock between sd_remove & sd_release").

We have that one in 6.6, as far as I can tell

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/block/genhd.c?h=v6.6.57#n663
Ming Lei Oct. 19, 2024, 1:09 p.m. UTC | #15
On Sat, Oct 19, 2024 at 09:58:04PM +0900, Sergey Senozhatsky wrote:
> On (24/10/19 20:50), Ming Lei wrote:
> > On Sat, Oct 19, 2024 at 09:37:27PM +0900, Sergey Senozhatsky wrote:
> > > On (24/10/19 20:32), Ming Lei wrote:
> > > [..]
> > > Unfortunately I don't have a device to repro this, but it happens to a
> > > number of our customers (using different peripheral devices, but, as far
> > > as I'm concerned, all running 6.6 kernel).
> > 
> > I can understand the issue on v6.6 because it doesn't have commit
> > 7e04da2dc701 ("block: fix deadlock between sd_remove & sd_release").
> 
> We have that one in 6.6, as far as I can tell
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/block/genhd.c?h=v6.6.57#n663

Then we need to root-cause it first.

If you can reproduce it, please provide dmesg log, and deadlock related
process stack trace log collected via sysrq control.


thanks,
Ming
Sergey Senozhatsky Oct. 19, 2024, 1:50 p.m. UTC | #16
On (24/10/19 21:09), Ming Lei wrote:
> On Sat, Oct 19, 2024 at 09:58:04PM +0900, Sergey Senozhatsky wrote:
> > On (24/10/19 20:50), Ming Lei wrote:
> > > On Sat, Oct 19, 2024 at 09:37:27PM +0900, Sergey Senozhatsky wrote:
[..]
> 
> Then we need to root-cause it first.
> 
> If you can reproduce it

I cannot.

All I'm having are backtraces from various crash reports, I posted
some of them earlier [1] (and in that entire thread).  This loos like
close()->bio_queue_enter() vs usb_disconnect()->del_gendisk() deadlock,
and del_gendisk() cannot drain.  Doing drain under the same lock, that
things we want to drain currently hold, looks troublesome in general.

[1] https://lore.kernel.org/linux-block/20241008051948.GB10794@google.com
Ming Lei Oct. 19, 2024, 3:03 p.m. UTC | #17
On Sat, Oct 19, 2024 at 10:50:10PM +0900, Sergey Senozhatsky wrote:
> On (24/10/19 21:09), Ming Lei wrote:
> > On Sat, Oct 19, 2024 at 09:58:04PM +0900, Sergey Senozhatsky wrote:
> > > On (24/10/19 20:50), Ming Lei wrote:
> > > > On Sat, Oct 19, 2024 at 09:37:27PM +0900, Sergey Senozhatsky wrote:
> [..]
> > 
> > Then we need to root-cause it first.
> > 
> > If you can reproduce it
> 
> I cannot.
> 
> All I'm having are backtraces from various crash reports, I posted
> some of them earlier [1] (and in that entire thread).  This loos like
> close()->bio_queue_enter() vs usb_disconnect()->del_gendisk() deadlock,
> and del_gendisk() cannot drain.  Doing drain under the same lock, that
> things we want to drain currently hold, looks troublesome in general.
> 
> [1] https://lore.kernel.org/linux-block/20241008051948.GB10794@google.com

Probably bio_queue_enter() waits for runtime PM, and the queue is in
->pm_only state, and BLK_MQ_REQ_PM isn't passed actually from
ioctl_internal_command() <- scsi_set_medium_removal().

And if you have vmcore collected, it shouldn't be not hard to root cause.

Also I'd suggest to collect intact related dmesg log in future, instead of
providing selective log, such as, there isn't even kernel version...


Thanks,
Ming
Sergey Senozhatsky Oct. 19, 2024, 3:11 p.m. UTC | #18
On (24/10/19 23:03), Ming Lei wrote:
> Probably bio_queue_enter() waits for runtime PM, and the queue is in
> ->pm_only state, and BLK_MQ_REQ_PM isn't passed actually from
> ioctl_internal_command() <- scsi_set_medium_removal().
> 
> And if you have vmcore collected, it shouldn't be not hard to root cause.

We don't collect those.

> Also I'd suggest to collect intact related dmesg log in future, instead of
> providing selective log, such as, there isn't even kernel version...

These "selected" backtraces are the only backtraces in the dmesg.
I literally have reports that have just two backtraces of tasks blocked
over 120 seconds, one close()->bio_queue_enter()->schedule (under
->open_mutex) and the other one del_gendisk()->mutex_lock()->schedule().
Sergey Senozhatsky Oct. 19, 2024, 3:40 p.m. UTC | #19
On (24/10/19 23:03), Ming Lei wrote:
> 
> there isn't even kernel version...
> 

Well, that's on me, yes, I admit it.  I completely missed that but
that was never a secret [1].  I missed it, probably, because I would
have not reached out to upstream with 5.4 bug report; and 6.6, in that
part of the code, looked quite close to the upsteram.  But well, I forgot
to add the kernel version, yes.

[1]  https://lore.kernel.org/linux-block/20241003135504.GL11458@google.com
Sergey Senozhatsky Oct. 28, 2024, 5:44 a.m. UTC | #20
On (24/10/19 23:03), Ming Lei wrote:
> On Sat, Oct 19, 2024 at 10:50:10PM +0900, Sergey Senozhatsky wrote:
> > On (24/10/19 21:09), Ming Lei wrote:
> > > On Sat, Oct 19, 2024 at 09:58:04PM +0900, Sergey Senozhatsky wrote:
> > > > On (24/10/19 20:50), Ming Lei wrote:
> > > > > On Sat, Oct 19, 2024 at 09:37:27PM +0900, Sergey Senozhatsky wrote:
> > [..]
> 
> Probably bio_queue_enter() waits for runtime PM, and the queue is in
> ->pm_only state, and BLK_MQ_REQ_PM isn't passed actually from
> ioctl_internal_command() <- scsi_set_medium_removal().

Sorry for the delay.

Another report.
I see lots of buffer I/O errors

<6>[ 364.268167] usb-storage 3-3:1.0: USB Mass Storage device detected
<6>[ 364.268551] scsi host3: usb-storage 3-3:1.0
<3>[ 364.274806] Buffer I/O error on dev sdc1, logical block 0, lost async page write
<5>[ 365.318424] scsi 3:0:0:0: Direct-Access VendorCo ProductCode 2.00 PQ: 0 ANSI: 4
<5>[ 365.319898] sd 3:0:0:0: [sdc] 122880000 512-byte logical blocks: (62.9 GB/58.6 GiB)
<5>[ 365.320077] sd 3:0:0:0: [sdc] Write Protect is off
<7>[ 365.320085] sd 3:0:0:0: [sdc] Mode Sense: 03 00 00 00
<4>[ 365.320255] sd 3:0:0:0: [sdc] No Caching mode page found
<4>[ 365.320262] sd 3:0:0:0: [sdc] Assuming drive cache: write through
<6>[ 365.322483] sdc: sdc1
<5>[ 365.323130] sd 3:0:0:0: [sdc] Attached SCSI removable disk
<6>[ 369.083225] usb 3-3: USB disconnect, device number 49

Then PM suspend/resume.

After resume

<7>[ 1338.847937] PM: resume of devices complete after 291.422 msecs
<6>[ 1338.854215] OOM killer enabled.
<6>[ 1338.854235] Restarting tasks ...
<6>[ 1338.854797] mei_hdcp 0000:00:16.0-(UUID: 7): bound 0000:00:02.0 (ops 0xffffffffb8f03e50)
<6>[ 1338.857745] mei_pxp 0000:00:16.0-(UUID: 2): bound 0000:00:02.0 (ops 0xffffffffb8f16a80)
<4>[ 1338.859663] done.
<5>[ 1338.859683] random: crng reseeded on system resumption
<12>[ 1338.868200] init: cupsd main process ended, respawning
<6>[ 1338.868541] Resume caused by IRQ 9, acpi
<6>[ 1338.868549] Resume caused by IRQ 98, chromeos-ec
<6>[ 1338.868555] PM: suspend exit

lots of buffer I/O errors again and eventually a deadlock.  The deadlock
happens much later than 120 seconds after resume, so I cannot directly
connect those events.

[..]
<6>[ 1859.660882] usb-storage 3-3:1.0: USB Mass Storage device detected
<6>[ 1859.661457] scsi host4: usb-storage 3-3:1.0
<3>[ 1859.668180] Buffer I/O error on dev sdd1, logical block 0, lost async page write
<5>[ 1860.697826] scsi 4:0:0:0: Direct-Access VendorCo ProductCode 2.00 PQ: 0 ANSI: 4
<5>[ 1860.699222] sd 4:0:0:0: [sdd] 122880000 512-byte logical blocks: (62.9 GB/58.6 GiB)
<5>[ 1860.699373] sd 4:0:0:0: [sdd] Write Protect is off
<7>[ 1860.699380] sd 4:0:0:0: [sdd] Mode Sense: 03 00 00 00
<4>[ 1860.699522] sd 4:0:0:0: [sdd] No Caching mode page found
<4>[ 1860.699526] sd 4:0:0:0: [sdd] Assuming drive cache: write through
<6>[ 1860.701393] sdd: sdd1
<5>[ 1860.701886] sd 4:0:0:0: [sdd] Attached SCSI removable disk
<6>[ 1862.077109] usb 3-3: USB disconnect, device number 110
<6>[ 1862.338159] usb 3-3: new high-speed USB device number 111 using xhci_hcd
<6>[ 1862.468090] usb 3-3: New USB device found, idVendor=346d, idProduct=5678, bcdDevice= 2.00
<6>[ 1862.468105] usb 3-3: New USB device strings: Mfr=1, Product=2, SerialNumber=(Serial: 8)
<6>[ 1862.468111] usb 3-3: Product: Disk 2.0
<6>[ 1862.468115] usb 3-3: Manufacturer: USB
<6>[ 1862.468119] usb 3-3: SerialNumber: (Serial: 9)
<6>[ 1862.469962] usb-storage 3-3:1.0: USB Mass Storage device detected
<6>[ 1862.470642] scsi host3: usb-storage 3-3:1.0
<3>[ 1862.476447] Buffer I/O error on dev sdd1, logical block 0, lost async page write
<5>[ 1863.514018] scsi 3:0:0:0: Direct-Access VendorCo ProductCode 2.00 PQ: 0 ANSI: 4
<5>[ 1863.515489] sd 3:0:0:0: [sdd] 122880000 512-byte logical blocks: (62.9 GB/58.6 GiB)
<5>[ 1863.515640] sd 3:0:0:0: [sdd] Write Protect is off
<7>[ 1863.515646] sd 3:0:0:0: [sdd] Mode Sense: 03 00 00 00
<4>[ 1863.515797] sd 3:0:0:0: [sdd] No Caching mode page found
<4>[ 1863.515802] sd 3:0:0:0: [sdd] Assuming drive cache: write through
<6>[ 1863.518227] sdd: sdd1
<5>[ 1863.518551] sd 3:0:0:0: [sdd] Attached SCSI removable disk
<6>[ 1865.018356] usb 3-3: USB disconnect, device number 111
<6>[ 1865.285091] usb 3-3: new high-speed USB device number 112 using xhci_hcd
<3>[ 1865.605088] usb 3-3: device descriptor read/64, error -71
<6>[ 1865.844873] usb 3-3: New USB device found, idVendor=346d, idProduct=5678, bcdDevice= 2.00
<6>[ 1865.844892] usb 3-3: New USB device strings: Mfr=1, Product=2, SerialNumber=(Serial: 8)
<6>[ 1865.844898] usb 3-3: Product: Disk 2.0
<6>[ 1865.844903] usb 3-3: Manufacturer: USB
<6>[ 1865.844906] usb 3-3: SerialNumber: (Serial: 9)
<6>[ 1865.847205] usb-storage 3-3:1.0: USB Mass Storage device detected
<6>[ 1865.847806] scsi host4: usb-storage 3-3:1.0
<3>[ 1865.853941] Buffer I/O error on dev sdd1, logical block 0, lost async page write
<6>[ 1866.436729] usb 3-3: USB disconnect, device number 112
<6>[ 1866.700998] usb 3-3: new high-speed USB device number 113 using xhci_hcd
<6>[ 1866.829449] usb 3-3: New USB device found, idVendor=346d, idProduct=5678, bcdDevice= 2.00
<6>[ 1866.829466] usb 3-3: New USB device strings: Mfr=1, Product=2, SerialNumber=(Serial: 8)
<6>[ 1866.829473] usb 3-3: Product: Disk 2.0
<6>[ 1866.829478] usb 3-3: Manufacturer: USB
<6>[ 1866.829482] usb 3-3: SerialNumber: (Serial: 9)
<6>[ 1866.831605] usb-storage 3-3:1.0: USB Mass Storage device detected
<6>[ 1866.832173] scsi host3: usb-storage 3-3:1.0
<5>[ 1867.866118] scsi 3:0:0:0: Direct-Access VendorCo ProductCode 2.00 PQ: 0 ANSI: 4
<5>[ 1867.868213] sd 3:0:0:0: [sdd] 122880000 512-byte logical blocks: (62.9 GB/58.6 GiB)
<5>[ 1867.868604] sd 3:0:0:0: [sdd] Write Protect is off
<7>[ 1867.868616] sd 3:0:0:0: [sdd] Mode Sense: 03 00 00 00
<4>[ 1867.869071] sd 3:0:0:0: [sdd] No Caching mode page found
<4>[ 1867.869081] sd 3:0:0:0: [sdd] Assuming drive cache: write through
<6>[ 1867.871429] sdd: sdd1
<5>[ 1867.871857] sd 3:0:0:0: [sdd] Attached SCSI removable disk
<6>[ 1868.423593] usb 3-3: USB disconnect, device number 113
<6>[ 1868.431172] sdd: detected capacity change from 122880000 to 0
<28>[ 1928.670962] udevd[203]: sdd: Worker [9839] processing SEQNUM=6508 is taking a long time
<3>[ 2004.633104] INFO: task kworker/0:3:187 blocked for more than 122 seconds.
<3>[ 2004.633125] Tainted: G U 6.6.41-03520-gd3d77f15f842 #1
<3>[ 2004.633131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[ 2004.633149] task:kworker/0:3 state:D stack:0 pid:187 ppid:2 flags:0x00004000
<6>[ 2004.633149] Workqueue: usb_hub_wq hub_event
<6>[ 2004.633166] Call Trace:
<6>[ 2004.633172] <TASK>
<6>[ 2004.633179] schedule+0x4f4/0x1540
<6>[ 2004.633190] ? default_wake_function+0x388/0xcd0
<6>[ 2004.633200] schedule_preempt_disabled+0x15/0x30
<6>[ 2004.633206] __mutex_lock_slowpath+0x2b5/0x4d0
<6>[ 2004.633212] del_gendisk+0x136/0x370
<6>[ 2004.633222] sd_remove+0x30/0x60
<6>[ 2004.633230] device_release_driver_internal+0x1a2/0x2a0
<6>[ 2004.633239] bus_remove_device+0x154/0x180
<6>[ 2004.633248] device_del+0x207/0x370
<6>[ 2004.633256] ? __pfx_transport_remove_classdev+0x10/0x10
<6>[ 2004.633264] ? attribute_container_device_trigger+0xe3/0x110
<6>[ 2004.633272] __scsi_remove_device+0xc0/0x170
<6>[ 2004.633279] scsi_forget_host+0x45/0x60
<6>[ 2004.633287] scsi_remove_host+0x87/0x170
<6>[ 2004.633295] usb_stor_disconnect+0x63/0xb0
<6>[ 2004.633302] usb_unbind_interface+0xbe/0x250
<6>[ 2004.633309] device_release_driver_internal+0x1a2/0x2a0
<6>[ 2004.633315] bus_remove_device+0x154/0x180
<6>[ 2004.633322] device_del+0x207/0x370
<6>[ 2004.633328] ? kobject_release+0x56/0xb0
<6>[ 2004.633336] usb_disable_device+0x72/0x170
<6>[ 2004.633342] usb_disconnect+0xeb/0x280
<6>[ 2004.633350] hub_event+0xac7/0x1760
<6>[ 2004.633359] worker_thread+0x355/0x900
<6>[ 2004.633367] kthread+0xed/0x110
<6>[ 2004.633374] ? __pfx_worker_thread+0x10/0x10
<6>[ 2004.633381] ? __pfx_kthread+0x10/0x10
<6>[ 2004.633387] ret_from_fork+0x38/0x50
<6>[ 2004.633393] ? __pfx_kthread+0x10/0x10
<6>[ 2004.633399] ret_from_fork_asm+0x1b/0x30
<6>[ 2004.633407] </TASK>
<3>[ 2004.633496] INFO: task cros-disks:1614 blocked for more than 122 seconds.
<3>[ 2004.633502] Tainted: G U 6.6.41-03520-gd3d77f15f842 #1
<3>[ 2004.633506] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[ 2004.633519] task:cros-disks state:D stack:0 pid:1614 ppid:1 flags:0x00004002
<6>[ 2004.633519] Call Trace:
<6>[ 2004.633523] <TASK>
<6>[ 2004.633527] schedule+0x4f4/0x1540
<6>[ 2004.633533] ? xas_store+0xc57/0xcc0
<6>[ 2004.633539] ? lru_add_drain+0x4d8/0x6e0
<6>[ 2004.633548] blk_queue_enter+0x172/0x250
<6>[ 2004.633557] ? __pfx_autoremove_wake_function+0x10/0x10
<6>[ 2004.633565] blk_mq_alloc_request+0x167/0x210
<6>[ 2004.633573] scsi_execute_cmd+0x65/0x240
<6>[ 2004.633580] ioctl_internal_command+0x6c/0x150
<6>[ 2004.633590] scsi_set_medium_removal+0x63/0xc0
<6>[ 2004.633598] sd_release+0x42/0x50
<6>[ 2004.633606] blkdev_put+0x13b/0x1f0
<6>[ 2004.633615] blkdev_release+0x2b/0x40
<6>[ 2004.633623] __fput_sync+0x9b/0x2c0
<6>[ 2004.633632] __se_sys_close+0x69/0xc0
<6>[ 2004.633639] do_syscall_64+0x60/0x90
<6>[ 2004.633649] ? exit_to_user_mode_prepare+0x49/0x130
<6>[ 2004.633657] ? do_syscall_64+0x6f/0x90
<6>[ 2004.633665] ? do_syscall_64+0x6f/0x90
<6>[ 2004.633672] ? do_syscall_64+0x6f/0x90
<6>[ 2004.633680] ? irq_exit_rcu+0x38/0x90
<6>[ 2004.633687] ? exit_to_user_mode_prepare+0x49/0x130
<6>[ 2004.633694] entry_SYSCALL_64_after_hwframe+0x73/0xdd
<6>[ 2004.633703] RIP: 0033:0x786d55239960
<6>[ 2004.633711] RSP: 002b:00007ffd1c6d8c28 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
<6>[ 2004.633719] RAX: ffffffffffffffda RBX: 00005a5ffe743fd0 RCX: 0000786d55239960
<6>[ 2004.633725] RDX: 0000786d55307b00 RSI: 0000000000000000 RDI: 000000000000000c
<6>[ 2004.633730] RBP: 00007ffd1c6d8d30 R08: 0000000000000007 R09: 00005a5ffe78a9f0
<6>[ 2004.633735] R10: 8a1ecef621fff8a0 R11: 0000000000000202 R12: 0000000000000831
<6>[ 2004.633741] R13: 00005a5ffe743f60 R14: 00005a5ffe743f80 R15: 000000000000000c
<6>[ 2004.633746] </TASK>
diff mbox series

Patch

diff --git a/block/genhd.c b/block/genhd.c
index 1c05dd4c6980b5..7026569fa8a0be 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -589,8 +589,16 @@  static void __blk_mark_disk_dead(struct gendisk *disk)
 	if (test_and_set_bit(GD_DEAD, &disk->state))
 		return;
 
-	if (test_bit(GD_OWNS_QUEUE, &disk->state))
-		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
+	/*
+	 * Also mark the disk dead if it is not owned by the gendisk.  This
+	 * means we can't allow /dev/sg passthrough or SCSI internal commands
+	 * while unbinding a ULP.  That is more than just a bit ugly, but until
+	 * we untangle q_usage_counter into one owned by the disk and one owned
+	 * by the queue this is as good as it gets.  The flag will be cleared
+	 * at the end of del_gendisk if it wasn't set before.
+	 */
+	if (!test_and_set_bit(QUEUE_FLAG_DYING, &disk->queue->queue_flags))
+		set_bit(QUEUE_FLAG_RESURRECT, &disk->queue->queue_flags);
 
 	/*
 	 * Stop buffered writers from dirtying pages that can't be written out.
@@ -719,6 +727,10 @@  void del_gendisk(struct gendisk *disk)
 	 * again.  Else leave the queue frozen to fail all I/O.
 	 */
 	if (!test_bit(GD_OWNS_QUEUE, &disk->state)) {
+		if (test_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags)) {
+			clear_bit(QUEUE_FLAG_DYING, &q->queue_flags);
+			clear_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags);
+		}
 		blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q);
 		__blk_mq_unfreeze_queue(q, true);
 	} else {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 50c3b959da2816..391e3eb3bb5e61 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -590,6 +590,7 @@  struct request_queue {
 /* Keep blk_queue_flag_name[] in sync with the definitions below */
 enum {
 	QUEUE_FLAG_DYING,		/* queue being torn down */
+	QUEUE_FLAG_RESURRECT,		/* temporarily dying */
 	QUEUE_FLAG_NOMERGES,		/* disable merge attempts */
 	QUEUE_FLAG_SAME_COMP,		/* complete on same CPU-group */
 	QUEUE_FLAG_FAIL_IO,		/* fake timeout */