diff mbox series

SCSI low level driver: how to prevent I/O upon hibernation?

Message ID HK0P153MB0273A1B109CE3A33F0984D34BFDE0@HK0P153MB0273.APCP153.PROD.OUTLOOK.COM (mailing list archive)
State Not Applicable
Headers show
Series SCSI low level driver: how to prevent I/O upon hibernation? | expand

Commit Message

Dexuan Cui April 10, 2020, 5:44 a.m. UTC
Hi all,
Can you please recommend the standard way to prevent the upper layer SCSI
driver from submitting new I/O requests when the system is doing hibernation?

Actually I already asked the question on 5/30 last year:
https://marc.info/?l=linux-scsi&m=155918927116283&w=2
and I thought all the sdevs are suspended and resumed automatically in
drivers/scsi/scsi_pm.c, and the low level SCSI adapter driver (i.e. hv_storvsc)
only needs to suspend/resume the state of the adapter itself. However, it looks
this is not true, because today I got such a panic in a v5.6 Linux VM running on
Hyper-V: the 'suspend' part of the hibernation process finished without any
issue, but when the VM was trying to resume back from the 'new' kernel to the
'old' kernel, these events happened:

1. the new kernel loaded the saved state from disk to memory.

2. the new kernel quiesced the devices, including the SCSI DVD device
controlled by the hv_storvsc low level SCSI driver, i.e.
drivers/scsi/storvsc_drv.c: storvsc_suspend() was called and the related vmbus
ringbuffer was freed.

3. However, disk_events_workfn() -> ... -> cdrom_check_events() -> ...
   -> scsi_queue_rq() -> ... -> storvsc_queuecommand() was still trying to
submit I/O commands to the freed vmbus ringbuffer, and as a result, a NULL
pointer dereference panic happened.

Can anyone please explain the symptom?

I made the below patch, with which it seems I can no longer reproduce the panic.

But I don't know how scsi_block_requests() can reliably prevent new I/O requests
from being issued concurrently on a different CPU -- the function only sets a
flag?

Looking forward to your insights!


This is the log of the panic:

[    8.565615] PM: Adding info for No Bus:vcsa63
[    8.590118] Freezing user space processes ... (elapsed 0.020 seconds) done.
[    8.619143] OOM killer disabled.
[    8.645914] PM: Using 3 thread(s) for decompression
[    8.650805] PM: Loading and decompressing image data (307144 pages)...
[    8.693765] PM: Image loading progress:   0%
[    9.286720] PM: Image loading progress:  10%
[    9.541665] PM: Image loading progress:  20%
[    9.777528] PM: Image loading progress:  30%
[   10.062504] PM: Image loading progress:  40%
[   10.317178] PM: Image loading progress:  50%
[   10.588564] PM: Image loading progress:  60%
[   10.796801] PM: Image loading progress:  70%
[   11.029323] PM: Image loading progress:  80%
[   11.327868] PM: Image loading progress:  90%
[   11.650745] PM: Image loading progress: 100%
[   11.655851] PM: Image loading done
[   11.659596] PM: hibernation: Read 1228576 kbytes in 2.99 seconds (410.89 MB/s)
[   11.668702] input input1: type quiesce
[   11.668741] sr 0:0:0:1: bus quiesce
[   11.668804] sd 0:0:0:0: bus quiesce
[   11.672970] input input0: type quiesce
[   11.698082] scsi target0:0:0: bus quiesce
[   11.703296] scsi host0: bus quiesce
[   11.707448] alarmtimer alarmtimer.0.auto: bus quiesce
[   11.712782] rtc rtc0: class quiesce
[   11.716560] platform Fixed MDIO bus.0: bus quiesce
[   11.721911] serial8250 serial8250: bus quiesce
[   11.727220] simple-framebuffer simple-framebuffer.0: bus quiesce
[   11.734066] platform pcspkr: bus quiesce
[   11.738726] rtc_cmos 00:02: bus quiesce
[   11.743353] serial 00:01: bus quiesce
[   11.747433] serial 00:00: bus quiesce
[   11.751654] platform efivars.0: bus quiesce
[   11.756316] platform rtc-efi.0: bus quiesce
[   11.760883] platform HYPER_V_GEN_COUN:00: bus quiesce
[   11.766255] platform VMBUS:00: bus quiesce
[   11.770668] platform PNP0003:00: bus quiesce
[   11.781730] hv_storvsc bf78936f-7d8f-45ce-ab03-6c341452e55d: noirq bus quiesce
[   11.796479] hv_netvsc dda5a2be-b8b8-4237-b330-be8a516a72c0: noirq bus quiesce
[   11.804042] BUG: kernel NULL pointer dereference, address: 0000000000000090
[   11.804996] #PF: supervisor read access in kernel mode
[   11.804996] #PF: error_code(0x0000) - not-present page
[   11.804996] PGD 0 P4D 0
[   11.804996] Oops: 0000 [#1] SMP PTI
[   11.804996] CPU: 18 PID: 353 Comm: kworker/18:1 Not tainted 5.6.0+ #1
[   11.804996] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 05/16/2019
[   11.804996] Workqueue: events_freezable_power_ disk_events_workfn
[   11.804996] RIP: 0010:storvsc_queuecommand+0x261/0x714 [hv_storvsc]
[   11.804996] Code: ...
[   11.804996] RSP: 0018:ffffa331c2347af0 EFLAGS: 00010246
[   11.804996] RAX: 0000000000000000 RBX: ffff8e6a32cec5a0 RCX: 0000000000000000
[   11.804996] RDX: 0000000000000012 RSI: ffff8e6a32cec3e0 RDI: ffff8e6a32cec710
[   11.804996] RBP: ffff8e6b6d58c800 R08: 0000000000000010 R09: ffff8e6a32aa6060
[   11.804996] R10: ffffc8a8c4c81980 R11: 0000000000000000 R12: 0000000000000012
[   11.804996] R13: 0000000000000012 R14: ffff8e6a32cec710 R15: ffff8e6a32cec3d8
[   11.804996] FS:  0000000000000000(0000) GS:ffff8e6a42c80000(0000) knlGS:0000000000000000
[   11.804996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.804996] CR2: 0000000000000090 CR3: 000000013a428006 CR4: 00000000003606e0
[   11.804996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   11.804996] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   11.804996] Call Trace:
[   11.804996]  scsi_queue_rq+0x593/0xa10
[   11.804996]  blk_mq_dispatch_rq_list+0x8d/0x510
[   11.804996]  blk_mq_sched_dispatch_requests+0xed/0x170
[   11.804996]  __blk_mq_run_hw_queue+0x55/0x110
[   11.804996]  __blk_mq_delay_run_hw_queue+0x141/0x160
[   11.804996]  blk_mq_sched_insert_request+0xc3/0x170
[   11.804996]  blk_execute_rq+0x4b/0xa0
[   11.804996]  __scsi_execute+0xeb/0x250
[   11.804996]  sr_check_events+0x9f/0x270 [sr_mod]
[   11.804996]  cdrom_check_events+0x1a/0x30 [cdrom]
[   11.804996]  sr_block_check_events+0xcc/0x110 [sr_mod]
[   11.804996]  disk_check_events+0x68/0x160
[   11.804996]  process_one_work+0x20c/0x3d0
[   11.804996]  worker_thread+0x2d/0x3e0
[   11.804996]  kthread+0x10c/0x130
[   11.804996]  ret_from_fork+0x35/0x40

Thanks,
-- Dexuan

Comments

Ming Lei April 10, 2020, 7:42 a.m. UTC | #1
Hello Dexuan,

On Fri, Apr 10, 2020 at 1:44 PM Dexuan Cui <decui@microsoft.com> wrote:
>
> Hi all,
> Can you please recommend the standard way to prevent the upper layer SCSI
> driver from submitting new I/O requests when the system is doing hibernation?
>
> Actually I already asked the question on 5/30 last year:
> https://marc.info/?l=linux-scsi&m=155918927116283&w=2
> and I thought all the sdevs are suspended and resumed automatically in
> drivers/scsi/scsi_pm.c, and the low level SCSI adapter driver (i.e. hv_storvsc)
> only needs to suspend/resume the state of the adapter itself. However, it looks
> this is not true, because today I got such a panic in a v5.6 Linux VM running on
> Hyper-V: the 'suspend' part of the hibernation process finished without any
> issue, but when the VM was trying to resume back from the 'new' kernel to the
> 'old' kernel, these events happened:
>
> 1. the new kernel loaded the saved state from disk to memory.
>
> 2. the new kernel quiesced the devices, including the SCSI DVD device
> controlled by the hv_storvsc low level SCSI driver, i.e.
> drivers/scsi/storvsc_drv.c: storvsc_suspend() was called and the related vmbus
> ringbuffer was freed.
>
> 3. However, disk_events_workfn() -> ... -> cdrom_check_events() -> ...
>    -> scsi_queue_rq() -> ... -> storvsc_queuecommand() was still trying to
> submit I/O commands to the freed vmbus ringbuffer, and as a result, a NULL
> pointer dereference panic happened.

Last time I replied to you in above link:

"scsi_device_quiesce() has been called by scsi_dev_type_suspend() to prevent
any non-pm request from entering queue."

That meant no any normal FS request can enter scsi queue after suspend, however
request with BLK_MQ_REQ_PREEMPT is still allowed to be queued to LLD
after suspend.

So you can't free related vmbus ringbuffer cause  BLK_MQ_REQ_PREEMPT request
is still to be handled.

Thanks,
Ming Lei
Dexuan Cui April 11, 2020, 6:01 a.m. UTC | #2
> From: Ming Lei <tom.leiming@gmail.com>
> Sent: Friday, April 10, 2020 12:43 AM
> To: Dexuan Cui <decui@microsoft.com>
> 
> Hello Dexuan,
> 
> On Fri, Apr 10, 2020 at 1:44 PM Dexuan Cui <decui@microsoft.com> wrote:
> >
> > Hi all,
> > Can you please recommend the standard way to prevent the upper layer SCSI
> > driver from submitting new I/O requests when the system is doing
> > hibernation?
> >
> > Actually I already asked the question on 5/30 last year ...
> > and I thought all the sdevs are suspended and resumed automatically in
> > drivers/scsi/scsi_pm.c, and the low level SCSI adapter driver (i.e. hv_storvsc)
> > only needs to suspend/resume the state of the adapter itself. However, it
> > looks
> > this is not true, because today I got such a panic in a v5.6 Linux VM running
> > on Hyper-V: the 'suspend' part of the hibernation process finished without 
> > any issue, but when the VM was trying to resume back from the 'new' 
> > kernel to the 'old' kernel, these events happened:
> >
> > 1. the new kernel loaded the saved state from disk to memory.
> >
> > 2. the new kernel quiesced the devices, including the SCSI DVD device
> > controlled by the hv_storvsc low level SCSI driver, i.e.
> > drivers/scsi/storvsc_drv.c: storvsc_suspend() was called and the related
> > vmbus ringbuffer was freed.
> >
> > 3. However, disk_events_workfn() -> ... -> cdrom_check_events() -> ...
> >    -> scsi_queue_rq() -> ... -> storvsc_queuecommand() was still trying to
> > submit I/O commands to the freed vmbus ringbuffer, and as a result, a NULL
> > pointer dereference panic happened.
> 
> Last time I replied to you in above link:
> 
> "scsi_device_quiesce() has been called by scsi_dev_type_suspend() to prevent
> any non-pm request from entering queue."
> 
> That meant no any normal FS request can enter scsi queue after suspend,
> however request with BLK_MQ_REQ_PREEMPT is still allowed to be queued
> to LLD after suspend.
> 
> So you can't free related vmbus ringbuffer cause  BLK_MQ_REQ_PREEMPT
> request is still to be handled.
> 
> Thanks,
> Ming Lei

Actually I think I found the APIs with the help of Long Li:
scsi_host_block and scsi_host_unblock(). The new APIs were just added
on 2/28. :-)

Unluckily scsi_host_block() doesn't allow a state transition from 
SDEV_QUIESCE to SDEV_BLOCK and returns -EINVAL for that, so I made the
below patch and it looks hibernation can work reliably for me now.

Please let me know if the change to scsi_device_set_state() is OK.

If the patch looks good, I plan to split and post it sometime next week.

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 47835c4b4ee0..06c260f6cdae 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2284,6 +2284,7 @@ scsi_device_set_state(struct scsi_device *sdev, enum scsi_device_state state)
                switch (oldstate) {
                case SDEV_RUNNING:
                case SDEV_CREATED_BLOCK:
+               case SDEV_QUIESCE:
                case SDEV_OFFLINE:
                        break;
                default:
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index fb41636519ee..fd51d2f03778 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device *hv_dev)
        struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
        struct Scsi_Host *host = stor_device->host;
        struct hv_host_device *host_dev = shost_priv(host);
+       int ret;
+
+       ret = scsi_host_block(host);
+       if (ret)
+               return ret;

        storvsc_wait_to_drain(stor_device);

@@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device *hv_dev)

 static int storvsc_resume(struct hv_device *hv_dev)
 {
+       struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
+       struct Scsi_Host *host = stor_device->host;
        int ret;

        ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size,
                                     hv_dev_is_fc(hv_dev));
+       if (!ret)
+               ret = scsi_host_unblock(host, SDEV_RUNNING);
+
        return ret;
 }

Thanks,
-- Dexuan
Bart Van Assche April 11, 2020, 3:03 p.m. UTC | #3
On 2020-04-10 23:01, Dexuan Cui wrote:
> Please let me know if the change to scsi_device_set_state() is OK.

Hadn't Ming Lei already root-caused this issue for you? From his e-mail:
"So you can't free related vmbus ringbuffer cause BLK_MQ_REQ_PREEMPT
request is still to be handled."

Please follow that advice.

Thanks,

Bart.
Dexuan Cui April 11, 2020, 5:20 p.m. UTC | #4
> From: Bart Van Assche <bvanassche@acm.org>
> Sent: Saturday, April 11, 2020 8:03 AM
> 
> On 2020-04-10 23:01, Dexuan Cui wrote:
> > Please let me know if the change to scsi_device_set_state() is OK.
> 
> Hadn't Ming Lei already root-caused this issue for you? From his e-mail:
> "So you can't free related vmbus ringbuffer cause BLK_MQ_REQ_PREEMPT
> request is still to be handled."
> 
> Please follow that advice.
> 
> Bart.

Hi Bart, Ming,
I agree Ming has root-caused the issue, but it looks the advice can not
apply to the hibernation scenario. :-)

Sorry for my lack of knowledge of the complex SCSI subsystems -- could
you please elaborate on what a low level SCSI device driver (like hv_storvsc)
should do to safely save/restore the device state upon hibernation?

The nature of "free related vmbus ringbuffer" in hv_storvsc is that: the 
driver can not handle any I/O after the device is quiesced in 
software_resume() -> load_image_and_restore() -> hibernation_restore()
-> dpm_suspend_start() -> ... -> storvsc_suspend(). BTW, after the SCSI
device is quiesced, the hibernation's resume path also quiesces other 
devices, disables non-boot CPUs, and finally jumps to the old kernel's
entry point where the old kernel was suspended, and the old kernel will
resume back.

My intuition is that the upper level SCSI layer should provide an API to
flush any pending I/O and block any new I/O after a SCSI device is
"quiesced"? -- it looks scsi_host_block()/scsi_host_unblock() are such 
APIs, which are already used by 
drivers/scsi/aacraid/linit.c: aac_suspend()/aac_resume().

That's why I proposed the patch of the same thing for hv_storvsc, and
it looks the patch works for me: without the patch I can easily hit the
panic I reported in the first email; with the patch, I have successfully
done more than 30 rounds of hibernation without the panic.

However, it looks you implied my intuition is wrong and it's *expected*
that the upper level SCSI layer can still submit I/O requests with the 
BLK_MQ_REQ_PREEMPT flag after the SCSI device is "quiesced"?

If this is the case, then how is hv_storvsc supposed to handle the I/O
after the SCSI device is quiesced? I can keep the related vmbus ringbuffer,
but the real issue is: the driver is unable to handle any I/O at all since the
vmbus connection to the Hyper-V host is disconnected soon, after
the SCSI device is quiesced. Should hv_storvsc return an error for such I/O,
or block such I/O until the SCSI device is resumed? -- These don't look
good to me, and I really think the upper level SCSI layer should provide
an API to block any new I/O after a SCSI device is "quiesced" -- again, can
you please clarify if scsi_host_block()/scsi_host_unblock() are such APIs?

Looking forward to your replies!

Thanks,
-- Dexuan
Dexuan Cui April 14, 2020, 5:09 p.m. UTC | #5
> From: linux-hyperv-owner@vger.kernel.org
> <linux-hyperv-owner@vger.kernel.org> On Behalf Of Dexuan Cui
> Sent: Saturday, April 11, 2020 10:20 AM
> To: Bart Van Assche <bvanassche@acm.org>; Ming Lei
> <tom.leiming@gmail.com>; Martin K. Petersen <martin.petersen@oracle.com>
> Cc: linux-scsi@vger.kernel.org; Christoph Hellwig <hch@lst.de>;
> linux-hyperv@vger.kernel.org; Long Li <longli@microsoft.com>; vkuznets
> <vkuznets@redhat.com>; Michael Kelley <mikelley@microsoft.com>; KY
> Srinivasan <kys@microsoft.com>; Olaf Hering <olaf@aepfle.de>; Stephen
> Hemminger <sthemmin@microsoft.com>; linux-kernel@vger.kernel.org
> Subject: RE: SCSI low level driver: how to prevent I/O upon hibernation?
> 
> > From: Bart Van Assche <bvanassche@acm.org>
> > Sent: Saturday, April 11, 2020 8:03 AM
> >
> > On 2020-04-10 23:01, Dexuan Cui wrote:
> > > Please let me know if the change to scsi_device_set_state() is OK.
> >
> > Hadn't Ming Lei already root-caused this issue for you? From his e-mail:
> > "So you can't free related vmbus ringbuffer cause BLK_MQ_REQ_PREEMPT
> > request is still to be handled."
> >
> > Please follow that advice.
> >
> > Bart.
> 
> Hi Bart, Ming,
> I agree Ming has root-caused the issue, but it looks the advice can not
> apply to the hibernation scenario. :-)
> 
> Sorry for my lack of knowledge of the complex SCSI subsystems -- could
> you please elaborate on what a low level SCSI device driver (like hv_storvsc)
> should do to safely save/restore the device state upon hibernation?
> 
> The nature of "free related vmbus ringbuffer" in hv_storvsc is that: the
> driver can not handle any I/O after the device is quiesced in
> software_resume() -> load_image_and_restore() -> hibernation_restore()
> -> dpm_suspend_start() -> ... -> storvsc_suspend(). BTW, after the SCSI
> device is quiesced, the hibernation's resume path also quiesces other
> devices, disables non-boot CPUs, and finally jumps to the old kernel's
> entry point where the old kernel was suspended, and the old kernel will
> resume back.
> 
> My intuition is that the upper level SCSI layer should provide an API to
> flush any pending I/O and block any new I/O after a SCSI device is
> "quiesced"? -- it looks scsi_host_block()/scsi_host_unblock() are such
> APIs, which are already used by
> drivers/scsi/aacraid/linit.c: aac_suspend()/aac_resume().
> 
> That's why I proposed the patch of the same thing for hv_storvsc, and
> it looks the patch works for me: without the patch I can easily hit the
> panic I reported in the first email; with the patch, I have successfully
> done more than 30 rounds of hibernation without the panic.
> 
> However, it looks you implied my intuition is wrong and it's *expected*
> that the upper level SCSI layer can still submit I/O requests with the
> BLK_MQ_REQ_PREEMPT flag after the SCSI device is "quiesced"?
> 
> If this is the case, then how is hv_storvsc supposed to handle the I/O
> after the SCSI device is quiesced? I can keep the related vmbus ringbuffer,
> but the real issue is: the driver is unable to handle any I/O at all since the
> vmbus connection to the Hyper-V host is disconnected soon, after
> the SCSI device is quiesced. Should hv_storvsc return an error for such I/O,
> or block such I/O until the SCSI device is resumed? -- These don't look
> good to me, and I really think the upper level SCSI layer should provide
> an API to block any new I/O after a SCSI device is "quiesced" -- again, can
> you please clarify if scsi_host_block()/scsi_host_unblock() are such APIs?
> 
> Looking forward to your replies!
> 
> Thanks,
> -- Dexuan

It looks scsi_host_block() and scsi_host_unblock() are just the APIs
I need. I have run my hibernation test with the APIs more than 1000 rounds
and the VM is still running fine without the panic I reported previously when
I didn't use the APIs.

FYI: the aacraid driver is the first user of the APIs:

commit 3d3ca53b163914c1397289d0c2ee6d2f52362dcc
Author: Hannes Reinecke <hare@suse.de>
Date:   Fri Feb 28 08:53:14 2020 +0100

    scsi: aacraid: use scsi_host_(block,unblock) to block I/O

    Use scsi_host_block() and scsi_host_unblock() instead of
    scsi_block_requests()/scsi_unblock_requests() to block and unblock I/O.
    This has the advantage that the block layer will stop sending I/O to the
    adapter instead of having the SCSI midlayer requeueing I/O internally.

    Link: https://lore.kernel.org/r/20200228075318.91255-10-hare@suse.de
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Balsundar P < Balsundar.P@microchip.com>
    Signed-off-by: Hannes Reinecke <hare@suse.de>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

BTW, I suspect aac_suspend() -> scsi_host_block() is not relly working
because scsi_host_block() doesn't allow a state transition from 
SDEV_QUIESCE to SDEV_BLOCK and returns -EINVAL for that. aac_suspend()
should check the return value of scsi_host_block(), and my change to
scsi_device_set_state() should be needed to avoid the -EINVAL error.

If there is no objection, I plan to send some patches later.

Thanks,
-- Dexuan
diff mbox series

Patch

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index fb41636519ee..dcfb0a820977 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1949,6 +1949,8 @@  static int storvsc_suspend(struct hv_device *hv_dev)
        struct Scsi_Host *host = stor_device->host;
        struct hv_host_device *host_dev = shost_priv(host);

+       scsi_block_requests(host);
+
        storvsc_wait_to_drain(stor_device);

        drain_workqueue(host_dev->handle_error_wq);
@@ -1968,10 +1970,14 @@  static int storvsc_suspend(struct hv_device *hv_dev)

 static int storvsc_resume(struct hv_device *hv_dev)
 {
+       struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
+       struct Scsi_Host *host = stor_device->host;
        int ret;

        ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size,
                                     hv_dev_is_fc(hv_dev));
+       if (!ret)
+               scsi_unblock_requests(host);
        return ret;
 }