Message ID | 20170926172235.29530-1-lduncan@suse.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
On Tue, 2017-09-26 at 10:22 -0700, Lee Duncan wrote: > The SCSI ioctl reset path is smart enough to set the > flag tmf_in_progress when a user-requested reset is > processed, but it does not wait for IO that is in > flight. This can result in lost IOs and hung > processes. We should wait for a reasonable amount > of time for either the IOs to complete or to fail > the request. Hello Lee, I'm using this functionality all the time to test how SCSI target code handles TMFs while SCSI commands are in progress. So I would regret if the SCSI reset ioctl code would be modified such that it waits for outstanding requests. Isn't the behavior you described a SCSI LLD bug? Shouldn't such bugs be fixed instead of implementing a work-around in the SCSI core? Thanks, Bart.
On Tue, 2017-09-26 at 10:22 -0700, Lee Duncan wrote: > The SCSI ioctl reset path is smart enough to set the > flag tmf_in_progress when a user-requested reset is > processed, but it does not wait for IO that is in > flight. This can result in lost IOs and hung > processes. We should wait for a reasonable amount > of time for either the IOs to complete or to fail > the request. The idea of the SCSI ioctl was to be async: issue reset and return. Do we have nothing in userspace that expects async behaviour? (if we have, you could still add a flag or a new ioctl). However: > Signed-off-by: Lee Duncan <lduncan@suse.com> > --- > drivers/scsi/scsi_error.c | 26 ++++++++++++++++++++++++++ > 1 file changed, 26 insertions(+) > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index 38942050b265..b964152611c3 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -57,6 +57,14 @@ > #define BUS_RESET_SETTLE_TIME (10) > #define HOST_RESET_SETTLE_TIME (10) > > +/* > + * Time to wait for outstanding IOs when about to send > + * a device reset, e.g. sg_reset. The msecs to wait must > + * be an multiple of the msecs to wait per try. > + */ > +#define MSECS_PER_TRY_FOR_IO_ON_RESET 500 > +#define MSECS_TO_WAIT_FOR_IO_ON_RESET (MSECS_PER_TRY_FOR_IO_O > N_RESET * 10) > + > static int scsi_eh_try_stu(struct scsi_cmnd *scmd); > static int scsi_try_to_abort_cmd(struct scsi_host_template *, > struct scsi_cmnd *); > @@ -2269,6 +2277,7 @@ void scsi_report_device_reset(struct Scsi_Host > *shost, int channel, int target) > struct request *rq; > unsigned long flags; > int error = 0, rtn, val; > + unsigned int msecs_to_wait = MSECS_TO_WAIT_FOR_IO_ON_RESET; scsi_report_device_reset is the wrong place to do the wait: that's used by low level device drivers to report that they've detected a bus reset and is expected to return immediately. James
On 09/26/2017 08:24 PM, Bart Van Assche wrote: > On Tue, 2017-09-26 at 10:22 -0700, Lee Duncan wrote: >> The SCSI ioctl reset path is smart enough to set the >> flag tmf_in_progress when a user-requested reset is >> processed, but it does not wait for IO that is in >> flight. This can result in lost IOs and hung >> processes. We should wait for a reasonable amount >> of time for either the IOs to complete or to fail >> the request. > > Hello Lee, > > I'm using this functionality all the time to test how SCSI target code handles > TMFs while SCSI commands are in progress. So I would regret if the SCSI reset > ioctl code would be modified such that it waits for outstanding requests. > Isn't the behavior you described a SCSI LLD bug? Shouldn't such bugs be fixed > instead of implementing a work-around in the SCSI core? > Well, thing is that there is an asymmetry here; originally all SCSI EH functions were supposed to run with no I/O in flight. (I've modified that with the asynchronous ABORT TASK TMF, but still). But when called with sg_reset this is no longer true, we're disallowing new requests, but do not wait for the in-flight I/O to complete. And we've had a customer report where calling sg_reset -t on an iSCSI device caused I/O to become stuck as the in-flight I/O was terminated by the target reset, but the iSCSI stack never sent a completion for that I/O. However, we could also defer this problem until my SCSI EH rework goes in; that clears up the sg_reset path and might clarify things a bit. Cheers, Hannes
On 09/26/2017 11:58 PM, Hannes Reinecke wrote: > On 09/26/2017 08:24 PM, Bart Van Assche wrote: >> On Tue, 2017-09-26 at 10:22 -0700, Lee Duncan wrote: >>> The SCSI ioctl reset path is smart enough to set the >>> flag tmf_in_progress when a user-requested reset is >>> processed, but it does not wait for IO that is in >>> flight. This can result in lost IOs and hung >>> processes. We should wait for a reasonable amount >>> of time for either the IOs to complete or to fail >>> the request. >> >> Hello Lee, >> >> I'm using this functionality all the time to test how SCSI target code handles >> TMFs while SCSI commands are in progress. So I would regret if the SCSI reset >> ioctl code would be modified such that it waits for outstanding requests. >> Isn't the behavior you described a SCSI LLD bug? Shouldn't such bugs be fixed >> instead of implementing a work-around in the SCSI core? >> > Well, thing is that there is an asymmetry here; originally all SCSI EH > functions were supposed to run with no I/O in flight. > (I've modified that with the asynchronous ABORT TASK TMF, but still). > But when called with sg_reset this is no longer true, we're disallowing > new requests, but do not wait for the in-flight I/O to complete. > And we've had a customer report where calling sg_reset -t on an iSCSI > device caused I/O to become stuck as the in-flight I/O was terminated by > the target reset, but the iSCSI stack never sent a completion for that I/O. > > However, we could also defer this problem until my SCSI EH rework goes > in; that clears up the sg_reset path and might clarify things a bit. > > Cheers, > > Hannes > I will wait and see if the problem still exists there, and address it if it does. Thank you.
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 38942050b265..b964152611c3 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -57,6 +57,14 @@ #define BUS_RESET_SETTLE_TIME (10) #define HOST_RESET_SETTLE_TIME (10) +/* + * Time to wait for outstanding IOs when about to send + * a device reset, e.g. sg_reset. The msecs to wait must + * be an multiple of the msecs to wait per try. + */ +#define MSECS_PER_TRY_FOR_IO_ON_RESET 500 +#define MSECS_TO_WAIT_FOR_IO_ON_RESET (MSECS_PER_TRY_FOR_IO_ON_RESET * 10) + static int scsi_eh_try_stu(struct scsi_cmnd *scmd); static int scsi_try_to_abort_cmd(struct scsi_host_template *, struct scsi_cmnd *); @@ -2269,6 +2277,7 @@ void scsi_report_device_reset(struct Scsi_Host *shost, int channel, int target) struct request *rq; unsigned long flags; int error = 0, rtn, val; + unsigned int msecs_to_wait = MSECS_TO_WAIT_FOR_IO_ON_RESET; if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO)) return -EACCES; @@ -2301,6 +2310,22 @@ void scsi_report_device_reset(struct Scsi_Host *shost, int channel, int target) spin_lock_irqsave(shost->host_lock, flags); shost->tmf_in_progress = 1; + + /* if any IOs in progress wait for them a while */ + while ((atomic_read(&shost->host_busy) > 0) && (msecs_to_wait > 0)) { + spin_unlock_irqrestore(shost->host_lock, flags); + msleep(MSECS_PER_TRY_FOR_IO_ON_RESET); + msecs_to_wait -= MSECS_PER_TRY_FOR_IO_ON_RESET; + spin_lock_irqsave(shost->host_lock, flags); + } + if (atomic_read(&shost->host_busy)) { + shost->tmf_in_progress = 0; + spin_unlock_irqrestore(shost->host_lock, flags); + SCSI_LOG_ERROR_RECOVERY(3, + printk("%s: device reset failed: outstanding IO\n", __func__)); + goto out_put_scmd_and_free; + } + spin_unlock_irqrestore(shost->host_lock, flags); switch (val & ~SG_SCSI_RESET_NO_ESCALATE) { @@ -2349,6 +2374,7 @@ void scsi_report_device_reset(struct Scsi_Host *shost, int channel, int target) wake_up(&shost->host_wait); scsi_run_host_queues(shost); +out_put_scmd_and_free: scsi_put_command(scmd); kfree(rq);
The SCSI ioctl reset path is smart enough to set the flag tmf_in_progress when a user-requested reset is processed, but it does not wait for IO that is in flight. This can result in lost IOs and hung processes. We should wait for a reasonable amount of time for either the IOs to complete or to fail the request. Signed-off-by: Lee Duncan <lduncan@suse.com> --- drivers/scsi/scsi_error.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)