Message ID | CAMGffE=C8mr_G=X8quAO44_0+1u_YNHDULwpBEJzSJnU4A4q9Q@mail.gmail.com (mailing list archive) |
---|---|
State | Changes Requested, archived |
Headers | show |
On 04/29/2016 02:49 PM, Jinpu Wang wrote: > Hi, all > > We hit IO error on fsync, it turns out was because sd treat succeeded > SYNC as error. From what I checked in SBC spec there is no indication > we should fail IO in this case, so we create this patch. > > > Best Regards, > > Jack Wang > > v2: > No change on patch itself, only resend in body as suggested by Bart, > still keep the attachment in case mail client break the format. > > From 5d1f72d9643ce61cd9f3d312377378c43f171d0c Mon Sep 17 00:00:00 2001 > From: Jack Wang <jinpu.wang@profitbricks.com> > Date: Mon, 25 Apr 2016 12:05:22 +0200 > Subject: [PATCH] sd: Don't treat succeeded SYNC as error > > We hit IO error in our production on multipath devices during resize > device on target side, the problem turns out sd driver passes up as IO > error when sense data is UNIT_ATTENTION and ASC && ASCQ indicate > Capacity data has changed, even storage side sync the data properly. > > In order to fix this check in sd_done, report success if condition > matches. > > Sebastian Parschauer report/analyze the bug here: > https://sourceforge.net/p/scst/mailman/message/34953416/ > > Signed-off-by: Sebastian Parschauer <s.parschauer@gmx.de> > Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> > --- > drivers/scsi/sd.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > Well. Is there anything which guarantees us that 'capacity data has changed' will be the only sense code which we'll be seeing as a response to SYNCHRONIZE CACHE? I sincerely doubt so. So why don't you fall back to the default action (ie retry the command) whenever you hit an UNIT ATTENTION? This way we would cove any resulting sense code, _and_ would get rid of the rather ugly special case here. Cheers, Hannes
On Mon, 2016-05-02 at 12:05 +0200, Hannes Reinecke wrote: > On 04/29/2016 02:49 PM, Jinpu Wang wrote: > > Hi, all > > > > We hit IO error on fsync, it turns out was because sd treat > > succeeded > > SYNC as error. From what I checked in SBC spec there is no > > indication > > we should fail IO in this case, so we create this patch. > > > > > > Best Regards, > > > > Jack Wang > > > > v2: > > No change on patch itself, only resend in body as suggested by > > Bart, > > still keep the attachment in case mail client break the format. > > > > From 5d1f72d9643ce61cd9f3d312377378c43f171d0c Mon Sep 17 00:00:00 > > 2001 > > From: Jack Wang <jinpu.wang@profitbricks.com> > > Date: Mon, 25 Apr 2016 12:05:22 +0200 > > Subject: [PATCH] sd: Don't treat succeeded SYNC as error > > > > We hit IO error in our production on multipath devices during > > resize > > device on target side, the problem turns out sd driver passes up as > > IO > > error when sense data is UNIT_ATTENTION and ASC && ASCQ indicate > > Capacity data has changed, even storage side sync the data > > properly. > > > > In order to fix this check in sd_done, report success if condition > > matches. > > > > Sebastian Parschauer report/analyze the bug here: > > https://sourceforge.net/p/scst/mailman/message/34953416/ > > > > Signed-off-by: Sebastian Parschauer <s.parschauer@gmx.de> > > Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> > > --- > > drivers/scsi/sd.c | 13 +++++++++++++ > > 1 file changed, 13 insertions(+) > > > Well. > Is there anything which guarantees us that 'capacity data has > changed' will be the only sense code which we'll be seeing as a > response to SYNCHRONIZE CACHE? > I sincerely doubt so. > So why don't you fall back to the default action (ie retry the > command) whenever you hit an UNIT ATTENTION? > This way we would cove any resulting sense code, _and_ would get rid > of the rather ugly special case here. Actually, why are we getting here at all? should we be eating this unit attention once we've reported it in scsi_check_sense()? I also don't quite understand why the normal retry mechanism in scsi_io_completion() (called after drv->done()) isn't handling this. We set retries on a flush command and we give sd_sync_cache three goes. Any one of those should also cause the CC/UA to be ignored. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2016-05-02 at 06:44 -0700, James Bottomley wrote: > On Mon, 2016-05-02 at 12:05 +0200, Hannes Reinecke wrote: > > On 04/29/2016 02:49 PM, Jinpu Wang wrote: > > > Hi, all > > > > > > We hit IO error on fsync, it turns out was because sd treat > > > succeeded > > > SYNC as error. From what I checked in SBC spec there is no > > > indication > > > we should fail IO in this case, so we create this patch. > > > > > > > > > Best Regards, > > > > > > Jack Wang > > > > > > v2: > > > No change on patch itself, only resend in body as suggested by > > > Bart, > > > still keep the attachment in case mail client break the format. > > > > > > From 5d1f72d9643ce61cd9f3d312377378c43f171d0c Mon Sep 17 00:00:00 > > > 2001 > > > From: Jack Wang <jinpu.wang@profitbricks.com> > > > Date: Mon, 25 Apr 2016 12:05:22 +0200 > > > Subject: [PATCH] sd: Don't treat succeeded SYNC as error > > > > > > We hit IO error in our production on multipath devices during > > > resize > > > device on target side, the problem turns out sd driver passes up > > > as > > > IO > > > error when sense data is UNIT_ATTENTION and ASC && ASCQ indicate > > > Capacity data has changed, even storage side sync the data > > > properly. > > > > > > In order to fix this check in sd_done, report success if > > > condition > > > matches. > > > > > > Sebastian Parschauer report/analyze the bug here: > > > https://sourceforge.net/p/scst/mailman/message/34953416/ > > > > > > Signed-off-by: Sebastian Parschauer <s.parschauer@gmx.de> > > > Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> > > > --- > > > drivers/scsi/sd.c | 13 +++++++++++++ > > > 1 file changed, 13 insertions(+) > > > > > Well. > > Is there anything which guarantees us that 'capacity data has > > changed' will be the only sense code which we'll be seeing as a > > response to SYNCHRONIZE CACHE? > > I sincerely doubt so. > > So why don't you fall back to the default action (ie retry the > > command) whenever you hit an UNIT ATTENTION? > > This way we would cove any resulting sense code, _and_ would get > > rid > > of the rather ugly special case here. > > Actually, why are we getting here at all? should we be eating this > unit attention once we've reported it in scsi_check_sense()? > > I also don't quite understand why the normal retry mechanism in > scsi_io_completion() (called after drv->done()) isn't handling this. > We set retries on a flush command and we give sd_sync_cache three > goes. Any one of those should also cause the CC/UA to be ignored. Actually, there's another problem with this patch: you're clearing the error and indicating success, meaning you never retry. CC/UA for notifications is usually signalled in the device before acting on the command, so the chances are your SYNC request was never executed and if you never retry we'll be operating in a cache unstaged situation where we're assuming data will be on the medium. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 2, 2016 at 3:44 PM, James Bottomley <jejb@linux.vnet.ibm.com> wrote: > On Mon, 2016-05-02 at 12:05 +0200, Hannes Reinecke wrote: >> On 04/29/2016 02:49 PM, Jinpu Wang wrote: >> > Hi, all >> > >> > We hit IO error on fsync, it turns out was because sd treat >> > succeeded >> > SYNC as error. From what I checked in SBC spec there is no >> > indication >> > we should fail IO in this case, so we create this patch. >> > >> > >> > Best Regards, >> > >> > Jack Wang >> > >> > v2: >> > No change on patch itself, only resend in body as suggested by >> > Bart, >> > still keep the attachment in case mail client break the format. >> > >> > From 5d1f72d9643ce61cd9f3d312377378c43f171d0c Mon Sep 17 00:00:00 >> > 2001 >> > From: Jack Wang <jinpu.wang@profitbricks.com> >> > Date: Mon, 25 Apr 2016 12:05:22 +0200 >> > Subject: [PATCH] sd: Don't treat succeeded SYNC as error >> > >> > We hit IO error in our production on multipath devices during >> > resize >> > device on target side, the problem turns out sd driver passes up as >> > IO >> > error when sense data is UNIT_ATTENTION and ASC && ASCQ indicate >> > Capacity data has changed, even storage side sync the data >> > properly. >> > >> > In order to fix this check in sd_done, report success if condition >> > matches. >> > >> > Sebastian Parschauer report/analyze the bug here: >> > https://sourceforge.net/p/scst/mailman/message/34953416/ >> > >> > Signed-off-by: Sebastian Parschauer <s.parschauer@gmx.de> >> > Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> >> > --- >> > drivers/scsi/sd.c | 13 +++++++++++++ >> > 1 file changed, 13 insertions(+) >> > >> Well. >> Is there anything which guarantees us that 'capacity data has >> changed' will be the only sense code which we'll be seeing as a >> response to SYNCHRONIZE CACHE? >> I sincerely doubt so. >> So why don't you fall back to the default action (ie retry the >> command) whenever you hit an UNIT ATTENTION? >> This way we would cove any resulting sense code, _and_ would get rid >> of the rather ugly special case here. > > Actually, why are we getting here at all? should we be eating this > unit attention once we've reported it in scsi_check_sense()? > > I also don't quite understand why the normal retry mechanism in > scsi_io_completion() (called after drv->done()) isn't handling this. > We set retries on a flush command and we give sd_sync_cache three > goes. Any one of those should also cause the CC/UA to be ignored. > > James > > Sorry for delay, I agree safer to retry this command. I checked the code path, in scsi_io_completion, we call __scsi_error_from_host_byte for FLUSH request, and we set error to EIO by default, somehow the code report error directly to user space without retry. [ 647.206270] sd 1:0:0:0: [sdb] tag#0 Send: scmd 0xffff8800b6558480 [ 647.206422] sd 1:0:0:0: [sdb] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 [ 647.209748] sd 1:0:0:0: Capacity data has changed [ 647.209896] sd 1:0:0:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK [ 647.210157] sd 1:0:0:0: [sdb] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 [ 647.210419] sd 1:0:0:0: [sdb] tag#0 Sense Key : Unit Attention [current] [ 647.210567] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Capacity data has changed [ 647.210741] sd 1:0:0:0: [sdb] tag#0 scsi host busy 1 failed 0 [ 647.210888] sd 1:0:0:0: Notifying upper driver of completion (result 8000002) [ 647.211035] sd 1:0:0:0: [sdb] tag#0 sd_done: completed 0 of 0 bytes [ 647.211182] sd 1:0:0:0: [sdb] tag#0 0 sectors total, 0 bytes done, error -5 [ 647.211330] blk_update_request: I/O error, dev sdb, sector 0 Will figure out why retry doesn't work. Thanks James and Hannes for all your input. Regards, Jack -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 4, 2016 at 7:02 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote: > On Mon, May 2, 2016 at 3:44 PM, James Bottomley <jejb@linux.vnet.ibm.com> wrote: >> On Mon, 2016-05-02 at 12:05 +0200, Hannes Reinecke wrote: >>> On 04/29/2016 02:49 PM, Jinpu Wang wrote: >>> > Hi, all >>> > >>> > We hit IO error on fsync, it turns out was because sd treat >>> > succeeded >>> > SYNC as error. From what I checked in SBC spec there is no >>> > indication >>> > we should fail IO in this case, so we create this patch. >>> > >>> > >>> > Best Regards, >>> > >>> > Jack Wang >>> > >>> > v2: >>> > No change on patch itself, only resend in body as suggested by >>> > Bart, >>> > still keep the attachment in case mail client break the format. >>> > >>> > From 5d1f72d9643ce61cd9f3d312377378c43f171d0c Mon Sep 17 00:00:00 >>> > 2001 >>> > From: Jack Wang <jinpu.wang@profitbricks.com> >>> > Date: Mon, 25 Apr 2016 12:05:22 +0200 >>> > Subject: [PATCH] sd: Don't treat succeeded SYNC as error >>> > >>> > We hit IO error in our production on multipath devices during >>> > resize >>> > device on target side, the problem turns out sd driver passes up as >>> > IO >>> > error when sense data is UNIT_ATTENTION and ASC && ASCQ indicate >>> > Capacity data has changed, even storage side sync the data >>> > properly. >>> > >>> > In order to fix this check in sd_done, report success if condition >>> > matches. >>> > >>> > Sebastian Parschauer report/analyze the bug here: >>> > https://sourceforge.net/p/scst/mailman/message/34953416/ >>> > >>> > Signed-off-by: Sebastian Parschauer <s.parschauer@gmx.de> >>> > Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> >>> > --- >>> > drivers/scsi/sd.c | 13 +++++++++++++ >>> > 1 file changed, 13 insertions(+) >>> > >>> Well. >>> Is there anything which guarantees us that 'capacity data has >>> changed' will be the only sense code which we'll be seeing as a >>> response to SYNCHRONIZE CACHE? >>> I sincerely doubt so. >>> So why don't you fall back to the default action (ie retry the >>> command) whenever you hit an UNIT ATTENTION? >>> This way we would cove any resulting sense code, _and_ would get rid >>> of the rather ugly special case here. >> >> Actually, why are we getting here at all? should we be eating this >> unit attention once we've reported it in scsi_check_sense()? >> >> I also don't quite understand why the normal retry mechanism in >> scsi_io_completion() (called after drv->done()) isn't handling this. >> We set retries on a flush command and we give sd_sync_cache three >> goes. Any one of those should also cause the CC/UA to be ignored. >> >> James >> >> > > Sorry for delay, I agree safer to retry this command. > I checked the code path, in scsi_io_completion, we call > __scsi_error_from_host_byte for FLUSH request, > and we set error to EIO by default, somehow the code report error > directly to user space without retry. > [ 647.206270] sd 1:0:0:0: [sdb] tag#0 Send: scmd 0xffff8800b6558480 > [ 647.206422] sd 1:0:0:0: [sdb] tag#0 CDB: Synchronize Cache(10) 35 > 00 00 00 00 00 00 00 00 00 > [ 647.209748] sd 1:0:0:0: Capacity data has changed > [ 647.209896] sd 1:0:0:0: [sdb] tag#0 Done: SUCCESS Result: > hostbyte=DID_OK driverbyte=DRIVER_OK > [ 647.210157] sd 1:0:0:0: [sdb] tag#0 CDB: Synchronize Cache(10) 35 > 00 00 00 00 00 00 00 00 00 > [ 647.210419] sd 1:0:0:0: [sdb] tag#0 Sense Key : Unit Attention [current] > [ 647.210567] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Capacity data has changed > [ 647.210741] sd 1:0:0:0: [sdb] tag#0 scsi host busy 1 failed 0 > [ 647.210888] sd 1:0:0:0: Notifying upper driver of completion (result 8000002) > [ 647.211035] sd 1:0:0:0: [sdb] tag#0 sd_done: completed 0 of 0 bytes > [ 647.211182] sd 1:0:0:0: [sdb] tag#0 0 sectors total, 0 bytes done, error -5 > [ 647.211330] blk_update_request: I/O error, dev sdb, sector 0 > > Will figure out why retry doesn't work. > > Thanks James and Hannes for all your input. > > Regards, > Jack Hi James, Hannes and all, I find out it's code below which report error directly back to user space without any retry. 913 /* 914 * If we finished all bytes in the request we are done now. 915 */ 916 if (!scsi_end_request(req, error, good_bytes, 0)) 917 return; But not sure, what's the best way to fix the behavior to let it retry, maybe add condition with sense key && asc && ascq direct go to requeue before line 913? Thanks Jack -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From 5d1f72d9643ce61cd9f3d312377378c43f171d0c Mon Sep 17 00:00:00 2001 From: Jack Wang <jinpu.wang@profitbricks.com> Date: Mon, 25 Apr 2016 12:05:22 +0200 Subject: [PATCH] sd: Don't treat succeeded SYNC as error We hit IO error in our production on multipath devices during resize device on target side, the problem turns out sd driver passes up as IO error when sense data is UNIT_ATTENTION and ASC && ASCQ indicate Capacity data has changed, even storage side sync the data properly. In order to fix this check in sd_done, report success if condition matches. Sebastian Parschauer report/analyze the bug here: https://sourceforge.net/p/scst/mailman/message/34953416/ Signed-off-by: Sebastian Parschauer <s.parschauer@gmx.de> Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> --- drivers/scsi/sd.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 5a5457a..e9bfe01 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1833,6 +1833,19 @@ static int sd_done(struct scsi_cmnd *SCpnt) } } break; + case UNIT_ATTENTION: + /* Capacity data has changed */ + if (sshdr.asc == 0x2a && sshdr.ascq == 0x09) { + switch (op) { + /* don't treat succeeded fsync() as error */ + case SYNCHRONIZE_CACHE: + case SYNCHRONIZE_CACHE_16: + if (good_bytes == scsi_bufflen(SCpnt)) + SCpnt->result = 0; + break; + } + } + break; default: break; } -- 1.9.1