Message ID | 1443223134-9886-1-git-send-email-mrochs@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 "Matthew R. Ochs" <mrochs@linux.vnet.ibm.com> writes: > The process_sense() routine can perform a read capacity which > can take some time to complete. If an EEH occurs while waiting > on the read capacity, the EEH handler is unable to obtain the > context's mutex in order to put the context in an error state. > The EEH handler will sit and wait until the context is free, > but this wait can last longer than the EEH handler tolerates, > leading to a failed recovery. I'm not quite clear on what you mean by the EEH handler timing out. AFAIK there's nothing in eehd and the EEH core that times out if a driver doesn't respond - indeed, it's pretty easy to hang eehd with a misbehaving driver. Are you referring to your own internal timeouts? cxlflash_wait_for_pci_err_recovery and anything else that uses CXLFLASH_PCI_ERROR_RECOVERY_TIMEOUT? Regards, Daniel > > To address this issue, make the context unavailable to new, > non-system owned threads and release the context while calling > into process_sense(). After returning from process_sense() the > context mutex is reacquired and the context is made available > again. The context can be safely moved to the error state if > needed during the unavailable window as no other threads will > hold its reference. > > Signed-off-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com> > Signed-off-by: Manoj N. Kumar <manoj@linux.vnet.ibm.com> > Reviewed-by: Brian King <brking@linux.vnet.ibm.com> > --- > drivers/scsi/cxlflash/superpipe.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/drivers/scsi/cxlflash/superpipe.c b/drivers/scsi/cxlflash/superpipe.c > index a6316f5..7283e83 100644 > --- a/drivers/scsi/cxlflash/superpipe.c > +++ b/drivers/scsi/cxlflash/superpipe.c > @@ -1787,12 +1787,21 @@ static int cxlflash_disk_verify(struct scsi_device *sdev, > * inquiry (i.e. the Unit attention is due to the WWN changing). > */ > if (verify->hint & DK_CXLFLASH_VERIFY_HINT_SENSE) { > + /* Can't hold mutex across process_sense/read_cap16, > + * since we could have an intervening EEH event. > + */ > + ctxi->unavail = true; > + mutex_unlock(&ctxi->mutex); > rc = process_sense(sdev, verify); > if (unlikely(rc)) { > dev_err(dev, "%s: Failed to validate sense data (%d)\n", > __func__, rc); > + mutex_lock(&ctxi->mutex); > + ctxi->unavail = false; > goto out; > } > + mutex_lock(&ctxi->mutex); > + ctxi->unavail = false; > } > > switch (gli->mode) { > -- > 2.1.0 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: GPGTools - https://gpgtools.org iQIcBAEBCgAGBQJWCeieAAoJEPC3R3P2I92F+hMP/1OdLQCin+kKbOb9qxf952bH DUAkmEhc0oD7xZFQI8HgDmHRxpes5HHxXtwXFsLgsr8QYG+aOIV568GXIZtTbrl0 aCFMqtKXZ6jVqv5L60r1tgzcWxmWdshMLd1op6t3BwA67nUc5Edcr94ePUyDDLj1 at335wCnxuGxn0kdB0Ud/lbPzTsgDPcuV6tCLy0o4J15KFOyFt9hCjO4nmL/wcIt kmjyn5SHbdgje+73uaRQnXkli4wDA9x7x6/8wFgLspnOxgMEJgnHmm+HYbOXnHyX nFFHw9+X2ETUcucVWuKNaFzW1vH+WJDteEZbjS7t7liJIkmIiZSFHyUTtVGdBkl1 FsWswA0pkzuGq94Wsb0nGtNHbsMw+WeWTcTlNN46DMG/wqz75iO3yMGK9MZuddSX 9jUokiM0kQvvfwAoujmvpMCVB4b2oseRRG4/yJ0lKSCcC8kETQTXgVHbT8oLmCdk rUA0hxbbKzVQsDzw8s5HqYZjqHdLp3sDPeyukPeJl2CNhysrmnyHXpq8XgcLi3op kbuuiR3z8UH3MW4BDpplnjhZ+5Wyw9cSI57vRF2Kr80NnU+5hBvftNh4rBneeny2 0gCDlPHDvB7Ks9HkcxkK9MW78FTgj50ePofS/dUUod4M9ohDd4MSwRKjpwQ+H3By jmxnzfvWO/oTlL1D9+2W =3BcU -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Sep 28, 2015, at 8:25 PM, Daniel Axtens <dja@axtens.net> wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA512 > > "Matthew R. Ochs" <mrochs@linux.vnet.ibm.com> writes: > > >> The process_sense() routine can perform a read capacity which >> can take some time to complete. If an EEH occurs while waiting >> on the read capacity, the EEH handler is unable to obtain the >> context's mutex in order to put the context in an error state. >> The EEH handler will sit and wait until the context is free, >> but this wait can last longer than the EEH handler tolerates, >> leading to a failed recovery. > > I'm not quite clear on what you mean by the EEH handler timing > out. AFAIK there's nothing in eehd and the EEH core that times out if a > driver doesn't respond - indeed, it's pretty easy to hang eehd with a > misbehaving driver. > > Are you referring to your own internal timeouts? > cxlflash_wait_for_pci_err_recovery and anything else that uses > CXLFLASH_PCI_ERROR_RECOVERY_TIMEOUT? Reading through this again I can see how this is misleading. This is actually similar and related to the deadlock scenario described in "Fix to avoid potential deadlock on EEH". Without this fix, you'd end up in a similar situation but deadlocked on the context mutex instead of the ioctl semaphore. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
"Matthew R. Ochs" <mrochs@linux.vnet.ibm.com> writes: >>> The process_sense() routine can perform a read capacity which >>> can take some time to complete. If an EEH occurs while waiting >>> on the read capacity, the EEH handler is unable to obtain the >>> context's mutex in order to put the context in an error state. >>> The EEH handler will sit and wait until the context is free, >>> but this wait can last longer than the EEH handler tolerates, >>> leading to a failed recovery. >> >> I'm not quite clear on what you mean by the EEH handler timing >> out. AFAIK there's nothing in eehd and the EEH core that times out if a >> driver doesn't respond - indeed, it's pretty easy to hang eehd with a >> misbehaving driver. >> >> Are you referring to your own internal timeouts? >> cxlflash_wait_for_pci_err_recovery and anything else that uses >> CXLFLASH_PCI_ERROR_RECOVERY_TIMEOUT? > > Reading through this again I can see how this is misleading. This is > actually similar and related to the deadlock scenario described in > "Fix to avoid potential deadlock on EEH". Without this fix, you'd end > up in a similar situation but deadlocked on the context mutex instead > of the ioctl semaphore. That makes _much_ more sense. If you could please revise the commit message to explain that, you can include this in the next version: Reviewed-by: Daniel Axtens <dja@axtens.net> Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/scsi/cxlflash/superpipe.c b/drivers/scsi/cxlflash/superpipe.c index a6316f5..7283e83 100644 --- a/drivers/scsi/cxlflash/superpipe.c +++ b/drivers/scsi/cxlflash/superpipe.c @@ -1787,12 +1787,21 @@ static int cxlflash_disk_verify(struct scsi_device *sdev, * inquiry (i.e. the Unit attention is due to the WWN changing). */ if (verify->hint & DK_CXLFLASH_VERIFY_HINT_SENSE) { + /* Can't hold mutex across process_sense/read_cap16, + * since we could have an intervening EEH event. + */ + ctxi->unavail = true; + mutex_unlock(&ctxi->mutex); rc = process_sense(sdev, verify); if (unlikely(rc)) { dev_err(dev, "%s: Failed to validate sense data (%d)\n", __func__, rc); + mutex_lock(&ctxi->mutex); + ctxi->unavail = false; goto out; } + mutex_lock(&ctxi->mutex); + ctxi->unavail = false; } switch (gli->mode) {