Message ID | 1535690081-16116-1-git-send-email-suganath-prabu.subramani@broadcom.com (mailing list archive) |
---|---|
Headers | show |
Series | mpt3sas: Hot-Plug Surprise removal support on IOC. | expand |
+Cc: Lukas (I think he might be interested to look at this) On Fri, Aug 31, 2018 at 7:37 AM Suganath Prabu S <suganath-prabu.subramani@broadcom.com> wrote: > > Posting below set of patches to support PCIe Hot Plug surprise removal, > and few defect fixes. > > This is NOT the normal PCIe Hot Plug support, whereby the user informs the > OS that a hot removal is desired, the OS does an orderly shutdown of the driver > on the device, special hot plug circuitry removes power from the PCIe slot, > then the user can remove the device and replace it > (where orderly bring-up of the device is done). > > With a true surprise removal (just removing HBA from a slot) > there is a possibility to get all kinds of PCIe transaction errors, > Below patches addresses those issues and remove HBA without bringing > the system down. > > For surprise removal detection, driver does a PCI > read of IOC's vendor field in IOC's PCI configuration space. > If the read value is 0xFFFFFFFF this indicates that the device > might have hot removed and the device will be removed from driver. > > V1 changes: > In Patch 0001 - unlock mutex, if active reset is in progress. > > Suganath Prabu S (7): > mpt3sas: Introduce mpt3sas_base_pci_device_is_unplugged > mpt3sas: Add HBA hot plug watchdog thread. > mpt3sas: Seperate out mpt3sas_wait_for_ioc_to_operational > mpt3sas: Introdude _scsih_get_shost_and_ioc. > mpt3sas: Fix Sync cache command failure during driver unload. > mpt3sas: Fix driver modifying NVRAM/persistent data. > mpt3sas: Bump driver version to 27.100.00.00. > > drivers/scsi/mpt3sas/mpt3sas_base.c | 248 ++++++++++++++++++++++++++----- > drivers/scsi/mpt3sas/mpt3sas_base.h | 17 ++- > drivers/scsi/mpt3sas/mpt3sas_config.c | 32 +--- > drivers/scsi/mpt3sas/mpt3sas_ctl.c | 26 +--- > drivers/scsi/mpt3sas/mpt3sas_scsih.c | 188 ++++++++++++++++++++--- > drivers/scsi/mpt3sas/mpt3sas_transport.c | 82 +++------- > 6 files changed, 414 insertions(+), 179 deletions(-) > > -- > 1.8.3.1 >
[cc += linux-pci, benh] On Fri, Aug 31, 2018 at 7:37 AM Suganath Prabu S <suganath-prabu.subramani@broadcom.com> wrote: > Posting below set of patches to support PCIe Hot Plug surprise removal, > and few defect fixes. Please cross-post to linux-pci in the future. Regarding [PATCH 1/7] mpt3sas: Introduce mpt3sas_base_pci_device_is_unplugged: https://www.spinics.net/lists/linux-scsi/msg122962.html * mpt3sas_base_pci_device_is_unplugged() is a duplication of the existing pci_device_is_present(). * Just reading the vendor ID may not be sufficient to detect unplug, it may also read as "all ones" if the link is down due to error recovery by DPC. Regarding [PATCH 2/7] mpt3sas: Add HBA hot plug watchdog thread: https://www.spinics.net/lists/linux-scsi/msg122963.html * I don't see why you need to poll for the device's removal from a watchdog thread. pciehp will invoke your driver's ->remove hook once the device is gone. * A recent discussion initiated by Benjamin Herrenschmidt came to the conclusion that device removal should be treated as a type of error state (either pci_channel_io_perm_failure or another, newly introduced state). It will then be possible to detect the device's inaccessibility with pci_channel_offline(). Please help work towards such a future solution in the PCI core instead of solutions localized to a single device driver. Sorry, the discussion was lengthy, it is available here: https://www.spinics.net/lists/linux-pci/msg75425.html Thanks, Lukas
On Fri, Aug 31, 2018 at 2:25 PM, Lukas Wunner <lukas@wunner.de> wrote: > [cc += linux-pci, benh] > > On Fri, Aug 31, 2018 at 7:37 AM Suganath Prabu S <suganath-prabu.subramani@broadcom.com> wrote: >> Posting below set of patches to support PCIe Hot Plug surprise removal, >> and few defect fixes. > > Please cross-post to linux-pci in the future. > > > Regarding [PATCH 1/7] mpt3sas: Introduce mpt3sas_base_pci_device_is_unplugged: > https://www.spinics.net/lists/linux-scsi/msg122962.html > > * mpt3sas_base_pci_device_is_unplugged() is a duplication of the existing > pci_device_is_present(). Thanks for pointing this pci_device_is_present() API, we will replace mpt3sas_base_pci_device_is_unplugged() with pci_device_is_present(). > > * Just reading the vendor ID may not be sufficient to detect unplug, > it may also read as "all ones" if the link is down due to error > recovery by DPC. So, is their any other way to detect pci device unplug apart from reading the vendor ID, I mean we have check any other flags, etc? > > > Regarding [PATCH 2/7] mpt3sas: Add HBA hot plug watchdog thread: > https://www.spinics.net/lists/linux-scsi/msg122963.html > > * I don't see why you need to poll for the device's removal from a > watchdog thread. pciehp will invoke your driver's ->remove hook > once the device is gone. If we have some three to four PCI devices and all pci devices are hot unplugged simultaneously, then we observed that driver's-remove hook is called sequentially. So it takes some time to call fourth PCI device driver's->remove hook. so during this time we want all the outstanding commands to be gracefully terminated and hence we added this watchdog thread to quickly detect the hba unplug and take necessary steps such as gracefully terminate the outstanding IOs and stop receiving further IOs on it. At later time when PCI subsystem calls driver's-remove hook then driver can quickly release the resources allocated for this unplugged device. > > * A recent discussion initiated by Benjamin Herrenschmidt came to the > conclusion that device removal should be treated as a type of > error state (either pci_channel_io_perm_failure or another, newly > introduced state). It will then be possible to detect the device's > inaccessibility with pci_channel_offline(). Please help work towards > such a future solution in the PCI core instead of solutions localized > to a single device driver. Sorry, the discussion was lengthy, it is > available here: > https://www.spinics.net/lists/linux-pci/msg75425.html Oh great, sure. We have very limited knowledge on PCI subsystem but we try our best in future to provide solutions in the PCI core. > > Thanks, > > Lukas
On Tue, Sep 04, 2018 at 11:19:04AM +0530, Sreekanth Reddy wrote: > On Fri, Aug 31, 2018 at 2:25 PM, Lukas Wunner <lukas@wunner.de> wrote: > > * Just reading the vendor ID may not be sufficient to detect unplug, > > it may also read as "all ones" if the link is down due to error > > recovery by DPC. > > So, is their any other way to detect pci device unplug apart from > reading the vendor ID, I mean we have check any other flags, etc? Many scsi drivers call pci_channel_offline() to detect inaccessibility of the device due to a PCI error: https://elixir.bootlin.com/linux/v4.19-rc2/ident/pci_channel_offline A patch is pending such that surprise removal can also be queried with that same function: https://www.spinics.net/lists/linux-pci/msg75722.html > > * I don't see why you need to poll for the device's removal from a > > watchdog thread. pciehp will invoke your driver's ->remove hook > > once the device is gone. > > If we have some three to four PCI devices and all pci devices are hot > unplugged simultaneously, then we observed that driver's-remove hook > is called sequentially. So it takes some time to call fourth PCI > device driver's->remove hook. so during this time we want all the > outstanding commands to be gracefully terminated and hence we added > this watchdog thread to quickly detect the hba unplug and take > necessary steps such as gracefully terminate the outstanding IOs and > stop receiving further IOs on it. At later time when PCI subsystem > calls driver's-remove hook then driver can quickly release the > resources allocated for this unplugged device. pciehp does the following as soon as the surprise removal event is handled: pci_dev_set_disconnected(dev, NULL); if (pci_has_subordinate(dev)) pci_walk_bus(dev->subordinate, pci_dev_set_disconnected, NULL); Thus, once the above-linked patch lands, you'll be able to detect surprise removal by calling pci_channel_offline() or checking pdev->error_state == pci_channel_io_perm_failure, obviating the need to poll for surprise removal from a kthread. There's another patch pending to move pci_walk_bus() out of the pci_lock_rescan_remove() so that it runs even earlier, without pointlessly waiting for a lock: https://www.spinics.net/lists/linux-pci/msg75438.html Thanks, Lukas
l On Tue, Sep 4, 2018 at 3:12 PM, Lukas Wunner <lukas@wunner.de> wrote: > On Tue, Sep 04, 2018 at 11:19:04AM +0530, Sreekanth Reddy wrote: >> On Fri, Aug 31, 2018 at 2:25 PM, Lukas Wunner <lukas@wunner.de> wrote: >> > * Just reading the vendor ID may not be sufficient to detect unplug, >> > it may also read as "all ones" if the link is down due to error >> > recovery by DPC. >> >> So, is their any other way to detect pci device unplug apart from >> reading the vendor ID, I mean we have check any other flags, etc? > > Many scsi drivers call pci_channel_offline() to detect inaccessibility > of the device due to a PCI error: > https://elixir.bootlin.com/linux/v4.19-rc2/ident/pci_channel_offline > > A patch is pending such that surprise removal can also be queried > with that same function: > https://www.spinics.net/lists/linux-pci/msg75722.html Lukas, thanks for pointing to this pci_dev_is_disconnected() API. So we can use this API directly instead of reading the vendor Id and checking for all one's once this patch get accepted? > > >> > * I don't see why you need to poll for the device's removal from a >> > watchdog thread. pciehp will invoke your driver's ->remove hook >> > once the device is gone. >> >> If we have some three to four PCI devices and all pci devices are hot >> unplugged simultaneously, then we observed that driver's-remove hook >> is called sequentially. So it takes some time to call fourth PCI >> device driver's->remove hook. so during this time we want all the >> outstanding commands to be gracefully terminated and hence we added >> this watchdog thread to quickly detect the hba unplug and take >> necessary steps such as gracefully terminate the outstanding IOs and >> stop receiving further IOs on it. At later time when PCI subsystem >> calls driver's-remove hook then driver can quickly release the >> resources allocated for this unplugged device. > > pciehp does the following as soon as the surprise removal event is handled: > > pci_dev_set_disconnected(dev, NULL); > if (pci_has_subordinate(dev)) > pci_walk_bus(dev->subordinate, pci_dev_set_disconnected, NULL); > > Thus, once the above-linked patch lands, you'll be able to detect > surprise removal by calling pci_channel_offline() or checking > pdev->error_state == pci_channel_io_perm_failure, obviating the > need to poll for surprise removal from a kthread. > > There's another patch pending to move pci_walk_bus() out of the > pci_lock_rescan_remove() so that it runs even earlier, without > pointlessly waiting for a lock: > https://www.spinics.net/lists/linux-pci/msg75438.html Yes due to this pci_lock_rescan_remove() lock, we observed that driver's->remove hook called sequentially even though multiple HBA devices are hot unplugged simultaneously. I have one more instance where still we need this poll kthread, i.e during the device probe time we have some commands which has more timeout value (e.g. 300 seconds), so if user has unplugged this device just after sending this more time-out valued command then driver has to wait until this time-out value expires. i.e. this device is still visible in lspci output until this 300 seconds timeout value expires even though device is unplugged. if we have a poll kthread (which will poll for every one second) then driver can early detect the unplugged state and can terminate the outstanding commands and hence probe operation can be completed quickly. Also whether we need to wait for below patches get accepted? so that we can post the new set of patches accommodating according to your suggestions, https://www.spinics.net/lists/linux-pci/msg75722.html https://www.spinics.net/lists/linux-pci/msg75438.html > > Thanks, > > Lukas
On Wed, Sep 05, 2018 at 11:45:45AM +0530, Sreekanth Reddy wrote: > On Tue, Sep 4, 2018 at 3:12 PM, Lukas Wunner <lukas@wunner.de> wrote: > > Many scsi drivers call pci_channel_offline() to detect inaccessibility > > of the device due to a PCI error: > > https://elixir.bootlin.com/linux/v4.19-rc2/ident/pci_channel_offline > > > > A patch is pending such that surprise removal can also be queried > > with that same function: > > https://www.spinics.net/lists/linux-pci/msg75722.html > > Lukas, thanks for pointing to this pci_dev_is_disconnected() API. So > we can use this API directly instead of reading the vendor Id and > checking for all one's once this patch get accepted? Yes, except pci_dev_is_disconnected() is private to the PCI core, but dev->error_state and pci_channel_offline() is public. > I have one more instance where still we need this poll kthread, i.e > during the device probe time we have some commands which has more > timeout value (e.g. 300 seconds), so if user has unplugged this device > just after sending this more time-out valued command then driver has > to wait until this time-out value expires. i.e. this device is still > visible in lspci output until this 300 seconds timeout value expires > even though device is unplugged. if we have a poll kthread (which will > poll for every one second) then driver can early detect the unplugged > state and can terminate the outstanding commands and hence probe > operation can be completed quickly. The only instances I can see in your driver where it waits for 300 s is in _base_diag_reset(), which does an msleep(256) in a loop for up to 300 s, and scsih_scan_finished(), which is called in a loop with an msleep(10) by do_scsi_scan_host(). Any harm in simply checking for removal of the device in those loops and bailing out if so? Instead of the poll kthread to achieve the same? > Also whether we need to wait for below patches get accepted? so that > we can post the new set of patches accommodating according to your > suggestions, > https://www.spinics.net/lists/linux-pci/msg75722.html > https://www.spinics.net/lists/linux-pci/msg75438.html I can't tell you whether and when those patches get accepted, that's for Bjorn Helgaas to decide. Also, what does get accepted might differ. E.g. the first patch uses the existing pci_channel_io_perm_failure state for removal. There was debate whether to introduce a new state for removed devices to avoid overloading the existing state, which is used for error recovery. All I can recommend is to follow linux-pci, test the patches that have already been brought forward to ascertain they fulfil your needs, and generally participate in the debate so that your use cases are covered. Thanks, Lukas
On Wed, Sep 05, 2018 at 09:38:16AM +0200, Lukas Wunner wrote: > On Wed, Sep 05, 2018 at 11:45:45AM +0530, Sreekanth Reddy wrote: > > On Tue, Sep 4, 2018 at 3:12 PM, Lukas Wunner <lukas@wunner.de> wrote: > > > Many scsi drivers call pci_channel_offline() to detect inaccessibility > > > of the device due to a PCI error: > > > https://elixir.bootlin.com/linux/v4.19-rc2/ident/pci_channel_offline > > > > > > A patch is pending such that surprise removal can also be queried > > > with that same function: > > > https://www.spinics.net/lists/linux-pci/msg75722.html > > > > Lukas, thanks for pointing to this pci_dev_is_disconnected() API. So > > we can use this API directly instead of reading the vendor Id and > > checking for all one's once this patch get accepted? > > Yes, except pci_dev_is_disconnected() is private to the PCI core, > but dev->error_state and pci_channel_offline() is public. The exported function to call is pci_device_is_present().
On Wed, Sep 5, 2018 at 1:08 PM, Lukas Wunner <lukas@wunner.de> wrote: > On Wed, Sep 05, 2018 at 11:45:45AM +0530, Sreekanth Reddy wrote: >> On Tue, Sep 4, 2018 at 3:12 PM, Lukas Wunner <lukas@wunner.de> wrote: >> > Many scsi drivers call pci_channel_offline() to detect inaccessibility >> > of the device due to a PCI error: >> > https://elixir.bootlin.com/linux/v4.19-rc2/ident/pci_channel_offline >> > >> > A patch is pending such that surprise removal can also be queried >> > with that same function: >> > https://www.spinics.net/lists/linux-pci/msg75722.html >> >> Lukas, thanks for pointing to this pci_dev_is_disconnected() API. So >> we can use this API directly instead of reading the vendor Id and >> checking for all one's once this patch get accepted? > > Yes, except pci_dev_is_disconnected() is private to the PCI core, > but dev->error_state and pci_channel_offline() is public. As Busch suggested, we will use pci_device_is_present() API. We have tested HBA hot-unplug operations with this API and the things are working fine. > > >> I have one more instance where still we need this poll kthread, i.e >> during the device probe time we have some commands which has more >> timeout value (e.g. 300 seconds), so if user has unplugged this device >> just after sending this more time-out valued command then driver has >> to wait until this time-out value expires. i.e. this device is still >> visible in lspci output until this 300 seconds timeout value expires >> even though device is unplugged. if we have a poll kthread (which will >> poll for every one second) then driver can early detect the unplugged >> state and can terminate the outstanding commands and hence probe >> operation can be completed quickly. > > The only instances I can see in your driver where it waits for 300 s > is in _base_diag_reset(), which does an msleep(256) in a loop for up > to 300 s, and scsih_scan_finished(), which is called in a loop with an > msleep(10) by do_scsi_scan_host(). > > Any harm in simply checking for removal of the device in those loops > and bailing out if so? Instead of the poll kthread to achieve the same? we can do for this port enable request but still we have other requests where we don't have this type of loops and used wait_for_completion_timeout () API where we can't bailout the request in-between and we have to wait until the timeout value expires. For these types of request terminating it though watchdog thread will be simple. [PATCH 2/7] mpt3sas: Add HBA hot plug watchdog thread: https://www.spinics.net/lists/linux-scsi/msg122963.html In the above patch and in the worker thread function we are just setting ioc->remove_host flag, Apart from resetting this flag we have to terminate all the outstanding commands at the HBA level and that is missing in the above patch. we will post new set of patches. We feel that this watch dog thread is needed, as we need to gracefully terminate all the outstanding command at the HBA level apart from stop receiving the new commands. With watch dog thread we can easily terminate all the outstanding commands. > > >> Also whether we need to wait for below patches get accepted? so that >> we can post the new set of patches accommodating according to your >> suggestions, >> https://www.spinics.net/lists/linux-pci/msg75722.html >> https://www.spinics.net/lists/linux-pci/msg75438.html > > I can't tell you whether and when those patches get accepted, that's > for Bjorn Helgaas to decide. Also, what does get accepted might differ. > E.g. the first patch uses the existing pci_channel_io_perm_failure > state for removal. There was debate whether to introduce a new state > for removed devices to avoid overloading the existing state, which is > used for error recovery. > > All I can recommend is to follow linux-pci, test the patches that have > already been brought forward to ascertain they fulfil your needs, > and generally participate in the debate so that your use cases are > covered. > > Thanks, > > Lukas
On Wed, Sep 12, 2018 at 03:31:07PM +0530, Sreekanth Reddy wrote: > On Wed, Sep 5, 2018 at 1:08 PM, Lukas Wunner <lukas@wunner.de> wrote: > > On Wed, Sep 05, 2018 at 11:45:45AM +0530, Sreekanth Reddy wrote: > > > I have one more instance where still we need this poll kthread, i.e > > > during the device probe time we have some commands which has more > > > timeout value (e.g. 300 seconds), so if user has unplugged this device > > > just after sending this more time-out valued command then driver has > > > to wait until this time-out value expires. i.e. this device is still > > > visible in lspci output until this 300 seconds timeout value expires > > > even though device is unplugged. if we have a poll kthread (which will > > > poll for every one second) then driver can early detect the unplugged > > > state and can terminate the outstanding commands and hence probe > > > operation can be completed quickly. > > > > The only instances I can see in your driver where it waits for 300 s > > is in _base_diag_reset(), which does an msleep(256) in a loop for up > > to 300 s, and scsih_scan_finished(), which is called in a loop with an > > msleep(10) by do_scsi_scan_host(). > > > > Any harm in simply checking for removal of the device in those loops > > and bailing out if so? Instead of the poll kthread to achieve the same? > > we can do for this port enable request but still we have other > requests where we don't have this type of loops and used > wait_for_completion_timeout () API where we can't bailout the request > in-between and we have to wait until the timeout value expires. For > these types of request terminating it though watchdog thread will be > simple. When the HBA is hot-removed, its driver's ->remove callback is invoked. You could just check at the beginning of the ->remove callback whether the device is no longer present, and if it isn't, complete() any completions that may be pending. I think that would obviate the need for a watchdog. Thanks, Lukas