Message ID | 20200927032829.11321-3-haifeng.zhao@intel.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | Fix DPC hotplug race and enhance error handling | expand |
On 9/26/2020 11:28 PM, Ethan Zhao wrote: > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c > index 53433b37e181..6f271160f18d 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) > down_read(&ctrl->reset_lock); > if (events & DISABLE_SLOT) > pciehp_handle_disable_request(ctrl); > - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) > + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { > + pci_wait_port_outdpc(pdev); > pciehp_handle_presence_or_link_change(ctrl, events); > + } > up_read(&ctrl->reset_lock); This looks like a hack TBH. Lukas, Keith; What is your take on this? Why is device lock not protecting this situation? Is there a lock missing in hotplug driver? Sinan
Sinan, I explained the reason why locks don't protect this case in the patch description part. Write side and read side hold different semaphore and mutex. Thanks, Ethan -----Original Message----- From: Sinan Kaya <okaya@kernel.org> Sent: Sunday, September 27, 2020 11:28 PM To: Zhao, Haifeng <haifeng.zhao@intel.com>; bhelgaas@google.com; oohall@gmail.com; ruscur@russell.cc; lukas@wunner.de; andriy.shevchenko@linux.intel.com; stuart.w.hayes@gmail.com; mr.nuke.me@gmail.com; mika.westerberg@linux.intel.com; Keith Busch <keith.busch@intel.com> Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Jia, Pei P <pei.p.jia@intel.com>; ashok.raj@linux.intel.com; Kuppuswamy, Sathyanarayanan <sathyanarayanan.kuppuswamy@intel.com> Subject: Re: [PATCH 2/5 V2] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC On 9/26/2020 11:28 PM, Ethan Zhao wrote: > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c > index 53433b37e181..6f271160f18d 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) > down_read(&ctrl->reset_lock); > if (events & DISABLE_SLOT) > pciehp_handle_disable_request(ctrl); > - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) > + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { > + pci_wait_port_outdpc(pdev); > pciehp_handle_presence_or_link_change(ctrl, events); > + } > up_read(&ctrl->reset_lock); This looks like a hack TBH. Lukas, Keith; What is your take on this? Why is device lock not protecting this situation? Is there a lock missing in hotplug driver? Sinan
On 9/27/2020 10:01 PM, Zhao, Haifeng wrote: > Sinan, > I explained the reason why locks don't protect this case in the patch description part. > Write side and read side hold different semaphore and mutex. > I have been thinking about it some time but is there any reason why we have to handle all port AER/DPC/HP events in different threads? Can we go to single threaded event loop for all port drivers events? This will require some refactoring but it wlll eliminate the lock nightmares we are having. This means no sleeping. All sleeps need to happen outside of the loop. I wanted to see what you all are thinking about this. It might become a performance problem if the system is continuously observing a hotplug/aer/dpc events. I always think that these should be rare events.
On 9/28/2020 7:10 AM, Sinan Kaya wrote: > On 9/27/2020 10:01 PM, Zhao, Haifeng wrote: >> Sinan, >> I explained the reason why locks don't protect this case in the patch description part. >> Write side and read side hold different semaphore and mutex. >> > I have been thinking about it some time but is there any reason why we > have to handle all port AER/DPC/HP events in different threads? > > Can we go to single threaded event loop for all port drivers events? > > This will require some refactoring but it wlll eliminate the lock > nightmares we are having. > > This means no sleeping. All sleeps need to happen outside of the loop. > > I wanted to see what you all are thinking about this. > > It might become a performance problem if the system is > continuously observing a hotplug/aer/dpc events. > > I always think that these should be rare events. If restructuring would be too costly, the preferred solution should be to fix the locks in hotplug driver rather than throwing there a random wait call.
On 9/28/20 9:43 AM, Sinan Kaya wrote: > On 9/28/2020 7:10 AM, Sinan Kaya wrote: >> On 9/27/2020 10:01 PM, Zhao, Haifeng wrote: >>> Sinan, >>> I explained the reason why locks don't protect this case in the patch description part. >>> Write side and read side hold different semaphore and mutex. >>> >> I have been thinking about it some time but is there any reason why we >> have to handle all port AER/DPC/HP events in different threads? >> >> Can we go to single threaded event loop for all port drivers events? >> >> This will require some refactoring but it wlll eliminate the lock >> nightmares we are having. >> >> This means no sleeping. All sleeps need to happen outside of the loop. >> >> I wanted to see what you all are thinking about this. >> >> It might become a performance problem if the system is >> continuously observing a hotplug/aer/dpc events. >> >> I always think that these should be rare events. > If restructuring would be too costly, the preferred solution should be > to fix the locks in hotplug driver rather than throwing there a random > wait call. Since the current race condition is detected between DPC and hotplug, I recommend synchronizing them.
On Tue, Sep 29, 2020 at 12:45 AM Kuppuswamy, Sathyanarayanan <sathyanarayanan.kuppuswamy@intel.com> wrote: > > > On 9/28/20 9:43 AM, Sinan Kaya wrote: > > On 9/28/2020 7:10 AM, Sinan Kaya wrote: > >> On 9/27/2020 10:01 PM, Zhao, Haifeng wrote: > >>> Sinan, > >>> I explained the reason why locks don't protect this case in the patch description part. > >>> Write side and read side hold different semaphore and mutex. > >>> > >> I have been thinking about it some time but is there any reason why we > >> have to handle all port AER/DPC/HP events in different threads? > >> > >> Can we go to single threaded event loop for all port drivers events? > >> > >> This will require some refactoring but it wlll eliminate the lock > >> nightmares we are having. > >> > >> This means no sleeping. All sleeps need to happen outside of the loop. > >> > >> I wanted to see what you all are thinking about this. > >> > >> It might become a performance problem if the system is > >> continuously observing a hotplug/aer/dpc events. > >> > >> I always think that these should be rare events. > > If restructuring would be too costly, the preferred solution should be > > to fix the locks in hotplug driver rather than throwing there a random > > wait call. > Since the current race condition is detected between DPC and > hotplug, I recommend synchronizing them. The locks are the first place to root cause and try to fix. but not so easy to refactor the remove-scan-semaphore and the bus-walk-mutex. too expensive work. --- rework every piece of code that uses them. Thanks, Ethan > > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer >
On Tue, Sep 29, 2020 at 12:44 AM Sinan Kaya <okaya@kernel.org> wrote: > > On 9/28/2020 7:10 AM, Sinan Kaya wrote: > > On 9/27/2020 10:01 PM, Zhao, Haifeng wrote: > >> Sinan, > >> I explained the reason why locks don't protect this case in the patch description part. > >> Write side and read side hold different semaphore and mutex. > >> > > I have been thinking about it some time but is there any reason why we > > have to handle all port AER/DPC/HP events in different threads? > > > > Can we go to single threaded event loop for all port drivers events? > > > > This will require some refactoring but it wlll eliminate the lock > > nightmares we are having. > > > > This means no sleeping. All sleeps need to happen outside of the loop. > > > > I wanted to see what you all are thinking about this. > > > > It might become a performance problem if the system is > > continuously observing a hotplug/aer/dpc events. > > > > I always think that these should be rare events. > > If restructuring would be too costly, the preferred solution should be > to fix the locks in hotplug driver rather than throwing there a random > wait call. My first though is to unify the pci_bus_sem & pci_rescan_remove_lock to one sleepable lock, but verifying every locking scenario to sort out dead lock warning, it is horrible job. I gave up and then played the device status waiting trick to workaround it. index 03d37128a24f..477d4c499f87 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -3223,17 +3223,19 @@ EXPORT_SYMBOL_GPL(pci_rescan_bus); * pci_rescan_bus(), pci_rescan_bus_bridge_resize() and PCI device removal * routines should always be executed under this mutex. */ -static DEFINE_MUTEX(pci_rescan_remove_lock); +/* static DEFINE_MUTEX(pci_rescan_remove_lock); */ void pci_lock_rescan_remove(void) { - mutex_lock(&pci_rescan_remove_lock); + /*mutex_lock(&pci_rescan_remove_lock); */ + down_write(&pci_bus_sem); } EXPORT_SYMBOL_GPL(pci_lock_rescan_remove); void pci_unlock_rescan_remove(void) { - mutex_unlock(&pci_rescan_remove_lock); + /*mutex_unlock(&pci_rescan_remove_lock); */ + up_write(&pci_bus_sem); } EXPORT_SYMBOL_GPL(pci_unlock_rescan_remove); Thanks, Ethan
On Sun, Sep 27, 2020 at 11:27:46AM -0400, Sinan Kaya wrote: > On 9/26/2020 11:28 PM, Ethan Zhao wrote: > > --- a/drivers/pci/hotplug/pciehp_hpc.c > > +++ b/drivers/pci/hotplug/pciehp_hpc.c > > @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) > > down_read(&ctrl->reset_lock); > > if (events & DISABLE_SLOT) > > pciehp_handle_disable_request(ctrl); > > - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) > > + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { > > + pci_wait_port_outdpc(pdev); > > pciehp_handle_presence_or_link_change(ctrl, events); > > + } > > up_read(&ctrl->reset_lock); > > This looks like a hack TBH. > > Lukas, Keith; > > What is your take on this? > Why is device lock not protecting this situation? > > Is there a lock missing in hotplug driver? According to Ethan's commit message, there are two issues here: One, that pciehp may remove a device even though DPC recovered the error, and two, that a null pointer deref occurs. The latter is most certainly not a locking issue but failure of DPC to hold a reference on the pci_dev. Thanks, Lukas
On Tue, Sep 29, 2020 at 4:29 PM Lukas Wunner <lukas@wunner.de> wrote: > > On Sun, Sep 27, 2020 at 11:27:46AM -0400, Sinan Kaya wrote: > > On 9/26/2020 11:28 PM, Ethan Zhao wrote: > > > --- a/drivers/pci/hotplug/pciehp_hpc.c > > > +++ b/drivers/pci/hotplug/pciehp_hpc.c > > > @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) > > > down_read(&ctrl->reset_lock); > > > if (events & DISABLE_SLOT) > > > pciehp_handle_disable_request(ctrl); > > > - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) > > > + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { > > > + pci_wait_port_outdpc(pdev); > > > pciehp_handle_presence_or_link_change(ctrl, events); > > > + } > > > up_read(&ctrl->reset_lock); > > > > This looks like a hack TBH. > > > > Lukas, Keith; > > > > What is your take on this? > > Why is device lock not protecting this situation? > > > > Is there a lock missing in hotplug driver? > > According to Ethan's commit message, there are two issues here: > One, that pciehp may remove a device even though DPC recovered the error, > and two, that a null pointer deref occurs. > > The latter is most certainly not a locking issue but failure of DPC > to hold a reference on the pci_dev. This is what patch 3/5 proposed to fix. while this one is to re-order the mixed DPC recovery procedure and DLLSC/PDC event handling, to make pciehp to know the exact recovered result of DPC to malfunctional device ---- link recovered, still there, or is removed from the slot. Thanks, Ethan > > Thanks, > > Lukas
On Tue, Sep 29, 2020 at 05:46:41PM +0800, Ethan Zhao wrote: > On Tue, Sep 29, 2020 at 4:29 PM Lukas Wunner <lukas@wunner.de> wrote: > > On Sun, Sep 27, 2020 at 11:27:46AM -0400, Sinan Kaya wrote: > > > On 9/26/2020 11:28 PM, Ethan Zhao wrote: > > > > --- a/drivers/pci/hotplug/pciehp_hpc.c > > > > +++ b/drivers/pci/hotplug/pciehp_hpc.c > > > > @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) > > > > down_read(&ctrl->reset_lock); > > > > if (events & DISABLE_SLOT) > > > > pciehp_handle_disable_request(ctrl); > > > > - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) > > > > + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { > > > > + pci_wait_port_outdpc(pdev); > > > > pciehp_handle_presence_or_link_change(ctrl, events); > > > > + } > > > > up_read(&ctrl->reset_lock); > > > > > > This looks like a hack TBH. [...] > > > Why is device lock not protecting this situation? > > > Is there a lock missing in hotplug driver? > > > > According to Ethan's commit message, there are two issues here: > > One, that pciehp may remove a device even though DPC recovered the error, > > and two, that a null pointer deref occurs. > > > > The latter is most certainly not a locking issue but failure of DPC > > to hold a reference on the pci_dev. > > This is what patch 3/5 proposed to fix. Please reorder the series to fix the null pointer deref first, i.e. move patch 3 before patch 2. If the null pointer deref is fixed by patch 3, do not mention it in patch 2. Thanks, Lukas
On Tue, Sep 29, 2020 at 6:08 PM Lukas Wunner <lukas@wunner.de> wrote: > > On Tue, Sep 29, 2020 at 05:46:41PM +0800, Ethan Zhao wrote: > > On Tue, Sep 29, 2020 at 4:29 PM Lukas Wunner <lukas@wunner.de> wrote: > > > On Sun, Sep 27, 2020 at 11:27:46AM -0400, Sinan Kaya wrote: > > > > On 9/26/2020 11:28 PM, Ethan Zhao wrote: > > > > > --- a/drivers/pci/hotplug/pciehp_hpc.c > > > > > +++ b/drivers/pci/hotplug/pciehp_hpc.c > > > > > @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) > > > > > down_read(&ctrl->reset_lock); > > > > > if (events & DISABLE_SLOT) > > > > > pciehp_handle_disable_request(ctrl); > > > > > - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) > > > > > + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { > > > > > + pci_wait_port_outdpc(pdev); > > > > > pciehp_handle_presence_or_link_change(ctrl, events); > > > > > + } > > > > > up_read(&ctrl->reset_lock); > > > > > > > > This looks like a hack TBH. > [...] > > > > Why is device lock not protecting this situation? > > > > Is there a lock missing in hotplug driver? > > > > > > According to Ethan's commit message, there are two issues here: > > > One, that pciehp may remove a device even though DPC recovered the error, > > > and two, that a null pointer deref occurs. > > > > > > The latter is most certainly not a locking issue but failure of DPC > > > to hold a reference on the pci_dev. > > > > This is what patch 3/5 proposed to fix. > > Please reorder the series to fix the null pointer deref first, > i.e. move patch 3 before patch 2. If the null pointer deref is > fixed by patch 3, do not mention it in patch 2. Make sense. Thanks, Ethan > > Thanks, > > Lukas
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 53433b37e181..6f271160f18d 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) down_read(&ctrl->reset_lock); if (events & DISABLE_SLOT) pciehp_handle_disable_request(ctrl); - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { + pci_wait_port_outdpc(pdev); pciehp_handle_presence_or_link_change(ctrl, events); + } up_read(&ctrl->reset_lock); ret = IRQ_HANDLED;