[1/2] IB/hfi1: Try slot reset before secondary bus reset

Message ID	1524167784-5911-1-git-send-email-okaya@codeaurora.org (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@kernel.org> sender: okaya@smtp.codeaurora.org) by smtp.codeaurora.org (Postfix) with ESMTPSA id 69C186081C; Thu, 19 Apr 2018 19:56:28 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org 69C186081C From: Sinan Kaya <okaya@codeaurora.org> To: linux-pci@vger.kernel.org, sulrich@codeaurora.org, timur@codeaurora.org Cc: linux-arm-msm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Sinan Kaya <okaya@codeaurora.org>, Mike Marciniszyn <mike.marciniszyn@intel.com>, Dennis Dalessandro <dennis.dalessandro@intel.com>, Doug Ledford <dledford@redhat.com>, Jason Gunthorpe <jgg@ziepe.ca>, linux-rdma@vger.kernel.org (open list:HFI1 DRIVER), linux-kernel@vger.kernel.org (open list) Subject: [PATCH 1/2] IB/hfi1: Try slot reset before secondary bus reset Date: Thu, 19 Apr 2018 15:56:23 -0400 Message-Id: <1524167784-5911-1-git-send-email-okaya@codeaurora.org> MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-pci-owner@vger.kernel.org Precedence: bulk

Sinan Kaya April 19, 2018, 7:56 p.m. UTC

The infiniband adapter might be connected to a PCI hotplug slot. Performing
secondary bus reset on a hotplug slot causes PCI link up/down interrupts.

Hotplug driver removes the device from system when a link down interrupt
is observed and performs re-enumeration when link up interrupt is observed.

This conflicts with what this code is trying to do. Try secondary bus reset
only if pci_reset_slot() fails/unsupported.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/infiniband/hw/hfi1/pcie.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Jason Gunthorpe April 19, 2018, 8:26 p.m. UTC | #1

On Thu, Apr 19, 2018 at 03:56:23PM -0400, Sinan Kaya wrote:
> The infiniband adapter might be connected to a PCI hotplug slot. Performing
> secondary bus reset on a hotplug slot causes PCI link up/down interrupts.
> 
> Hotplug driver removes the device from system when a link down interrupt
> is observed and performs re-enumeration when link up interrupt is observed.
> 
> This conflicts with what this code is trying to do. Try secondary bus reset
> only if pci_reset_slot() fails/unsupported.
> 
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
>  drivers/infiniband/hw/hfi1/pcie.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
> index 83d66e8..75f49e3 100644
> +++ b/drivers/infiniband/hw/hfi1/pcie.c
> @@ -908,7 +908,8 @@ static int trigger_sbr(struct hfi1_devdata *dd)

The code above this hunk is:

/*
 * Trigger a secondary bus reset (SBR) on ourselves using our parent.
 *
 * Based on pci_parent_bus_reset() which is not exported by the
 * kernel core.
 */
static int trigger_sbr(struct hfi1_devdata *dd)
{

[..]

This really seems like something the PCI core should be helping with,
drivers shouldn't be doing stuff like this. I get the feeling this
should be a common need if drivers support various error recovery
schemes?

Bjorn, would be appropriate to export pci_parent_bus_reset() or some
variation therin??

Thanks,
Jason

Sinan Kaya April 19, 2018, 8:35 p.m. UTC | #2

On 4/19/2018 4:26 PM, Jason Gunthorpe wrote:
> On Thu, Apr 19, 2018 at 03:56:23PM -0400, Sinan Kaya wrote:
>> The infiniband adapter might be connected to a PCI hotplug slot. Performing
>> secondary bus reset on a hotplug slot causes PCI link up/down interrupts.
>>
>> Hotplug driver removes the device from system when a link down interrupt
>> is observed and performs re-enumeration when link up interrupt is observed.
>>
>> This conflicts with what this code is trying to do. Try secondary bus reset
>> only if pci_reset_slot() fails/unsupported.
>>
>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
>>  drivers/infiniband/hw/hfi1/pcie.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
>> index 83d66e8..75f49e3 100644
>> +++ b/drivers/infiniband/hw/hfi1/pcie.c
>> @@ -908,7 +908,8 @@ static int trigger_sbr(struct hfi1_devdata *dd)
> 
> The code above this hunk is:
> 
> /*
>  * Trigger a secondary bus reset (SBR) on ourselves using our parent.
>  *
>  * Based on pci_parent_bus_reset() which is not exported by the
>  * kernel core.
>  */
> static int trigger_sbr(struct hfi1_devdata *dd)
> {
> 
> [..]
> 
> This really seems like something the PCI core should be helping with,
> drivers shouldn't be doing stuff like this. I get the feeling this
> should be a common need if drivers support various error recovery
> schemes?

pci_parent_bus_reset() still doesn't deal with hotplug. We need to call
a variation of pci_slot_reset() before calling pci_parent_bus_reset().

	rc = pci_dev_reset_slot_function(dev, 0);
	if (rc != -ENOTTY)
		return rc;
	return pci_parent_bus_reset(dev, 0);

VFIO driver does this.

			/* User has access, do the reset */
			ret = slot ? pci_try_reset_slot(vdev->pdev->slot) :
				     pci_try_reset_bus(vdev->pdev->bus);

I assumed the responsibility is at the driver to call the right API it likes.

Bjorn Helgaas April 19, 2018, 9:47 p.m. UTC | #3

[+cc Alex, who might know why DRM drivers have their own PCIe Gen3 code]

On Thu, Apr 19, 2018 at 02:26:32PM -0600, Jason Gunthorpe wrote:
> On Thu, Apr 19, 2018 at 03:56:23PM -0400, Sinan Kaya wrote:
> > The infiniband adapter might be connected to a PCI hotplug slot. Performing
> > secondary bus reset on a hotplug slot causes PCI link up/down interrupts.
> > 
> > Hotplug driver removes the device from system when a link down interrupt
> > is observed and performs re-enumeration when link up interrupt is observed.
> > 
> > This conflicts with what this code is trying to do. Try secondary bus reset
> > only if pci_reset_slot() fails/unsupported.
> > 
> > Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> >  drivers/infiniband/hw/hfi1/pcie.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
> > index 83d66e8..75f49e3 100644
> > +++ b/drivers/infiniband/hw/hfi1/pcie.c
> > @@ -908,7 +908,8 @@ static int trigger_sbr(struct hfi1_devdata *dd)
> 
> The code above this hunk is:
> 
> /*
>  * Trigger a secondary bus reset (SBR) on ourselves using our parent.
>  *
>  * Based on pci_parent_bus_reset() which is not exported by the
>  * kernel core.
>  */
> static int trigger_sbr(struct hfi1_devdata *dd)
> {
> 
> [..]
> 
> This really seems like something the PCI core should be helping with,
> drivers shouldn't be doing stuff like this. I get the feeling this
> should be a common need if drivers support various error recovery
> schemes?
> 
> Bjorn, would be appropriate to export pci_parent_bus_reset() or some
> variation therin??

I agree it would be really nice if the PCI core could help out somehow
so we could get some of this code out of individual drivers.

If fact, stepping back a few paces, this HFI reset path is part of a
transition to PCIe gen3 signaling, and I'm not sure why *that* is in
the driver either.

There's an ongoing discussion [1] about why this gen3 code is in the
driver.  Several DRM drivers include similar code
(cik_pcie_gen3_enable(), si_pcie_gen3_enable()).

I *thought* the hardware was supposed to automatically negotiate to
the highest rate supported by both sides without any help at all from
software.  But since several drivers have code to do it themselves, I
wonder if I'm missing something, or maybe there's something the PCI
core should be doing that it isn't, and the driver code is basically
working around that PCI core deficiency.

[1] https://lkml.kernel.org/r/20180417171956.GJ28657@bhelgaas-glaptop.roam.corp.google.com

Alex Deucher April 19, 2018, 10:11 p.m. UTC | #4

> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> Sent: Thursday, April 19, 2018 5:47 PM
> To: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Sinan Kaya <okaya@codeaurora.org>; Bjorn Helgaas
> <bhelgaas@google.com>; linux-pci@vger.kernel.org;
> sulrich@codeaurora.org; timur@codeaurora.org; linux-arm-
> msm@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Mike
> Marciniszyn <mike.marciniszyn@intel.com>; Dennis Dalessandro
> <dennis.dalessandro@intel.com>; Doug Ledford <dledford@redhat.com>;
> open list:HFI1 DRIVER <linux-rdma@vger.kernel.org>; open list <linux-
> kernel@vger.kernel.org>; Deucher, Alexander
> <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH 1/2] IB/hfi1: Try slot reset before secondary bus reset
> 
> [+cc Alex, who might know why DRM drivers have their own PCIe Gen3
> code]
> 
> On Thu, Apr 19, 2018 at 02:26:32PM -0600, Jason Gunthorpe wrote:
> > On Thu, Apr 19, 2018 at 03:56:23PM -0400, Sinan Kaya wrote:
> > > The infiniband adapter might be connected to a PCI hotplug slot.
> > > Performing secondary bus reset on a hotplug slot causes PCI link
> up/down interrupts.
> > >
> > > Hotplug driver removes the device from system when a link down
> > > interrupt is observed and performs re-enumeration when link up
> interrupt is observed.
> > >
> > > This conflicts with what this code is trying to do. Try secondary
> > > bus reset only if pci_reset_slot() fails/unsupported.
> > >
> > > Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> > > drivers/infiniband/hw/hfi1/pcie.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/infiniband/hw/hfi1/pcie.c
> > > b/drivers/infiniband/hw/hfi1/pcie.c
> > > index 83d66e8..75f49e3 100644
> > > +++ b/drivers/infiniband/hw/hfi1/pcie.c
> > > @@ -908,7 +908,8 @@ static int trigger_sbr(struct hfi1_devdata *dd)
> >
> > The code above this hunk is:
> >
> > /*
> >  * Trigger a secondary bus reset (SBR) on ourselves using our parent.
> >  *
> >  * Based on pci_parent_bus_reset() which is not exported by the
> >  * kernel core.
> >  */
> > static int trigger_sbr(struct hfi1_devdata *dd) {
> >
> > [..]
> >
> > This really seems like something the PCI core should be helping with,
> > drivers shouldn't be doing stuff like this. I get the feeling this
> > should be a common need if drivers support various error recovery
> > schemes?
> >
> > Bjorn, would be appropriate to export pci_parent_bus_reset() or some
> > variation therin??
> 
> I agree it would be really nice if the PCI core could help out somehow so we
> could get some of this code out of individual drivers.
> 
> If fact, stepping back a few paces, this HFI reset path is part of a transition to
> PCIe gen3 signaling, and I'm not sure why *that* is in the driver either.
> 
> There's an ongoing discussion [1] about why this gen3 code is in the driver.
> Several DRM drivers include similar code (cik_pcie_gen3_enable(),
> si_pcie_gen3_enable()).
> 
> I *thought* the hardware was supposed to automatically negotiate to the
> highest rate supported by both sides without any help at all from software.
> But since several drivers have code to do it themselves, I wonder if I'm
> missing something, or maybe there's something the PCI core should be doing
> that it isn't, and the driver code is basically working around that PCI core
> deficiency.

My understanding was that some platfoms only bring up the link in gen 1 mode for compatibility reasons.  TBH, I'm not that familiar with how the links come up on different platforms.

Alex

Sinan Kaya April 19, 2018, 10:19 p.m. UTC | #5

On 4/19/2018 5:47 PM, Bjorn Helgaas wrote:
>> Bjorn, would be appropriate to export pci_parent_bus_reset() or some
>> variation therin??
> I agree it would be really nice if the PCI core could help out somehow
> so we could get some of this code out of individual drivers.

I can create a function called pci_reset_link() and move both slot and
secondary bus reset inside. 

Will this work?

Bjorn Helgaas April 20, 2018, 2 p.m. UTC | #6

[+cc Rajat, Alex because of their interest in the reset/hotplug issue]

For context, Sinan's patch is this:

> diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
> index 83d66e8..75f49e3 100644
> --- a/drivers/infiniband/hw/hfi1/pcie.c
> +++ b/drivers/infiniband/hw/hfi1/pcie.c
> @@ -908,7 +908,8 @@ static int trigger_sbr(struct hfi1_devdata *dd)
>          * delay after a reset is required.  Per spec requirements,
>          * the link is either working or not after that point.
>          */
> -       pci_reset_bridge_secondary_bus(dev->bus->self);
> +       if (pci_reset_slot(dev->slot))
> +               pci_reset_bridge_secondary_bus(dev->bus->self);

On Thu, Apr 19, 2018 at 06:19:32PM -0400, Sinan Kaya wrote:
> On 4/19/2018 5:47 PM, Bjorn Helgaas wrote:
> >> Bjorn, would be appropriate to export pci_parent_bus_reset() or some
> >> variation therin??
> > I agree it would be really nice if the PCI core could help out somehow
> > so we could get some of this code out of individual drivers.

What I was really thinking here was about the whole Gen3 transition
thing, not the reset thing by itself.

> I can create a function called pci_reset_link() and move both slot and
> secondary bus reset inside.

What exactly is your patch fixing?  Is it the following?

  If the HFI link is not operating at 8GT/s, the driver's .probe()
  method tries to transition it to 8GT/s, which involves resetting the
  HFI device with pci_reset_bridge_secondary_bus().  If the HFI device
  is in a hotplug slot, the reset causes a "Link Down" event, which
  causes the pciehp driver to remove the HFI device and re-enumerate
  it when the link comes back up.

  When pciehp removes the device, it calls the HFI .remove() method,
  which is a problem because the .probe() method is still active.

  It looks like this should deadlock because __device_attach() holds
  the device_lock while calling .probe() and the
  device_release_driver() path tries to acquire it.

Your patch uses pci_reset_slot(), which connects with Rajat's work
(06a8d89af551 ("PCI: pciehp: Disable link notification across slot
reset")) to avoid hotplug events for intentional resets.

So I think I just reverse-engineered the whole rationale for your
patch :)  Sorry about the long detour.

I'm having a hard time articulating my thoughts here.  I think my
concern is that knowledge about this reset/link down/hotplug issue is
leaking out and we'll end up with different reset interfaces that may
or may not result in hotplug events.  This seems like a confusing API
because it's hard to explain which interface a driver should use.

Bjorn

Dennis Dalessandro April 20, 2018, 2:12 p.m. UTC | #7

On 4/19/2018 6:11 PM, Deucher, Alexander wrote:
>> -----Original Message-----
>> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
>> Sent: Thursday, April 19, 2018 5:47 PM
>> To: Jason Gunthorpe <jgg@ziepe.ca>
>> Cc: Sinan Kaya <okaya@codeaurora.org>; Bjorn Helgaas
>> <bhelgaas@google.com>; linux-pci@vger.kernel.org;
>> sulrich@codeaurora.org; timur@codeaurora.org; linux-arm-
>> msm@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Mike
>> Marciniszyn <mike.marciniszyn@intel.com>; Dennis Dalessandro
>> <dennis.dalessandro@intel.com>; Doug Ledford <dledford@redhat.com>;
>> open list:HFI1 DRIVER <linux-rdma@vger.kernel.org>; open list <linux-
>> kernel@vger.kernel.org>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>
>> Subject: Re: [PATCH 1/2] IB/hfi1: Try slot reset before secondary bus reset
>>
>> [+cc Alex, who might know why DRM drivers have their own PCIe Gen3
>> code]
>>
>> On Thu, Apr 19, 2018 at 02:26:32PM -0600, Jason Gunthorpe wrote:
>>> On Thu, Apr 19, 2018 at 03:56:23PM -0400, Sinan Kaya wrote:
>>>> The infiniband adapter might be connected to a PCI hotplug slot.
>>>> Performing secondary bus reset on a hotplug slot causes PCI link
>> up/down interrupts.
>>>>
>>>> Hotplug driver removes the device from system when a link down
>>>> interrupt is observed and performs re-enumeration when link up
>> interrupt is observed.
>>>>
>>>> This conflicts with what this code is trying to do. Try secondary
>>>> bus reset only if pci_reset_slot() fails/unsupported.
>>>>
>>>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
>>>> drivers/infiniband/hw/hfi1/pcie.c | 3 ++-
>>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/infiniband/hw/hfi1/pcie.c
>>>> b/drivers/infiniband/hw/hfi1/pcie.c
>>>> index 83d66e8..75f49e3 100644
>>>> +++ b/drivers/infiniband/hw/hfi1/pcie.c
>>>> @@ -908,7 +908,8 @@ static int trigger_sbr(struct hfi1_devdata *dd)
>>>
>>> The code above this hunk is:
>>>
>>> /*
>>>   * Trigger a secondary bus reset (SBR) on ourselves using our parent.
>>>   *
>>>   * Based on pci_parent_bus_reset() which is not exported by the
>>>   * kernel core.
>>>   */
>>> static int trigger_sbr(struct hfi1_devdata *dd) {
>>>
>>> [..]
>>>
>>> This really seems like something the PCI core should be helping with,
>>> drivers shouldn't be doing stuff like this. I get the feeling this
>>> should be a common need if drivers support various error recovery
>>> schemes?
>>>
>>> Bjorn, would be appropriate to export pci_parent_bus_reset() or some
>>> variation therin??
>>
>> I agree it would be really nice if the PCI core could help out somehow so we
>> could get some of this code out of individual drivers.
>>
>> If fact, stepping back a few paces, this HFI reset path is part of a transition to
>> PCIe gen3 signaling, and I'm not sure why *that* is in the driver either.
>>
>> There's an ongoing discussion [1] about why this gen3 code is in the driver.
>> Several DRM drivers include similar code (cik_pcie_gen3_enable(),
>> si_pcie_gen3_enable()).
>>
>> I *thought* the hardware was supposed to automatically negotiate to the
>> highest rate supported by both sides without any help at all from software.
>> But since several drivers have code to do it themselves, I wonder if I'm
>> missing something, or maybe there's something the PCI core should be doing
>> that it isn't, and the driver code is basically working around that PCI core
>> deficiency.
> 
> My understanding was that some platfoms only bring up the link in gen 1 mode for compatibility reasons.  TBH, I'm not that familiar with how the links come up on different platforms.
> 
> Alex
> 

I'm checking with our HW folks and the guys that wrote the PCI code in 
our driver. I quite frankly just don't recall what all is going on here. 
Let me find out for sure before I give out wrong information.

-Denny

Sinan Kaya April 20, 2018, 2:23 p.m. UTC | #8

On 4/20/2018 10:00 AM, Bjorn Helgaas wrote:
> [+cc Rajat, Alex because of their interest in the reset/hotplug issue]
> 
> For context, Sinan's patch is this:
> 
>> diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
>> index 83d66e8..75f49e3 100644
>> --- a/drivers/infiniband/hw/hfi1/pcie.c
>> +++ b/drivers/infiniband/hw/hfi1/pcie.c
>> @@ -908,7 +908,8 @@ static int trigger_sbr(struct hfi1_devdata *dd)
>>          * delay after a reset is required.  Per spec requirements,
>>          * the link is either working or not after that point.
>>          */
>> -       pci_reset_bridge_secondary_bus(dev->bus->self);
>> +       if (pci_reset_slot(dev->slot))
>> +               pci_reset_bridge_secondary_bus(dev->bus->self);
> 
> On Thu, Apr 19, 2018 at 06:19:32PM -0400, Sinan Kaya wrote:
>> On 4/19/2018 5:47 PM, Bjorn Helgaas wrote:
>>>> Bjorn, would be appropriate to export pci_parent_bus_reset() or some
>>>> variation therin??
>>> I agree it would be really nice if the PCI core could help out somehow
>>> so we could get some of this code out of individual drivers.
> 
> What I was really thinking here was about the whole Gen3 transition
> thing, not the reset thing by itself.
> 
>> I can create a function called pci_reset_link() and move both slot and
>> secondary bus reset inside.
> 
> What exactly is your patch fixing?  Is it the following?
> 
>   If the HFI link is not operating at 8GT/s, the driver's .probe()
>   method tries to transition it to 8GT/s, which involves resetting the
>   HFI device with pci_reset_bridge_secondary_bus().  If the HFI device
>   is in a hotplug slot, the reset causes a "Link Down" event, which
>   causes the pciehp driver to remove the HFI device and re-enumerate
>   it when the link comes back up.
> 
>   When pciehp removes the device, it calls the HFI .remove() method,
>   which is a problem because the .probe() method is still active.
> 
>   It looks like this should deadlock because __device_attach() holds
>   the device_lock while calling .probe() and the
>   device_release_driver() path tries to acquire it.
> 
> Your patch uses pci_reset_slot(), which connects with Rajat's work
> (06a8d89af551 ("PCI: pciehp: Disable link notification across slot
> reset")) to avoid hotplug events for intentional resets.
> 
> So I think I just reverse-engineered the whole rationale for your
> patch :)  Sorry about the long detour.

Yes, you are on track. Basically; for all callers in drivers directory,
I was trying to call hotplug reset as in (1/2) of this patch before
secondary bus reset.

I learnt about this during our DPC+hotplug interaction thread here:

https://patchwork.kernel.org/project/linux-pci/list/?submitter=77241

Existing issues:
https://marc.info/?l=linux-pci&m=152336615707640&w=2
https://www.spinics.net/lists/linux-pci/msg70614.html


> 
> I'm having a hard time articulating my thoughts here.  I think my
> concern is that knowledge about this reset/link down/hotplug issue is
> leaking out and we'll end up with different reset interfaces that may
> or may not result in hotplug events.  This seems like a confusing API
> because it's hard to explain which interface a driver should use.

I think we should go ahead and combine slot reset and secondary bus
reset into a single API and hide the other ones (pci_reset_slot() and
pci_reset_bridge_secondary_bus()) from external users. This way,
people don't have to query if system supports hotplug or not like VFIO
does.

Other drivers (AER/IB) not doing this are broken in hotplug systems today.


> 
> Bjorn
>

Jason Gunthorpe April 20, 2018, 2:54 p.m. UTC | #9

On Thu, Apr 19, 2018 at 04:47:23PM -0500, Bjorn Helgaas wrote:

> I *thought* the hardware was supposed to automatically negotiate to
> the highest rate supported by both sides without any help at all from
> software.  But since several drivers have code to do it themselves, I
> wonder if I'm missing something, or maybe there's something the PCI
> core should be doing that it isn't, and the driver code is basically
> working around that PCI core deficiency.

The HW is supposed to do that, but Gen3 is electrically *hard*.

I'm not surprised that some HW has run into trouble where the driver
might have to be involved to tweak the SERDES.. eg there is now often
on-device software involved here and updating the SERDES 'firmware'
may be needed.

Jason

Jason Gunthorpe April 20, 2018, 2:55 p.m. UTC | #10

On Thu, Apr 19, 2018 at 10:11:27PM +0000, Deucher, Alexander wrote:

> My understanding was that some platfoms only bring up the link in
> gen 1 mode for compatibility reasons.  TBH, I'm not that familiar
> with how the links come up on different platforms.

Then the platform firmware or platform's linux PCI root complex driver
should perform speed negotiation.

Something like that shouldn't be a driver problem to solve.

The driver should only be involved if it has to alter the device (eg
new device firmware or something)..

Jason

Alex Williamson April 20, 2018, 3:04 p.m. UTC | #11

On Fri, 20 Apr 2018 09:00:49 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> [+cc Rajat, Alex because of their interest in the reset/hotplug issue]
> 
> For context, Sinan's patch is this:
> 
> > diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
> > index 83d66e8..75f49e3 100644
> > --- a/drivers/infiniband/hw/hfi1/pcie.c
> > +++ b/drivers/infiniband/hw/hfi1/pcie.c
> > @@ -908,7 +908,8 @@ static int trigger_sbr(struct hfi1_devdata *dd)
> >          * delay after a reset is required.  Per spec requirements,
> >          * the link is either working or not after that point.
> >          */
> > -       pci_reset_bridge_secondary_bus(dev->bus->self);
> > +       if (pci_reset_slot(dev->slot))
> > +               pci_reset_bridge_secondary_bus(dev->bus->self);  
> 
> On Thu, Apr 19, 2018 at 06:19:32PM -0400, Sinan Kaya wrote:
> > On 4/19/2018 5:47 PM, Bjorn Helgaas wrote:  
> > >> Bjorn, would be appropriate to export pci_parent_bus_reset() or some
> > >> variation therin??  
> > > I agree it would be really nice if the PCI core could help out somehow
> > > so we could get some of this code out of individual drivers.  
> 
> What I was really thinking here was about the whole Gen3 transition
> thing, not the reset thing by itself.
> 
> > I can create a function called pci_reset_link() and move both slot and
> > secondary bus reset inside.  
> 
> What exactly is your patch fixing?  Is it the following?
> 
>   If the HFI link is not operating at 8GT/s, the driver's .probe()
>   method tries to transition it to 8GT/s, which involves resetting the
>   HFI device with pci_reset_bridge_secondary_bus().  If the HFI device
>   is in a hotplug slot, the reset causes a "Link Down" event, which
>   causes the pciehp driver to remove the HFI device and re-enumerate
>   it when the link comes back up.
> 
>   When pciehp removes the device, it calls the HFI .remove() method,
>   which is a problem because the .probe() method is still active.
> 
>   It looks like this should deadlock because __device_attach() holds
>   the device_lock while calling .probe() and the
>   device_release_driver() path tries to acquire it.
> 
> Your patch uses pci_reset_slot(), which connects with Rajat's work
> (06a8d89af551 ("PCI: pciehp: Disable link notification across slot
> reset")) to avoid hotplug events for intentional resets.
> 
> So I think I just reverse-engineered the whole rationale for your
> patch :)  Sorry about the long detour.
> 
> I'm having a hard time articulating my thoughts here.  I think my
> concern is that knowledge about this reset/link down/hotplug issue is
> leaking out and we'll end up with different reset interfaces that may
> or may not result in hotplug events.  This seems like a confusing API
> because it's hard to explain which interface a driver should use.

Is there a concern here about whether the endpoint device driver or the
PCI core really knows better about link retraining?  This makes me
remember my unfinished (and need to revisit) Pericom quirk[1] where
errata in the PCIe switch requires that upstream and downstream links
are balanced (ie. same link rate) or else enabling ACS results in
packets not properly flowing through the switch.  If an endpoint driver
starts deciding to retrain links, overriding quirks in the PCI core,
then such topology manipulation isn't possible.  Why does the
driver .probe() function think it can retrain at a higher link rate
than PCI core?  Thanks,

Alex

[1]https://lists.linuxfoundation.org/pipermail/iommu/2016-October/018890.html

Sinan Kaya April 23, 2018, 5:28 p.m. UTC | #12

On 4/20/2018 11:04 AM, Alex Williamson wrote:
> Is there a concern here about whether the endpoint device driver or the
> PCI core really knows better about link retraining?  This makes me
> remember my unfinished (and need to revisit) Pericom quirk[1] where
> errata in the PCIe switch requires that upstream and downstream links
> are balanced (ie. same link rate) or else enabling ACS results in
> packets not properly flowing through the switch.  If an endpoint driver
> starts deciding to retrain links, overriding quirks in the PCI core,
> then such topology manipulation isn't possible.  Why does the
> driver .probe() function think it can retrain at a higher link rate
> than PCI core?  Thanks,

The example given is for some serdes firmware load to happen in probe and
then performing a retrain to reach to a better speed.

It becomes a chicken and egg problem. 

1. Endpoint HW trains to gen1 by default pre-boot.
2. PCI core enumerates the device.
3. Endpoint driver gets loaded
4. Driver does the firmware programming followed by a link retrain.

I think it is the responsibility of the PCI core to provide reset APIs.
However, expecting endpoint drivers to be knowledgeable about hotplug is
too much.

We can certainly contain AER change into pci directory by moving the slot
reset function to drivers/pci.h file.

But, we need to think about what to do about VFIO and other endpoint
initiated reset cases. My suggestion was to move this into a single API and
remove all other APIs from include/linux/pci.h.

Coming back to this patch...

Do we need a retrain API with the speed that driver wants to reach to?
API can return what was actually accomplished. The quirk from Alex can
run inside this API to make a decision on what speed do we actually want
to reach to for a given PCI topology by reprogramming the target link speed
field.

Alex Williamson April 23, 2018, 7:10 p.m. UTC | #13

On Mon, 23 Apr 2018 13:28:22 -0400
Sinan Kaya <okaya@codeaurora.org> wrote:

> On 4/20/2018 11:04 AM, Alex Williamson wrote:
> > Is there a concern here about whether the endpoint device driver or the
> > PCI core really knows better about link retraining?  This makes me
> > remember my unfinished (and need to revisit) Pericom quirk[1] where
> > errata in the PCIe switch requires that upstream and downstream links
> > are balanced (ie. same link rate) or else enabling ACS results in
> > packets not properly flowing through the switch.  If an endpoint driver
> > starts deciding to retrain links, overriding quirks in the PCI core,
> > then such topology manipulation isn't possible.  Why does the
> > driver .probe() function think it can retrain at a higher link rate
> > than PCI core?  Thanks,  
> 
> The example given is for some serdes firmware load to happen in probe and
> then performing a retrain to reach to a better speed.
> 
> It becomes a chicken and egg problem. 
> 
> 1. Endpoint HW trains to gen1 by default pre-boot.
> 2. PCI core enumerates the device.
> 3. Endpoint driver gets loaded
> 4. Driver does the firmware programming followed by a link retrain.
> 
> I think it is the responsibility of the PCI core to provide reset APIs.
> However, expecting endpoint drivers to be knowledgeable about hotplug is
> too much.
> 
> We can certainly contain AER change into pci directory by moving the slot
> reset function to drivers/pci.h file.
> 
> But, we need to think about what to do about VFIO and other endpoint
> initiated reset cases. My suggestion was to move this into a single API and
> remove all other APIs from include/linux/pci.h.

I'm a little confused about the relation between reset and retrain.
AIUI we can retrain the link without any sort of endpoint reset and if
some sort of driver/firmware setup is required on the endpoint to
achieve the target link speed, then I'd think we only want to retrain.
How this is going to work with vfio is an interesting question.  We're
only providing access to the device, not the link to the device.
Multifunction endpoints become a big problem if one function starts
requesting link retraining while another is in use elsewhere.
 
> Coming back to this patch...
> 
> Do we need a retrain API with the speed that driver wants to reach to?
> API can return what was actually accomplished. The quirk from Alex can
> run inside this API to make a decision on what speed do we actually want
> to reach to for a given PCI topology by reprogramming the target link speed
> field.

Yes, I think the core should provide a retraining API, that would also
make the Pericom quirk easier to implement.  We'd want a field/flag on
the pcidev that could be set by quirks to limit the highest target
rate, but it makes sense that this should be an interface provided by
and control point for the PCI core.  Thanks,

Alex

Sinan Kaya April 23, 2018, 8:17 p.m. UTC | #14

On 4/23/2018 3:10 PM, Alex Williamson wrote:
>> But, we need to think about what to do about VFIO and other endpoint
>> initiated reset cases. My suggestion was to move this into a single API and
>> remove all other APIs from include/linux/pci.h.
> I'm a little confused about the relation between reset and retrain.
> AIUI we can retrain the link without any sort of endpoint reset and if
> some sort of driver/firmware setup is required on the endpoint to
> achieve the target link speed, then I'd think we only want to retrain.

I'm guessing on why you may want to do a secondary bus reset instead of just
retrain bit in the PCI Express Capabilities register...

The maximum link speed is embedded into the TS1s that are exchanged during
initial link training. 

If device only advertises gen1 during boot, no matter what you do with retrain
bit link may not reach to gen3.

A lot of guess here.

Bjorn Helgaas April 23, 2018, 8:23 p.m. UTC | #15

On Mon, Apr 23, 2018 at 01:10:44PM -0600, Alex Williamson wrote:
> On Mon, 23 Apr 2018 13:28:22 -0400
> Sinan Kaya <okaya@codeaurora.org> wrote:
> 
> > On 4/20/2018 11:04 AM, Alex Williamson wrote:
> > > Is there a concern here about whether the endpoint device driver or the
> > > PCI core really knows better about link retraining?  This makes me
> > > remember my unfinished (and need to revisit) Pericom quirk[1] where
> > > errata in the PCIe switch requires that upstream and downstream links
> > > are balanced (ie. same link rate) or else enabling ACS results in
> > > packets not properly flowing through the switch.  If an endpoint driver
> > > starts deciding to retrain links, overriding quirks in the PCI core,
> > > then such topology manipulation isn't possible.  Why does the
> > > driver .probe() function think it can retrain at a higher link rate
> > > than PCI core?  Thanks,  
> > 
> > The example given is for some serdes firmware load to happen in probe and
> > then performing a retrain to reach to a better speed.
> > 
> > It becomes a chicken and egg problem. 
> > 
> > 1. Endpoint HW trains to gen1 by default pre-boot.
> > 2. PCI core enumerates the device.
> > 3. Endpoint driver gets loaded
> > 4. Driver does the firmware programming followed by a link retrain.
> > 
> > I think it is the responsibility of the PCI core to provide reset APIs.
> > However, expecting endpoint drivers to be knowledgeable about hotplug is
> > too much.
> > 
> > We can certainly contain AER change into pci directory by moving the slot
> > reset function to drivers/pci.h file.
> > 
> > But, we need to think about what to do about VFIO and other endpoint
> > initiated reset cases. My suggestion was to move this into a single API and
> > remove all other APIs from include/linux/pci.h.
> 
> I'm a little confused about the relation between reset and retrain.
> AIUI we can retrain the link without any sort of endpoint reset and if
> some sort of driver/firmware setup is required on the endpoint to
> achieve the target link speed, then I'd think we only want to retrain.

In hfi1, do_pcie_gen3_transition() resets the device.  I don't know if
retraining the link would be sufficient; maybe the reset is required
to make the device use the new firmware.  I guess we already export
reset interfaces, so if we add a retrain interface, drivers could
choose what they need.

> How this is going to work with vfio is an interesting question.  We're
> only providing access to the device, not the link to the device.

> Multifunction endpoints become a big problem if one function starts
> requesting link retraining while another is in use elsewhere.

Can we just make it the driver's problem by returning -EPERM if one
function requests a retrain while another function is in use?

Bjorn

Alex Williamson April 23, 2018, 8:41 p.m. UTC | #16

On Mon, 23 Apr 2018 15:23:11 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Mon, Apr 23, 2018 at 01:10:44PM -0600, Alex Williamson wrote:
> > On Mon, 23 Apr 2018 13:28:22 -0400
> > Sinan Kaya <okaya@codeaurora.org> wrote:
> >   
> > > On 4/20/2018 11:04 AM, Alex Williamson wrote:  
> > > > Is there a concern here about whether the endpoint device driver or the
> > > > PCI core really knows better about link retraining?  This makes me
> > > > remember my unfinished (and need to revisit) Pericom quirk[1] where
> > > > errata in the PCIe switch requires that upstream and downstream links
> > > > are balanced (ie. same link rate) or else enabling ACS results in
> > > > packets not properly flowing through the switch.  If an endpoint driver
> > > > starts deciding to retrain links, overriding quirks in the PCI core,
> > > > then such topology manipulation isn't possible.  Why does the
> > > > driver .probe() function think it can retrain at a higher link rate
> > > > than PCI core?  Thanks,    
> > > 
> > > The example given is for some serdes firmware load to happen in probe and
> > > then performing a retrain to reach to a better speed.
> > > 
> > > It becomes a chicken and egg problem. 
> > > 
> > > 1. Endpoint HW trains to gen1 by default pre-boot.
> > > 2. PCI core enumerates the device.
> > > 3. Endpoint driver gets loaded
> > > 4. Driver does the firmware programming followed by a link retrain.
> > > 
> > > I think it is the responsibility of the PCI core to provide reset APIs.
> > > However, expecting endpoint drivers to be knowledgeable about hotplug is
> > > too much.
> > > 
> > > We can certainly contain AER change into pci directory by moving the slot
> > > reset function to drivers/pci.h file.
> > > 
> > > But, we need to think about what to do about VFIO and other endpoint
> > > initiated reset cases. My suggestion was to move this into a single API and
> > > remove all other APIs from include/linux/pci.h.  
> > 
> > I'm a little confused about the relation between reset and retrain.
> > AIUI we can retrain the link without any sort of endpoint reset and if
> > some sort of driver/firmware setup is required on the endpoint to
> > achieve the target link speed, then I'd think we only want to retrain.  
> 
> In hfi1, do_pcie_gen3_transition() resets the device.  I don't know if
> retraining the link would be sufficient; maybe the reset is required
> to make the device use the new firmware.  I guess we already export
> reset interfaces, so if we add a retrain interface, drivers could
> choose what they need.
> 
> > How this is going to work with vfio is an interesting question.  We're
> > only providing access to the device, not the link to the device.  
> 
> > Multifunction endpoints become a big problem if one function starts
> > requesting link retraining while another is in use elsewhere.  
> 
> Can we just make it the driver's problem by returning -EPERM if one
> function requests a retrain while another function is in use?

Yes, this is basically how we handle bus resets through vfio-pci, if
there are multiple devices affected by a bus reset, slots or functions,
the caller needs to prove ownership of all of them before we do a bus
reset.  That's great for video cards where we have the GPU and audio
functions typically assigned together anyway, but I don't know that
it's a great solution for IB where you're more likely to want to assign
separate functions to separate users and nobody can get the device to
run at full speed unless all the functions are assigned to one user.
Thanks,

Alex

Sinan Kaya April 25, 2018, 2:13 p.m. UTC | #17

On 4/23/2018 3:10 PM, Alex Williamson wrote:
>> Do we need a retrain API with the speed that driver wants to reach to?
>> API can return what was actually accomplished. The quirk from Alex can
>> run inside this API to make a decision on what speed do we actually want
>> to reach to for a given PCI topology by reprogramming the target link speed
>> field.
> Yes, I think the core should provide a retraining API, that would also
> make the Pericom quirk easier to implement.  We'd want a field/flag on
> the pcidev that could be set by quirks to limit the highest target
> rate, but it makes sense that this should be an interface provided by
> and control point for the PCI core.  Thanks,

Maybe, we could live with the existing secondary bus reset API. The speed that
driver wants to reach to is in the target speed field. Quirk can hook up
to the secondary bus reset call, read the target speed field to see what speed
we are trying to negotiate to and override the target speed register based on
upstream and downstream speeds.

Bjorn Helgaas June 19, 2018, 9:43 p.m. UTC | #18

On Thu, Apr 19, 2018 at 03:56:23PM -0400, Sinan Kaya wrote:
> The infiniband adapter might be connected to a PCI hotplug slot. Performing
> secondary bus reset on a hotplug slot causes PCI link up/down interrupts.
> 
> Hotplug driver removes the device from system when a link down interrupt
> is observed and performs re-enumeration when link up interrupt is observed.
> 
> This conflicts with what this code is trying to do. Try secondary bus reset
> only if pci_reset_slot() fails/unsupported.

Hi Sinan,

We had a bunch of discussion here but not sure we ever reached a
consensus.  It did seem like we'd like to avoid putting hotplug
knowledge in drivers, though.  What do you think the path forward is?

> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> ---
>  drivers/infiniband/hw/hfi1/pcie.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
> index 83d66e8..75f49e3 100644
> --- a/drivers/infiniband/hw/hfi1/pcie.c
> +++ b/drivers/infiniband/hw/hfi1/pcie.c
> @@ -908,7 +908,8 @@ static int trigger_sbr(struct hfi1_devdata *dd)
>  	 * delay after a reset is required.  Per spec requirements,
>  	 * the link is either working or not after that point.
>  	 */
> -	pci_reset_bridge_secondary_bus(dev->bus->self);
> +	if (pci_reset_slot(dev->slot))
> +		pci_reset_bridge_secondary_bus(dev->bus->self);
>  
>  	return 0;
>  }
> -- 
> 2.7.4
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Sinan Kaya June 19, 2018, 10:21 p.m. UTC | #19

On 6/19/2018 5:43 PM, Bjorn Helgaas wrote:
>> Hotplug driver removes the device from system when a link down interrupt
>> is observed and performs re-enumeration when link up interrupt is observed.
>>
>> This conflicts with what this code is trying to do. Try secondary bus reset
>> only if pci_reset_slot() fails/unsupported.
> Hi Sinan,
> 
> We had a bunch of discussion here but not sure we ever reached a
> consensus.  It did seem like we'd like to avoid putting hotplug
> knowledge in drivers, though.  What do you think the path forward is?
> 

There are multiple issues. 

We discussed the need for a retrain API on this thread. However, retrain API may
not be enough for this particular usage as the device might need a full link training
sequence following firmware load including a hot reset message. I don't think we can
remove the bus reset usage in this code.

I'd like to start with a small one to address your comment here.

"I think my concern is that knowledge about this reset/link down/hotplug issue is
leaking out and we'll end up with different reset interfaces that may
or may not result in hotplug events.  This seems like a confusing API
because it's hard to explain which interface a driver should use."

I'm thinking of removing pci_reset_slot() and pci_try_reset_slot() functions
as an exported API and fold them into pci_reset_bus() and  pci_try_reset_bus()
API respectively.

This is what happens when you insert a fatal error to a hotplug slot. See multiple
link up/down messages.

/_#_[__333.699731]_pcieport_0001:00:00.0:_AER:_Uncorrected_(Fatal)_error_received:_id=0000
[  333.748116] pcieport 0001:00:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id
[  333.816044] pcieport 0001:00:00.0:   device [17cb:0404] error status/mask=00040000/00400000
[  333.871581] pcieport 0001:00:00.0:    [18] Malformed TLP          (First)
[  333.914675] pcieport 0001:00:00.0:   TLP Header: 40000001 000000ff 00000000 00000000
[  333.968397] pciehp 0001:00:00.0:pcie004: Slot(2): Link Down
[  334.043234] iommu: Removing device 0001:01:00.4 from group 0
[  334.095952] iommu: Removing device 0001:01:00.3 from group 0
[  334.144644] iommu: Removing device 0001:01:00.2 from group 0
[  334.194653] iommu: Removing device 0001:01:00.1 from group 0
[  334.243564] pciehp 0001:00:00.0:pcie004: Slot(2): Link Up
[  334.282556] iommu: Removing device 0001:01:00.0 from group 0
[  334.330994] pciehp 0001:00:00.0:pcie004: Slot(2): Link Up event queued; currently getting powered off
[  334.890587] pciehp 0001:00:00.0:pcie004: Timeout on hotplug command 0x13f1 (issued 282900 msec ago)
[  335.070190] pciehp 0001:00:00.0:pcie004: Slot(2): Link Down
[  335.106960] pciehp 0001:00:00.0:pcie004: Slot(2): Link Down event queued; currently getting powered on
[  335.191119] pcieport 0001:00:00.0: AER: Device recovery failed
[  346.590153] pciehp 0001:00:00.0:pcie004: Timeout on hotplug command 0x17f1 (issued 10250 msec ago)

As a suggestion:

1. If the device belongs to a slot, do slot reset. 
2. Otherwise, do bus reset.

Since Oza's DPC/AER patch to refactor fatal error handling, both hotplug driver
and AER/DPC driver will try removing devices and perform enumeration on link events/AER events.

Perfect environment for race condition without a change.

Bjorn Helgaas June 22, 2018, 2:01 p.m. UTC | #20

[+cc Alex]

On Tue, Jun 19, 2018 at 06:21:43PM -0400, Sinan Kaya wrote:
> On 6/19/2018 5:43 PM, Bjorn Helgaas wrote:
> >> Hotplug driver removes the device from system when a link down interrupt
> >> is observed and performs re-enumeration when link up interrupt is observed.
> >>
> >> This conflicts with what this code is trying to do. Try secondary bus reset
> >> only if pci_reset_slot() fails/unsupported.
> > Hi Sinan,
> > 
> > We had a bunch of discussion here but not sure we ever reached a
> > consensus.  It did seem like we'd like to avoid putting hotplug
> > knowledge in drivers, though.  What do you think the path forward is?
> > 
> 
> There are multiple issues. 
> 
> We discussed the need for a retrain API on this thread. However,
> retrain API may not be enough for this particular usage as the
> device might need a full link training sequence following firmware
> load including a hot reset message. I don't think we can remove the
> bus reset usage in this code.
> 
> I'd like to start with a small one to address your comment here.
> 
> "I think my concern is that knowledge about this reset/link
> down/hotplug issue is leaking out and we'll end up with different
> reset interfaces that may or may not result in hotplug events.  This
> seems like a confusing API because it's hard to explain which
> interface a driver should use."
> 
> I'm thinking of removing pci_reset_slot() and pci_try_reset_slot()
> functions as an exported API and fold them into pci_reset_bus() and
> pci_try_reset_bus() API respectively.

pci_try_reset_slot() and pci_try_reset_bus() are both used by VFIO,
but I don't see any callers of either pci_reset_slot() or
pci_reset_bus().

I suspect what happened was that we added pci_reset_slot(), used it
in VFIO, found a deadlock, added pci_try_reset_slot(), and converted
VFIO to use pci_try_reset_slot() to fix the deadlock, leaving no
callers of pci_reset_slot() itself.

  090a3c5322e9 ("PCI: Add pci_reset_slot() and pci_reset_bus()")
  8b27ee60bfd6 ("vfio-pci: PCI hot reset interface")
  61cf16d8bd38 ("PCI: Add pci_try_reset_function(), pci_try_reset_slot(), pci_try_reset_bus()")
  890ed578df82 ("vfio-pci: Use pci "try" reset interface")

I *think* pci_try_reset_slot() is already equivalent to pci_reset_slot()
except that it returns -EAGAIN if it can't lock the slot.  But if you
remove pci_reset_slot(), you could rename pci_try_reset_slot() to
pci_reset_slot().  It doesn't seem like keeping "try" in the function
name would be necessary.

> This is what happens when you insert a fatal error to a hotplug
> slot. See multiple link up/down messages.
> 
> /_#_[__333.699731]_pcieport_0001:00:00.0:_AER:_Uncorrected_(Fatal)_error_received:_id=0000
> [  333.748116] pcieport 0001:00:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id
> [  333.816044] pcieport 0001:00:00.0:   device [17cb:0404] error status/mask=00040000/00400000
> [  333.871581] pcieport 0001:00:00.0:    [18] Malformed TLP          (First)
> [  333.914675] pcieport 0001:00:00.0:   TLP Header: 40000001 000000ff 00000000 00000000
> [  333.968397] pciehp 0001:00:00.0:pcie004: Slot(2): Link Down
> [  334.043234] iommu: Removing device 0001:01:00.4 from group 0
> [  334.095952] iommu: Removing device 0001:01:00.3 from group 0
> [  334.144644] iommu: Removing device 0001:01:00.2 from group 0
> [  334.194653] iommu: Removing device 0001:01:00.1 from group 0
> [  334.243564] pciehp 0001:00:00.0:pcie004: Slot(2): Link Up
> [  334.282556] iommu: Removing device 0001:01:00.0 from group 0
> [  334.330994] pciehp 0001:00:00.0:pcie004: Slot(2): Link Up event queued; currently getting powered off
> [  334.890587] pciehp 0001:00:00.0:pcie004: Timeout on hotplug command 0x13f1 (issued 282900 msec ago)
> [  335.070190] pciehp 0001:00:00.0:pcie004: Slot(2): Link Down
> [  335.106960] pciehp 0001:00:00.0:pcie004: Slot(2): Link Down event queued; currently getting powered on
> [  335.191119] pcieport 0001:00:00.0: AER: Device recovery failed
> [  346.590153] pciehp 0001:00:00.0:pcie004: Timeout on hotplug command 0x17f1 (issued 10250 msec ago)
> 
> As a suggestion:
> 
> 1. If the device belongs to a slot, do slot reset. 
> 2. Otherwise, do bus reset.

I assume this refers to pci_try_reset_slot() and pci_try_reset_bus(),
which are only used by VFIO in vfio_pci_ioctl() and
vfio_pci_try_bus_reset().

Both of those callers use pci_probe_reset_slot() to decide whether to
use pci_try_reset_slot() or pci_try_reset_bus().  If you're suggesting
to pull that slot/bus distinction into the PCI core somehow, that
would be fine with me, although VFIO does use the
pci_probe_reset_slot() result for other purposes in those functions.

> Since Oza's DPC/AER patch to refactor fatal error handling, both
> hotplug driver and AER/DPC driver will try removing devices and
> perform enumeration on link events/AER events.
> 
> Perfect environment for race condition without a change.

Yeah, this looks like a bit of a mess.  I guess we're getting two
interrupts (AER interrupt and hotplug interrupt) and we should
coordinate their handling somehow.  I don't have a proposal.  This
race could happen independent of the device reset paths, of course.

Bjorn

Sinan Kaya June 22, 2018, 4:04 p.m. UTC | #21

On 6/22/2018 10:01 AM, Bjorn Helgaas wrote:
>> Since Oza's DPC/AER patch to refactor fatal error handling, both
>> hotplug driver and AER/DPC driver will try removing devices and
>> perform enumeration on link events/AER events.
>>
>> Perfect environment for race condition without a change.
> Yeah, this looks like a bit of a mess.  I guess we're getting two
> interrupts (AER interrupt and hotplug interrupt) and we should
> coordinate their handling somehow.  I don't have a proposal.  This
> race could happen independent of the device reset paths, of course.

I was hoping for pci_reset_slot() to avoid this but it turns out
the root port does not have any slot pointers set on my system even though
hotplug is enabled. 

Slot pointer is only set for the child objects.

https://patchwork.kernel.org/patch/10351515/

 * A PCI bus may host multiple slots, each slot may support a reset mechanism
 * independent of other slots.  For instance, some slots may support slot power
 * control.  In the case of a 1:1 bus to slot architecture, this function may
 * wrap the bus reset to avoid spurious slot related events such as hotplug.
 * Generally a slot reset should be attempted before a bus reset.  All of the
 * function of the slot and any subordinate buses behind the slot are reset
 * through this function.  PCI config space of all devices in the slot and
 * behind the slot is saved before and restored after reset.

[1/2] IB/hfi1: Try slot reset before secondary bus reset

Commit Message

Comments

Patch