[RFC] PCI/AER: Block runtime suspend when handling errors

Message ID	20240123090016.281252-1-stanislaw.gruszka@linux.intel.com (mailing list archive)
State	Handled Elsewhere, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1D32F51C54; Tue, 23 Jan 2024 09:00:30 +0000 (UTC) From: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> To: Bjorn Helgaas <helgaas@kernel.org>, "Rafael J. Wysocki" <rafael@kernel.org> Cc: linux-pci@vger.kernel.org, linux-pm@vger.kernel.org, Mahesh J Salgaonkar <mahesh@linux.ibm.com>, Oliver O'Halloran <oohall@gmail.com>, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>, Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Subject: [RFC] PCI/AER: Block runtime suspend when handling errors Date: Tue, 23 Jan 2024 10:00:16 +0100 Message-Id: <20240123090016.281252-1-stanislaw.gruszka@linux.intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[RFC] PCI/AER: Block runtime suspend when handling errors \| expand [RFC] PCI/AER: Block runtime suspend when handling errors

Stanislaw Gruszka Jan. 23, 2024, 9 a.m. UTC

PM runtime can be done simultaneously with AER error handling.
Avoid that by using pm_runtime_get_sync() just after pci_dev_get()
and pm_runtime_put() just before pci_dev_put() in AER recovery
procedures.

I'm not sure about DPC case since I do not see get/put there. It
just call pci_do_recovery() from threaded irq dcd_handler().
I think pm_runtime* should be added to this handler as well.

pm_runtime_get_sync() will increase dev->power.usage_count counter to
prevent any rpm actives. When there is suspending pending, it will wait
for it and do the rpm resume. Not sure if that problem, on my testing
I did not encounter issues with that.

I tested with igc device by doing simultaneous aer_inject and 
rpm suspend/resume via /sys/bus/pci/devices/PCI_ID/power/control
and can reproduce: 

[  853.253938] igc 0000:02:00.0: not ready 65535ms after bus reset; giving up
[  853.253973] pcieport 0000:00:1c.2: AER: Root Port link has been reset (-25)
[  853.253996] pcieport 0000:00:1c.2: AER: subordinate device reset failed
[  853.254099] pcieport 0000:00:1c.2: AER: device recovery failed
[  853.254178] igc 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible

The problem disappears when applied this patch.

Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
---
 drivers/pci/pcie/aer.c | 8 ++++++++
 drivers/pci/pcie/edr.c | 3 +++
 2 files changed, 11 insertions(+)

Bjorn Helgaas Jan. 30, 2024, 12:14 a.m. UTC | #1

On Tue, Jan 23, 2024 at 10:00:16AM +0100, Stanislaw Gruszka wrote:
> PM runtime can be done simultaneously with AER error handling.
> Avoid that by using pm_runtime_get_sync() just after pci_dev_get()
> and pm_runtime_put() just before pci_dev_put() in AER recovery
> procedures.

I guess there must be a general rule here, like "PCI core must use
pm_runtime_get_sync() whenever touching the device asynchronously,
i.e., when it's doing something unrelated to a call from the driver"?

Probably would apply to all subsystem cores, not just PCI.

> I'm not sure about DPC case since I do not see get/put there. It
> just call pci_do_recovery() from threaded irq dcd_handler().
> I think pm_runtime* should be added to this handler as well.

s/dcd_handler/dpc_handler/

I'm guessing the "threaded" part really doesn't matter; just the fact
that this is in response to an interrupt, not something directly
called by a driver?

> pm_runtime_get_sync() will increase dev->power.usage_count counter to
> prevent any rpm actives. When there is suspending pending, it will wait
> for it and do the rpm resume. Not sure if that problem, on my testing
> I did not encounter issues with that.

Sorry, I didn't catch your meaning here.  IIUC, you can reproduce the
problem with the simultaneous aer_inject and rpm suspend/resume, and
this patch fixes it.

But there's some other scenario where you *don't* see the problem?

> I tested with igc device by doing simultaneous aer_inject and 
> rpm suspend/resume via /sys/bus/pci/devices/PCI_ID/power/control
> and can reproduce: 
> 
> [  853.253938] igc 0000:02:00.0: not ready 65535ms after bus reset; giving up
> [  853.253973] pcieport 0000:00:1c.2: AER: Root Port link has been reset (-25)
> [  853.253996] pcieport 0000:00:1c.2: AER: subordinate device reset failed
> [  853.254099] pcieport 0000:00:1c.2: AER: device recovery failed
> [  853.254178] igc 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible

Drop the timestamps; they don't add to understanding the problem.

> The problem disappears when applied this patch.
> 
> Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
> ---
>  drivers/pci/pcie/aer.c | 8 ++++++++
>  drivers/pci/pcie/edr.c | 3 +++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 42a3bd35a3e1..9b56460edc76 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -23,6 +23,7 @@
>  #include <linux/kernel.h>
>  #include <linux/errno.h>
>  #include <linux/pm.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/init.h>
>  #include <linux/interrupt.h>
>  #include <linux/delay.h>
> @@ -813,6 +814,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
>  {
>  	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
>  		e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
> +		pm_runtime_get_sync(&dev->dev);
>  		e_info->error_dev_num++;
>  		return 0;
>  	}
> @@ -1111,6 +1113,8 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
>  	cxl_rch_handle_error(dev, info);
>  	pci_aer_handle_error(dev, info);
> +
> +	pm_runtime_put(&dev->dev);
>  	pci_dev_put(dev);
>  }
>  
> @@ -1143,6 +1147,8 @@ static void aer_recover_work_func(struct work_struct *work)
>  			       PCI_SLOT(entry.devfn), PCI_FUNC(entry.devfn));
>  			continue;
>  		}
> +		pm_runtime_get_sync(&pdev->dev);
> +
>  		pci_print_aer(pdev, entry.severity, entry.regs);
>  		/*
>  		 * Memory for aer_capability_regs(entry.regs) is being allocated from the
> @@ -1159,6 +1165,8 @@ static void aer_recover_work_func(struct work_struct *work)
>  		else if (entry.severity == AER_FATAL)
>  			pcie_do_recovery(pdev, pci_channel_io_frozen,
>  					 aer_root_reset);
> +
> +		pm_runtime_put(&pdev->dev);
>  		pci_dev_put(pdev);
>  	}
>  }
> diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
> index 5f4914d313a1..bd96babd7249 100644
> --- a/drivers/pci/pcie/edr.c
> +++ b/drivers/pci/pcie/edr.c
> @@ -10,6 +10,7 @@
>  
>  #include <linux/pci.h>
>  #include <linux/pci-acpi.h>
> +#include <linux/pm_runtime.h>
>  
>  #include "portdrv.h"
>  #include "../pci.h"
> @@ -169,6 +170,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
>  		return;
>  	}
>  
> +	pm_runtime_get_sync(&edev->dev);
>  	pci_dbg(pdev, "Reported EDR dev: %s\n", pci_name(edev));
>  
>  	/* If port does not support DPC, just send the OST */
> @@ -209,6 +211,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
>  		acpi_send_edr_status(pdev, edev, EDR_OST_FAILED);
>  	}
>  
> +	pm_runtime_put(&edev->dev);
>  	pci_dev_put(edev);
>  }
>  
> -- 
> 2.34.1
>

Stanislaw Gruszka Feb. 1, 2024, 9:36 a.m. UTC | #2

On Mon, Jan 29, 2024 at 06:14:36PM -0600, Bjorn Helgaas wrote:
> On Tue, Jan 23, 2024 at 10:00:16AM +0100, Stanislaw Gruszka wrote:
> > PM runtime can be done simultaneously with AER error handling.
> > Avoid that by using pm_runtime_get_sync() just after pci_dev_get()
> > and pm_runtime_put() just before pci_dev_put() in AER recovery
> > procedures.
> 
> I guess there must be a general rule here, like "PCI core must use
> pm_runtime_get_sync() whenever touching the device asynchronously,
> i.e., when it's doing something unrelated to a call from the driver"?

Clear rules would be nice, that's for sure.

> Probably would apply to all subsystem cores, not just PCI.

If they have something similar like AER.

> > I'm not sure about DPC case since I do not see get/put there. It
> > just call pci_do_recovery() from threaded irq dcd_handler().
> > I think pm_runtime* should be added to this handler as well.
> 
> s/dcd_handler/dpc_handler/
>
> I'm guessing the "threaded" part really doesn't matter; just the fact
> that this is in response to an interrupt, not something directly
> called by a driver?

Yes. I added "threaded" just to emphasis that it's safe to add sleeping
functions there. In context of possible solution, not related to
the problem itself.
 
> > pm_runtime_get_sync() will increase dev->power.usage_count counter to
> > prevent any rpm actives. When there is suspending pending, it will wait
> > for it and do the rpm resume. Not sure if that problem, on my testing
> > I did not encounter issues with that.
> 
> Sorry, I didn't catch your meaning here.
I tired to write two things:

First, pm_runtime_get_sync() after exit prevents possibility of
runtime suspend, so we are sure device will not be powered off

Second, if during pm_runtime_get_sync(), there is pending attempt
to suspend device, it will be synchronized and device will
be resumed. This can be problematic as device is in error state.
But at least from software perspective we should end in device
being in active state and then we can perform reset of it.

[ Third scenario, i.e device is suspended, and device has to be resumed
by pm_runtime_get_sync() can not really happen, because we can not
get error from disabled device. ]

... after writing this I realized that is not true, as we can do
recovery for several devices on subordinate bus, some of them can
be rpm suspended, I presume. Need to think more about this case.

> IIUC, you can reproduce the
> problem with the simultaneous aer_inject and rpm suspend/resume, and
> this patch fixes it.

Yes.

> But there's some other scenario where you *don't* see the problem?

For example there are no runtime pm related race conditions with
.probe, .remove , (system) .suspend, .resume. All of those have 
proper synchronization with runtime pm in the device core.

> > I tested with igc device by doing simultaneous aer_inject and 
> > rpm suspend/resume via /sys/bus/pci/devices/PCI_ID/power/control
> > and can reproduce: 
> > 
> > [  853.253938] igc 0000:02:00.0: not ready 65535ms after bus reset; giving up
> > [  853.253973] pcieport 0000:00:1c.2: AER: Root Port link has been reset (-25)
> > [  853.253996] pcieport 0000:00:1c.2: AER: subordinate device reset failed
> > [  853.254099] pcieport 0000:00:1c.2: AER: device recovery failed
> > [  853.254178] igc 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible
> 
> Drop the timestamps; they don't add to understanding the problem.

Ok.

Regards
Stanislaw
> 
> > The problem disappears when applied this patch.
> > 
> > Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
> > ---
> >  drivers/pci/pcie/aer.c | 8 ++++++++
> >  drivers/pci/pcie/edr.c | 3 +++
> >  2 files changed, 11 insertions(+)
> > 
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 42a3bd35a3e1..9b56460edc76 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -23,6 +23,7 @@
> >  #include <linux/kernel.h>
> >  #include <linux/errno.h>
> >  #include <linux/pm.h>
> > +#include <linux/pm_runtime.h>
> >  #include <linux/init.h>
> >  #include <linux/interrupt.h>
> >  #include <linux/delay.h>
> > @@ -813,6 +814,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
> >  {
> >  	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
> >  		e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
> > +		pm_runtime_get_sync(&dev->dev);
> >  		e_info->error_dev_num++;
> >  		return 0;
> >  	}
> > @@ -1111,6 +1113,8 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
> >  {
> >  	cxl_rch_handle_error(dev, info);
> >  	pci_aer_handle_error(dev, info);
> > +
> > +	pm_runtime_put(&dev->dev);
> >  	pci_dev_put(dev);
> >  }
> >  
> > @@ -1143,6 +1147,8 @@ static void aer_recover_work_func(struct work_struct *work)
> >  			       PCI_SLOT(entry.devfn), PCI_FUNC(entry.devfn));
> >  			continue;
> >  		}
> > +		pm_runtime_get_sync(&pdev->dev);
> > +
> >  		pci_print_aer(pdev, entry.severity, entry.regs);
> >  		/*
> >  		 * Memory for aer_capability_regs(entry.regs) is being allocated from the
> > @@ -1159,6 +1165,8 @@ static void aer_recover_work_func(struct work_struct *work)
> >  		else if (entry.severity == AER_FATAL)
> >  			pcie_do_recovery(pdev, pci_channel_io_frozen,
> >  					 aer_root_reset);
> > +
> > +		pm_runtime_put(&pdev->dev);
> >  		pci_dev_put(pdev);
> >  	}
> >  }
> > diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
> > index 5f4914d313a1..bd96babd7249 100644
> > --- a/drivers/pci/pcie/edr.c
> > +++ b/drivers/pci/pcie/edr.c
> > @@ -10,6 +10,7 @@
> >  
> >  #include <linux/pci.h>
> >  #include <linux/pci-acpi.h>
> > +#include <linux/pm_runtime.h>
> >  
> >  #include "portdrv.h"
> >  #include "../pci.h"
> > @@ -169,6 +170,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
> >  		return;
> >  	}
> >  
> > +	pm_runtime_get_sync(&edev->dev);
> >  	pci_dbg(pdev, "Reported EDR dev: %s\n", pci_name(edev));
> >  
> >  	/* If port does not support DPC, just send the OST */
> > @@ -209,6 +211,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
> >  		acpi_send_edr_status(pdev, edev, EDR_OST_FAILED);
> >  	}
> >  
> > +	pm_runtime_put(&edev->dev);
> >  	pci_dev_put(edev);
> >  }
> >  
> > -- 
> > 2.34.1
> > 
>

Rafael J. Wysocki Feb. 1, 2024, 2:10 p.m. UTC | #3

On Thu, Feb 1, 2024 at 10:36 AM Stanislaw Gruszka
<stanislaw.gruszka@linux.intel.com> wrote:
>
> On Mon, Jan 29, 2024 at 06:14:36PM -0600, Bjorn Helgaas wrote:
> > On Tue, Jan 23, 2024 at 10:00:16AM +0100, Stanislaw Gruszka wrote:
> > > PM runtime can be done simultaneously with AER error handling.
> > > Avoid that by using pm_runtime_get_sync() just after pci_dev_get()
> > > and pm_runtime_put() just before pci_dev_put() in AER recovery
> > > procedures.
> >
> > I guess there must be a general rule here, like "PCI core must use
> > pm_runtime_get_sync() whenever touching the device asynchronously,
> > i.e., when it's doing something unrelated to a call from the driver"?
>
> Clear rules would be nice, that's for sure.
>
> > Probably would apply to all subsystem cores, not just PCI.
>
> If they have something similar like AER.
>
> > > I'm not sure about DPC case since I do not see get/put there. It
> > > just call pci_do_recovery() from threaded irq dcd_handler().
> > > I think pm_runtime* should be added to this handler as well.
> >
> > s/dcd_handler/dpc_handler/
> >
> > I'm guessing the "threaded" part really doesn't matter; just the fact
> > that this is in response to an interrupt, not something directly
> > called by a driver?
>
> Yes. I added "threaded" just to emphasis that it's safe to add sleeping
> functions there. In context of possible solution, not related to
> the problem itself.
>
> > > pm_runtime_get_sync() will increase dev->power.usage_count counter to
> > > prevent any rpm actives. When there is suspending pending, it will wait
> > > for it and do the rpm resume. Not sure if that problem, on my testing
> > > I did not encounter issues with that.
> >
> > Sorry, I didn't catch your meaning here.
> I tired to write two things:
>
> First, pm_runtime_get_sync() after exit prevents possibility of
> runtime suspend, so we are sure device will not be powered off
>
> Second, if during pm_runtime_get_sync(), there is pending attempt
> to suspend device, it will be synchronized and device will
> be resumed.

Not exactly.  pm_runtime_get_sync() will resume the device
synchronously no matter what.

> This can be problematic as device is in error state.

If this is a real possibility (I mean, device in a low-power state and
in an error state at the same time), it would be better to call
__pm_runtime_disable(dev, false) instead of pm_runtime_get_sync(), as
that would prevent runtime PM from changing the device state.

> But at least from software perspective we should end in device
> being in active state and then we can perform reset of it.

I'm not sure about this.  It may be better to power-cycle the device
in D3hot instead of attempting to put it into D0 beforehand.

Stanislaw Gruszka Feb. 5, 2024, 7:19 p.m. UTC | #4

On Thu, Feb 01, 2024 at 03:10:35PM +0100, Rafael J. Wysocki wrote:
> On Thu, Feb 1, 2024 at 10:36 AM Stanislaw Gruszka
> <stanislaw.gruszka@linux.intel.com> wrote:
> >
> > On Mon, Jan 29, 2024 at 06:14:36PM -0600, Bjorn Helgaas wrote:
> > > On Tue, Jan 23, 2024 at 10:00:16AM +0100, Stanislaw Gruszka wrote:
> > > > PM runtime can be done simultaneously with AER error handling.
> > > > Avoid that by using pm_runtime_get_sync() just after pci_dev_get()
> > > > and pm_runtime_put() just before pci_dev_put() in AER recovery
> > > > procedures.
> > >
> > > I guess there must be a general rule here, like "PCI core must use
> > > pm_runtime_get_sync() whenever touching the device asynchronously,
> > > i.e., when it's doing something unrelated to a call from the driver"?
> >
> > Clear rules would be nice, that's for sure.
> >
> > > Probably would apply to all subsystem cores, not just PCI.
> >
> > If they have something similar like AER.
> >
> > > > I'm not sure about DPC case since I do not see get/put there. It
> > > > just call pci_do_recovery() from threaded irq dcd_handler().
> > > > I think pm_runtime* should be added to this handler as well.
> > >
> > > s/dcd_handler/dpc_handler/
> > >
> > > I'm guessing the "threaded" part really doesn't matter; just the fact
> > > that this is in response to an interrupt, not something directly
> > > called by a driver?
> >
> > Yes. I added "threaded" just to emphasis that it's safe to add sleeping
> > functions there. In context of possible solution, not related to
> > the problem itself.
> >
> > > > pm_runtime_get_sync() will increase dev->power.usage_count counter to
> > > > prevent any rpm actives. When there is suspending pending, it will wait
> > > > for it and do the rpm resume. Not sure if that problem, on my testing
> > > > I did not encounter issues with that.
> > >
> > > Sorry, I didn't catch your meaning here.
> > I tired to write two things:
> >
> > First, pm_runtime_get_sync() after exit prevents possibility of
> > runtime suspend, so we are sure device will not be powered off
> >
> > Second, if during pm_runtime_get_sync(), there is pending attempt
> > to suspend device, it will be synchronized and device will
> > be resumed.
> 
> Not exactly.  pm_runtime_get_sync() will resume the device
> synchronously no matter what.
> 
> > This can be problematic as device is in error state.
> 
> If this is a real possibility (I mean, device in a low-power state and
> in an error state at the same time), it would be better to call
> __pm_runtime_disable(dev, false) instead of pm_runtime_get_sync(), as
> that would prevent runtime PM from changing the device state.

__pm_runtime_disable(dev, false) does not work as reliable in my
test as pm_runtime_get_sync(), the

igc 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible

message disappears, but sill have:

igc 0000:02:00.0: not ready 65535ms after bus reset; giving up
pcieport 0000:00:1c.2: AER: Root Port link has been reset (-25)
pcieport 0000:00:1c.2: AER: subordinate device reset failed
pcieport 0000:00:1c.2: AER: device recovery fail

> > But at least from software perspective we should end in device
> > being in active state and then we can perform reset of it.
> 
> I'm not sure about this.  It may be better to power-cycle the device
> in D3hot instead of attempting to put it into D0 beforehand.

Me nigher, but in pci_reset_function() and similar resetting 
procedures we always call pci_dev_save_and_disable() before
actual reset and it set device in D0, so I think it's ok to do so.

Regards
Stanislaw

Rafael J. Wysocki Feb. 5, 2024, 9 p.m. UTC | #5

On Mon, Feb 5, 2024 at 9:11 PM Stanislaw Gruszka
<stanislaw.gruszka@linux.intel.com> wrote:
>
> On Thu, Feb 01, 2024 at 03:10:35PM +0100, Rafael J. Wysocki wrote:
> > On Thu, Feb 1, 2024 at 10:36 AM Stanislaw Gruszka
> > <stanislaw.gruszka@linux.intel.com> wrote:
> > >
> > > On Mon, Jan 29, 2024 at 06:14:36PM -0600, Bjorn Helgaas wrote:
> > > > On Tue, Jan 23, 2024 at 10:00:16AM +0100, Stanislaw Gruszka wrote:
> > > > > PM runtime can be done simultaneously with AER error handling.
> > > > > Avoid that by using pm_runtime_get_sync() just after pci_dev_get()
> > > > > and pm_runtime_put() just before pci_dev_put() in AER recovery
> > > > > procedures.
> > > >
> > > > I guess there must be a general rule here, like "PCI core must use
> > > > pm_runtime_get_sync() whenever touching the device asynchronously,
> > > > i.e., when it's doing something unrelated to a call from the driver"?
> > >
> > > Clear rules would be nice, that's for sure.
> > >
> > > > Probably would apply to all subsystem cores, not just PCI.
> > >
> > > If they have something similar like AER.
> > >
> > > > > I'm not sure about DPC case since I do not see get/put there. It
> > > > > just call pci_do_recovery() from threaded irq dcd_handler().
> > > > > I think pm_runtime* should be added to this handler as well.
> > > >
> > > > s/dcd_handler/dpc_handler/
> > > >
> > > > I'm guessing the "threaded" part really doesn't matter; just the fact
> > > > that this is in response to an interrupt, not something directly
> > > > called by a driver?
> > >
> > > Yes. I added "threaded" just to emphasis that it's safe to add sleeping
> > > functions there. In context of possible solution, not related to
> > > the problem itself.
> > >
> > > > > pm_runtime_get_sync() will increase dev->power.usage_count counter to
> > > > > prevent any rpm actives. When there is suspending pending, it will wait
> > > > > for it and do the rpm resume. Not sure if that problem, on my testing
> > > > > I did not encounter issues with that.
> > > >
> > > > Sorry, I didn't catch your meaning here.
> > > I tired to write two things:
> > >
> > > First, pm_runtime_get_sync() after exit prevents possibility of
> > > runtime suspend, so we are sure device will not be powered off
> > >
> > > Second, if during pm_runtime_get_sync(), there is pending attempt
> > > to suspend device, it will be synchronized and device will
> > > be resumed.
> >
> > Not exactly.  pm_runtime_get_sync() will resume the device
> > synchronously no matter what.
> >
> > > This can be problematic as device is in error state.
> >
> > If this is a real possibility (I mean, device in a low-power state and
> > in an error state at the same time), it would be better to call
> > __pm_runtime_disable(dev, false) instead of pm_runtime_get_sync(), as
> > that would prevent runtime PM from changing the device state.
>
> __pm_runtime_disable(dev, false) does not work as reliable in my
> test as pm_runtime_get_sync(), the
>
> igc 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible
>
> message disappears, but sill have:
>
> igc 0000:02:00.0: not ready 65535ms after bus reset; giving up
> pcieport 0000:00:1c.2: AER: Root Port link has been reset (-25)
> pcieport 0000:00:1c.2: AER: subordinate device reset failed
> pcieport 0000:00:1c.2: AER: device recovery fail

But what exactly do you do?

(1) __pm_runtime_disable(dev, false)
(2) Check power state
     (a) If D0 (and device runtime-active), proceed
     (b) If > D0, remove power (if possible) and put into D0

or something else?

> > > But at least from software perspective we should end in device
> > > being in active state and then we can perform reset of it.
> >
> > I'm not sure about this.  It may be better to power-cycle the device
> > in D3hot instead of attempting to put it into D0 beforehand.
>
> Me nigher, but in pci_reset_function() and similar resetting
> procedures we always call pci_dev_save_and_disable() before
> actual reset and it set device in D0, so I think it's ok to do so.

OK

Stanislaw Gruszka Feb. 6, 2024, 3:37 p.m. UTC | #6

On Mon, Feb 05, 2024 at 10:00:45PM +0100, Rafael J. Wysocki wrote:
> On Mon, Feb 5, 2024 at 9:11 PM Stanislaw Gruszka
> <stanislaw.gruszka@linux.intel.com> wrote:
> >
> > On Thu, Feb 01, 2024 at 03:10:35PM +0100, Rafael J. Wysocki wrote:
> > > On Thu, Feb 1, 2024 at 10:36 AM Stanislaw Gruszka
> > > <stanislaw.gruszka@linux.intel.com> wrote:
> > > >
> > > > On Mon, Jan 29, 2024 at 06:14:36PM -0600, Bjorn Helgaas wrote:
> > > > > On Tue, Jan 23, 2024 at 10:00:16AM +0100, Stanislaw Gruszka wrote:
> > > > > > PM runtime can be done simultaneously with AER error handling.
> > > > > > Avoid that by using pm_runtime_get_sync() just after pci_dev_get()
> > > > > > and pm_runtime_put() just before pci_dev_put() in AER recovery
> > > > > > procedures.
> > > > >
> > > > > I guess there must be a general rule here, like "PCI core must use
> > > > > pm_runtime_get_sync() whenever touching the device asynchronously,
> > > > > i.e., when it's doing something unrelated to a call from the driver"?
> > > >
> > > > Clear rules would be nice, that's for sure.
> > > >
> > > > > Probably would apply to all subsystem cores, not just PCI.
> > > >
> > > > If they have something similar like AER.
> > > >
> > > > > > I'm not sure about DPC case since I do not see get/put there. It
> > > > > > just call pci_do_recovery() from threaded irq dcd_handler().
> > > > > > I think pm_runtime* should be added to this handler as well.
> > > > >
> > > > > s/dcd_handler/dpc_handler/
> > > > >
> > > > > I'm guessing the "threaded" part really doesn't matter; just the fact
> > > > > that this is in response to an interrupt, not something directly
> > > > > called by a driver?
> > > >
> > > > Yes. I added "threaded" just to emphasis that it's safe to add sleeping
> > > > functions there. In context of possible solution, not related to
> > > > the problem itself.
> > > >
> > > > > > pm_runtime_get_sync() will increase dev->power.usage_count counter to
> > > > > > prevent any rpm actives. When there is suspending pending, it will wait
> > > > > > for it and do the rpm resume. Not sure if that problem, on my testing
> > > > > > I did not encounter issues with that.
> > > > >
> > > > > Sorry, I didn't catch your meaning here.
> > > > I tired to write two things:
> > > >
> > > > First, pm_runtime_get_sync() after exit prevents possibility of
> > > > runtime suspend, so we are sure device will not be powered off
> > > >
> > > > Second, if during pm_runtime_get_sync(), there is pending attempt
> > > > to suspend device, it will be synchronized and device will
> > > > be resumed.
> > >
> > > Not exactly.  pm_runtime_get_sync() will resume the device
> > > synchronously no matter what.
> > >
> > > > This can be problematic as device is in error state.
> > >
> > > If this is a real possibility (I mean, device in a low-power state and
> > > in an error state at the same time), it would be better to call
> > > __pm_runtime_disable(dev, false) instead of pm_runtime_get_sync(), as
> > > that would prevent runtime PM from changing the device state.
> >
> > __pm_runtime_disable(dev, false) does not work as reliable in my
> > test as pm_runtime_get_sync(), the
> >
> > igc 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible
> >
> > message disappears, but sill have:
> >
> > igc 0000:02:00.0: not ready 65535ms after bus reset; giving up
> > pcieport 0000:00:1c.2: AER: Root Port link has been reset (-25)
> > pcieport 0000:00:1c.2: AER: subordinate device reset failed
> > pcieport 0000:00:1c.2: AER: device recovery fail
> 
> But what exactly do you do?
> 
> (1) __pm_runtime_disable(dev, false)
> (2) Check power state
>      (a) If D0 (and device runtime-active), proceed
>      (b) If > D0, remove power (if possible) and put into D0
> 
> or something else?

I just did point (1), did not check power state (2).
I tested below patch with replaced:

  pm_runtime_get_sync -> __pm_runtime_disable(false)
  pm_runtime_put -> pm_runtime_enable()

I could try to test with (1)+(2), but now sure how do do step (2b),
by:

pci_set_power_state(D3cold)
pci_set_power_state(D0)

?

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 59c90d04a609..705893b5f7b0 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -13,6 +13,7 @@
 #define dev_fmt(fmt) "AER: " fmt
 
 #include <linux/pci.h>
+#include <linux/pm_runtime.h>
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/errno.h>
@@ -85,6 +86,18 @@ static int report_error_detected(struct pci_dev *dev,
 	return 0;
 }
 
+static int pci_pm_runtime_get_sync(struct pci_dev *pdev, void *data)
+{
+	pm_runtime_get_sync(&pdev->dev);
+	return 0;
+}
+
+static int pci_pm_runtime_put(struct pci_dev *pdev, void *data)
+{
+	pm_runtime_put(&pdev->dev);
+	return 0;
+}
+
 static int report_frozen_detected(struct pci_dev *dev, void *data)
 {
 	return report_error_detected(dev, pci_channel_io_frozen, data);
@@ -207,6 +220,8 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	else
 		bridge = pci_upstream_bridge(dev);
 
+	pci_walk_bridge(bridge, pci_pm_runtime_get_sync, NULL);
+
 	pci_dbg(bridge, "broadcast error_detected message\n");
 	if (state == pci_channel_io_frozen) {
 		pci_walk_bridge(bridge, report_frozen_detected, &status);
@@ -251,10 +266,15 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 		pcie_clear_device_status(dev);
 		pci_aer_clear_nonfatal_status(dev);
 	}
+
+	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
+
 	pci_info(bridge, "device recovery successful\n");
 	return status;
 
 failed:
+	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
+
 	pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT);
 
 	/* TODO: Should kernel panic here? */

Stanislaw Gruszka Feb. 8, 2024, 1:49 p.m. UTC | #7

On Tue, Feb 06, 2024 at 04:37:36PM +0100, Stanislaw Gruszka wrote:
> > > > If this is a real possibility (I mean, device in a low-power state and
> > > > in an error state at the same time), it would be better to call
> > > > __pm_runtime_disable(dev, false) instead of pm_runtime_get_sync(), as
> > > > that would prevent runtime PM from changing the device state.
> > >
> > > __pm_runtime_disable(dev, false) does not work as reliable in my
> > > test as pm_runtime_get_sync(), the
> > >
> > > igc 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible
> > >
> > > message disappears, but sill have:
> > >
> > > igc 0000:02:00.0: not ready 65535ms after bus reset; giving up
> > > pcieport 0000:00:1c.2: AER: Root Port link has been reset (-25)
> > > pcieport 0000:00:1c.2: AER: subordinate device reset failed
> > > pcieport 0000:00:1c.2: AER: device recovery fail
> > 
> > But what exactly do you do?
> > 
> > (1) __pm_runtime_disable(dev, false)
> > (2) Check power state
> >      (a) If D0 (and device runtime-active), proceed
> >      (b) If > D0, remove power (if possible) and put into D0
> > 
> > or something else?
> 
> I just did point (1), did not check power state (2).
> I tested below patch with replaced:
> 
>   pm_runtime_get_sync -> __pm_runtime_disable(false)
>   pm_runtime_put -> pm_runtime_enable()
> 
> I could try to test with (1)+(2), but now sure how do do step (2b),
> by:
> 
> pci_set_power_state(D3cold)
> pci_set_power_state(D0)

The problematic case is indeed when after __pm_runtime_disable(), device
is in D3hot state. In such case we need to do the same operations as
pci_pm_runtime_resume() does, otherwise AER code is not able to work.
I think, just doing pm_runtime_get_sync() is better.

While I'm able to reproduce D3hot & error state using aer_inject
on the same smae device, more practical case is recovery running on all
devices connected to a port (else case in pcie_do_recovery 
type == PCIE_EXP_TYPE* check). On such case some devices can be suspend,
and from AER code comments I conclude they have to be reset.

> +static int pci_pm_runtime_put(struct pci_dev *pdev, void *data)
> +{
> +	pm_runtime_put(&pdev->dev);
> +	return 0;
> +}
> +
<snip>
> +
> +	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);

This can happen after driver is unbind from device. I had concern about
that, but after drivers/base/ code examination, seems to be fine do
do pm_runtime_put() after unbind.

Regards
Stanislaw

[RFC] PCI/AER: Block runtime suspend when handling errors

Commit Message

Comments

Patch