diff mbox series

[net-next,v2] net: wwan: t7xx: reset device if suspend fails

Message ID 20241022084348.4571-1-jinjian.song@fibocom.com (mailing list archive)
State New, archived
Headers show
Series [net-next,v2] net: wwan: t7xx: reset device if suspend fails | expand

Commit Message

Jinjian Song Oct. 22, 2024, 8:43 a.m. UTC
If driver fails to set the device to suspend, it means that the
device is abnormal. In this case, reset the device to recover
when PCIe device is offline.

Signed-off-by: Jinjian Song <jinjian.song@fibocom.com>
---
V2:
 * Add judgment, reset when device is offline 
---
 drivers/net/wwan/t7xx/t7xx_pci.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Sergey Ryazanov Oct. 29, 2024, 12:58 a.m. UTC | #1
Hello Jinjian,

On 22.10.2024 11:43, Jinjian Song wrote:
> If driver fails to set the device to suspend, it means that the
> device is abnormal. In this case, reset the device to recover
> when PCIe device is offline.

Is it a reproducible or a speculative issue? Does the fix recover modem 
from a problematic state?

Anyway we need someone more familiar with this hardware (Intel or 
MediaTek engineer) to Ack the change to make sure we are not going to 
put a system in a more complicated state.

> Signed-off-by: Jinjian Song <jinjian.song@fibocom.com>
> ---
> V2:
>   * Add judgment, reset when device is offline
> ---
>   drivers/net/wwan/t7xx/t7xx_pci.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/wwan/t7xx/t7xx_pci.c b/drivers/net/wwan/t7xx/t7xx_pci.c
> index e556e5bd49ab..4f89a353588b 100644
> --- a/drivers/net/wwan/t7xx/t7xx_pci.c
> +++ b/drivers/net/wwan/t7xx/t7xx_pci.c
> @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct pci_dev *pdev)
>   	iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + ENABLE_ASPM_LOWPWR);
>   	atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
>   	t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
> +	if (pci_channel_offline(pdev)) {
> +		dev_err(&pdev->dev, "Device offline, reset to recover\n");
> +		t7xx_reset_device(t7xx_dev, PLDR);
> +	}
>   	return ret;
>   }

--
Sergey
Jinjian Song Oct. 29, 2024, 3:46 a.m. UTC | #2
From: Sergey Ryazanov <ryazanov.s.a@gmail.com>

>Hello Jinjian,
>
>On 22.10.2024 11:43, Jinjian Song wrote:
>> If driver fails to set the device to suspend, it means that the
>> device is abnormal. In this case, reset the device to recover
>> when PCIe device is offline.
>
>Is it a reproducible or a speculative issue? Does the fix recover modem 
>from a problematic state?
>
>Anyway we need someone more familiar with this hardware (Intel or 
>MediaTek engineer) to Ack the change to make sure we are not going to 
>put a system in a more complicated state.

Hi Sergey,

This is a very difficult issue to replicate onece occured and fixed.

The issue occured when driver and device lost the connection. I have
encountered this problem twice so far:
1. During suspend/resume stress test, there was a probabilistic D3L2
time sequence issue with the BIOS, result in PCIe link down, driver
read and write the register of device invalid, so suspend failed.
This issue was eventually fixed in the BIOS and I was able to restore
it through the reset module after reproducing the problem.

2. During idle test, the modem probabilistic hang up, result in PCIe
link down, driver read and write the register of device invalid, so
suspend failed. This issue was eventually fiex in device modem firmware
by adjust a certain power supply voltage, and reset modem as a workround
to restore when the MBIM port command timeout in userspace applycations.

Hardware reset modem to recover was discussed with MTK, and they said
that if we don't want to keep the on-site problem location in case of
suspend failure, we can use the recover solution. 

Both the ocurred issues result in the PCIe link issue, driver can't 
read and writer the register of WWAN device, so I want to add this path
to restore, hardware reset modem can recover modem, but using the 
pci_channle_offline() as the judgment is my inference.

Thanks.

>> Signed-off-by: Jinjian Song <jinjian.song@fibocom.com>
>> ---
>> V2:
>>   * Add judgment, reset when device is offline
>> ---
>>   drivers/net/wwan/t7xx/t7xx_pci.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>> 
>> diff --git a/drivers/net/wwan/t7xx/t7xx_pci.c b/drivers/net/wwan/t7xx/t7xx_pci.c
>> index e556e5bd49ab..4f89a353588b 100644
>> --- a/drivers/net/wwan/t7xx/t7xx_pci.c
>> +++ b/drivers/net/wwan/t7xx/t7xx_pci.c
>> @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct pci_dev *pdev)
>>   	iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + ENABLE_ASPM_LOWPWR);
>>   	atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
>>   	t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
>> +	if (pci_channel_offline(pdev)) {
>> +		dev_err(&pdev->dev, "Device offline, reset to recover\n");
>> +		t7xx_reset_device(t7xx_dev, PLDR);
>> +	}
>>   	return ret;
>>   }
>
>--
>Sergey
>

Best Regards,
Jinjian.
Sergey Ryazanov Oct. 30, 2024, 6:33 p.m. UTC | #3
Hi Jinjian,

On 29.10.2024 05:46, Jinjian Song wrote:
> From: Sergey Ryazanov <ryazanov.s.a@gmail.com>
>> On 22.10.2024 11:43, Jinjian Song wrote:
>>> If driver fails to set the device to suspend, it means that the
>>> device is abnormal. In this case, reset the device to recover
>>> when PCIe device is offline.
>>
>> Is it a reproducible or a speculative issue? Does the fix recover 
>> modem from a problematic state?
>>
>> Anyway we need someone more familiar with this hardware (Intel or 
>> MediaTek engineer) to Ack the change to make sure we are not going to 
>> put a system in a more complicated state.
> 
> Hi Sergey,
> 
> This is a very difficult issue to replicate onece occured and fixed.
> 
> The issue occured when driver and device lost the connection. I have
> encountered this problem twice so far:
> 1. During suspend/resume stress test, there was a probabilistic D3L2
> time sequence issue with the BIOS, result in PCIe link down, driver
> read and write the register of device invalid, so suspend failed.
> This issue was eventually fixed in the BIOS and I was able to restore
> it through the reset module after reproducing the problem.
> 
> 2. During idle test, the modem probabilistic hang up, result in PCIe
> link down, driver read and write the register of device invalid, so
> suspend failed. This issue was eventually fiex in device modem firmware
> by adjust a certain power supply voltage, and reset modem as a workround
> to restore when the MBIM port command timeout in userspace applycations.
> 
> Hardware reset modem to recover was discussed with MTK, and they said
> that if we don't want to keep the on-site problem location in case of
> suspend failure, we can use the recover solution.
> Both the ocurred issues result in the PCIe link issue, driver can't read 
> and writer the register of WWAN device, so I want to add this path
> to restore, hardware reset modem can recover modem, but using the 
> pci_channle_offline() as the judgment is my inference.

Thank you for the clarification. Let me summarize what I've understood 
from the explanation:
a) there were hardware (firmware) issues,
b) issues already were solved,
c) issues were not directly related to the device suspension procedure,
d) you want to implement a backup plan to make the modem support robust.

If got it right, then I would like to recommend to implement a generic 
error handling solution for the PCIe interface. You can check this 
document: Documentation/PCI/pci-error-recovery.rst

Suddenly, I am not an expert in the PCIe link recovery procedure, so 
I've CCed this message to PCI subsystem maintainers. I hope they can 
suggest a conceptually correct way to handle these cases.

>>> Signed-off-by: Jinjian Song <jinjian.song@fibocom.com>
>>> ---
>>> V2:
>>>   * Add judgment, reset when device is offline
>>> ---
>>>   drivers/net/wwan/t7xx/t7xx_pci.c | 4 ++++
>>>   1 file changed, 4 insertions(+)
>>>
>>> diff --git a/drivers/net/wwan/t7xx/t7xx_pci.c b/drivers/net/wwan/ 
>>> t7xx/t7xx_pci.c
>>> index e556e5bd49ab..4f89a353588b 100644
>>> --- a/drivers/net/wwan/t7xx/t7xx_pci.c
>>> +++ b/drivers/net/wwan/t7xx/t7xx_pci.c
>>> @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct pci_dev 
>>> *pdev)
>>>       iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + 
>>> ENABLE_ASPM_LOWPWR);
>>>       atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
>>>       t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
>>> +    if (pci_channel_offline(pdev)) {
>>> +        dev_err(&pdev->dev, "Device offline, reset to recover\n");
>>> +        t7xx_reset_device(t7xx_dev, PLDR);
>>> +    }
>>>       return ret;
>>>   }

--
Sergey
Jinjian Song Oct. 31, 2024, 1:09 p.m. UTC | #4
From: Sergey Ryazanov <ryazanov.s.a@gmail.com>

>Hi Jinjian,
>
>On 29.10.2024 05:46, Jinjian Song wrote:
>> From: Sergey Ryazanov <ryazanov.s.a@gmail.com>
>>> On 22.10.2024 11:43, Jinjian Song wrote:
>>>> If driver fails to set the device to suspend, it means that the
>>>> device is abnormal. In this case, reset the device to recover
>>>> when PCIe device is offline.
>>>
>>> Is it a reproducible or a speculative issue? Does the fix recover 
>>> modem from a problematic state?
>>>
>>> Anyway we need someone more familiar with this hardware (Intel or 
>>> MediaTek engineer) to Ack the change to make sure we are not going to 
>>> put a system in a more complicated state.
>> 
>> Hi Sergey,
>> 
>> This is a very difficult issue to replicate onece occured and fixed.
>> 
>> The issue occured when driver and device lost the connection. I have
>> encountered this problem twice so far:
>> 1. During suspend/resume stress test, there was a probabilistic D3L2
>> time sequence issue with the BIOS, result in PCIe link down, driver
>> read and write the register of device invalid, so suspend failed.
>> This issue was eventually fixed in the BIOS and I was able to restore
>> it through the reset module after reproducing the problem.
>> 
>> 2. During idle test, the modem probabilistic hang up, result in PCIe
>> link down, driver read and write the register of device invalid, so
>> suspend failed. This issue was eventually fiex in device modem firmware
>> by adjust a certain power supply voltage, and reset modem as a workround
>> to restore when the MBIM port command timeout in userspace applycations.
>> 
>> Hardware reset modem to recover was discussed with MTK, and they said
>> that if we don't want to keep the on-site problem location in case of
>> suspend failure, we can use the recover solution.
>> Both the ocurred issues result in the PCIe link issue, driver can't read 
>> and writer the register of WWAN device, so I want to add this path
>> to restore, hardware reset modem can recover modem, but using the 
>> pci_channle_offline() as the judgment is my inference.
>
>Thank you for the clarification. Let me summarize what I've understood 
>from the explanation:
>a) there were hardware (firmware) issues,
>b) issues already were solved,
>c) issues were not directly related to the device suspension procedure,
>d) you want to implement a backup plan to make the modem support robust.
>
>If got it right, then I would like to recommend to implement a generic 
>error handling solution for the PCIe interface. You can check this 
>document: Documentation/PCI/pci-error-recovery.rst
Hi Sergey,

Yes, got it right.
I want to identify the scenario and then recover by reset device,
otherwise suspend failure will aways prevent the system from suspending
if it occurs.

>Suddenly, I am not an expert in the PCIe link recovery procedure, so 
>I've CCed this message to PCI subsystem maintainers. I hope they can 
>suggest a conceptually correct way to handle these cases.

Thanks.

>>>> Signed-off-by: Jinjian Song <jinjian.song@fibocom.com>
>>>> ---
>>>> V2:
>>>>   * Add judgment, reset when device is offline
>>>> ---
>>>>   drivers/net/wwan/t7xx/t7xx_pci.c | 4 ++++
>>>>   1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/drivers/net/wwan/t7xx/t7xx_pci.c b/drivers/net/wwan/ 
>>>> t7xx/t7xx_pci.c
>>>> index e556e5bd49ab..4f89a353588b 100644
>>>> --- a/drivers/net/wwan/t7xx/t7xx_pci.c
>>>> +++ b/drivers/net/wwan/t7xx/t7xx_pci.c
>>>> @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct pci_dev 
>>>> *pdev)
>>>>       iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + 
>>>> ENABLE_ASPM_LOWPWR);
>>>>       atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
>>>>       t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
>>>> +    if (pci_channel_offline(pdev)) {
>>>> +        dev_err(&pdev->dev, "Device offline, reset to recover\n");
>>>> +        t7xx_reset_device(t7xx_dev, PLDR);
>>>> +    }
>>>>       return ret;
>>>>   }
>
>--
>Sergey
>

Best Regards,
Jinjian
Bjorn Helgaas Oct. 31, 2024, 5:04 p.m. UTC | #5
On Thu, Oct 31, 2024 at 09:09:30PM +0800, Jinjian Song wrote:
> From: Sergey Ryazanov <ryazanov.s.a@gmail.com>
> > On 29.10.2024 05:46, Jinjian Song wrote:
> > > From: Sergey Ryazanov <ryazanov.s.a@gmail.com>
> > > > On 22.10.2024 11:43, Jinjian Song wrote:
> > > > > If driver fails to set the device to suspend, it means that the
> > > > > device is abnormal. In this case, reset the device to recover
> > > > > when PCIe device is offline.
> > > > 
> > > > Is it a reproducible or a speculative issue? Does the fix
> > > > recover modem from a problematic state?
> > > > 
> > > > Anyway we need someone more familiar with this hardware (Intel
> > > > or MediaTek engineer) to Ack the change to make sure we are not
> > > > going to put a system in a more complicated state.
> > > 
> > > This is a very difficult issue to replicate onece occured and fixed.
> > > 
> > > The issue occured when driver and device lost the connection. I have
> > > encountered this problem twice so far:
> > > 1. During suspend/resume stress test, there was a probabilistic D3L2
> > > time sequence issue with the BIOS, result in PCIe link down, driver
> > > read and write the register of device invalid, so suspend failed.
> > > This issue was eventually fixed in the BIOS and I was able to restore
> > > it through the reset module after reproducing the problem.
> > > 
> > > 2. During idle test, the modem probabilistic hang up, result in PCIe
> > > link down, driver read and write the register of device invalid, so
> > > suspend failed. This issue was eventually fiex in device modem firmware
> > > by adjust a certain power supply voltage, and reset modem as a workround
> > > to restore when the MBIM port command timeout in userspace applycations.
> > > 
> > > Hardware reset modem to recover was discussed with MTK, and they said
> > > that if we don't want to keep the on-site problem location in case of
> > > suspend failure, we can use the recover solution.
> > > Both the ocurred issues result in the PCIe link issue, driver can't
> > > read and writer the register of WWAN device, so I want to add this
> > > path
> > > to restore, hardware reset modem can recover modem, but using the
> > > pci_channle_offline() as the judgment is my inference.
> > 
> > Thank you for the clarification. Let me summarize what I've understood
> > from the explanation:
> > a) there were hardware (firmware) issues,
> > b) issues already were solved,
> > c) issues were not directly related to the device suspension procedure,
> > d) you want to implement a backup plan to make the modem support robust.
> > 
> > If got it right, then I would like to recommend to implement a generic
> > error handling solution for the PCIe interface. You can check this
> > document: Documentation/PCI/pci-error-recovery.rst
> 
> Yes, got it right.
> I want to identify the scenario and then recover by reset device,
> otherwise suspend failure will aways prevent the system from suspending
> if it occurs.

If a PCIe link goes down here's my understanding of what happens:

  - Writes to the device are silently dropped.

  - Reads from the device return ~0 (PCI_POSSIBLE_ERROR()).

  - If the device is in a slot and pciehp is enabled, a Data Link
    Layer State Changed interrupt will cause pciehp_unconfigure_device()
    to detach the driver and remove the pci_dev.

  - If AER is enabled, a failed access to the device will cause an AER
    interrupt.  If the driver has registered pci_error_handlers, the
    driver callbacks will be called, and based on what the driver
    returns, the PCI core may reset the device.

The pciehp and AER interrupts are *asynchronous* to link down events
and to any driver access to the device, so they may be delayed an
arbitrary amount of time.

Both interrupt paths may lead to the device being marked as "offline".
Obviously this is asynchronous with respect to the driver.

> > > > > +++ b/drivers/net/wwan/t7xx/t7xx_pci.c
> > > > > @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct
> > > > > pci_dev *pdev)
> > > > >       iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) +
> > > > > ENABLE_ASPM_LOWPWR);
> > > > >       atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
> > > > >       t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
> > > > > +    if (pci_channel_offline(pdev)) {
> > > > > +        dev_err(&pdev->dev, "Device offline, reset to recover\n");
> > > > > +        t7xx_reset_device(t7xx_dev, PLDR);
> > > > > +    }

This looks like an unreliable way to detect issues.  It only works if
AER is enabled, and the device is only marked "offline" some arbitrary
length of time *after* a driver access to the device has failed.

You can't reliably detect errors on writes to the device.

You can only reliably detect errors on reads from the device by
looking for PCI_POSSIBLE_ERROR().  Obviously ~0 might be a valid value
to read from some registers, so you need device-specific knowledge to
know whether ~0 is valid or indicates an error.

If AER or DPC are enabled, the driver can be *notified* about read
errors and some write errors via pci_error_handlers, but the
notification is long after the error.

Bjorn
Linas Vepstas Nov. 3, 2024, 1:24 a.m. UTC | #6
Top-post reply.

What Bjorn says is exactly right. Just one clarifying remark: When PCI
error recovery was designed, it was envisioned for high-end,
high-availability servers so that they could reset a failing device
without forcing a reboot of the OS.  Concepts like suspend/resume did
not perturb even the unconscious sleep of the engineers. Thus,
stepping through error recovery during suspend would be novel,
untested, and perhaps prone to confusion. The error recovery procedure
is trying to reset the device into a fully-powered-on,
fully-functional and connected state. This might be problematic if the
suspend has already walked half-way through power-down. The fact that
error recovery might run a very long fraction of a second after the
error is detected further complicates things. The fact that error
recovery  usually has the driver fully reinitializing the device,
including reloading the firmware, could be a problem if the firmware
is not in RAM or is sitting on a suspended block device. So it all
could be an adventure.

As to testing: I asked one of the guys in the lab how he did it, and
he said he would brush a metal wire against the PCI pins. I winced.
But I guess it's cleaner than pouring coffee on it. Later we developed
a software tool to artificially inject errors.

--linas

On Fri, Nov 1, 2024 at 8:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Oct 31, 2024 at 09:09:30PM +0800, Jinjian Song wrote:
> > From: Sergey Ryazanov <ryazanov.s.a@gmail.com>
> > > On 29.10.2024 05:46, Jinjian Song wrote:
> > > > From: Sergey Ryazanov <ryazanov.s.a@gmail.com>
> > > > > On 22.10.2024 11:43, Jinjian Song wrote:
> > > > > > If driver fails to set the device to suspend, it means that the
> > > > > > device is abnormal. In this case, reset the device to recover
> > > > > > when PCIe device is offline.
> > > > >
> > > > > Is it a reproducible or a speculative issue? Does the fix
> > > > > recover modem from a problematic state?
> > > > >
> > > > > Anyway we need someone more familiar with this hardware (Intel
> > > > > or MediaTek engineer) to Ack the change to make sure we are not
> > > > > going to put a system in a more complicated state.
> > > >
> > > > This is a very difficult issue to replicate onece occured and fixed.
> > > >
> > > > The issue occured when driver and device lost the connection. I have
> > > > encountered this problem twice so far:
> > > > 1. During suspend/resume stress test, there was a probabilistic D3L2
> > > > time sequence issue with the BIOS, result in PCIe link down, driver
> > > > read and write the register of device invalid, so suspend failed.
> > > > This issue was eventually fixed in the BIOS and I was able to restore
> > > > it through the reset module after reproducing the problem.
> > > >
> > > > 2. During idle test, the modem probabilistic hang up, result in PCIe
> > > > link down, driver read and write the register of device invalid, so
> > > > suspend failed. This issue was eventually fiex in device modem firmware
> > > > by adjust a certain power supply voltage, and reset modem as a workround
> > > > to restore when the MBIM port command timeout in userspace applycations.
> > > >
> > > > Hardware reset modem to recover was discussed with MTK, and they said
> > > > that if we don't want to keep the on-site problem location in case of
> > > > suspend failure, we can use the recover solution.
> > > > Both the ocurred issues result in the PCIe link issue, driver can't
> > > > read and writer the register of WWAN device, so I want to add this
> > > > path
> > > > to restore, hardware reset modem can recover modem, but using the
> > > > pci_channle_offline() as the judgment is my inference.
> > >
> > > Thank you for the clarification. Let me summarize what I've understood
> > > from the explanation:
> > > a) there were hardware (firmware) issues,
> > > b) issues already were solved,
> > > c) issues were not directly related to the device suspension procedure,
> > > d) you want to implement a backup plan to make the modem support robust.
> > >
> > > If got it right, then I would like to recommend to implement a generic
> > > error handling solution for the PCIe interface. You can check this
> > > document: Documentation/PCI/pci-error-recovery.rst
> >
> > Yes, got it right.
> > I want to identify the scenario and then recover by reset device,
> > otherwise suspend failure will aways prevent the system from suspending
> > if it occurs.
>
> If a PCIe link goes down here's my understanding of what happens:
>
>   - Writes to the device are silently dropped.
>
>   - Reads from the device return ~0 (PCI_POSSIBLE_ERROR()).
>
>   - If the device is in a slot and pciehp is enabled, a Data Link
>     Layer State Changed interrupt will cause pciehp_unconfigure_device()
>     to detach the driver and remove the pci_dev.
>
>   - If AER is enabled, a failed access to the device will cause an AER
>     interrupt.  If the driver has registered pci_error_handlers, the
>     driver callbacks will be called, and based on what the driver
>     returns, the PCI core may reset the device.
>
> The pciehp and AER interrupts are *asynchronous* to link down events
> and to any driver access to the device, so they may be delayed an
> arbitrary amount of time.
>
> Both interrupt paths may lead to the device being marked as "offline".
> Obviously this is asynchronous with respect to the driver.
>
> > > > > > +++ b/drivers/net/wwan/t7xx/t7xx_pci.c
> > > > > > @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct
> > > > > > pci_dev *pdev)
> > > > > >       iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) +
> > > > > > ENABLE_ASPM_LOWPWR);
> > > > > >       atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
> > > > > >       t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
> > > > > > +    if (pci_channel_offline(pdev)) {
> > > > > > +        dev_err(&pdev->dev, "Device offline, reset to recover\n");
> > > > > > +        t7xx_reset_device(t7xx_dev, PLDR);
> > > > > > +    }
>
> This looks like an unreliable way to detect issues.  It only works if
> AER is enabled, and the device is only marked "offline" some arbitrary
> length of time *after* a driver access to the device has failed.
>
> You can't reliably detect errors on writes to the device.
>
> You can only reliably detect errors on reads from the device by
> looking for PCI_POSSIBLE_ERROR().  Obviously ~0 might be a valid value
> to read from some registers, so you need device-specific knowledge to
> know whether ~0 is valid or indicates an error.
>
> If AER or DPC are enabled, the driver can be *notified* about read
> errors and some write errors via pci_error_handlers, but the
> notification is long after the error.
>
> Bjorn
Sergey Ryazanov Nov. 3, 2024, 10:02 p.m. UTC | #7
Hello Bjorn, Linas,

many thanks for clarifying this tricky topic! Hope this information can 
serve a good starting point for Jinjian to develop a proper solution for 
that case.

On 03.11.2024 03:24, Linas Vepstas wrote:
> Top-post reply.
> 
> What Bjorn says is exactly right. Just one clarifying remark: When PCI
> error recovery was designed, it was envisioned for high-end,
> high-availability servers so that they could reset a failing device
> without forcing a reboot of the OS.  Concepts like suspend/resume did
> not perturb even the unconscious sleep of the engineers. Thus,
> stepping through error recovery during suspend would be novel,
> untested, and perhaps prone to confusion. The error recovery procedure
> is trying to reset the device into a fully-powered-on,
> fully-functional and connected state. This might be problematic if the
> suspend has already walked half-way through power-down. The fact that
> error recovery might run a very long fraction of a second after the
> error is detected further complicates things. The fact that error
> recovery  usually has the driver fully reinitializing the device,
> including reloading the firmware, could be a problem if the firmware
> is not in RAM or is sitting on a suspended block device. So it all
> could be an adventure.
> 
> As to testing: I asked one of the guys in the lab how he did it, and
> he said he would brush a metal wire against the PCI pins. I winced.
> But I guess it's cleaner than pouring coffee on it. Later we developed
> a software tool to artificially inject errors.
> 
> On Fri, Nov 1, 2024 at 8:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>> On Thu, Oct 31, 2024 at 09:09:30PM +0800, Jinjian Song wrote:
>>> From: Sergey Ryazanov <ryazanov.s.a@gmail.com>
>>>> On 29.10.2024 05:46, Jinjian Song wrote:
>>>>> From: Sergey Ryazanov <ryazanov.s.a@gmail.com>
>>>>>> On 22.10.2024 11:43, Jinjian Song wrote:
>>>>>>> If driver fails to set the device to suspend, it means that the
>>>>>>> device is abnormal. In this case, reset the device to recover
>>>>>>> when PCIe device is offline.
>>>>>>
>>>>>> Is it a reproducible or a speculative issue? Does the fix
>>>>>> recover modem from a problematic state?
>>>>>>
>>>>>> Anyway we need someone more familiar with this hardware (Intel
>>>>>> or MediaTek engineer) to Ack the change to make sure we are not
>>>>>> going to put a system in a more complicated state.
>>>>>
>>>>> This is a very difficult issue to replicate onece occured and fixed.
>>>>>
>>>>> The issue occured when driver and device lost the connection. I have
>>>>> encountered this problem twice so far:
>>>>> 1. During suspend/resume stress test, there was a probabilistic D3L2
>>>>> time sequence issue with the BIOS, result in PCIe link down, driver
>>>>> read and write the register of device invalid, so suspend failed.
>>>>> This issue was eventually fixed in the BIOS and I was able to restore
>>>>> it through the reset module after reproducing the problem.
>>>>>
>>>>> 2. During idle test, the modem probabilistic hang up, result in PCIe
>>>>> link down, driver read and write the register of device invalid, so
>>>>> suspend failed. This issue was eventually fiex in device modem firmware
>>>>> by adjust a certain power supply voltage, and reset modem as a workround
>>>>> to restore when the MBIM port command timeout in userspace applycations.
>>>>>
>>>>> Hardware reset modem to recover was discussed with MTK, and they said
>>>>> that if we don't want to keep the on-site problem location in case of
>>>>> suspend failure, we can use the recover solution.
>>>>> Both the ocurred issues result in the PCIe link issue, driver can't
>>>>> read and writer the register of WWAN device, so I want to add this
>>>>> path
>>>>> to restore, hardware reset modem can recover modem, but using the
>>>>> pci_channle_offline() as the judgment is my inference.
>>>>
>>>> Thank you for the clarification. Let me summarize what I've understood
>>>> from the explanation:
>>>> a) there were hardware (firmware) issues,
>>>> b) issues already were solved,
>>>> c) issues were not directly related to the device suspension procedure,
>>>> d) you want to implement a backup plan to make the modem support robust.
>>>>
>>>> If got it right, then I would like to recommend to implement a generic
>>>> error handling solution for the PCIe interface. You can check this
>>>> document: Documentation/PCI/pci-error-recovery.rst
>>>
>>> Yes, got it right.
>>> I want to identify the scenario and then recover by reset device,
>>> otherwise suspend failure will aways prevent the system from suspending
>>> if it occurs.
>>
>> If a PCIe link goes down here's my understanding of what happens:
>>
>>    - Writes to the device are silently dropped.
>>
>>    - Reads from the device return ~0 (PCI_POSSIBLE_ERROR()).
>>
>>    - If the device is in a slot and pciehp is enabled, a Data Link
>>      Layer State Changed interrupt will cause pciehp_unconfigure_device()
>>      to detach the driver and remove the pci_dev.
>>
>>    - If AER is enabled, a failed access to the device will cause an AER
>>      interrupt.  If the driver has registered pci_error_handlers, the
>>      driver callbacks will be called, and based on what the driver
>>      returns, the PCI core may reset the device.
>>
>> The pciehp and AER interrupts are *asynchronous* to link down events
>> and to any driver access to the device, so they may be delayed an
>> arbitrary amount of time.
>>
>> Both interrupt paths may lead to the device being marked as "offline".
>> Obviously this is asynchronous with respect to the driver.
>>
>>>>>>> +++ b/drivers/net/wwan/t7xx/t7xx_pci.c
>>>>>>> @@ -427,6 +427,10 @@ static int __t7xx_pci_pm_suspend(struct
>>>>>>> pci_dev *pdev)
>>>>>>>        iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) +
>>>>>>> ENABLE_ASPM_LOWPWR);
>>>>>>>        atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
>>>>>>>        t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
>>>>>>> +    if (pci_channel_offline(pdev)) {
>>>>>>> +        dev_err(&pdev->dev, "Device offline, reset to recover\n");
>>>>>>> +        t7xx_reset_device(t7xx_dev, PLDR);
>>>>>>> +    }
>>
>> This looks like an unreliable way to detect issues.  It only works if
>> AER is enabled, and the device is only marked "offline" some arbitrary
>> length of time *after* a driver access to the device has failed.
>>
>> You can't reliably detect errors on writes to the device.
>>
>> You can only reliably detect errors on reads from the device by
>> looking for PCI_POSSIBLE_ERROR().  Obviously ~0 might be a valid value
>> to read from some registers, so you need device-specific knowledge to
>> know whether ~0 is valid or indicates an error.
>>
>> If AER or DPC are enabled, the driver can be *notified* about read
>> errors and some write errors via pci_error_handlers, but the
>> notification is long after the error.

--
Sergey
diff mbox series

Patch

diff --git a/drivers/net/wwan/t7xx/t7xx_pci.c b/drivers/net/wwan/t7xx/t7xx_pci.c
index e556e5bd49ab..4f89a353588b 100644
--- a/drivers/net/wwan/t7xx/t7xx_pci.c
+++ b/drivers/net/wwan/t7xx/t7xx_pci.c
@@ -427,6 +427,10 @@  static int __t7xx_pci_pm_suspend(struct pci_dev *pdev)
 	iowrite32(T7XX_L1_BIT(0), IREG_BASE(t7xx_dev) + ENABLE_ASPM_LOWPWR);
 	atomic_set(&t7xx_dev->md_pm_state, MTK_PM_RESUMED);
 	t7xx_pcie_mac_set_int(t7xx_dev, SAP_RGU_INT);
+	if (pci_channel_offline(pdev)) {
+		dev_err(&pdev->dev, "Device offline, reset to recover\n");
+		t7xx_reset_device(t7xx_dev, PLDR);
+	}
 	return ret;
 }