[v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel

Message ID	1589251566-32126-1-git-send-email-pkushwaha@marvell.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=E2i/=62=lists.infradead.org=linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 63E642070B From: Prabhakar Kushwaha <pkushwaha@marvell.com> To: <linux-arm-kernel@lists.infradead.org>, <kexec@lists.infradead.org>, <robin.murphy@arm.com>, <maz@kernel.org>, <will@kernel.org> Subject: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel Date: Mon, 11 May 2020 19:46:06 -0700 Message-ID: <1589251566-32126-1-git-send-email-pkushwaha@marvell.com> MIME-Version: 1.0 summary: Content analysis details: (-0.5 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at https://www.dnswl.org/, low trust [67.231.148.174 listed in list.dnswl.org] -0.0 SPF_PASS SPF: sender matches SPF record 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid 0.1 DKIM_INVALID DKIM or DK signature exists, but is not valid Precedence: list Cc: bhsharma@redhat.com, Prabhakar Kushwaha <pkushwaha@marvell.com>, helgaas@kernel.org, gkulkarni@marvell.com, prabhakar.pkin@gmail.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	[v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel \| expand [v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel

Prabhakar Kushwaha May 12, 2020, 2:46 a.m. UTC

An SMMU Stream table is created by the primary kernel. This table is
used by the SMMU to perform address translations for device-originated
transactions. Any crash (if happened) launches the kdump kernel which
re-creates the SMMU Stream table. New transactions will be translated
via this new table.

There are scenarios, where devices are still having old pending
transactions (configured in the primary kernel). These transactions
come in-between Stream table creation and device-driver probe.
As new stream table does not have entry for older transactions,
it will be aborted by SMMU.

Similar observations were found with PCIe-Intel 82576 Gigabit
Network card. It sends old Memory Read transaction in kdump kernel.
Transactions configured for older Stream table entries, that do not
exist any longer in the new table, will cause a PCIe Completion Abort.
Returned PCIe completion abort further leads to AER Errors from APEI
Generic Hardware Error Source (GHES) with completion timeout.
A network device hang is observed even after continuous
reset/recovery from driver, Hence device is no more usable.

So, If we are in a kdump kernel try to copy SMMU Stream table from
primary/old kernel to preserve the mappings until the device driver
takes over.

Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
---
Changes for v2: Used memremap in-place of ioremap

V2 patch has been sanity tested. 

V1 patch has been tested with
A) PCIe-Intel 82576 Gigabit Network card in following
configurations with "no AER error". Each iteration has
been tested on both Suse kdump rfs And default Centos distro rfs.

 1)  with 2 level stream table 
       ----------------------------------------------------
       SMMU               |  Normal Ping   | Flood Ping
       -----------------------------------------------------
       Default Operation  |  100 times     | 10 times
       -----------------------------------------------------
       IOMMU bypass       |  41 times      | 10 times
       -----------------------------------------------------

 2)  with Linear stream table. 
       -----------------------------------------------------
       SMMU               |  Normal Ping   | Flood Ping
       ------------------------------------------------------
       Default Operation  |  100 times     | 10 times
       ------------------------------------------------------
       IOMMU bypass       |  55 times      | 10 times
       -------------------------------------------------------

B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
SSD card with 2 level stream table using "fio" in mixed read/write and
only read configurations. It is tested for both Default Operation and
IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
default Centos ditstro rfs.

This patch is not full proof solution. Issue can still come
from the point device is discovered and driver probe called. 
This patch has reduced window of scenario from "SMMU Stream table 
creation - device-driver" to "device discovery - device-driver".
Usually, device discovery to device-driver is very small time. So
the probability is very low. 

Note: device-discovery will overwrite existing stream table entries 
with both SMMU stage as by-pass.


 drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

Bjorn Helgaas May 12, 2020, 10:03 p.m. UTC | #1

[+cc linux-pci]

On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> An SMMU Stream table is created by the primary kernel. This table is
> used by the SMMU to perform address translations for device-originated
> transactions. Any crash (if happened) launches the kdump kernel which
> re-creates the SMMU Stream table. New transactions will be translated
> via this new table.
> 
> There are scenarios, where devices are still having old pending
> transactions (configured in the primary kernel). These transactions
> come in-between Stream table creation and device-driver probe.
> As new stream table does not have entry for older transactions,
> it will be aborted by SMMU.
> 
> Similar observations were found with PCIe-Intel 82576 Gigabit
> Network card. It sends old Memory Read transaction in kdump kernel.
> Transactions configured for older Stream table entries, that do not
> exist any longer in the new table, will cause a PCIe Completion Abort.

That sounds like exactly what we want, doesn't it?

Or do you *want* DMA from the previous kernel to complete?  That will
read or scribble on something, but maybe that's not terrible as long
as it's not memory used by the kdump kernel.

> Returned PCIe completion abort further leads to AER Errors from APEI
> Generic Hardware Error Source (GHES) with completion timeout.
> A network device hang is observed even after continuous
> reset/recovery from driver, Hence device is no more usable.

The fact that the device is no longer usable is definitely a problem.
But in principle we *should* be able to recover from these errors.  If
we could recover and reliably use the device after the error, that
seems like it would be a more robust solution that having to add
special cases in every IOMMU driver.

If you have details about this sort of error, I'd like to try to fix
it because we want to recover from that sort of error in normal
(non-crash) situations as well.

> So, If we are in a kdump kernel try to copy SMMU Stream table from
> primary/old kernel to preserve the mappings until the device driver
> takes over.
> 
> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> ---
> Changes for v2: Used memremap in-place of ioremap
> 
> V2 patch has been sanity tested. 
> 
> V1 patch has been tested with
> A) PCIe-Intel 82576 Gigabit Network card in following
> configurations with "no AER error". Each iteration has
> been tested on both Suse kdump rfs And default Centos distro rfs.
> 
>  1)  with 2 level stream table 
>        ----------------------------------------------------
>        SMMU               |  Normal Ping   | Flood Ping
>        -----------------------------------------------------
>        Default Operation  |  100 times     | 10 times
>        -----------------------------------------------------
>        IOMMU bypass       |  41 times      | 10 times
>        -----------------------------------------------------
> 
>  2)  with Linear stream table. 
>        -----------------------------------------------------
>        SMMU               |  Normal Ping   | Flood Ping
>        ------------------------------------------------------
>        Default Operation  |  100 times     | 10 times
>        ------------------------------------------------------
>        IOMMU bypass       |  55 times      | 10 times
>        -------------------------------------------------------
> 
> B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> SSD card with 2 level stream table using "fio" in mixed read/write and
> only read configurations. It is tested for both Default Operation and
> IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> default Centos ditstro rfs.
> 
> This patch is not full proof solution. Issue can still come
> from the point device is discovered and driver probe called. 
> This patch has reduced window of scenario from "SMMU Stream table 
> creation - device-driver" to "device discovery - device-driver".
> Usually, device discovery to device-driver is very small time. So
> the probability is very low. 
> 
> Note: device-discovery will overwrite existing stream table entries 
> with both SMMU stage as by-pass.
> 
> 
>  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 82508730feb7..d492d92c2dd7 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  			break;
>  		case STRTAB_STE_0_CFG_S1_TRANS:
>  		case STRTAB_STE_0_CFG_S2_TRANS:
> -			ste_live = true;
> +			/*
> +			 * As kdump kernel copy STE table from previous
> +			 * kernel. It still may have valid stream table entries.
> +			 * Forcing entry as false to allow overwrite.
> +			 */
> +			if (!is_kdump_kernel())
> +				ste_live = true;
>  			break;
>  		case STRTAB_STE_0_CFG_ABORT:
>  			BUG_ON(!disable_bypass);
> @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
>  		return -ENOMEM;
>  	}
>  
> +	if (is_kdump_kernel())
> +		return 0;
> +
>  	for (i = 0; i < cfg->num_l1_ents; ++i) {
>  		arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
>  		strtab += STRTAB_L1_DESC_DWORDS << 3;
> @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
>  	return 0;
>  }
>  
> +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> +			       struct arm_smmu_strtab_cfg *cfg, u32 size)
> +{
> +	struct arm_smmu_strtab_cfg rdcfg;
> +
> +	rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> +	rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> +					      + ARM_SMMU_STRTAB_BASE_CFG);
> +
> +	rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> +	rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> +
> +	memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> +
> +	cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> +}
> +
>  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
>  {
>  	void *strtab;
> @@ -3307,6 +3333,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
>  	reg |= FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
>  	cfg->strtab_base_cfg = reg;
>  
> +	if (is_kdump_kernel())
> +		arm_smmu_copy_table(smmu, cfg, l1size);
> +
>  	return arm_smmu_init_l1_strtab(smmu);
>  }
>  
> @@ -3334,6 +3363,11 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
>  	reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
>  	cfg->strtab_base_cfg = reg;
>  
> +	if (is_kdump_kernel()) {
> +		arm_smmu_copy_table(smmu, cfg, size);
> +		return 0;
> +	}
> +
>  	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
>  	return 0;
>  }
> -- 
> 2.18.2
>

Prabhakar Kushwaha May 14, 2020, 7:17 a.m. UTC | #2

Thanks Bjorn for replying on this thread.

On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc linux-pci]
>
> On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > An SMMU Stream table is created by the primary kernel. This table is
> > used by the SMMU to perform address translations for device-originated
> > transactions. Any crash (if happened) launches the kdump kernel which
> > re-creates the SMMU Stream table. New transactions will be translated
> > via this new table..
> >
> > There are scenarios, where devices are still having old pending
> > transactions (configured in the primary kernel). These transactions
> > come in-between Stream table creation and device-driver probe.
> > As new stream table does not have entry for older transactions,
> > it will be aborted by SMMU.
> >
> > Similar observations were found with PCIe-Intel 82576 Gigabit
> > Network card. It sends old Memory Read transaction in kdump kernel.
> > Transactions configured for older Stream table entries, that do not
> > exist any longer in the new table, will cause a PCIe Completion Abort.
>
> That sounds like exactly what we want, doesn't it?
>
> Or do you *want* DMA from the previous kernel to complete?  That will
> read or scribble on something, but maybe that's not terrible as long
> as it's not memory used by the kdump kernel.
>

Yes, Abort should happen. But it should happen in context of driver.
But current abort is happening because of SMMU and no driver/pcie
setup present at this moment.

Solution of this issue should be at 2 place
a) SMMU level: I still believe, this patch has potential to overcome
issue till finally driver's probe takeover.
b) Device level: Even if something goes wrong. Driver/device should
able to recover.


> > Returned PCIe completion abort further leads to AER Errors from APEI
> > Generic Hardware Error Source (GHES) with completion timeout.
> > A network device hang is observed even after continuous
> > reset/recovery from driver, Hence device is no more usable.
>
> The fact that the device is no longer usable is definitely a problem.
> But in principle we *should* be able to recover from these errors.  If
> we could recover and reliably use the device after the error, that
> seems like it would be a more robust solution that having to add
> special cases in every IOMMU driver.
>
> If you have details about this sort of error, I'd like to try to fix
> it because we want to recover from that sort of error in normal
> (non-crash) situations as well.
>
Completion abort case should be gracefully handled.  And device should
always remain usable.

There are 2 scenario which I am testing with Ethernet card PCIe-Intel
82576 Gigabit Network card.

I)  Crash testing using kdump root file system: De-facto scenario
    -  kdump file system does not have Ethernet driver
    -  A lot of AER prints [1], making it impossible to work on shell
of kdump root file system.
    -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.

II) Crash testing using default root file system: Specific case to
test Ethernet driver in second kernel
   -  Default root file system have Ethernet driver
   -  AER error comes even before the driver probe starts.
   -  Driver does reset Ethernet card as part of probe but no success.
   -  AER also tries to recover. but no success.  [2]
   -  I also tries to remove AER errors by using "pci=noaer" bootargs
and commenting ghes_handle_aer() from GHES driver..
          than different set of errors come which also never able to recover [3]

As per my understanding, possible solutions are
 - Copy SMMU table i.e. this patch
OR
 - Doing pci_reset_function() during enumeration phase.
I also tried clearing "M" bit using pci_clear_master during
enumeration but it did not help. Because driver re-set M bit causing
same AER error again.


-pk

---------------------------------------------------------------------------------------------------------------------------
[1] with bootargs having pci=noaer

[   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[   22.512773] {4}[Hardware Error]: event severity: recoverable
[   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
[   22.544804] {4}[Hardware Error]:   section_type: PCIe error
[   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
[   22.556268] {4}[Hardware Error]:   version: 3.0
[   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
[   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
[   22.582323] {4}[Hardware Error]:   slot: 0
[   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
[   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   22.608900] {4}[Hardware Error]:   class_code: 000002
[   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
aer_mask: 0x00000000
[   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
[   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
aer_agent=Requester ID
[   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
[   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
total mem (8153768 kB)
[   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
[   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
[   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
[   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
aer_mask: 0x00000000
[   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
[   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
aer_agent=Requester ID
[   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
[   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)


----------------------------------------------------------------------------------------------------------------------------
[2] Normal bootargs.

[   54.252454] {6}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[   54.265827] {6}[Hardware Error]: event severity: recoverable
[   54.271473] {6}[Hardware Error]:  Error 0, type: recoverable
[   54.281605] {6}[Hardware Error]:   section_type: PCIe error
[   54.287163] {6}[Hardware Error]:   port_type: 0, PCIe end point
[   54.296955] {6}[Hardware Error]:   version: 3.0
[   54.301471] {6}[Hardware Error]:   command: 0x0507, status: 0x4010
[   54.312520] {6}[Hardware Error]:   device_id: 0000:09:00.1
[   54.317991] {6}[Hardware Error]:   slot: 0
[   54.322074] {6}[Hardware Error]:   secondary_bus: 0x00
[   54.327197] {6}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   54.333797] {6}[Hardware Error]:   class_code: 000002
[   54.351312] {6}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   54.358001] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[   54.376852] pcieport 0000:00:09.0: AER: device recovery successful
[   54.383034] igb 0000:09:00.1: AER: aer_status: 0x00004000,
aer_mask: 0x00000000
[   54.390348] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
[   54.397144] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
aer_agent=Requester ID
[   54.409555] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
[   54.551370] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[   54.705214] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[   54.758703] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[   54.865445] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[   54.888751] pcieport 0000:00:09.0: AER: device recovery successful
[   54.894933] igb 0000:09:00.1: AER: aer_status: 0x00004000,
aer_mask: 0x00000000
[   54.902228] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
[   54.916059] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
aer_agent=Requester ID
[   54.923972] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
[   55.057272] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  274.571401] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  274.686138] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  274.786134] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  274.886141] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  397.792897] Workqueue: events aer_recover_work_func
[  397.797760] Call trace:
[  397.800199]  __switch_to+0xcc/0x108
[  397.803675]  __schedule+0x2c0/0x700
[  397.807150]  schedule+0x58/0xe8
[  397.810283]  schedule_preempt_disabled+0x18/0x28
[  397.810788] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  397.814887]  __mutex_lock.isra.9+0x288/0x5c8
[  397.814890]  __mutex_lock_slowpath+0x1c/0x28
[  397.830962]  mutex_lock+0x4c/0x68
[  397.834264]  report_slot_reset+0x30/0xa0
[  397.838178]  pci_walk_bus+0x68/0xc0
[  397.841653]  pcie_do_recovery+0xe8/0x248
[  397.845562]  aer_recover_work_func+0x100/0x138
[  397.849995]  process_one_work+0x1bc/0x458
[  397.853991]  worker_thread+0x150/0x500
[  397.857727]  kthread+0x114/0x118
[  397.860945]  ret_from_fork+0x10/0x18
[  397.864525] INFO: task kworker/223:2:2939 blocked for more than 122 seconds.
[  397.871564]       Not tainted 5.7.0-rc3+ #68
[  397.875819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  397.883638] kworker/223:2   D    0  2939      2 0x00000228
[  397.889121] Workqueue: ipv6_addrconf addrconf_verify_work
[  397.894505] Call trace:
[  397.896940]  __switch_to+0xcc/0x108
[  397.900419]  __schedule+0x2c0/0x700
[  397.903894]  schedule+0x58/0xe8
[  397.907023]  schedule_preempt_disabled+0x18/0x28
[  397.910798] AER: AER recover: Buffer overflow when recovering AER
for 0000:09:00:1
[  397.911630]  __mutex_lock.isra.9+0x288/0x5c8
[  397.923440]  __mutex_lock_slowpath+0x1c/0x28
[  397.927696]  mutex_lock+0x4c/0x68
[  397.931005]  rtnl_lock+0x24/0x30
[  397.934220]  addrconf_verify_work+0x18/0x30
[  397.938394]  process_one_work+0x1bc/0x458
[  397.942390]  worker_thread+0x150/0x500
[  397.946126]  kthread+0x114/0x118
[  397.949345]  ret_from_fork+0x10/0x18

---------------------------------------------------------------------------------------------------------------------------------
[3] with bootargs as pci=noaer and comment ghes_halder_aer() from AER driver

[   69.037035] igb 0000:09:00.1 enp9s0f1: Reset adapter
[   69.348446] {9}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
[   69.356698] {9}[Hardware Error]: It has been corrected by h/w and
requires no further action
[   69.365121] {9}[Hardware Error]: event severity: corrected
[   69.370593] {9}[Hardware Error]:  Error 0, type: corrected
[   69.376064] {9}[Hardware Error]:   section_type: PCIe error
[   69.381623] {9}[Hardware Error]:   port_type: 4, root port
[   69.387094] {9}[Hardware Error]:   version: 3.0
[   69.391611] {9}[Hardware Error]:   command: 0x0106, status: 0x4010
[   69.397777] {9}[Hardware Error]:   device_id: 0000:00:09.0
[   69.403248] {9}[Hardware Error]:   slot: 0
[   69.407331] {9}[Hardware Error]:   secondary_bus: 0x09
[   69.412455] {9}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   69.419055] {9}[Hardware Error]:   class_code: 000406
[   69.424093] {9}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   72.118132] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
1000 Mbps Full Duplex, Flow Control: RX
[   73.995068] igb 0000:09:00.1: Detected Tx Unit Hang
[   73.995068]   Tx Queue             <2>
[   73.995068]   TDH                  <0>
[   73.995068]   TDT                  <1>
[   73.995068]   next_to_use          <1>
[   73.995068]   next_to_clean        <0>
[   73.995068] buffer_info[next_to_clean]
[   73.995068]   time_stamp           <ffff9c1a>
[   73.995068]   next_to_watch        <0000000097d42934>
[   73.995068]   jiffies              <ffff9cd0>
[   73.995068]   desc.status          <168000>
[   75.987323] igb 0000:09:00.1: Detected Tx Unit Hang
[   75.987323]   Tx Queue             <2>
[   75.987323]   TDH                  <0>
[   75.987323]   TDT                  <1>
[   75.987323]   next_to_use          <1>
[   75.987323]   next_to_clean        <0>
[   75.987323] buffer_info[next_to_clean]
[   75.987323]   time_stamp           <ffff9c1a>
[   75.987323]   next_to_watch        <0000000097d42934>
[   75.987323]   jiffies              <ffff9d98>
[   75.987323]   desc.status          <168000>
[   77.952661] {10}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[   77.971790] {10}[Hardware Error]: event severity: recoverable
[   77.977522] {10}[Hardware Error]:  Error 0, type: recoverable
[   77.983254] {10}[Hardware Error]:   section_type: PCIe error
[   77.999930] {10}[Hardware Error]:   port_type: 0, PCIe end point
[   78.005922] {10}[Hardware Error]:   version: 3.0
[   78.010526] {10}[Hardware Error]:   command: 0x0507, status: 0x4010
[   78.016779] {10}[Hardware Error]:   device_id: 0000:09:00.1
[   78.033107] {10}[Hardware Error]:   slot: 0
[   78.037276] {10}[Hardware Error]:   secondary_bus: 0x00
[   78.066253] {10}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   78.072940] {10}[Hardware Error]:   class_code: 000002
[   78.078064] {10}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   78.096202] igb 0000:09:00.1: Detected Tx Unit Hang
[   78.096202]   Tx Queue             <2>
[   78.096202]   TDH                  <0>
[   78.096202]   TDT                  <1>
[   78.096202]   next_to_use          <1>
[   78.096202]   next_to_clean        <0>
[   78.096202] buffer_info[next_to_clean]
[   78.096202]   time_stamp           <ffff9c1a>
[   78.096202]   next_to_watch        <0000000097d42934>
[   78.096202]   jiffies              <ffff9e6a>
[   78.096202]   desc.status          <168000>
[   79.587406] {11}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
[   79.595744] {11}[Hardware Error]: It has been corrected by h/w and
requires no further action
[   79.604254] {11}[Hardware Error]: event severity: corrected
[   79.609813] {11}[Hardware Error]:  Error 0, type: corrected
[   79.615371] {11}[Hardware Error]:   section_type: PCIe error
[   79.621016] {11}[Hardware Error]:   port_type: 4, root port
[   79.626574] {11}[Hardware Error]:   version: 3.0
[   79.631177] {11}[Hardware Error]:   command: 0x0106, status: 0x4010
[   79.637430] {11}[Hardware Error]:   device_id: 0000:00:09.0
[   79.642988] {11}[Hardware Error]:   slot: 0
[   79.647157] {11}[Hardware Error]:   secondary_bus: 0x09
[   79.652368] {11}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   79.659055] {11}[Hardware Error]:   class_code: 000406
[   79.664180] {11}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   79.987052] igb 0000:09:00.1: Detected Tx Unit Hang
[   79.987052]   Tx Queue             <2>
[   79.987052]   TDH                  <0>
[   79.987052]   TDT                  <1>
[   79.987052]   next_to_use          <1>
[   79.987052]   next_to_clean        <0>
[   79.987052] buffer_info[next_to_clean]
[   79.987052]   time_stamp           <ffff9c1a>
[   79.987052]   next_to_watch        <0000000097d42934>
[   79.987052]   jiffies              <ffff9f28>
[   79.987052]   desc.status          <168000>
[   79.987056] igb 0000:09:00.1: Detected Tx Unit Hang
[   79.987056]   Tx Queue             <3>
[   79.987056]   TDH                  <0>
[   79.987056]   TDT                  <1>
[   79.987056]   next_to_use          <1>
[   79.987056]   next_to_clean        <0>
[   79.987056] buffer_info[next_to_clean]
[   79.987056]   time_stamp           <ffff9e43>
[   79.987056]   next_to_watch        <000000008da33deb>
[   79.987056]   jiffies              <ffff9f28>
[   79.987056]   desc.status          <514000>
[   81.986688] igb 0000:09:00.1 enp9s0f1: Reset adapter
[   81.986842] igb 0000:09:00.1: Detected Tx Unit Hang
[   81.986842]   Tx Queue             <2>
[   81.986842]   TDH                  <0>
[   81.986842]   TDT                  <1>
[   81.986842]   next_to_use          <1>
[   81.986842]   next_to_clean        <0>
[   81.986842] buffer_info[next_to_clean]
[   81.986842]   time_stamp           <ffff9c1a>
[   81.986842]   next_to_watch        <0000000097d42934>
[   81.986842]   jiffies              <ffff9ff0>
[   81.986842]   desc.status          <168000>
[   81.986844] igb 0000:09:00.1: Detected Tx Unit Hang
[   81.986844]   Tx Queue             <3>
[   81.986844]   TDH                  <0>
[   81.986844]   TDT                  <1>
[   81.986844]   next_to_use          <1>
[   81.986844]   next_to_clean        <0>
[   81.986844] buffer_info[next_to_clean]
[   81.986844]   time_stamp           <ffff9e43>
[   81.986844]   next_to_watch        <000000008da33deb>
[   81.986844]   jiffies              <ffff9ff0>
[   81.986844]   desc.status          <514000>
[   85.346515] {12}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
[   85.354854] {12}[Hardware Error]: It has been corrected by h/w and
requires no further action
[   85.363365] {12}[Hardware Error]: event severity: corrected
[   85.368924] {12}[Hardware Error]:  Error 0, type: corrected
[   85.374483] {12}[Hardware Error]:   section_type: PCIe error
[   85.380129] {12}[Hardware Error]:   port_type: 0, PCIe end point
[   85.386121] {12}[Hardware Error]:   version: 3.0
[   85.390725] {12}[Hardware Error]:   command: 0x0507, status: 0x0010
[   85.396980] {12}[Hardware Error]:   device_id: 0000:09:00.0
[   85.402540] {12}[Hardware Error]:   slot: 0
[   85.406710] {12}[Hardware Error]:   secondary_bus: 0x00
[   85.411921] {12}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   85.418609] {12}[Hardware Error]:   class_code: 000002
[   85.423733] {12}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   85.826695] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
1000 Mbps Full Duplex, Flow Control: RX





> > So, If we are in a kdump kernel try to copy SMMU Stream table from
> > primary/old kernel to preserve the mappings until the device driver
> > takes over.
> >
> > Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > ---
> > Changes for v2: Used memremap in-place of ioremap
> >
> > V2 patch has been sanity tested.
> >
> > V1 patch has been tested with
> > A) PCIe-Intel 82576 Gigabit Network card in following
> > configurations with "no AER error". Each iteration has
> > been tested on both Suse kdump rfs And default Centos distro rfs.
> >
> >  1)  with 2 level stream table
> >        ----------------------------------------------------
> >        SMMU               |  Normal Ping   | Flood Ping
> >        -----------------------------------------------------
> >        Default Operation  |  100 times     | 10 times
> >        -----------------------------------------------------
> >        IOMMU bypass       |  41 times      | 10 times
> >        -----------------------------------------------------
> >
> >  2)  with Linear stream table.
> >        -----------------------------------------------------
> >        SMMU               |  Normal Ping   | Flood Ping
> >        ------------------------------------------------------
> >        Default Operation  |  100 times     | 10 times
> >        ------------------------------------------------------
> >        IOMMU bypass       |  55 times      | 10 times
> >        -------------------------------------------------------
> >
> > B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> > SSD card with 2 level stream table using "fio" in mixed read/write and
> > only read configurations. It is tested for both Default Operation and
> > IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> > default Centos ditstro rfs.
> >
> > This patch is not full proof solution. Issue can still come
> > from the point device is discovered and driver probe called.
> > This patch has reduced window of scenario from "SMMU Stream table
> > creation - device-driver" to "device discovery - device-driver".
> > Usually, device discovery to device-driver is very small time. So
> > the probability is very low.
> >
> > Note: device-discovery will overwrite existing stream table entries
> > with both SMMU stage as by-pass.
> >
> >
> >  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
> >  1 file changed, 35 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index 82508730feb7..d492d92c2dd7 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> >                       break;
> >               case STRTAB_STE_0_CFG_S1_TRANS:
> >               case STRTAB_STE_0_CFG_S2_TRANS:
> > -                     ste_live = true;
> > +                     /*
> > +                      * As kdump kernel copy STE table from previous
> > +                      * kernel. It still may have valid stream table entries.
> > +                      * Forcing entry as false to allow overwrite.
> > +                      */
> > +                     if (!is_kdump_kernel())
> > +                             ste_live = true;
> >                       break;
> >               case STRTAB_STE_0_CFG_ABORT:
> >                       BUG_ON(!disable_bypass);
> > @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> >               return -ENOMEM;
> >       }
> >
> > +     if (is_kdump_kernel())
> > +             return 0;
> > +
> >       for (i = 0; i < cfg->num_l1_ents; ++i) {
> >               arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
> >               strtab += STRTAB_L1_DESC_DWORDS << 3;
> > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> >       return 0;
> >  }
> >
> > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > +{
> > +     struct arm_smmu_strtab_cfg rdcfg;
> > +
> > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > +
> > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > +
> > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > +
> > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > +}
> > +
> >  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> >  {
> >       void *strtab;
> > @@ -3307,6 +3333,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> >       reg |= FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
> >       cfg->strtab_base_cfg = reg;
> >
> > +     if (is_kdump_kernel())
> > +             arm_smmu_copy_table(smmu, cfg, l1size);
> > +
> >       return arm_smmu_init_l1_strtab(smmu);
> >  }
> >
> > @@ -3334,6 +3363,11 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
> >       reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
> >       cfg->strtab_base_cfg = reg;
> >
> > +     if (is_kdump_kernel()) {
> > +             arm_smmu_copy_table(smmu, cfg, size);
> > +             return 0;
> > +     }
> > +
> >       arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
> >       return 0;
> >  }
> > --
> > 2.18.2
> >

Will Deacon May 18, 2020, 3:55 p.m. UTC | #3

On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> An SMMU Stream table is created by the primary kernel. This table is
> used by the SMMU to perform address translations for device-originated
> transactions. Any crash (if happened) launches the kdump kernel which
> re-creates the SMMU Stream table. New transactions will be translated
> via this new table.
> 
> There are scenarios, where devices are still having old pending
> transactions (configured in the primary kernel). These transactions
> come in-between Stream table creation and device-driver probe.
> As new stream table does not have entry for older transactions,
> it will be aborted by SMMU.
> 
> Similar observations were found with PCIe-Intel 82576 Gigabit
> Network card. It sends old Memory Read transaction in kdump kernel.
> Transactions configured for older Stream table entries, that do not
> exist any longer in the new table, will cause a PCIe Completion Abort.
> Returned PCIe completion abort further leads to AER Errors from APEI
> Generic Hardware Error Source (GHES) with completion timeout.
> A network device hang is observed even after continuous
> reset/recovery from driver, Hence device is no more usable.
> 
> So, If we are in a kdump kernel try to copy SMMU Stream table from
> primary/old kernel to preserve the mappings until the device driver
> takes over.
> 
> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> ---
> Changes for v2: Used memremap in-place of ioremap
> 
> V2 patch has been sanity tested.

Are you sure?

> V1 patch has been tested with
> A) PCIe-Intel 82576 Gigabit Network card in following
> configurations with "no AER error". Each iteration has
> been tested on both Suse kdump rfs And default Centos distro rfs.
> 
>  1)  with 2 level stream table 
>        ----------------------------------------------------
>        SMMU               |  Normal Ping   | Flood Ping
>        -----------------------------------------------------
>        Default Operation  |  100 times     | 10 times
>        -----------------------------------------------------
>        IOMMU bypass       |  41 times      | 10 times
>        -----------------------------------------------------
> 
>  2)  with Linear stream table. 
>        -----------------------------------------------------
>        SMMU               |  Normal Ping   | Flood Ping
>        ------------------------------------------------------
>        Default Operation  |  100 times     | 10 times
>        ------------------------------------------------------
>        IOMMU bypass       |  55 times      | 10 times
>        -------------------------------------------------------
> 
> B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> SSD card with 2 level stream table using "fio" in mixed read/write and
> only read configurations. It is tested for both Default Operation and
> IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> default Centos ditstro rfs.
> 
> This patch is not full proof solution. Issue can still come
> from the point device is discovered and driver probe called. 
> This patch has reduced window of scenario from "SMMU Stream table 
> creation - device-driver" to "device discovery - device-driver".
> Usually, device discovery to device-driver is very small time. So
> the probability is very low. 
> 
> Note: device-discovery will overwrite existing stream table entries 
> with both SMMU stage as by-pass.
> 
> 
>  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 82508730feb7..d492d92c2dd7 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  			break;
>  		case STRTAB_STE_0_CFG_S1_TRANS:
>  		case STRTAB_STE_0_CFG_S2_TRANS:
> -			ste_live = true;
> +			/*
> +			 * As kdump kernel copy STE table from previous
> +			 * kernel. It still may have valid stream table entries.
> +			 * Forcing entry as false to allow overwrite.
> +			 */
> +			if (!is_kdump_kernel())
> +				ste_live = true;
>  			break;
>  		case STRTAB_STE_0_CFG_ABORT:
>  			BUG_ON(!disable_bypass);
> @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
>  		return -ENOMEM;
>  	}
>  
> +	if (is_kdump_kernel())
> +		return 0;
> +
>  	for (i = 0; i < cfg->num_l1_ents; ++i) {
>  		arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
>  		strtab += STRTAB_L1_DESC_DWORDS << 3;
> @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
>  	return 0;
>  }
>  
> +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> +			       struct arm_smmu_strtab_cfg *cfg, u32 size)
> +{
> +	struct arm_smmu_strtab_cfg rdcfg;
> +
> +	rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> +	rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> +					      + ARM_SMMU_STRTAB_BASE_CFG);
> +
> +	rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> +	rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> +
> +	memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> +
> +	cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;

Sorry, but this is unacceptable. These things were allocated by the DMA API
so you can't just memcpy them around and hope for the best.

Either you reinitialise the DMA masters you care about or you disable DMA. I
don't see a viable third option.

Will

Prabhakar Kushwaha May 19, 2020, 2:54 a.m. UTC | #4

Hi Will,

Sorry, I replied 1:1. Now replying with mailing list

On Mon, May 18, 2020 at 9:25 PM Will Deacon <will@kernel.org> wrote:
>
> On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > An SMMU Stream table is created by the primary kernel. This table is
> > used by the SMMU to perform address translations for device-originated
> > transactions. Any crash (if happened) launches the kdump kernel which
> > re-creates the SMMU Stream table. New transactions will be translated
> > via this new table.
> >
> > There are scenarios, where devices are still having old pending
> > transactions (configured in the primary kernel). These transactions
> > come in-between Stream table creation and device-driver probe.
> > As new stream table does not have entry for older transactions,
> > it will be aborted by SMMU.
> >
> > Similar observations were found with PCIe-Intel 82576 Gigabit
> > Network card. It sends old Memory Read transaction in kdump kernel.
> > Transactions configured for older Stream table entries, that do not
> > exist any longer in the new table, will cause a PCIe Completion Abort.
> > Returned PCIe completion abort further leads to AER Errors from APEI
> > Generic Hardware Error Source (GHES) with completion timeout.
> > A network device hang is observed even after continuous
> > reset/recovery from driver, Hence device is no more usable.
> >
> > So, If we are in a kdump kernel try to copy SMMU Stream table from
> > primary/old kernel to preserve the mappings until the device driver
> > takes over.
> >
> > Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > ---
> > Changes for v2: Used memremap in-place of ioremap
> >
> > V2 patch has been sanity tested.
>
> Are you sure?
>

I tested v1 patch thoroughly.

After replacing ioremap with memremap, I tested 1-2 cycle per type.
I can test this patch thoroughly to check any kind of possible error.

> > V1 patch has been tested with
> > A) PCIe-Intel 82576 Gigabit Network card in following
> > configurations with "no AER error". Each iteration has
> > been tested on both Suse kdump rfs And default Centos distro rfs.
> >
> >  1)  with 2 level stream table
> >        ----------------------------------------------------
> >        SMMU               |  Normal Ping   | Flood Ping
> >        -----------------------------------------------------
> >        Default Operation  |  100 times     | 10 times
> >        -----------------------------------------------------
> >        IOMMU bypass       |  41 times      | 10 times
> >        -----------------------------------------------------
> >
> >  2)  with Linear stream table.
> >        -----------------------------------------------------
> >        SMMU               |  Normal Ping   | Flood Ping
> >        ------------------------------------------------------
> >        Default Operation  |  100 times     | 10 times
> >        ------------------------------------------------------
> >        IOMMU bypass       |  55 times      | 10 times
> >        -------------------------------------------------------
> >
> > B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> > SSD card with 2 level stream table using "fio" in mixed read/write and
> > only read configurations. It is tested for both Default Operation and
> > IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> > default Centos ditstro rfs.
> >
> > This patch is not full proof solution. Issue can still come
> > from the point device is discovered and driver probe called.
> > This patch has reduced window of scenario from "SMMU Stream table
> > creation - device-driver" to "device discovery - device-driver".
> > Usually, device discovery to device-driver is very small time. So
> > the probability is very low.
> >
> > Note: device-discovery will overwrite existing stream table entries
> > with both SMMU stage as by-pass.
> >
> >
> >  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
> >  1 file changed, 35 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > index 82508730feb7..d492d92c2dd7 100644
> > --- a/drivers/iommu/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm-smmu-v3.c
> > @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> >                       break;
> >               case STRTAB_STE_0_CFG_S1_TRANS:
> >               case STRTAB_STE_0_CFG_S2_TRANS:
> > -                     ste_live = true;
> > +                     /*
> > +                      * As kdump kernel copy STE table from previous
> > +                      * kernel. It still may have valid stream table entries.
> > +                      * Forcing entry as false to allow overwrite.
> > +                      */
> > +                     if (!is_kdump_kernel())
> > +                             ste_live = true;
> >                       break;
> >               case STRTAB_STE_0_CFG_ABORT:
> >                       BUG_ON(!disable_bypass);
> > @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> >               return -ENOMEM;
> >       }
> >
> > +     if (is_kdump_kernel())
> > +             return 0;
> > +
> >       for (i = 0; i < cfg->num_l1_ents; ++i) {
> >               arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
> >               strtab += STRTAB_L1_DESC_DWORDS << 3;
> > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> >       return 0;
> >  }
> >
> > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > +{
> > +     struct arm_smmu_strtab_cfg rdcfg;
> > +
> > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > +
> > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > +
> > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > +

this need a fix. It should be memcpy.

> > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
>
> Sorry, but this is unacceptable. These things were allocated by the DMA API
> so you can't just memcpy them around and hope for the best.
>

I was referring copy_context_table() in drivers/iommu/intel-iommu.c.
here i see usage of memremap and memcpy to copy older iommu table.
did I take wrong reference?

What kind of issue you are foreseeing in using memcpy(). May be we can
try to find a solution.

-pk

Bjorn Helgaas May 19, 2020, 11:22 p.m. UTC | #5

[+cc Sathy, Vijay, Myron]

On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > An SMMU Stream table is created by the primary kernel. This table is
> > > used by the SMMU to perform address translations for device-originated
> > > transactions. Any crash (if happened) launches the kdump kernel which
> > > re-creates the SMMU Stream table. New transactions will be translated
> > > via this new table..
> > >
> > > There are scenarios, where devices are still having old pending
> > > transactions (configured in the primary kernel). These transactions
> > > come in-between Stream table creation and device-driver probe.
> > > As new stream table does not have entry for older transactions,
> > > it will be aborted by SMMU.
> > >
> > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > Transactions configured for older Stream table entries, that do not
> > > exist any longer in the new table, will cause a PCIe Completion Abort.
> >
> > That sounds like exactly what we want, doesn't it?
> >
> > Or do you *want* DMA from the previous kernel to complete?  That will
> > read or scribble on something, but maybe that's not terrible as long
> > as it's not memory used by the kdump kernel.
> 
> Yes, Abort should happen. But it should happen in context of driver.
> But current abort is happening because of SMMU and no driver/pcie
> setup present at this moment.

I don't understand what you mean by "in context of driver."  The whole
problem is that we can't control *when* the abort happens, so it may
happen in *any* context.  It may happen when a NIC receives a packet
or at some other unpredictable time.

> Solution of this issue should be at 2 place
> a) SMMU level: I still believe, this patch has potential to overcome
> issue till finally driver's probe takeover.
> b) Device level: Even if something goes wrong. Driver/device should
> able to recover.
> 
> > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > Generic Hardware Error Source (GHES) with completion timeout.
> > > A network device hang is observed even after continuous
> > > reset/recovery from driver, Hence device is no more usable.
> >
> > The fact that the device is no longer usable is definitely a problem.
> > But in principle we *should* be able to recover from these errors.  If
> > we could recover and reliably use the device after the error, that
> > seems like it would be a more robust solution that having to add
> > special cases in every IOMMU driver.
> >
> > If you have details about this sort of error, I'd like to try to fix
> > it because we want to recover from that sort of error in normal
> > (non-crash) situations as well.
> >
> Completion abort case should be gracefully handled.  And device should
> always remain usable.
> 
> There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> 82576 Gigabit Network card.
> 
> I)  Crash testing using kdump root file system: De-facto scenario
>     -  kdump file system does not have Ethernet driver
>     -  A lot of AER prints [1], making it impossible to work on shell
> of kdump root file system.

In this case, I think report_error_detected() is deciding that because
the device has no driver, we can't do anything.  The flow is like
this:

  aer_recover_work_func               # aer_recover_work
    kfifo_get(aer_recover_ring, entry)
    dev = pci_get_domain_bus_and_slot
    cper_print_aer(dev, ...)
      pci_err("AER: aer_status:")
      pci_err("AER:   [14] CmpltTO")
      pci_err("AER: aer_layer=")
    if (AER_NONFATAL)
      pcie_do_recovery(dev, pci_channel_io_normal)
	status = CAN_RECOVER
        pci_walk_bus(report_normal_detected)
	  report_error_detected
	    if (!dev->driver)
	      vote = NO_AER_DRIVER
	      pci_info("can't recover (no error_detected callback)")
	    *result = merge_result(*, NO_AER_DRIVER)
	    # always NO_AER_DRIVER
	status is now NO_AER_DRIVER

So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(), 
and status is not RECOVERED, so it skips .resume().

I don't remember the history there, but if a device has no driver and
the device generates errors, it seems like we ought to be able to
reset it.

We should be able to field one (or a few) AER errors, reset the
device, and you should be able to use the shell in the kdump kernel.

>     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> 
> II) Crash testing using default root file system: Specific case to
> test Ethernet driver in second kernel
>    -  Default root file system have Ethernet driver
>    -  AER error comes even before the driver probe starts.
>    -  Driver does reset Ethernet card as part of probe but no success.
>    -  AER also tries to recover. but no success.  [2]
>    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> and commenting ghes_handle_aer() from GHES driver..
>           than different set of errors come which also never able to recover [3]
> 
> As per my understanding, possible solutions are
>  - Copy SMMU table i.e. this patch
> OR
>  - Doing pci_reset_function() during enumeration phase.
> I also tried clearing "M" bit using pci_clear_master during
> enumeration but it did not help. Because driver re-set M bit causing
> same AER error again.
> 
> 
> -pk
> 
> ---------------------------------------------------------------------------------------------------------------------------
> [1] with bootargs having pci=noaer
> 
> [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 1
> [   22.512773] {4}[Hardware Error]: event severity: recoverable
> [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> [   22.556268] {4}[Hardware Error]:   version: 3.0
> [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> [   22.582323] {4}[Hardware Error]:   slot: 0
> [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   22.608900] {4}[Hardware Error]:   class_code: 000002
> [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> aer_mask: 0x00000000
> [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> aer_agent=Requester ID
> [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> total mem (8153768 kB)
> [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> aer_mask: 0x00000000
> [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> aer_agent=Requester ID
> [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> 
> 
> ----------------------------------------------------------------------------------------------------------------------------
> [2] Normal bootargs.
> 
> [   54.252454] {6}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 1
> [   54.265827] {6}[Hardware Error]: event severity: recoverable
> [   54.271473] {6}[Hardware Error]:  Error 0, type: recoverable
> [   54.281605] {6}[Hardware Error]:   section_type: PCIe error
> [   54.287163] {6}[Hardware Error]:   port_type: 0, PCIe end point
> [   54.296955] {6}[Hardware Error]:   version: 3.0
> [   54.301471] {6}[Hardware Error]:   command: 0x0507, status: 0x4010
> [   54.312520] {6}[Hardware Error]:   device_id: 0000:09:00.1
> [   54.317991] {6}[Hardware Error]:   slot: 0
> [   54.322074] {6}[Hardware Error]:   secondary_bus: 0x00
> [   54.327197] {6}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   54.333797] {6}[Hardware Error]:   class_code: 000002
> [   54.351312] {6}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   54.358001] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [   54.376852] pcieport 0000:00:09.0: AER: device recovery successful
> [   54.383034] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> aer_mask: 0x00000000
> [   54.390348] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> [   54.397144] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> aer_agent=Requester ID
> [   54.409555] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> [   54.551370] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [   54.705214] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [   54.758703] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [   54.865445] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [   54.888751] pcieport 0000:00:09.0: AER: device recovery successful
> [   54.894933] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> aer_mask: 0x00000000
> [   54.902228] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> [   54.916059] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> aer_agent=Requester ID
> [   54.923972] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> [   55.057272] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  274.571401] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  274.686138] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  274.786134] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  274.886141] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  397.792897] Workqueue: events aer_recover_work_func
> [  397.797760] Call trace:
> [  397.800199]  __switch_to+0xcc/0x108
> [  397.803675]  __schedule+0x2c0/0x700
> [  397.807150]  schedule+0x58/0xe8
> [  397.810283]  schedule_preempt_disabled+0x18/0x28
> [  397.810788] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  397.814887]  __mutex_lock.isra.9+0x288/0x5c8
> [  397.814890]  __mutex_lock_slowpath+0x1c/0x28
> [  397.830962]  mutex_lock+0x4c/0x68
> [  397.834264]  report_slot_reset+0x30/0xa0
> [  397.838178]  pci_walk_bus+0x68/0xc0
> [  397.841653]  pcie_do_recovery+0xe8/0x248
> [  397.845562]  aer_recover_work_func+0x100/0x138
> [  397.849995]  process_one_work+0x1bc/0x458
> [  397.853991]  worker_thread+0x150/0x500
> [  397.857727]  kthread+0x114/0x118
> [  397.860945]  ret_from_fork+0x10/0x18
> [  397.864525] INFO: task kworker/223:2:2939 blocked for more than 122 seconds.
> [  397.871564]       Not tainted 5.7.0-rc3+ #68
> [  397.875819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  397.883638] kworker/223:2   D    0  2939      2 0x00000228
> [  397.889121] Workqueue: ipv6_addrconf addrconf_verify_work
> [  397.894505] Call trace:
> [  397.896940]  __switch_to+0xcc/0x108
> [  397.900419]  __schedule+0x2c0/0x700
> [  397.903894]  schedule+0x58/0xe8
> [  397.907023]  schedule_preempt_disabled+0x18/0x28
> [  397.910798] AER: AER recover: Buffer overflow when recovering AER
> for 0000:09:00:1
> [  397.911630]  __mutex_lock.isra.9+0x288/0x5c8
> [  397.923440]  __mutex_lock_slowpath+0x1c/0x28
> [  397.927696]  mutex_lock+0x4c/0x68
> [  397.931005]  rtnl_lock+0x24/0x30
> [  397.934220]  addrconf_verify_work+0x18/0x30
> [  397.938394]  process_one_work+0x1bc/0x458
> [  397.942390]  worker_thread+0x150/0x500
> [  397.946126]  kthread+0x114/0x118
> [  397.949345]  ret_from_fork+0x10/0x18
> 
> ---------------------------------------------------------------------------------------------------------------------------------
> [3] with bootargs as pci=noaer and comment ghes_halder_aer() from AER driver
> 
> [   69.037035] igb 0000:09:00.1 enp9s0f1: Reset adapter
> [   69.348446] {9}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
> [   69.356698] {9}[Hardware Error]: It has been corrected by h/w and
> requires no further action
> [   69.365121] {9}[Hardware Error]: event severity: corrected
> [   69.370593] {9}[Hardware Error]:  Error 0, type: corrected
> [   69.376064] {9}[Hardware Error]:   section_type: PCIe error
> [   69.381623] {9}[Hardware Error]:   port_type: 4, root port
> [   69.387094] {9}[Hardware Error]:   version: 3.0
> [   69.391611] {9}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   69.397777] {9}[Hardware Error]:   device_id: 0000:00:09.0
> [   69.403248] {9}[Hardware Error]:   slot: 0
> [   69.407331] {9}[Hardware Error]:   secondary_bus: 0x09
> [   69.412455] {9}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   69.419055] {9}[Hardware Error]:   class_code: 000406
> [   69.424093] {9}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   72.118132] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> 1000 Mbps Full Duplex, Flow Control: RX
> [   73.995068] igb 0000:09:00.1: Detected Tx Unit Hang
> [   73.995068]   Tx Queue             <2>
> [   73.995068]   TDH                  <0>
> [   73.995068]   TDT                  <1>
> [   73.995068]   next_to_use          <1>
> [   73.995068]   next_to_clean        <0>
> [   73.995068] buffer_info[next_to_clean]
> [   73.995068]   time_stamp           <ffff9c1a>
> [   73.995068]   next_to_watch        <0000000097d42934>
> [   73.995068]   jiffies              <ffff9cd0>
> [   73.995068]   desc.status          <168000>
> [   75.987323] igb 0000:09:00.1: Detected Tx Unit Hang
> [   75.987323]   Tx Queue             <2>
> [   75.987323]   TDH                  <0>
> [   75.987323]   TDT                  <1>
> [   75.987323]   next_to_use          <1>
> [   75.987323]   next_to_clean        <0>
> [   75.987323] buffer_info[next_to_clean]
> [   75.987323]   time_stamp           <ffff9c1a>
> [   75.987323]   next_to_watch        <0000000097d42934>
> [   75.987323]   jiffies              <ffff9d98>
> [   75.987323]   desc.status          <168000>
> [   77.952661] {10}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 1
> [   77.971790] {10}[Hardware Error]: event severity: recoverable
> [   77.977522] {10}[Hardware Error]:  Error 0, type: recoverable
> [   77.983254] {10}[Hardware Error]:   section_type: PCIe error
> [   77.999930] {10}[Hardware Error]:   port_type: 0, PCIe end point
> [   78.005922] {10}[Hardware Error]:   version: 3.0
> [   78.010526] {10}[Hardware Error]:   command: 0x0507, status: 0x4010
> [   78.016779] {10}[Hardware Error]:   device_id: 0000:09:00.1
> [   78.033107] {10}[Hardware Error]:   slot: 0
> [   78.037276] {10}[Hardware Error]:   secondary_bus: 0x00
> [   78.066253] {10}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   78.072940] {10}[Hardware Error]:   class_code: 000002
> [   78.078064] {10}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   78.096202] igb 0000:09:00.1: Detected Tx Unit Hang
> [   78.096202]   Tx Queue             <2>
> [   78.096202]   TDH                  <0>
> [   78.096202]   TDT                  <1>
> [   78.096202]   next_to_use          <1>
> [   78.096202]   next_to_clean        <0>
> [   78.096202] buffer_info[next_to_clean]
> [   78.096202]   time_stamp           <ffff9c1a>
> [   78.096202]   next_to_watch        <0000000097d42934>
> [   78.096202]   jiffies              <ffff9e6a>
> [   78.096202]   desc.status          <168000>
> [   79.587406] {11}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
> [   79.595744] {11}[Hardware Error]: It has been corrected by h/w and
> requires no further action
> [   79.604254] {11}[Hardware Error]: event severity: corrected
> [   79.609813] {11}[Hardware Error]:  Error 0, type: corrected
> [   79.615371] {11}[Hardware Error]:   section_type: PCIe error
> [   79.621016] {11}[Hardware Error]:   port_type: 4, root port
> [   79.626574] {11}[Hardware Error]:   version: 3.0
> [   79.631177] {11}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   79.637430] {11}[Hardware Error]:   device_id: 0000:00:09.0
> [   79.642988] {11}[Hardware Error]:   slot: 0
> [   79.647157] {11}[Hardware Error]:   secondary_bus: 0x09
> [   79.652368] {11}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   79.659055] {11}[Hardware Error]:   class_code: 000406
> [   79.664180] {11}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   79.987052] igb 0000:09:00.1: Detected Tx Unit Hang
> [   79.987052]   Tx Queue             <2>
> [   79.987052]   TDH                  <0>
> [   79.987052]   TDT                  <1>
> [   79.987052]   next_to_use          <1>
> [   79.987052]   next_to_clean        <0>
> [   79.987052] buffer_info[next_to_clean]
> [   79.987052]   time_stamp           <ffff9c1a>
> [   79.987052]   next_to_watch        <0000000097d42934>
> [   79.987052]   jiffies              <ffff9f28>
> [   79.987052]   desc.status          <168000>
> [   79.987056] igb 0000:09:00.1: Detected Tx Unit Hang
> [   79.987056]   Tx Queue             <3>
> [   79.987056]   TDH                  <0>
> [   79.987056]   TDT                  <1>
> [   79.987056]   next_to_use          <1>
> [   79.987056]   next_to_clean        <0>
> [   79.987056] buffer_info[next_to_clean]
> [   79.987056]   time_stamp           <ffff9e43>
> [   79.987056]   next_to_watch        <000000008da33deb>
> [   79.987056]   jiffies              <ffff9f28>
> [   79.987056]   desc.status          <514000>
> [   81.986688] igb 0000:09:00.1 enp9s0f1: Reset adapter
> [   81.986842] igb 0000:09:00.1: Detected Tx Unit Hang
> [   81.986842]   Tx Queue             <2>
> [   81.986842]   TDH                  <0>
> [   81.986842]   TDT                  <1>
> [   81.986842]   next_to_use          <1>
> [   81.986842]   next_to_clean        <0>
> [   81.986842] buffer_info[next_to_clean]
> [   81.986842]   time_stamp           <ffff9c1a>
> [   81.986842]   next_to_watch        <0000000097d42934>
> [   81.986842]   jiffies              <ffff9ff0>
> [   81.986842]   desc.status          <168000>
> [   81.986844] igb 0000:09:00.1: Detected Tx Unit Hang
> [   81.986844]   Tx Queue             <3>
> [   81.986844]   TDH                  <0>
> [   81.986844]   TDT                  <1>
> [   81.986844]   next_to_use          <1>
> [   81.986844]   next_to_clean        <0>
> [   81.986844] buffer_info[next_to_clean]
> [   81.986844]   time_stamp           <ffff9e43>
> [   81.986844]   next_to_watch        <000000008da33deb>
> [   81.986844]   jiffies              <ffff9ff0>
> [   81.986844]   desc.status          <514000>
> [   85.346515] {12}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
> [   85.354854] {12}[Hardware Error]: It has been corrected by h/w and
> requires no further action
> [   85.363365] {12}[Hardware Error]: event severity: corrected
> [   85.368924] {12}[Hardware Error]:  Error 0, type: corrected
> [   85.374483] {12}[Hardware Error]:   section_type: PCIe error
> [   85.380129] {12}[Hardware Error]:   port_type: 0, PCIe end point
> [   85.386121] {12}[Hardware Error]:   version: 3.0
> [   85.390725] {12}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   85.396980] {12}[Hardware Error]:   device_id: 0000:09:00.0
> [   85.402540] {12}[Hardware Error]:   slot: 0
> [   85.406710] {12}[Hardware Error]:   secondary_bus: 0x00
> [   85.411921] {12}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   85.418609] {12}[Hardware Error]:   class_code: 000002
> [   85.423733] {12}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   85.826695] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> 1000 Mbps Full Duplex, Flow Control: RX
> 
> 
> 
> 
> 
> > > So, If we are in a kdump kernel try to copy SMMU Stream table from
> > > primary/old kernel to preserve the mappings until the device driver
> > > takes over.
> > >
> > > Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > > ---
> > > Changes for v2: Used memremap in-place of ioremap
> > >
> > > V2 patch has been sanity tested.
> > >
> > > V1 patch has been tested with
> > > A) PCIe-Intel 82576 Gigabit Network card in following
> > > configurations with "no AER error". Each iteration has
> > > been tested on both Suse kdump rfs And default Centos distro rfs.
> > >
> > >  1)  with 2 level stream table
> > >        ----------------------------------------------------
> > >        SMMU               |  Normal Ping   | Flood Ping
> > >        -----------------------------------------------------
> > >        Default Operation  |  100 times     | 10 times
> > >        -----------------------------------------------------
> > >        IOMMU bypass       |  41 times      | 10 times
> > >        -----------------------------------------------------
> > >
> > >  2)  with Linear stream table.
> > >        -----------------------------------------------------
> > >        SMMU               |  Normal Ping   | Flood Ping
> > >        ------------------------------------------------------
> > >        Default Operation  |  100 times     | 10 times
> > >        ------------------------------------------------------
> > >        IOMMU bypass       |  55 times      | 10 times
> > >        -------------------------------------------------------
> > >
> > > B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> > > SSD card with 2 level stream table using "fio" in mixed read/write and
> > > only read configurations. It is tested for both Default Operation and
> > > IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> > > default Centos ditstro rfs.
> > >
> > > This patch is not full proof solution. Issue can still come
> > > from the point device is discovered and driver probe called.
> > > This patch has reduced window of scenario from "SMMU Stream table
> > > creation - device-driver" to "device discovery - device-driver".
> > > Usually, device discovery to device-driver is very small time. So
> > > the probability is very low.
> > >
> > > Note: device-discovery will overwrite existing stream table entries
> > > with both SMMU stage as by-pass.
> > >
> > >
> > >  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 35 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > > index 82508730feb7..d492d92c2dd7 100644
> > > --- a/drivers/iommu/arm-smmu-v3.c
> > > +++ b/drivers/iommu/arm-smmu-v3.c
> > > @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> > >                       break;
> > >               case STRTAB_STE_0_CFG_S1_TRANS:
> > >               case STRTAB_STE_0_CFG_S2_TRANS:
> > > -                     ste_live = true;
> > > +                     /*
> > > +                      * As kdump kernel copy STE table from previous
> > > +                      * kernel. It still may have valid stream table entries.
> > > +                      * Forcing entry as false to allow overwrite.
> > > +                      */
> > > +                     if (!is_kdump_kernel())
> > > +                             ste_live = true;
> > >                       break;
> > >               case STRTAB_STE_0_CFG_ABORT:
> > >                       BUG_ON(!disable_bypass);
> > > @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > >               return -ENOMEM;
> > >       }
> > >
> > > +     if (is_kdump_kernel())
> > > +             return 0;
> > > +
> > >       for (i = 0; i < cfg->num_l1_ents; ++i) {
> > >               arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
> > >               strtab += STRTAB_L1_DESC_DWORDS << 3;
> > > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > >       return 0;
> > >  }
> > >
> > > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > > +{
> > > +     struct arm_smmu_strtab_cfg rdcfg;
> > > +
> > > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > > +
> > > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > > +
> > > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > > +
> > > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > > +}
> > > +
> > >  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > >  {
> > >       void *strtab;
> > > @@ -3307,6 +3333,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
> > >       cfg->strtab_base_cfg = reg;
> > >
> > > +     if (is_kdump_kernel())
> > > +             arm_smmu_copy_table(smmu, cfg, l1size);
> > > +
> > >       return arm_smmu_init_l1_strtab(smmu);
> > >  }
> > >
> > > @@ -3334,6 +3363,11 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
> > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
> > >       cfg->strtab_base_cfg = reg;
> > >
> > > +     if (is_kdump_kernel()) {
> > > +             arm_smmu_copy_table(smmu, cfg, size);
> > > +             return 0;
> > > +     }
> > > +
> > >       arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
> > >       return 0;
> > >  }
> > > --
> > > 2.18.2
> > >

Prabhakar Kushwaha May 21, 2020, 3:58 a.m. UTC | #6

Hi Bjorn,

On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Sathy, Vijay, Myron]
>
> On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > used by the SMMU to perform address translations for device-originated
> > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > via this new table..
> > > >
> > > > There are scenarios, where devices are still having old pending
> > > > transactions (configured in the primary kernel). These transactions
> > > > come in-between Stream table creation and device-driver probe.
> > > > As new stream table does not have entry for older transactions,
> > > > it will be aborted by SMMU.
> > > >
> > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > Transactions configured for older Stream table entries, that do not
> > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > >
> > > That sounds like exactly what we want, doesn't it?
> > >
> > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > read or scribble on something, but maybe that's not terrible as long
> > > as it's not memory used by the kdump kernel.
> >
> > Yes, Abort should happen. But it should happen in context of driver.
> > But current abort is happening because of SMMU and no driver/pcie
> > setup present at this moment.
>
> I don't understand what you mean by "in context of driver."  The whole
> problem is that we can't control *when* the abort happens, so it may
> happen in *any* context.  It may happen when a NIC receives a packet
> or at some other unpredictable time.
>
> > Solution of this issue should be at 2 place
> > a) SMMU level: I still believe, this patch has potential to overcome
> > issue till finally driver's probe takeover.
> > b) Device level: Even if something goes wrong. Driver/device should
> > able to recover.
> >
> > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > A network device hang is observed even after continuous
> > > > reset/recovery from driver, Hence device is no more usable.
> > >
> > > The fact that the device is no longer usable is definitely a problem.
> > > But in principle we *should* be able to recover from these errors.  If
> > > we could recover and reliably use the device after the error, that
> > > seems like it would be a more robust solution that having to add
> > > special cases in every IOMMU driver.
> > >
> > > If you have details about this sort of error, I'd like to try to fix
> > > it because we want to recover from that sort of error in normal
> > > (non-crash) situations as well.
> > >
> > Completion abort case should be gracefully handled.  And device should
> > always remain usable.
> >
> > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > 82576 Gigabit Network card.
> >
> > I)  Crash testing using kdump root file system: De-facto scenario
> >     -  kdump file system does not have Ethernet driver
> >     -  A lot of AER prints [1], making it impossible to work on shell
> > of kdump root file system.
>
> In this case, I think report_error_detected() is deciding that because
> the device has no driver, we can't do anything.  The flow is like
> this:
>
>   aer_recover_work_func               # aer_recover_work
>     kfifo_get(aer_recover_ring, entry)
>     dev = pci_get_domain_bus_and_slot
>     cper_print_aer(dev, ...)
>       pci_err("AER: aer_status:")
>       pci_err("AER:   [14] CmpltTO")
>       pci_err("AER: aer_layer=")
>     if (AER_NONFATAL)
>       pcie_do_recovery(dev, pci_channel_io_normal)
>         status = CAN_RECOVER
>         pci_walk_bus(report_normal_detected)
>           report_error_detected
>             if (!dev->driver)
>               vote = NO_AER_DRIVER
>               pci_info("can't recover (no error_detected callback)")
>             *result = merge_result(*, NO_AER_DRIVER)
>             # always NO_AER_DRIVER
>         status is now NO_AER_DRIVER
>
> So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> and status is not RECOVERED, so it skips .resume().
>
> I don't remember the history there, but if a device has no driver and
> the device generates errors, it seems like we ought to be able to
> reset it.
>

But how to reset the device considering there is no driver.
Hypothetically, this case should be taken care by PCIe subsystem to
perform reset at PCIe level.

> We should be able to field one (or a few) AER errors, reset the
> device, and you should be able to use the shell in the kdump kernel.
>
here kdump shell is usable only problem is a "lot of AER Errors". One
cannot see what they are typing.

> >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> >
> > II) Crash testing using default root file system: Specific case to
> > test Ethernet driver in second kernel
> >    -  Default root file system have Ethernet driver
> >    -  AER error comes even before the driver probe starts.
> >    -  Driver does reset Ethernet card as part of probe but no success.
> >    -  AER also tries to recover. but no success.  [2]
> >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > and commenting ghes_handle_aer() from GHES driver..
> >           than different set of errors come which also never able to recover [3]
> >

Please suggest your view on this case. Here driver is preset.
(driver/net/ethernet/intel/igb/igb_main.c)
In this case AER errors starts even before driver probe starts.
After probe, driver does the device reset with no success and even AER
recovery does not work.

Problem mentioned in case I and II goes away if do pci_reset_function
during enumeration phase of kdump kernel.
can we thought of doing pci_reset_function for all devices in kdump
kernel or device specific quirk.

--pk


> > As per my understanding, possible solutions are
> >  - Copy SMMU table i.e. this patch
> > OR
> >  - Doing pci_reset_function() during enumeration phase.
> > I also tried clearing "M" bit using pci_clear_master during
> > enumeration but it did not help. Because driver re-set M bit causing
> > same AER error again.
> >
> >
> > -pk
> >
> > ---------------------------------------------------------------------------------------------------------------------------
> > [1] with bootargs having pci=noaer
> >
> > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1
> > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > [   22.582323] {4}[Hardware Error]:   slot: 0
> > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > aer_mask: 0x00000000
> > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > aer_agent=Requester ID
> > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > total mem (8153768 kB)
> > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > aer_mask: 0x00000000
> > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > aer_agent=Requester ID
> > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> >
> >
> > ----------------------------------------------------------------------------------------------------------------------------
> > [2] Normal bootargs.
> >
> > [   54.252454] {6}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1
> > [   54.265827] {6}[Hardware Error]: event severity: recoverable
> > [   54.271473] {6}[Hardware Error]:  Error 0, type: recoverable
> > [   54.281605] {6}[Hardware Error]:   section_type: PCIe error
> > [   54.287163] {6}[Hardware Error]:   port_type: 0, PCIe end point
> > [   54.296955] {6}[Hardware Error]:   version: 3.0
> > [   54.301471] {6}[Hardware Error]:   command: 0x0507, status: 0x4010
> > [   54.312520] {6}[Hardware Error]:   device_id: 0000:09:00.1
> > [   54.317991] {6}[Hardware Error]:   slot: 0
> > [   54.322074] {6}[Hardware Error]:   secondary_bus: 0x00
> > [   54.327197] {6}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   54.333797] {6}[Hardware Error]:   class_code: 000002
> > [   54.351312] {6}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   54.358001] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [   54.376852] pcieport 0000:00:09.0: AER: device recovery successful
> > [   54.383034] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> > aer_mask: 0x00000000
> > [   54.390348] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > [   54.397144] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > aer_agent=Requester ID
> > [   54.409555] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > [   54.551370] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [   54.705214] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [   54.758703] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [   54.865445] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [   54.888751] pcieport 0000:00:09.0: AER: device recovery successful
> > [   54.894933] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> > aer_mask: 0x00000000
> > [   54.902228] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > [   54.916059] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > aer_agent=Requester ID
> > [   54.923972] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > [   55.057272] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  274.571401] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  274.686138] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  274.786134] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  274.886141] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  397.792897] Workqueue: events aer_recover_work_func
> > [  397.797760] Call trace:
> > [  397.800199]  __switch_to+0xcc/0x108
> > [  397.803675]  __schedule+0x2c0/0x700
> > [  397.807150]  schedule+0x58/0xe8
> > [  397.810283]  schedule_preempt_disabled+0x18/0x28
> > [  397.810788] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  397.814887]  __mutex_lock.isra.9+0x288/0x5c8
> > [  397.814890]  __mutex_lock_slowpath+0x1c/0x28
> > [  397.830962]  mutex_lock+0x4c/0x68
> > [  397.834264]  report_slot_reset+0x30/0xa0
> > [  397.838178]  pci_walk_bus+0x68/0xc0
> > [  397.841653]  pcie_do_recovery+0xe8/0x248
> > [  397.845562]  aer_recover_work_func+0x100/0x138
> > [  397.849995]  process_one_work+0x1bc/0x458
> > [  397.853991]  worker_thread+0x150/0x500
> > [  397.857727]  kthread+0x114/0x118
> > [  397.860945]  ret_from_fork+0x10/0x18
> > [  397.864525] INFO: task kworker/223:2:2939 blocked for more than 122 seconds.
> > [  397.871564]       Not tainted 5.7.0-rc3+ #68
> > [  397.875819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  397.883638] kworker/223:2   D    0  2939      2 0x00000228
> > [  397.889121] Workqueue: ipv6_addrconf addrconf_verify_work
> > [  397.894505] Call trace:
> > [  397.896940]  __switch_to+0xcc/0x108
> > [  397.900419]  __schedule+0x2c0/0x700
> > [  397.903894]  schedule+0x58/0xe8
> > [  397.907023]  schedule_preempt_disabled+0x18/0x28
> > [  397.910798] AER: AER recover: Buffer overflow when recovering AER
> > for 0000:09:00:1
> > [  397.911630]  __mutex_lock.isra.9+0x288/0x5c8
> > [  397.923440]  __mutex_lock_slowpath+0x1c/0x28
> > [  397.927696]  mutex_lock+0x4c/0x68
> > [  397.931005]  rtnl_lock+0x24/0x30
> > [  397.934220]  addrconf_verify_work+0x18/0x30
> > [  397.938394]  process_one_work+0x1bc/0x458
> > [  397.942390]  worker_thread+0x150/0x500
> > [  397.946126]  kthread+0x114/0x118
> > [  397.949345]  ret_from_fork+0x10/0x18
> >
> > ---------------------------------------------------------------------------------------------------------------------------------
> > [3] with bootargs as pci=noaer and comment ghes_halder_aer() from AER driver
> >
> > [   69.037035] igb 0000:09:00.1 enp9s0f1: Reset adapter
> > [   69.348446] {9}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 0
> > [   69.356698] {9}[Hardware Error]: It has been corrected by h/w and
> > requires no further action
> > [   69.365121] {9}[Hardware Error]: event severity: corrected
> > [   69.370593] {9}[Hardware Error]:  Error 0, type: corrected
> > [   69.376064] {9}[Hardware Error]:   section_type: PCIe error
> > [   69.381623] {9}[Hardware Error]:   port_type: 4, root port
> > [   69.387094] {9}[Hardware Error]:   version: 3.0
> > [   69.391611] {9}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   69.397777] {9}[Hardware Error]:   device_id: 0000:00:09.0
> > [   69.403248] {9}[Hardware Error]:   slot: 0
> > [   69.407331] {9}[Hardware Error]:   secondary_bus: 0x09
> > [   69.412455] {9}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   69.419055] {9}[Hardware Error]:   class_code: 000406
> > [   69.424093] {9}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   72.118132] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> > 1000 Mbps Full Duplex, Flow Control: RX
> > [   73.995068] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   73.995068]   Tx Queue             <2>
> > [   73.995068]   TDH                  <0>
> > [   73.995068]   TDT                  <1>
> > [   73.995068]   next_to_use          <1>
> > [   73.995068]   next_to_clean        <0>
> > [   73.995068] buffer_info[next_to_clean]
> > [   73.995068]   time_stamp           <ffff9c1a>
> > [   73.995068]   next_to_watch        <0000000097d42934>
> > [   73.995068]   jiffies              <ffff9cd0>
> > [   73.995068]   desc.status          <168000>
> > [   75.987323] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   75.987323]   Tx Queue             <2>
> > [   75.987323]   TDH                  <0>
> > [   75.987323]   TDT                  <1>
> > [   75.987323]   next_to_use          <1>
> > [   75.987323]   next_to_clean        <0>
> > [   75.987323] buffer_info[next_to_clean]
> > [   75.987323]   time_stamp           <ffff9c1a>
> > [   75.987323]   next_to_watch        <0000000097d42934>
> > [   75.987323]   jiffies              <ffff9d98>
> > [   75.987323]   desc.status          <168000>
> > [   77.952661] {10}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1
> > [   77.971790] {10}[Hardware Error]: event severity: recoverable
> > [   77.977522] {10}[Hardware Error]:  Error 0, type: recoverable
> > [   77.983254] {10}[Hardware Error]:   section_type: PCIe error
> > [   77.999930] {10}[Hardware Error]:   port_type: 0, PCIe end point
> > [   78.005922] {10}[Hardware Error]:   version: 3.0
> > [   78.010526] {10}[Hardware Error]:   command: 0x0507, status: 0x4010
> > [   78.016779] {10}[Hardware Error]:   device_id: 0000:09:00.1
> > [   78.033107] {10}[Hardware Error]:   slot: 0
> > [   78.037276] {10}[Hardware Error]:   secondary_bus: 0x00
> > [   78.066253] {10}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   78.072940] {10}[Hardware Error]:   class_code: 000002
> > [   78.078064] {10}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   78.096202] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   78.096202]   Tx Queue             <2>
> > [   78.096202]   TDH                  <0>
> > [   78.096202]   TDT                  <1>
> > [   78.096202]   next_to_use          <1>
> > [   78.096202]   next_to_clean        <0>
> > [   78.096202] buffer_info[next_to_clean]
> > [   78.096202]   time_stamp           <ffff9c1a>
> > [   78.096202]   next_to_watch        <0000000097d42934>
> > [   78.096202]   jiffies              <ffff9e6a>
> > [   78.096202]   desc.status          <168000>
> > [   79.587406] {11}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 0
> > [   79.595744] {11}[Hardware Error]: It has been corrected by h/w and
> > requires no further action
> > [   79.604254] {11}[Hardware Error]: event severity: corrected
> > [   79.609813] {11}[Hardware Error]:  Error 0, type: corrected
> > [   79.615371] {11}[Hardware Error]:   section_type: PCIe error
> > [   79.621016] {11}[Hardware Error]:   port_type: 4, root port
> > [   79.626574] {11}[Hardware Error]:   version: 3.0
> > [   79.631177] {11}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   79.637430] {11}[Hardware Error]:   device_id: 0000:00:09.0
> > [   79.642988] {11}[Hardware Error]:   slot: 0
> > [   79.647157] {11}[Hardware Error]:   secondary_bus: 0x09
> > [   79.652368] {11}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   79.659055] {11}[Hardware Error]:   class_code: 000406
> > [   79.664180] {11}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   79.987052] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   79.987052]   Tx Queue             <2>
> > [   79.987052]   TDH                  <0>
> > [   79.987052]   TDT                  <1>
> > [   79.987052]   next_to_use          <1>
> > [   79.987052]   next_to_clean        <0>
> > [   79.987052] buffer_info[next_to_clean]
> > [   79.987052]   time_stamp           <ffff9c1a>
> > [   79.987052]   next_to_watch        <0000000097d42934>
> > [   79.987052]   jiffies              <ffff9f28>
> > [   79.987052]   desc.status          <168000>
> > [   79.987056] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   79.987056]   Tx Queue             <3>
> > [   79.987056]   TDH                  <0>
> > [   79.987056]   TDT                  <1>
> > [   79.987056]   next_to_use          <1>
> > [   79.987056]   next_to_clean        <0>
> > [   79.987056] buffer_info[next_to_clean]
> > [   79.987056]   time_stamp           <ffff9e43>
> > [   79.987056]   next_to_watch        <000000008da33deb>
> > [   79.987056]   jiffies              <ffff9f28>
> > [   79.987056]   desc.status          <514000>
> > [   81.986688] igb 0000:09:00.1 enp9s0f1: Reset adapter
> > [   81.986842] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   81.986842]   Tx Queue             <2>
> > [   81.986842]   TDH                  <0>
> > [   81.986842]   TDT                  <1>
> > [   81.986842]   next_to_use          <1>
> > [   81.986842]   next_to_clean        <0>
> > [   81.986842] buffer_info[next_to_clean]
> > [   81.986842]   time_stamp           <ffff9c1a>
> > [   81.986842]   next_to_watch        <0000000097d42934>
> > [   81.986842]   jiffies              <ffff9ff0>
> > [   81.986842]   desc.status          <168000>
> > [   81.986844] igb 0000:09:00.1: Detected Tx Unit Hang
> > [   81.986844]   Tx Queue             <3>
> > [   81.986844]   TDH                  <0>
> > [   81.986844]   TDT                  <1>
> > [   81.986844]   next_to_use          <1>
> > [   81.986844]   next_to_clean        <0>
> > [   81.986844] buffer_info[next_to_clean]
> > [   81.986844]   time_stamp           <ffff9e43>
> > [   81.986844]   next_to_watch        <000000008da33deb>
> > [   81.986844]   jiffies              <ffff9ff0>
> > [   81.986844]   desc.status          <514000>
> > [   85.346515] {12}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 0
> > [   85.354854] {12}[Hardware Error]: It has been corrected by h/w and
> > requires no further action
> > [   85.363365] {12}[Hardware Error]: event severity: corrected
> > [   85.368924] {12}[Hardware Error]:  Error 0, type: corrected
> > [   85.374483] {12}[Hardware Error]:   section_type: PCIe error
> > [   85.380129] {12}[Hardware Error]:   port_type: 0, PCIe end point
> > [   85.386121] {12}[Hardware Error]:   version: 3.0
> > [   85.390725] {12}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   85.396980] {12}[Hardware Error]:   device_id: 0000:09:00.0
> > [   85.402540] {12}[Hardware Error]:   slot: 0
> > [   85.406710] {12}[Hardware Error]:   secondary_bus: 0x00
> > [   85.411921] {12}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   85.418609] {12}[Hardware Error]:   class_code: 000002
> > [   85.423733] {12}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   85.826695] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> > 1000 Mbps Full Duplex, Flow Control: RX
> >
> >
> >
> >
> >
> > > > So, If we are in a kdump kernel try to copy SMMU Stream table from
> > > > primary/old kernel to preserve the mappings until the device driver
> > > > takes over.
> > > >
> > > > Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > > > ---
> > > > Changes for v2: Used memremap in-place of ioremap
> > > >
> > > > V2 patch has been sanity tested.
> > > >
> > > > V1 patch has been tested with
> > > > A) PCIe-Intel 82576 Gigabit Network card in following
> > > > configurations with "no AER error". Each iteration has
> > > > been tested on both Suse kdump rfs And default Centos distro rfs.
> > > >
> > > >  1)  with 2 level stream table
> > > >        ----------------------------------------------------
> > > >        SMMU               |  Normal Ping   | Flood Ping
> > > >        -----------------------------------------------------
> > > >        Default Operation  |  100 times     | 10 times
> > > >        -----------------------------------------------------
> > > >        IOMMU bypass       |  41 times      | 10 times
> > > >        -----------------------------------------------------
> > > >
> > > >  2)  with Linear stream table.
> > > >        -----------------------------------------------------
> > > >        SMMU               |  Normal Ping   | Flood Ping
> > > >        ------------------------------------------------------
> > > >        Default Operation  |  100 times     | 10 times
> > > >        ------------------------------------------------------
> > > >        IOMMU bypass       |  55 times      | 10 times
> > > >        -------------------------------------------------------
> > > >
> > > > B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> > > > SSD card with 2 level stream table using "fio" in mixed read/write and
> > > > only read configurations. It is tested for both Default Operation and
> > > > IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> > > > default Centos ditstro rfs.
> > > >
> > > > This patch is not full proof solution. Issue can still come
> > > > from the point device is discovered and driver probe called.
> > > > This patch has reduced window of scenario from "SMMU Stream table
> > > > creation - device-driver" to "device discovery - device-driver".
> > > > Usually, device discovery to device-driver is very small time. So
> > > > the probability is very low.
> > > >
> > > > Note: device-discovery will overwrite existing stream table entries
> > > > with both SMMU stage as by-pass.
> > > >
> > > >
> > > >  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 35 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > > > index 82508730feb7..d492d92c2dd7 100644
> > > > --- a/drivers/iommu/arm-smmu-v3.c
> > > > +++ b/drivers/iommu/arm-smmu-v3.c
> > > > @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> > > >                       break;
> > > >               case STRTAB_STE_0_CFG_S1_TRANS:
> > > >               case STRTAB_STE_0_CFG_S2_TRANS:
> > > > -                     ste_live = true;
> > > > +                     /*
> > > > +                      * As kdump kernel copy STE table from previous
> > > > +                      * kernel. It still may have valid stream table entries.
> > > > +                      * Forcing entry as false to allow overwrite.
> > > > +                      */
> > > > +                     if (!is_kdump_kernel())
> > > > +                             ste_live = true;
> > > >                       break;
> > > >               case STRTAB_STE_0_CFG_ABORT:
> > > >                       BUG_ON(!disable_bypass);
> > > > @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > >               return -ENOMEM;
> > > >       }
> > > >
> > > > +     if (is_kdump_kernel())
> > > > +             return 0;
> > > > +
> > > >       for (i = 0; i < cfg->num_l1_ents; ++i) {
> > > >               arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
> > > >               strtab += STRTAB_L1_DESC_DWORDS << 3;
> > > > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > >       return 0;
> > > >  }
> > > >
> > > > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > > > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > > > +{
> > > > +     struct arm_smmu_strtab_cfg rdcfg;
> > > > +
> > > > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > > > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > > > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > > > +
> > > > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > > > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > > > +
> > > > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > > > +
> > > > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > > > +}
> > > > +
> > > >  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > > >  {
> > > >       void *strtab;
> > > > @@ -3307,6 +3333,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
> > > >       cfg->strtab_base_cfg = reg;
> > > >
> > > > +     if (is_kdump_kernel())
> > > > +             arm_smmu_copy_table(smmu, cfg, l1size);
> > > > +
> > > >       return arm_smmu_init_l1_strtab(smmu);
> > > >  }
> > > >
> > > > @@ -3334,6 +3363,11 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
> > > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
> > > >       cfg->strtab_base_cfg = reg;
> > > >
> > > > +     if (is_kdump_kernel()) {
> > > > +             arm_smmu_copy_table(smmu, cfg, size);
> > > > +             return 0;
> > > > +     }
> > > > +
> > > >       arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
> > > >       return 0;
> > > >  }
> > > > --
> > > > 2.18.2
> > > >

Will Deacon May 21, 2020, 9:23 a.m. UTC | #7

On Tue, May 19, 2020 at 08:24:21AM +0530, Prabhakar Kushwaha wrote:
> On Mon, May 18, 2020 at 9:25 PM Will Deacon <will@kernel.org> wrote:
> > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > >       return 0;
> > >  }
> > >
> > > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > > +{
> > > +     struct arm_smmu_strtab_cfg rdcfg;
> > > +
> > > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > > +
> > > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > > +
> > > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > > +
> 
> this need a fix. It should be memcpy.
> 
> > > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> >
> > Sorry, but this is unacceptable. These things were allocated by the DMA API
> > so you can't just memcpy them around and hope for the best.
> >
> 
> I was referring copy_context_table() in drivers/iommu/intel-iommu.c.
> here i see usage of memremap and memcpy to copy older iommu table.
> did I take wrong reference?
> 
> What kind of issue you are foreseeing in using memcpy(). May be we can
> try to find a solution.

Well the thing might not be cache-coherent to start with...

Will

Prabhakar Kushwaha May 21, 2020, 11:22 a.m. UTC | #8

Hi Will,

On Thu, May 21, 2020 at 2:53 PM Will Deacon <will@kernel.org> wrote:
>
> On Tue, May 19, 2020 at 08:24:21AM +0530, Prabhakar Kushwaha wrote:
> > On Mon, May 18, 2020 at 9:25 PM Will Deacon <will@kernel.org> wrote:
> > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > >       return 0;
> > > >  }
> > > >
> > > > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > > > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > > > +{
> > > > +     struct arm_smmu_strtab_cfg rdcfg;
> > > > +
> > > > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > > > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > > > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > > > +
> > > > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > > > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > > > +
> > > > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > > > +
> >
> > this need a fix. It should be memcpy.
> >
> > > > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > >
> > > Sorry, but this is unacceptable. These things were allocated by the DMA API
> > > so you can't just memcpy them around and hope for the best.
> > >
> >
> > I was referring copy_context_table() in drivers/iommu/intel-iommu.c.
> > here i see usage of memremap and memcpy to copy older iommu table.
> > did I take wrong reference?
> >
> > What kind of issue you are foreseeing in using memcpy(). May be we can
> > try to find a solution.
>
> Well the thing might not be cache-coherent to start with...
>

Thanks for telling possible issue area.  Let me try to explain why
this should not be an issue.

kdump kernel runs from reserved memory space defined during the boot
of first kernel. kdump does not touch memory of the previous kernel.
So no page has been created in kdump kernel  and  there should not be
any data/attribute/coherency issue from MMU point of view .

During SMMU probe functions,  dmem_alloc_coherent() will be used
allocate new memory (part of existing flow).
This patch copy STE or first level descriptor to *this* memory, after
mapping physical address using memremap().
It just copy everything  so there should not be any issue related to
attribute/content.

Yes, copying  done after mapping it as MEMREMAP_WB. if you want I can
use it as MEMREMAP_WT

In both scenario and also considering intel driver is doing similar
things. I feel there should not be an issue.

Please let me know if you have any other view to solve this problem. I
will be more than happy to explore it.

thanks
--pk

Bjorn Helgaas May 21, 2020, 10:49 p.m. UTC | #9

On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > used by the SMMU to perform address translations for device-originated
> > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > via this new table..
> > > > >
> > > > > There are scenarios, where devices are still having old pending
> > > > > transactions (configured in the primary kernel). These transactions
> > > > > come in-between Stream table creation and device-driver probe.
> > > > > As new stream table does not have entry for older transactions,
> > > > > it will be aborted by SMMU.
> > > > >
> > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > Transactions configured for older Stream table entries, that do not
> > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > >
> > > > That sounds like exactly what we want, doesn't it?
> > > >
> > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > read or scribble on something, but maybe that's not terrible as long
> > > > as it's not memory used by the kdump kernel.
> > >
> > > Yes, Abort should happen. But it should happen in context of driver.
> > > But current abort is happening because of SMMU and no driver/pcie
> > > setup present at this moment.
> >
> > I don't understand what you mean by "in context of driver."  The whole
> > problem is that we can't control *when* the abort happens, so it may
> > happen in *any* context.  It may happen when a NIC receives a packet
> > or at some other unpredictable time.
> >
> > > Solution of this issue should be at 2 place
> > > a) SMMU level: I still believe, this patch has potential to overcome
> > > issue till finally driver's probe takeover.
> > > b) Device level: Even if something goes wrong. Driver/device should
> > > able to recover.
> > >
> > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > A network device hang is observed even after continuous
> > > > > reset/recovery from driver, Hence device is no more usable.
> > > >
> > > > The fact that the device is no longer usable is definitely a problem.
> > > > But in principle we *should* be able to recover from these errors.  If
> > > > we could recover and reliably use the device after the error, that
> > > > seems like it would be a more robust solution that having to add
> > > > special cases in every IOMMU driver.
> > > >
> > > > If you have details about this sort of error, I'd like to try to fix
> > > > it because we want to recover from that sort of error in normal
> > > > (non-crash) situations as well.
> > > >
> > > Completion abort case should be gracefully handled.  And device should
> > > always remain usable.
> > >
> > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > 82576 Gigabit Network card.
> > >
> > > I)  Crash testing using kdump root file system: De-facto scenario
> > >     -  kdump file system does not have Ethernet driver
> > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > of kdump root file system.
> >
> > In this case, I think report_error_detected() is deciding that because
> > the device has no driver, we can't do anything.  The flow is like
> > this:
> >
> >   aer_recover_work_func               # aer_recover_work
> >     kfifo_get(aer_recover_ring, entry)
> >     dev = pci_get_domain_bus_and_slot
> >     cper_print_aer(dev, ...)
> >       pci_err("AER: aer_status:")
> >       pci_err("AER:   [14] CmpltTO")
> >       pci_err("AER: aer_layer=")
> >     if (AER_NONFATAL)
> >       pcie_do_recovery(dev, pci_channel_io_normal)
> >         status = CAN_RECOVER
> >         pci_walk_bus(report_normal_detected)
> >           report_error_detected
> >             if (!dev->driver)
> >               vote = NO_AER_DRIVER
> >               pci_info("can't recover (no error_detected callback)")
> >             *result = merge_result(*, NO_AER_DRIVER)
> >             # always NO_AER_DRIVER
> >         status is now NO_AER_DRIVER
> >
> > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > and status is not RECOVERED, so it skips .resume().
> >
> > I don't remember the history there, but if a device has no driver and
> > the device generates errors, it seems like we ought to be able to
> > reset it.
> 
> But how to reset the device considering there is no driver.
> Hypothetically, this case should be taken care by PCIe subsystem to
> perform reset at PCIe level.

I don't understand your question.  The PCI core (not the device
driver) already does the reset.  When pcie_do_recovery() calls
reset_link(), all devices on the other side of the link are reset.

> > We should be able to field one (or a few) AER errors, reset the
> > device, and you should be able to use the shell in the kdump kernel.
> >
> here kdump shell is usable only problem is a "lot of AER Errors". One
> cannot see what they are typing.

Right, that's what I expect.  If the PCI core resets the device, you
should get just a few AER errors, and they should stop after the
device is reset.

> > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > >
> > > II) Crash testing using default root file system: Specific case to
> > > test Ethernet driver in second kernel
> > >    -  Default root file system have Ethernet driver
> > >    -  AER error comes even before the driver probe starts.
> > >    -  Driver does reset Ethernet card as part of probe but no success.
> > >    -  AER also tries to recover. but no success.  [2]
> > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > and commenting ghes_handle_aer() from GHES driver..
> > >           than different set of errors come which also never able to recover [3]
> > >
> 
> Please suggest your view on this case. Here driver is preset.
> (driver/net/ethernet/intel/igb/igb_main.c)
> In this case AER errors starts even before driver probe starts.
> After probe, driver does the device reset with no success and even AER
> recovery does not work.

This case should be the same as the one above.  If we can change the
PCI core so it can reset the device when there's no driver, that would
apply to case I (where there will never be a driver) and to case II
(where there is no driver now, but a driver will probe the device
later).

> Problem mentioned in case I and II goes away if do pci_reset_function
> during enumeration phase of kdump kernel.
> can we thought of doing pci_reset_function for all devices in kdump
> kernel or device specific quirk.
> 
> --pk
> 
> 
> > > As per my understanding, possible solutions are
> > >  - Copy SMMU table i.e. this patch
> > > OR
> > >  - Doing pci_reset_function() during enumeration phase.
> > > I also tried clearing "M" bit using pci_clear_master during
> > > enumeration but it did not help. Because driver re-set M bit causing
> > > same AER error again.
> > >
> > >
> > > -pk
> > >
> > > ---------------------------------------------------------------------------------------------------------------------------
> > > [1] with bootargs having pci=noaer
> > >
> > > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > > Hardware Error Source: 1
> > > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > > [   22.582323] {4}[Hardware Error]:   slot: 0
> > > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > aer_mask: 0x00000000
> > > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > aer_agent=Requester ID
> > > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > > total mem (8153768 kB)
> > > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > aer_mask: 0x00000000
> > > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > aer_agent=Requester ID
> > > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > >
> > >
> > > ----------------------------------------------------------------------------------------------------------------------------
> > > [2] Normal bootargs.
> > >
> > > [   54.252454] {6}[Hardware Error]: Hardware error from APEI Generic
> > > Hardware Error Source: 1
> > > [   54.265827] {6}[Hardware Error]: event severity: recoverable
> > > [   54.271473] {6}[Hardware Error]:  Error 0, type: recoverable
> > > [   54.281605] {6}[Hardware Error]:   section_type: PCIe error
> > > [   54.287163] {6}[Hardware Error]:   port_type: 0, PCIe end point
> > > [   54.296955] {6}[Hardware Error]:   version: 3.0
> > > [   54.301471] {6}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > [   54.312520] {6}[Hardware Error]:   device_id: 0000:09:00.1
> > > [   54.317991] {6}[Hardware Error]:   slot: 0
> > > [   54.322074] {6}[Hardware Error]:   secondary_bus: 0x00
> > > [   54.327197] {6}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > [   54.333797] {6}[Hardware Error]:   class_code: 000002
> > > [   54.351312] {6}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > [   54.358001] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [   54.376852] pcieport 0000:00:09.0: AER: device recovery successful
> > > [   54.383034] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> > > aer_mask: 0x00000000
> > > [   54.390348] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > [   54.397144] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > aer_agent=Requester ID
> > > [   54.409555] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > [   54.551370] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [   54.705214] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [   54.758703] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [   54.865445] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [   54.888751] pcieport 0000:00:09.0: AER: device recovery successful
> > > [   54.894933] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> > > aer_mask: 0x00000000
> > > [   54.902228] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > [   54.916059] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > aer_agent=Requester ID
> > > [   54.923972] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > [   55.057272] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  274.571401] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  274.686138] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  274.786134] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  274.886141] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  397.792897] Workqueue: events aer_recover_work_func
> > > [  397.797760] Call trace:
> > > [  397.800199]  __switch_to+0xcc/0x108
> > > [  397.803675]  __schedule+0x2c0/0x700
> > > [  397.807150]  schedule+0x58/0xe8
> > > [  397.810283]  schedule_preempt_disabled+0x18/0x28
> > > [  397.810788] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  397.814887]  __mutex_lock.isra.9+0x288/0x5c8
> > > [  397.814890]  __mutex_lock_slowpath+0x1c/0x28
> > > [  397.830962]  mutex_lock+0x4c/0x68
> > > [  397.834264]  report_slot_reset+0x30/0xa0
> > > [  397.838178]  pci_walk_bus+0x68/0xc0
> > > [  397.841653]  pcie_do_recovery+0xe8/0x248
> > > [  397.845562]  aer_recover_work_func+0x100/0x138
> > > [  397.849995]  process_one_work+0x1bc/0x458
> > > [  397.853991]  worker_thread+0x150/0x500
> > > [  397.857727]  kthread+0x114/0x118
> > > [  397.860945]  ret_from_fork+0x10/0x18
> > > [  397.864525] INFO: task kworker/223:2:2939 blocked for more than 122 seconds.
> > > [  397.871564]       Not tainted 5.7.0-rc3+ #68
> > > [  397.875819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [  397.883638] kworker/223:2   D    0  2939      2 0x00000228
> > > [  397.889121] Workqueue: ipv6_addrconf addrconf_verify_work
> > > [  397.894505] Call trace:
> > > [  397.896940]  __switch_to+0xcc/0x108
> > > [  397.900419]  __schedule+0x2c0/0x700
> > > [  397.903894]  schedule+0x58/0xe8
> > > [  397.907023]  schedule_preempt_disabled+0x18/0x28
> > > [  397.910798] AER: AER recover: Buffer overflow when recovering AER
> > > for 0000:09:00:1
> > > [  397.911630]  __mutex_lock.isra.9+0x288/0x5c8
> > > [  397.923440]  __mutex_lock_slowpath+0x1c/0x28
> > > [  397.927696]  mutex_lock+0x4c/0x68
> > > [  397.931005]  rtnl_lock+0x24/0x30
> > > [  397.934220]  addrconf_verify_work+0x18/0x30
> > > [  397.938394]  process_one_work+0x1bc/0x458
> > > [  397.942390]  worker_thread+0x150/0x500
> > > [  397.946126]  kthread+0x114/0x118
> > > [  397.949345]  ret_from_fork+0x10/0x18
> > >
> > > ---------------------------------------------------------------------------------------------------------------------------------
> > > [3] with bootargs as pci=noaer and comment ghes_halder_aer() from AER driver
> > >
> > > [   69.037035] igb 0000:09:00.1 enp9s0f1: Reset adapter
> > > [   69.348446] {9}[Hardware Error]: Hardware error from APEI Generic
> > > Hardware Error Source: 0
> > > [   69.356698] {9}[Hardware Error]: It has been corrected by h/w and
> > > requires no further action
> > > [   69.365121] {9}[Hardware Error]: event severity: corrected
> > > [   69.370593] {9}[Hardware Error]:  Error 0, type: corrected
> > > [   69.376064] {9}[Hardware Error]:   section_type: PCIe error
> > > [   69.381623] {9}[Hardware Error]:   port_type: 4, root port
> > > [   69.387094] {9}[Hardware Error]:   version: 3.0
> > > [   69.391611] {9}[Hardware Error]:   command: 0x0106, status: 0x4010
> > > [   69.397777] {9}[Hardware Error]:   device_id: 0000:00:09.0
> > > [   69.403248] {9}[Hardware Error]:   slot: 0
> > > [   69.407331] {9}[Hardware Error]:   secondary_bus: 0x09
> > > [   69.412455] {9}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > > [   69.419055] {9}[Hardware Error]:   class_code: 000406
> > > [   69.424093] {9}[Hardware Error]:   bridge: secondary_status:
> > > 0x6000, control: 0x0002
> > > [   72.118132] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> > > 1000 Mbps Full Duplex, Flow Control: RX
> > > [   73.995068] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   73.995068]   Tx Queue             <2>
> > > [   73.995068]   TDH                  <0>
> > > [   73.995068]   TDT                  <1>
> > > [   73.995068]   next_to_use          <1>
> > > [   73.995068]   next_to_clean        <0>
> > > [   73.995068] buffer_info[next_to_clean]
> > > [   73.995068]   time_stamp           <ffff9c1a>
> > > [   73.995068]   next_to_watch        <0000000097d42934>
> > > [   73.995068]   jiffies              <ffff9cd0>
> > > [   73.995068]   desc.status          <168000>
> > > [   75.987323] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   75.987323]   Tx Queue             <2>
> > > [   75.987323]   TDH                  <0>
> > > [   75.987323]   TDT                  <1>
> > > [   75.987323]   next_to_use          <1>
> > > [   75.987323]   next_to_clean        <0>
> > > [   75.987323] buffer_info[next_to_clean]
> > > [   75.987323]   time_stamp           <ffff9c1a>
> > > [   75.987323]   next_to_watch        <0000000097d42934>
> > > [   75.987323]   jiffies              <ffff9d98>
> > > [   75.987323]   desc.status          <168000>
> > > [   77.952661] {10}[Hardware Error]: Hardware error from APEI Generic
> > > Hardware Error Source: 1
> > > [   77.971790] {10}[Hardware Error]: event severity: recoverable
> > > [   77.977522] {10}[Hardware Error]:  Error 0, type: recoverable
> > > [   77.983254] {10}[Hardware Error]:   section_type: PCIe error
> > > [   77.999930] {10}[Hardware Error]:   port_type: 0, PCIe end point
> > > [   78.005922] {10}[Hardware Error]:   version: 3.0
> > > [   78.010526] {10}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > [   78.016779] {10}[Hardware Error]:   device_id: 0000:09:00.1
> > > [   78.033107] {10}[Hardware Error]:   slot: 0
> > > [   78.037276] {10}[Hardware Error]:   secondary_bus: 0x00
> > > [   78.066253] {10}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > [   78.072940] {10}[Hardware Error]:   class_code: 000002
> > > [   78.078064] {10}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > [   78.096202] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   78.096202]   Tx Queue             <2>
> > > [   78.096202]   TDH                  <0>
> > > [   78.096202]   TDT                  <1>
> > > [   78.096202]   next_to_use          <1>
> > > [   78.096202]   next_to_clean        <0>
> > > [   78.096202] buffer_info[next_to_clean]
> > > [   78.096202]   time_stamp           <ffff9c1a>
> > > [   78.096202]   next_to_watch        <0000000097d42934>
> > > [   78.096202]   jiffies              <ffff9e6a>
> > > [   78.096202]   desc.status          <168000>
> > > [   79.587406] {11}[Hardware Error]: Hardware error from APEI Generic
> > > Hardware Error Source: 0
> > > [   79.595744] {11}[Hardware Error]: It has been corrected by h/w and
> > > requires no further action
> > > [   79.604254] {11}[Hardware Error]: event severity: corrected
> > > [   79.609813] {11}[Hardware Error]:  Error 0, type: corrected
> > > [   79.615371] {11}[Hardware Error]:   section_type: PCIe error
> > > [   79.621016] {11}[Hardware Error]:   port_type: 4, root port
> > > [   79.626574] {11}[Hardware Error]:   version: 3.0
> > > [   79.631177] {11}[Hardware Error]:   command: 0x0106, status: 0x4010
> > > [   79.637430] {11}[Hardware Error]:   device_id: 0000:00:09.0
> > > [   79.642988] {11}[Hardware Error]:   slot: 0
> > > [   79.647157] {11}[Hardware Error]:   secondary_bus: 0x09
> > > [   79.652368] {11}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > > [   79.659055] {11}[Hardware Error]:   class_code: 000406
> > > [   79.664180] {11}[Hardware Error]:   bridge: secondary_status:
> > > 0x6000, control: 0x0002
> > > [   79.987052] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   79.987052]   Tx Queue             <2>
> > > [   79.987052]   TDH                  <0>
> > > [   79.987052]   TDT                  <1>
> > > [   79.987052]   next_to_use          <1>
> > > [   79.987052]   next_to_clean        <0>
> > > [   79.987052] buffer_info[next_to_clean]
> > > [   79.987052]   time_stamp           <ffff9c1a>
> > > [   79.987052]   next_to_watch        <0000000097d42934>
> > > [   79.987052]   jiffies              <ffff9f28>
> > > [   79.987052]   desc.status          <168000>
> > > [   79.987056] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   79.987056]   Tx Queue             <3>
> > > [   79.987056]   TDH                  <0>
> > > [   79.987056]   TDT                  <1>
> > > [   79.987056]   next_to_use          <1>
> > > [   79.987056]   next_to_clean        <0>
> > > [   79.987056] buffer_info[next_to_clean]
> > > [   79.987056]   time_stamp           <ffff9e43>
> > > [   79.987056]   next_to_watch        <000000008da33deb>
> > > [   79.987056]   jiffies              <ffff9f28>
> > > [   79.987056]   desc.status          <514000>
> > > [   81.986688] igb 0000:09:00.1 enp9s0f1: Reset adapter
> > > [   81.986842] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   81.986842]   Tx Queue             <2>
> > > [   81.986842]   TDH                  <0>
> > > [   81.986842]   TDT                  <1>
> > > [   81.986842]   next_to_use          <1>
> > > [   81.986842]   next_to_clean        <0>
> > > [   81.986842] buffer_info[next_to_clean]
> > > [   81.986842]   time_stamp           <ffff9c1a>
> > > [   81.986842]   next_to_watch        <0000000097d42934>
> > > [   81.986842]   jiffies              <ffff9ff0>
> > > [   81.986842]   desc.status          <168000>
> > > [   81.986844] igb 0000:09:00.1: Detected Tx Unit Hang
> > > [   81.986844]   Tx Queue             <3>
> > > [   81.986844]   TDH                  <0>
> > > [   81.986844]   TDT                  <1>
> > > [   81.986844]   next_to_use          <1>
> > > [   81.986844]   next_to_clean        <0>
> > > [   81.986844] buffer_info[next_to_clean]
> > > [   81.986844]   time_stamp           <ffff9e43>
> > > [   81.986844]   next_to_watch        <000000008da33deb>
> > > [   81.986844]   jiffies              <ffff9ff0>
> > > [   81.986844]   desc.status          <514000>
> > > [   85.346515] {12}[Hardware Error]: Hardware error from APEI Generic
> > > Hardware Error Source: 0
> > > [   85.354854] {12}[Hardware Error]: It has been corrected by h/w and
> > > requires no further action
> > > [   85.363365] {12}[Hardware Error]: event severity: corrected
> > > [   85.368924] {12}[Hardware Error]:  Error 0, type: corrected
> > > [   85.374483] {12}[Hardware Error]:   section_type: PCIe error
> > > [   85.380129] {12}[Hardware Error]:   port_type: 0, PCIe end point
> > > [   85.386121] {12}[Hardware Error]:   version: 3.0
> > > [   85.390725] {12}[Hardware Error]:   command: 0x0507, status: 0x0010
> > > [   85.396980] {12}[Hardware Error]:   device_id: 0000:09:00.0
> > > [   85.402540] {12}[Hardware Error]:   slot: 0
> > > [   85.406710] {12}[Hardware Error]:   secondary_bus: 0x00
> > > [   85.411921] {12}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > [   85.418609] {12}[Hardware Error]:   class_code: 000002
> > > [   85.423733] {12}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > [   85.826695] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> > > 1000 Mbps Full Duplex, Flow Control: RX
> > >
> > >
> > >
> > >
> > >
> > > > > So, If we are in a kdump kernel try to copy SMMU Stream table from
> > > > > primary/old kernel to preserve the mappings until the device driver
> > > > > takes over.
> > > > >
> > > > > Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > > > > ---
> > > > > Changes for v2: Used memremap in-place of ioremap
> > > > >
> > > > > V2 patch has been sanity tested.
> > > > >
> > > > > V1 patch has been tested with
> > > > > A) PCIe-Intel 82576 Gigabit Network card in following
> > > > > configurations with "no AER error". Each iteration has
> > > > > been tested on both Suse kdump rfs And default Centos distro rfs.
> > > > >
> > > > >  1)  with 2 level stream table
> > > > >        ----------------------------------------------------
> > > > >        SMMU               |  Normal Ping   | Flood Ping
> > > > >        -----------------------------------------------------
> > > > >        Default Operation  |  100 times     | 10 times
> > > > >        -----------------------------------------------------
> > > > >        IOMMU bypass       |  41 times      | 10 times
> > > > >        -----------------------------------------------------
> > > > >
> > > > >  2)  with Linear stream table.
> > > > >        -----------------------------------------------------
> > > > >        SMMU               |  Normal Ping   | Flood Ping
> > > > >        ------------------------------------------------------
> > > > >        Default Operation  |  100 times     | 10 times
> > > > >        ------------------------------------------------------
> > > > >        IOMMU bypass       |  55 times      | 10 times
> > > > >        -------------------------------------------------------
> > > > >
> > > > > B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> > > > > SSD card with 2 level stream table using "fio" in mixed read/write and
> > > > > only read configurations. It is tested for both Default Operation and
> > > > > IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> > > > > default Centos ditstro rfs.
> > > > >
> > > > > This patch is not full proof solution. Issue can still come
> > > > > from the point device is discovered and driver probe called.
> > > > > This patch has reduced window of scenario from "SMMU Stream table
> > > > > creation - device-driver" to "device discovery - device-driver".
> > > > > Usually, device discovery to device-driver is very small time. So
> > > > > the probability is very low.
> > > > >
> > > > > Note: device-discovery will overwrite existing stream table entries
> > > > > with both SMMU stage as by-pass.
> > > > >
> > > > >
> > > > >  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
> > > > >  1 file changed, 35 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > > > > index 82508730feb7..d492d92c2dd7 100644
> > > > > --- a/drivers/iommu/arm-smmu-v3.c
> > > > > +++ b/drivers/iommu/arm-smmu-v3.c
> > > > > @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> > > > >                       break;
> > > > >               case STRTAB_STE_0_CFG_S1_TRANS:
> > > > >               case STRTAB_STE_0_CFG_S2_TRANS:
> > > > > -                     ste_live = true;
> > > > > +                     /*
> > > > > +                      * As kdump kernel copy STE table from previous
> > > > > +                      * kernel. It still may have valid stream table entries.
> > > > > +                      * Forcing entry as false to allow overwrite.
> > > > > +                      */
> > > > > +                     if (!is_kdump_kernel())
> > > > > +                             ste_live = true;
> > > > >                       break;
> > > > >               case STRTAB_STE_0_CFG_ABORT:
> > > > >                       BUG_ON(!disable_bypass);
> > > > > @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > > >               return -ENOMEM;
> > > > >       }
> > > > >
> > > > > +     if (is_kdump_kernel())
> > > > > +             return 0;
> > > > > +
> > > > >       for (i = 0; i < cfg->num_l1_ents; ++i) {
> > > > >               arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
> > > > >               strtab += STRTAB_L1_DESC_DWORDS << 3;
> > > > > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > > >       return 0;
> > > > >  }
> > > > >
> > > > > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > > > > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > > > > +{
> > > > > +     struct arm_smmu_strtab_cfg rdcfg;
> > > > > +
> > > > > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > > > > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > > > > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > > > > +
> > > > > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > > > > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > > > > +
> > > > > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > > > > +
> > > > > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > > > > +}
> > > > > +
> > > > >  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > > > >  {
> > > > >       void *strtab;
> > > > > @@ -3307,6 +3333,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > > > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
> > > > >       cfg->strtab_base_cfg = reg;
> > > > >
> > > > > +     if (is_kdump_kernel())
> > > > > +             arm_smmu_copy_table(smmu, cfg, l1size);
> > > > > +
> > > > >       return arm_smmu_init_l1_strtab(smmu);
> > > > >  }
> > > > >
> > > > > @@ -3334,6 +3363,11 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
> > > > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
> > > > >       cfg->strtab_base_cfg = reg;
> > > > >
> > > > > +     if (is_kdump_kernel()) {
> > > > > +             arm_smmu_copy_table(smmu, cfg, size);
> > > > > +             return 0;
> > > > > +     }
> > > > > +
> > > > >       arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
> > > > >       return 0;
> > > > >  }
> > > > > --
> > > > > 2.18.2
> > > > >

Prabhakar Kushwaha May 27, 2020, 11:44 a.m. UTC | #10

Hi Bjorn,

On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > via this new table..
> > > > > >
> > > > > > There are scenarios, where devices are still having old pending
> > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > As new stream table does not have entry for older transactions,
> > > > > > it will be aborted by SMMU.
> > > > > >
> > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > >
> > > > > That sounds like exactly what we want, doesn't it?
> > > > >
> > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > as it's not memory used by the kdump kernel.
> > > >
> > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > But current abort is happening because of SMMU and no driver/pcie
> > > > setup present at this moment.
> > >
> > > I don't understand what you mean by "in context of driver."  The whole
> > > problem is that we can't control *when* the abort happens, so it may
> > > happen in *any* context.  It may happen when a NIC receives a packet
> > > or at some other unpredictable time.
> > >
> > > > Solution of this issue should be at 2 place
> > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > issue till finally driver's probe takeover.
> > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > able to recover.
> > > >
> > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > A network device hang is observed even after continuous
> > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > >
> > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > we could recover and reliably use the device after the error, that
> > > > > seems like it would be a more robust solution that having to add
> > > > > special cases in every IOMMU driver.
> > > > >
> > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > it because we want to recover from that sort of error in normal
> > > > > (non-crash) situations as well.
> > > > >
> > > > Completion abort case should be gracefully handled.  And device should
> > > > always remain usable.
> > > >
> > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > 82576 Gigabit Network card.
> > > >
> > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > >     -  kdump file system does not have Ethernet driver
> > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > of kdump root file system.
> > >
> > > In this case, I think report_error_detected() is deciding that because
> > > the device has no driver, we can't do anything.  The flow is like
> > > this:
> > >
> > >   aer_recover_work_func               # aer_recover_work
> > >     kfifo_get(aer_recover_ring, entry)
> > >     dev = pci_get_domain_bus_and_slot
> > >     cper_print_aer(dev, ...)
> > >       pci_err("AER: aer_status:")
> > >       pci_err("AER:   [14] CmpltTO")
> > >       pci_err("AER: aer_layer=")
> > >     if (AER_NONFATAL)
> > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > >         status = CAN_RECOVER
> > >         pci_walk_bus(report_normal_detected)
> > >           report_error_detected
> > >             if (!dev->driver)
> > >               vote = NO_AER_DRIVER
> > >               pci_info("can't recover (no error_detected callback)")
> > >             *result = merge_result(*, NO_AER_DRIVER)
> > >             # always NO_AER_DRIVER
> > >         status is now NO_AER_DRIVER
> > >
> > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > and status is not RECOVERED, so it skips .resume().
> > >
> > > I don't remember the history there, but if a device has no driver and
> > > the device generates errors, it seems like we ought to be able to
> > > reset it.
> >
> > But how to reset the device considering there is no driver.
> > Hypothetically, this case should be taken care by PCIe subsystem to
> > perform reset at PCIe level.
>
> I don't understand your question.  The PCI core (not the device
> driver) already does the reset.  When pcie_do_recovery() calls
> reset_link(), all devices on the other side of the link are reset.
>
> > > We should be able to field one (or a few) AER errors, reset the
> > > device, and you should be able to use the shell in the kdump kernel.
> > >
> > here kdump shell is usable only problem is a "lot of AER Errors". One
> > cannot see what they are typing.
>
> Right, that's what I expect.  If the PCI core resets the device, you
> should get just a few AER errors, and they should stop after the
> device is reset.
>
> > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > >
> > > > II) Crash testing using default root file system: Specific case to
> > > > test Ethernet driver in second kernel
> > > >    -  Default root file system have Ethernet driver
> > > >    -  AER error comes even before the driver probe starts.
> > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > >    -  AER also tries to recover. but no success.  [2]
> > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > and commenting ghes_handle_aer() from GHES driver..
> > > >           than different set of errors come which also never able to recover [3]
> > > >
> >
> > Please suggest your view on this case. Here driver is preset.
> > (driver/net/ethernet/intel/igb/igb_main.c)
> > In this case AER errors starts even before driver probe starts.
> > After probe, driver does the device reset with no success and even AER
> > recovery does not work.
>
> This case should be the same as the one above.  If we can change the
> PCI core so it can reset the device when there's no driver,  that would
> apply to case I (where there will never be a driver) and to case II
> (where there is no driver now, but a driver will probe the device
> later).
>

Does this means change are required in PCI core.
I tried following changes in pcie_do_recovery() but it did not help.
Same error as before.

-- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
        pci_info(dev, "broadcast resume message\n");
        pci_walk_bus(bus, report_resume, &status);
@@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
        return status;

 failed:
        pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
+       pci_reset_function(dev);
+       pci_aer_clear_device_status(dev);
+       pci_aer_clear_nonfatal_status(dev);

--pk

> > Problem mentioned in case I and II goes away if do pci_reset_function
> > during enumeration phase of kdump kernel.
> > can we thought of doing pci_reset_function for all devices in kdump
> > kernel or device specific quirk.
> >
> > --pk
> >
> >
> > > > As per my understanding, possible solutions are
> > > >  - Copy SMMU table i.e. this patch
> > > > OR
> > > >  - Doing pci_reset_function() during enumeration phase.
> > > > I also tried clearing "M" bit using pci_clear_master during
> > > > enumeration but it did not help. Because driver re-set M bit causing
> > > > same AER error again.
> > > >
> > > >
> > > > -pk
> > > >
> > > > ---------------------------------------------------------------------------------------------------------------------------
> > > > [1] with bootargs having pci=noaer
> > > >
> > > > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > > > Hardware Error Source: 1
> > > > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > > > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > > > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > > > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > > > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > > > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > > > [   22.582323] {4}[Hardware Error]:   slot: 0
> > > > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > > > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > > > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > aer_mask: 0x00000000
> > > > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > aer_agent=Requester ID
> > > > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > > > total mem (8153768 kB)
> > > > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > > > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > > > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > aer_mask: 0x00000000
> > > > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > aer_agent=Requester ID
> > > > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > >
> > > >
> > > > ----------------------------------------------------------------------------------------------------------------------------
> > > > [2] Normal bootargs.
> > > >
> > > > [   54.252454] {6}[Hardware Error]: Hardware error from APEI Generic
> > > > Hardware Error Source: 1
> > > > [   54.265827] {6}[Hardware Error]: event severity: recoverable
> > > > [   54.271473] {6}[Hardware Error]:  Error 0, type: recoverable
> > > > [   54.281605] {6}[Hardware Error]:   section_type: PCIe error
> > > > [   54.287163] {6}[Hardware Error]:   port_type: 0, PCIe end point
> > > > [   54.296955] {6}[Hardware Error]:   version: 3.0
> > > > [   54.301471] {6}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > [   54.312520] {6}[Hardware Error]:   device_id: 0000:09:00.1
> > > > [   54.317991] {6}[Hardware Error]:   slot: 0
> > > > [   54.322074] {6}[Hardware Error]:   secondary_bus: 0x00
> > > > [   54.327197] {6}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > [   54.333797] {6}[Hardware Error]:   class_code: 000002
> > > > [   54.351312] {6}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > [   54.358001] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [   54.376852] pcieport 0000:00:09.0: AER: device recovery successful
> > > > [   54.383034] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > aer_mask: 0x00000000
> > > > [   54.390348] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > [   54.397144] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > aer_agent=Requester ID
> > > > [   54.409555] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > [   54.551370] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [   54.705214] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [   54.758703] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [   54.865445] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [   54.888751] pcieport 0000:00:09.0: AER: device recovery successful
> > > > [   54.894933] igb 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > aer_mask: 0x00000000
> > > > [   54.902228] igb 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > [   54.916059] igb 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > aer_agent=Requester ID
> > > > [   54.923972] igb 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > [   55.057272] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  274.571401] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  274.686138] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  274.786134] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  274.886141] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  397.792897] Workqueue: events aer_recover_work_func
> > > > [  397.797760] Call trace:
> > > > [  397.800199]  __switch_to+0xcc/0x108
> > > > [  397.803675]  __schedule+0x2c0/0x700
> > > > [  397.807150]  schedule+0x58/0xe8
> > > > [  397.810283]  schedule_preempt_disabled+0x18/0x28
> > > > [  397.810788] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  397.814887]  __mutex_lock.isra.9+0x288/0x5c8
> > > > [  397.814890]  __mutex_lock_slowpath+0x1c/0x28
> > > > [  397.830962]  mutex_lock+0x4c/0x68
> > > > [  397.834264]  report_slot_reset+0x30/0xa0
> > > > [  397.838178]  pci_walk_bus+0x68/0xc0
> > > > [  397.841653]  pcie_do_recovery+0xe8/0x248
> > > > [  397.845562]  aer_recover_work_func+0x100/0x138
> > > > [  397.849995]  process_one_work+0x1bc/0x458
> > > > [  397.853991]  worker_thread+0x150/0x500
> > > > [  397.857727]  kthread+0x114/0x118
> > > > [  397.860945]  ret_from_fork+0x10/0x18
> > > > [  397.864525] INFO: task kworker/223:2:2939 blocked for more than 122 seconds.
> > > > [  397.871564]       Not tainted 5.7.0-rc3+ #68
> > > > [  397.875819] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > > disables this message.
> > > > [  397.883638] kworker/223:2   D    0  2939      2 0x00000228
> > > > [  397.889121] Workqueue: ipv6_addrconf addrconf_verify_work
> > > > [  397.894505] Call trace:
> > > > [  397.896940]  __switch_to+0xcc/0x108
> > > > [  397.900419]  __schedule+0x2c0/0x700
> > > > [  397.903894]  schedule+0x58/0xe8
> > > > [  397.907023]  schedule_preempt_disabled+0x18/0x28
> > > > [  397.910798] AER: AER recover: Buffer overflow when recovering AER
> > > > for 0000:09:00:1
> > > > [  397.911630]  __mutex_lock.isra.9+0x288/0x5c8
> > > > [  397.923440]  __mutex_lock_slowpath+0x1c/0x28
> > > > [  397.927696]  mutex_lock+0x4c/0x68
> > > > [  397.931005]  rtnl_lock+0x24/0x30
> > > > [  397.934220]  addrconf_verify_work+0x18/0x30
> > > > [  397.938394]  process_one_work+0x1bc/0x458
> > > > [  397.942390]  worker_thread+0x150/0x500
> > > > [  397.946126]  kthread+0x114/0x118
> > > > [  397.949345]  ret_from_fork+0x10/0x18
> > > >
> > > > ---------------------------------------------------------------------------------------------------------------------------------
> > > > [3] with bootargs as pci=noaer and comment ghes_halder_aer() from AER driver
> > > >
> > > > [   69.037035] igb 0000:09:00.1 enp9s0f1: Reset adapter
> > > > [   69.348446] {9}[Hardware Error]: Hardware error from APEI Generic
> > > > Hardware Error Source: 0
> > > > [   69.356698] {9}[Hardware Error]: It has been corrected by h/w and
> > > > requires no further action
> > > > [   69.365121] {9}[Hardware Error]: event severity: corrected
> > > > [   69.370593] {9}[Hardware Error]:  Error 0, type: corrected
> > > > [   69.376064] {9}[Hardware Error]:   section_type: PCIe error
> > > > [   69.381623] {9}[Hardware Error]:   port_type: 4, root port
> > > > [   69.387094] {9}[Hardware Error]:   version: 3.0
> > > > [   69.391611] {9}[Hardware Error]:   command: 0x0106, status: 0x4010
> > > > [   69.397777] {9}[Hardware Error]:   device_id: 0000:00:09.0
> > > > [   69.403248] {9}[Hardware Error]:   slot: 0
> > > > [   69.407331] {9}[Hardware Error]:   secondary_bus: 0x09
> > > > [   69.412455] {9}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > > > [   69.419055] {9}[Hardware Error]:   class_code: 000406
> > > > [   69.424093] {9}[Hardware Error]:   bridge: secondary_status:
> > > > 0x6000, control: 0x0002
> > > > [   72.118132] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> > > > 1000 Mbps Full Duplex, Flow Control: RX
> > > > [   73.995068] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   73.995068]   Tx Queue             <2>
> > > > [   73.995068]   TDH                  <0>
> > > > [   73.995068]   TDT                  <1>
> > > > [   73.995068]   next_to_use          <1>
> > > > [   73.995068]   next_to_clean        <0>
> > > > [   73.995068] buffer_info[next_to_clean]
> > > > [   73.995068]   time_stamp           <ffff9c1a>
> > > > [   73.995068]   next_to_watch        <0000000097d42934>
> > > > [   73.995068]   jiffies              <ffff9cd0>
> > > > [   73.995068]   desc.status          <168000>
> > > > [   75.987323] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   75.987323]   Tx Queue             <2>
> > > > [   75.987323]   TDH                  <0>
> > > > [   75.987323]   TDT                  <1>
> > > > [   75.987323]   next_to_use          <1>
> > > > [   75.987323]   next_to_clean        <0>
> > > > [   75.987323] buffer_info[next_to_clean]
> > > > [   75.987323]   time_stamp           <ffff9c1a>
> > > > [   75.987323]   next_to_watch        <0000000097d42934>
> > > > [   75.987323]   jiffies              <ffff9d98>
> > > > [   75.987323]   desc.status          <168000>
> > > > [   77.952661] {10}[Hardware Error]: Hardware error from APEI Generic
> > > > Hardware Error Source: 1
> > > > [   77.971790] {10}[Hardware Error]: event severity: recoverable
> > > > [   77.977522] {10}[Hardware Error]:  Error 0, type: recoverable
> > > > [   77.983254] {10}[Hardware Error]:   section_type: PCIe error
> > > > [   77.999930] {10}[Hardware Error]:   port_type: 0, PCIe end point
> > > > [   78.005922] {10}[Hardware Error]:   version: 3.0
> > > > [   78.010526] {10}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > [   78.016779] {10}[Hardware Error]:   device_id: 0000:09:00.1
> > > > [   78.033107] {10}[Hardware Error]:   slot: 0
> > > > [   78.037276] {10}[Hardware Error]:   secondary_bus: 0x00
> > > > [   78.066253] {10}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > [   78.072940] {10}[Hardware Error]:   class_code: 000002
> > > > [   78.078064] {10}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > [   78.096202] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   78.096202]   Tx Queue             <2>
> > > > [   78.096202]   TDH                  <0>
> > > > [   78.096202]   TDT                  <1>
> > > > [   78.096202]   next_to_use          <1>
> > > > [   78.096202]   next_to_clean        <0>
> > > > [   78.096202] buffer_info[next_to_clean]
> > > > [   78.096202]   time_stamp           <ffff9c1a>
> > > > [   78.096202]   next_to_watch        <0000000097d42934>
> > > > [   78.096202]   jiffies              <ffff9e6a>
> > > > [   78.096202]   desc.status          <168000>
> > > > [   79.587406] {11}[Hardware Error]: Hardware error from APEI Generic
> > > > Hardware Error Source: 0
> > > > [   79.595744] {11}[Hardware Error]: It has been corrected by h/w and
> > > > requires no further action
> > > > [   79.604254] {11}[Hardware Error]: event severity: corrected
> > > > [   79.609813] {11}[Hardware Error]:  Error 0, type: corrected
> > > > [   79.615371] {11}[Hardware Error]:   section_type: PCIe error
> > > > [   79.621016] {11}[Hardware Error]:   port_type: 4, root port
> > > > [   79.626574] {11}[Hardware Error]:   version: 3.0
> > > > [   79.631177] {11}[Hardware Error]:   command: 0x0106, status: 0x4010
> > > > [   79.637430] {11}[Hardware Error]:   device_id: 0000:00:09.0
> > > > [   79.642988] {11}[Hardware Error]:   slot: 0
> > > > [   79.647157] {11}[Hardware Error]:   secondary_bus: 0x09
> > > > [   79.652368] {11}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > > > [   79.659055] {11}[Hardware Error]:   class_code: 000406
> > > > [   79.664180] {11}[Hardware Error]:   bridge: secondary_status:
> > > > 0x6000, control: 0x0002
> > > > [   79.987052] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   79.987052]   Tx Queue             <2>
> > > > [   79.987052]   TDH                  <0>
> > > > [   79.987052]   TDT                  <1>
> > > > [   79.987052]   next_to_use          <1>
> > > > [   79.987052]   next_to_clean        <0>
> > > > [   79.987052] buffer_info[next_to_clean]
> > > > [   79.987052]   time_stamp           <ffff9c1a>
> > > > [   79.987052]   next_to_watch        <0000000097d42934>
> > > > [   79.987052]   jiffies              <ffff9f28>
> > > > [   79.987052]   desc.status          <168000>
> > > > [   79.987056] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   79.987056]   Tx Queue             <3>
> > > > [   79.987056]   TDH                  <0>
> > > > [   79.987056]   TDT                  <1>
> > > > [   79.987056]   next_to_use          <1>
> > > > [   79.987056]   next_to_clean        <0>
> > > > [   79.987056] buffer_info[next_to_clean]
> > > > [   79.987056]   time_stamp           <ffff9e43>
> > > > [   79.987056]   next_to_watch        <000000008da33deb>
> > > > [   79.987056]   jiffies              <ffff9f28>
> > > > [   79.987056]   desc.status          <514000>
> > > > [   81.986688] igb 0000:09:00.1 enp9s0f1: Reset adapter
> > > > [   81.986842] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   81.986842]   Tx Queue             <2>
> > > > [   81.986842]   TDH                  <0>
> > > > [   81.986842]   TDT                  <1>
> > > > [   81.986842]   next_to_use          <1>
> > > > [   81.986842]   next_to_clean        <0>
> > > > [   81.986842] buffer_info[next_to_clean]
> > > > [   81.986842]   time_stamp           <ffff9c1a>
> > > > [   81.986842]   next_to_watch        <0000000097d42934>
> > > > [   81.986842]   jiffies              <ffff9ff0>
> > > > [   81.986842]   desc.status          <168000>
> > > > [   81.986844] igb 0000:09:00.1: Detected Tx Unit Hang
> > > > [   81.986844]   Tx Queue             <3>
> > > > [   81.986844]   TDH                  <0>
> > > > [   81.986844]   TDT                  <1>
> > > > [   81.986844]   next_to_use          <1>
> > > > [   81.986844]   next_to_clean        <0>
> > > > [   81.986844] buffer_info[next_to_clean]
> > > > [   81.986844]   time_stamp           <ffff9e43>
> > > > [   81.986844]   next_to_watch        <000000008da33deb>
> > > > [   81.986844]   jiffies              <ffff9ff0>
> > > > [   81.986844]   desc.status          <514000>
> > > > [   85.346515] {12}[Hardware Error]: Hardware error from APEI Generic
> > > > Hardware Error Source: 0
> > > > [   85.354854] {12}[Hardware Error]: It has been corrected by h/w and
> > > > requires no further action
> > > > [   85.363365] {12}[Hardware Error]: event severity: corrected
> > > > [   85.368924] {12}[Hardware Error]:  Error 0, type: corrected
> > > > [   85.374483] {12}[Hardware Error]:   section_type: PCIe error
> > > > [   85.380129] {12}[Hardware Error]:   port_type: 0, PCIe end point
> > > > [   85.386121] {12}[Hardware Error]:   version: 3.0
> > > > [   85.390725] {12}[Hardware Error]:   command: 0x0507, status: 0x0010
> > > > [   85.396980] {12}[Hardware Error]:   device_id: 0000:09:00.0
> > > > [   85.402540] {12}[Hardware Error]:   slot: 0
> > > > [   85.406710] {12}[Hardware Error]:   secondary_bus: 0x00
> > > > [   85.411921] {12}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > [   85.418609] {12}[Hardware Error]:   class_code: 000002
> > > > [   85.423733] {12}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > [   85.826695] igb 0000:09:00.1 enp9s0f1: igb: enp9s0f1 NIC Link is Up
> > > > 1000 Mbps Full Duplex, Flow Control: RX
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > > So, If we are in a kdump kernel try to copy SMMU Stream table from
> > > > > > primary/old kernel to preserve the mappings until the device driver
> > > > > > takes over.
> > > > > >
> > > > > > Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com>
> > > > > > ---
> > > > > > Changes for v2: Used memremap in-place of ioremap
> > > > > >
> > > > > > V2 patch has been sanity tested.
> > > > > >
> > > > > > V1 patch has been tested with
> > > > > > A) PCIe-Intel 82576 Gigabit Network card in following
> > > > > > configurations with "no AER error". Each iteration has
> > > > > > been tested on both Suse kdump rfs And default Centos distro rfs.
> > > > > >
> > > > > >  1)  with 2 level stream table
> > > > > >        ----------------------------------------------------
> > > > > >        SMMU               |  Normal Ping   | Flood Ping
> > > > > >        -----------------------------------------------------
> > > > > >        Default Operation  |  100 times     | 10 times
> > > > > >        -----------------------------------------------------
> > > > > >        IOMMU bypass       |  41 times      | 10 times
> > > > > >        -----------------------------------------------------
> > > > > >
> > > > > >  2)  with Linear stream table.
> > > > > >        -----------------------------------------------------
> > > > > >        SMMU               |  Normal Ping   | Flood Ping
> > > > > >        ------------------------------------------------------
> > > > > >        Default Operation  |  100 times     | 10 times
> > > > > >        ------------------------------------------------------
> > > > > >        IOMMU bypass       |  55 times      | 10 times
> > > > > >        -------------------------------------------------------
> > > > > >
> > > > > > B) This patch is also tested with Micron Technology Inc 9200 PRO NVMe
> > > > > > SSD card with 2 level stream table using "fio" in mixed read/write and
> > > > > > only read configurations. It is tested for both Default Operation and
> > > > > > IOMMU bypass mode for minimum 10 iterations across Centos kdump rfs and
> > > > > > default Centos ditstro rfs.
> > > > > >
> > > > > > This patch is not full proof solution. Issue can still come
> > > > > > from the point device is discovered and driver probe called.
> > > > > > This patch has reduced window of scenario from "SMMU Stream table
> > > > > > creation - device-driver" to "device discovery - device-driver".
> > > > > > Usually, device discovery to device-driver is very small time. So
> > > > > > the probability is very low.
> > > > > >
> > > > > > Note: device-discovery will overwrite existing stream table entries
> > > > > > with both SMMU stage as by-pass.
> > > > > >
> > > > > >
> > > > > >  drivers/iommu/arm-smmu-v3.c | 36 +++++++++++++++++++++++++++++++++++-
> > > > > >  1 file changed, 35 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> > > > > > index 82508730feb7..d492d92c2dd7 100644
> > > > > > --- a/drivers/iommu/arm-smmu-v3.c
> > > > > > +++ b/drivers/iommu/arm-smmu-v3.c
> > > > > > @@ -1847,7 +1847,13 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> > > > > >                       break;
> > > > > >               case STRTAB_STE_0_CFG_S1_TRANS:
> > > > > >               case STRTAB_STE_0_CFG_S2_TRANS:
> > > > > > -                     ste_live = true;
> > > > > > +                     /*
> > > > > > +                      * As kdump kernel copy STE table from previous
> > > > > > +                      * kernel. It still may have valid stream table entries.
> > > > > > +                      * Forcing entry as false to allow overwrite.
> > > > > > +                      */
> > > > > > +                     if (!is_kdump_kernel())
> > > > > > +                             ste_live = true;
> > > > > >                       break;
> > > > > >               case STRTAB_STE_0_CFG_ABORT:
> > > > > >                       BUG_ON(!disable_bypass);
> > > > > > @@ -3264,6 +3270,9 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > > > >               return -ENOMEM;
> > > > > >       }
> > > > > >
> > > > > > +     if (is_kdump_kernel())
> > > > > > +             return 0;
> > > > > > +
> > > > > >       for (i = 0; i < cfg->num_l1_ents; ++i) {
> > > > > >               arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
> > > > > >               strtab += STRTAB_L1_DESC_DWORDS << 3;
> > > > > > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > > > >       return 0;
> > > > > >  }
> > > > > >
> > > > > > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > > > > > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > > > > > +{
> > > > > > +     struct arm_smmu_strtab_cfg rdcfg;
> > > > > > +
> > > > > > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > > > > > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > > > > > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > > > > > +
> > > > > > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > > > > > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > > > > > +
> > > > > > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > > > > > +
> > > > > > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > > > > > +}
> > > > > > +
> > > > > >  static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > > > > >  {
> > > > > >       void *strtab;
> > > > > > @@ -3307,6 +3333,9 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
> > > > > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_SPLIT, STRTAB_SPLIT);
> > > > > >       cfg->strtab_base_cfg = reg;
> > > > > >
> > > > > > +     if (is_kdump_kernel())
> > > > > > +             arm_smmu_copy_table(smmu, cfg, l1size);
> > > > > > +
> > > > > >       return arm_smmu_init_l1_strtab(smmu);
> > > > > >  }
> > > > > >
> > > > > > @@ -3334,6 +3363,11 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
> > > > > >       reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
> > > > > >       cfg->strtab_base_cfg = reg;
> > > > > >
> > > > > > +     if (is_kdump_kernel()) {
> > > > > > +             arm_smmu_copy_table(smmu, cfg, size);
> > > > > > +             return 0;
> > > > > > +     }
> > > > > > +
> > > > > >       arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
> > > > > >       return 0;
> > > > > >  }
> > > > > > --
> > > > > > 2.18.2
> > > > > >

Bjorn Helgaas May 27, 2020, 8:18 p.m. UTC | #11

On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
> On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > > via this new table..
> > > > > > >
> > > > > > > There are scenarios, where devices are still having old pending
> > > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > > As new stream table does not have entry for older transactions,
> > > > > > > it will be aborted by SMMU.
> > > > > > >
> > > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > > >
> > > > > > That sounds like exactly what we want, doesn't it?
> > > > > >
> > > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > > as it's not memory used by the kdump kernel.
> > > > >
> > > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > > But current abort is happening because of SMMU and no driver/pcie
> > > > > setup present at this moment.
> > > >
> > > > I don't understand what you mean by "in context of driver."  The whole
> > > > problem is that we can't control *when* the abort happens, so it may
> > > > happen in *any* context.  It may happen when a NIC receives a packet
> > > > or at some other unpredictable time.
> > > >
> > > > > Solution of this issue should be at 2 place
> > > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > > issue till finally driver's probe takeover.
> > > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > > able to recover.
> > > > >
> > > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > > A network device hang is observed even after continuous
> > > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > > >
> > > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > > we could recover and reliably use the device after the error, that
> > > > > > seems like it would be a more robust solution that having to add
> > > > > > special cases in every IOMMU driver.
> > > > > >
> > > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > > it because we want to recover from that sort of error in normal
> > > > > > (non-crash) situations as well.
> > > > > >
> > > > > Completion abort case should be gracefully handled.  And device should
> > > > > always remain usable.
> > > > >
> > > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > > 82576 Gigabit Network card.
> > > > >
> > > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > > >     -  kdump file system does not have Ethernet driver
> > > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > > of kdump root file system.
> > > >
> > > > In this case, I think report_error_detected() is deciding that because
> > > > the device has no driver, we can't do anything.  The flow is like
> > > > this:
> > > >
> > > >   aer_recover_work_func               # aer_recover_work
> > > >     kfifo_get(aer_recover_ring, entry)
> > > >     dev = pci_get_domain_bus_and_slot
> > > >     cper_print_aer(dev, ...)
> > > >       pci_err("AER: aer_status:")
> > > >       pci_err("AER:   [14] CmpltTO")
> > > >       pci_err("AER: aer_layer=")
> > > >     if (AER_NONFATAL)
> > > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > > >         status = CAN_RECOVER
> > > >         pci_walk_bus(report_normal_detected)
> > > >           report_error_detected
> > > >             if (!dev->driver)
> > > >               vote = NO_AER_DRIVER
> > > >               pci_info("can't recover (no error_detected callback)")
> > > >             *result = merge_result(*, NO_AER_DRIVER)
> > > >             # always NO_AER_DRIVER
> > > >         status is now NO_AER_DRIVER
> > > >
> > > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > > and status is not RECOVERED, so it skips .resume().
> > > >
> > > > I don't remember the history there, but if a device has no driver and
> > > > the device generates errors, it seems like we ought to be able to
> > > > reset it.
> > >
> > > But how to reset the device considering there is no driver.
> > > Hypothetically, this case should be taken care by PCIe subsystem to
> > > perform reset at PCIe level.
> >
> > I don't understand your question.  The PCI core (not the device
> > driver) already does the reset.  When pcie_do_recovery() calls
> > reset_link(), all devices on the other side of the link are reset.
> >
> > > > We should be able to field one (or a few) AER errors, reset the
> > > > device, and you should be able to use the shell in the kdump kernel.
> > > >
> > > here kdump shell is usable only problem is a "lot of AER Errors". One
> > > cannot see what they are typing.
> >
> > Right, that's what I expect.  If the PCI core resets the device, you
> > should get just a few AER errors, and they should stop after the
> > device is reset.
> >
> > > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > > >
> > > > > II) Crash testing using default root file system: Specific case to
> > > > > test Ethernet driver in second kernel
> > > > >    -  Default root file system have Ethernet driver
> > > > >    -  AER error comes even before the driver probe starts.
> > > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > > >    -  AER also tries to recover. but no success.  [2]
> > > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > > and commenting ghes_handle_aer() from GHES driver..
> > > > >           than different set of errors come which also never able to recover [3]
> > > > >
> > >
> > > Please suggest your view on this case. Here driver is preset.
> > > (driver/net/ethernet/intel/igb/igb_main.c)
> > > In this case AER errors starts even before driver probe starts.
> > > After probe, driver does the device reset with no success and even AER
> > > recovery does not work.
> >
> > This case should be the same as the one above.  If we can change the
> > PCI core so it can reset the device when there's no driver,  that would
> > apply to case I (where there will never be a driver) and to case II
> > (where there is no driver now, but a driver will probe the device
> > later).
> 
> Does this means change are required in PCI core.

Yes, I am suggesting that the PCI core does not do the right thing
here.

> I tried following changes in pcie_do_recovery() but it did not help.
> Same error as before.
> 
> -- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
>         pci_info(dev, "broadcast resume message\n");
>         pci_walk_bus(bus, report_resume, &status);
> @@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>         return status;
> 
>  failed:
>         pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> +       pci_reset_function(dev);
> +       pci_aer_clear_device_status(dev);
> +       pci_aer_clear_nonfatal_status(dev);

Did you confirm that this resets the devices in question (0000:09:00.0
and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
PM, etc)?

Case I is using APEI, and it looks like that can queue up 16 errors
(AER_RECOVER_RING_SIZE), so that queue could be completely full before
we even get a chance to reset the device.  But I would think that the
reset should *eventually* stop the errors, even though we might log
30+ of them first.

As an experiment, you could reduce AER_RECOVER_RING_SIZE to 1 or 2 and
see if it reduces the logging.

> > > Problem mentioned in case I and II goes away if do pci_reset_function
> > > during enumeration phase of kdump kernel.
> > > can we thought of doing pci_reset_function for all devices in kdump
> > > kernel or device specific quirk.
> > >
> > > --pk
> > >
> > >
> > > > > As per my understanding, possible solutions are
> > > > >  - Copy SMMU table i.e. this patch
> > > > > OR
> > > > >  - Doing pci_reset_function() during enumeration phase.
> > > > > I also tried clearing "M" bit using pci_clear_master during
> > > > > enumeration but it did not help. Because driver re-set M bit causing
> > > > > same AER error again.
> > > > >
> > > > >
> > > > > -pk
> > > > >
> > > > > ---------------------------------------------------------------------------------------------------------------------------
> > > > > [1] with bootargs having pci=noaer
> > > > >
> > > > > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > > > > Hardware Error Source: 1
> > > > > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > > > > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > > > > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > > > > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > > > > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > > > > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > > > > [   22.582323] {4}[Hardware Error]:   slot: 0
> > > > > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > > > > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > > > > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > aer_mask: 0x00000000
> > > > > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > aer_agent=Requester ID
> > > > > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > > > > total mem (8153768 kB)
> > > > > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > > > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > > > > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > > > > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > aer_mask: 0x00000000
> > > > > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > aer_agent=Requester ID
> > > > > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)

<snip>

Prabhakar Kushwaha May 29, 2020, 2:18 p.m. UTC | #12

Hi Bjorn,

On Thu, May 28, 2020 at 1:48 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
> > On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > > > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > > > via this new table..
> > > > > > > >
> > > > > > > > There are scenarios, where devices are still having old pending
> > > > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > > > As new stream table does not have entry for older transactions,
> > > > > > > > it will be aborted by SMMU.
> > > > > > > >
> > > > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > > > >
> > > > > > > That sounds like exactly what we want, doesn't it?
> > > > > > >
> > > > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > > > as it's not memory used by the kdump kernel.
> > > > > >
> > > > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > > > But current abort is happening because of SMMU and no driver/pcie
> > > > > > setup present at this moment.
> > > > >
> > > > > I don't understand what you mean by "in context of driver."  The whole
> > > > > problem is that we can't control *when* the abort happens, so it may
> > > > > happen in *any* context.  It may happen when a NIC receives a packet
> > > > > or at some other unpredictable time.
> > > > >
> > > > > > Solution of this issue should be at 2 place
> > > > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > > > issue till finally driver's probe takeover.
> > > > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > > > able to recover.
> > > > > >
> > > > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > > > A network device hang is observed even after continuous
> > > > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > > > >
> > > > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > > > we could recover and reliably use the device after the error, that
> > > > > > > seems like it would be a more robust solution that having to add
> > > > > > > special cases in every IOMMU driver.
> > > > > > >
> > > > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > > > it because we want to recover from that sort of error in normal
> > > > > > > (non-crash) situations as well.
> > > > > > >
> > > > > > Completion abort case should be gracefully handled.  And device should
> > > > > > always remain usable.
> > > > > >
> > > > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > > > 82576 Gigabit Network card.
> > > > > >
> > > > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > > > >     -  kdump file system does not have Ethernet driver
> > > > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > > > of kdump root file system.
> > > > >
> > > > > In this case, I think report_error_detected() is deciding that because
> > > > > the device has no driver, we can't do anything.  The flow is like
> > > > > this:
> > > > >
> > > > >   aer_recover_work_func               # aer_recover_work
> > > > >     kfifo_get(aer_recover_ring, entry)
> > > > >     dev = pci_get_domain_bus_and_slot
> > > > >     cper_print_aer(dev, ...)
> > > > >       pci_err("AER: aer_status:")
> > > > >       pci_err("AER:   [14] CmpltTO")
> > > > >       pci_err("AER: aer_layer=")
> > > > >     if (AER_NONFATAL)
> > > > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > > > >         status = CAN_RECOVER
> > > > >         pci_walk_bus(report_normal_detected)
> > > > >           report_error_detected
> > > > >             if (!dev->driver)
> > > > >               vote = NO_AER_DRIVER
> > > > >               pci_info("can't recover (no error_detected callback)")
> > > > >             *result = merge_result(*, NO_AER_DRIVER)
> > > > >             # always NO_AER_DRIVER
> > > > >         status is now NO_AER_DRIVER
> > > > >
> > > > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > > > and status is not RECOVERED, so it skips .resume().
> > > > >
> > > > > I don't remember the history there, but if a device has no driver and
> > > > > the device generates errors, it seems like we ought to be able to
> > > > > reset it.
> > > >
> > > > But how to reset the device considering there is no driver.
> > > > Hypothetically, this case should be taken care by PCIe subsystem to
> > > > perform reset at PCIe level.
> > >
> > > I don't understand your question.  The PCI core (not the device
> > > driver) already does the reset.  When pcie_do_recovery() calls
> > > reset_link(), all devices on the other side of the link are reset.
> > >
> > > > > We should be able to field one (or a few) AER errors, reset the
> > > > > device, and you should be able to use the shell in the kdump kernel.
> > > > >
> > > > here kdump shell is usable only problem is a "lot of AER Errors". One
> > > > cannot see what they are typing.
> > >
> > > Right, that's what I expect.  If the PCI core resets the device, you
> > > should get just a few AER errors, and they should stop after the
> > > device is reset.
> > >
> > > > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > > > >
> > > > > > II) Crash testing using default root file system: Specific case to
> > > > > > test Ethernet driver in second kernel
> > > > > >    -  Default root file system have Ethernet driver
> > > > > >    -  AER error comes even before the driver probe starts.
> > > > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > > > >    -  AER also tries to recover. but no success.  [2]
> > > > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > > > and commenting ghes_handle_aer() from GHES driver..
> > > > > >           than different set of errors come which also never able to recover [3]
> > > > > >
> > > >
> > > > Please suggest your view on this case. Here driver is preset.
> > > > (driver/net/ethernet/intel/igb/igb_main.c)
> > > > In this case AER errors starts even before driver probe starts.
> > > > After probe, driver does the device reset with no success and even AER
> > > > recovery does not work.
> > >
> > > This case should be the same as the one above.  If we can change the
> > > PCI core so it can reset the device when there's no driver,  that would
> > > apply to case I (where there will never be a driver) and to case II
> > > (where there is no driver now, but a driver will probe the device
> > > later).
> >
> > Does this means change are required in PCI core.
>
> Yes, I am suggesting that the PCI core does not do the right thing
> here.
>
> > I tried following changes in pcie_do_recovery() but it did not help.
> > Same error as before.
> >
> > -- a/drivers/pci/pcie/err.c
> > +++ b/drivers/pci/pcie/err.c
> >         pci_info(dev, "broadcast resume message\n");
> >         pci_walk_bus(bus, report_resume, &status);
> > @@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >         return status;
> >
> >  failed:
> >         pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> > +       pci_reset_function(dev);
> > +       pci_aer_clear_device_status(dev);
> > +       pci_aer_clear_nonfatal_status(dev);
>
> Did you confirm that this resets the devices in question (0000:09:00.0
> and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
> PM, etc)?
>

Earlier reset  was happening with P2P bridge(0000:00:09.0) this the
reason no effect. After making following changes,  both devices are
now getting reset.
Both devices are using FLR.

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 117c0a2b2ba4..26b908f55aef 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
                if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
                        vote = PCI_ERS_RESULT_NO_AER_DRIVER;
                        pci_info(dev, "can't recover (no
error_detected callback)\n");
+
+                       pci_save_state(dev);
+                       pci_cfg_access_lock(dev);
+
+                       /* Quiesce the device completely */
+                       pci_write_config_word(dev, PCI_COMMAND,
+                             PCI_COMMAND_INTX_DISABLE);
+                       if (!__pci_reset_function_locked(dev)) {
+                               vote = PCI_ERS_RESULT_RECOVERED;
+                               pci_info(dev, "recovered via pci level
reset\n");
+                       }
+
+                       pci_cfg_access_unlock(dev);
+                       pci_restore_state(dev);
                } else {
                        vote = PCI_ERS_RESULT_NONE;
                }

in order to take care of case 2 (driver comes after sometime) ==>
following code needs to be added to avoid crash during igb_probe.  It
looks to be a race condition between AER and igb_probe().

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
b/drivers/net/ethernet/intel/igb/igb_main.c
index b46bff8fe056..c48f0a54bb95 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3012,6 +3012,11 @@ static int igb_probe(struct pci_dev *pdev,
const struct pci_device_id *ent)
        /* Catch broken hardware that put the wrong VF device ID in
         * the PCIe SR-IOV capability.
         */
+       if (pci_dev_trylock(pdev)) {
+               mdelay(1000);
+               pci_info(pdev,"device is locked, try waiting 1 sec\n");
+       }
+

Here are the observation with all above changes
A) AER errors are less but they are still there for both case 1 (No
driver at all) and case 2 (driver comes after some time)
B) Each AER error(NON_FATAL) causes both devices to reset. It happens many times
C) After that AER errors [1] comes is only for device 0000:09:00.0.
This is strange as this pci device is not being used during test.
Ping/ssh are happening with 0000:09:01.0
D) If wait for some more time. No more AER errors from any device
E) Ping is working fine in case 2.

09:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network
Connection (rev 01)
09:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network
Connection (rev 01)

# lspci -t -v

 \-[0000:00]-+-00.0  Cavium, Inc. CN99xx [ThunderX2] Integrated PCI Host bridge
             +-01.0-[01]--
             +-02.0-[02]--
             +-03.0-[03]--
             +-04.0-[04]--
             +-05.0-[05]--+-00.0  Broadcom Inc. and subsidiaries
BCM57840 NetXtreme II 10 Gigabit Ethernet
             |            \-00.1  Broadcom Inc. and subsidiaries
BCM57840 NetXtreme II 10 Gigabit Ethernet
             +-06.0-[06]--
             +-07.0-[07]--
             +-08.0-[08]--
             +-09.0-[09-0a]--+-00.0  Intel Corporation 82576 Gigabit
Network Connection
             |               \-00.1  Intel Corporation 82576 Gigabit
Network Connection


[1] AER error which comes for 09:00.0:

[   81.659825] {7}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
[   81.668080] {7}[Hardware Error]: It has been corrected by h/w and
requires no further action
[   81.676503] {7}[Hardware Error]: event severity: corrected
[   81.681975] {7}[Hardware Error]:  Error 0, type: corrected
[   81.687447] {7}[Hardware Error]:   section_type: PCIe error
[   81.693004] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   81.698908] {7}[Hardware Error]:   version: 3.0
[   81.703424] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   81.709589] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   81.715059] {7}[Hardware Error]:   slot: 0
[   81.719141] {7}[Hardware Error]:   secondary_bus: 0x00
[   81.724265] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   81.730864] {7}[Hardware Error]:   class_code: 000002
[   81.735901] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   81.742587] {7}[Hardware Error]:  Error 1, type: corrected
[   81.748058] {7}[Hardware Error]:   section_type: PCIe error
[   81.753615] {7}[Hardware Error]:   port_type: 4, root port
[   81.759086] {7}[Hardware Error]:   version: 3.0
[   81.763602] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   81.769767] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   81.775237] {7}[Hardware Error]:   slot: 0
[   81.779319] {7}[Hardware Error]:   secondary_bus: 0x09
[   81.784442] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   81.791041] {7}[Hardware Error]:   class_code: 000406
[   81.796078] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   81.803806] {7}[Hardware Error]:  Error 2, type: corrected
[   81.809276] {7}[Hardware Error]:   section_type: PCIe error
[   81.814834] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   81.820738] {7}[Hardware Error]:   version: 3.0
[   81.825254] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   81.831419] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   81.836889] {7}[Hardware Error]:   slot: 0
[   81.840971] {7}[Hardware Error]:   secondary_bus: 0x00
[   81.846094] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   81.852693] {7}[Hardware Error]:   class_code: 000002
[   81.857730] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   81.864416] {7}[Hardware Error]:  Error 3, type: corrected
[   81.869886] {7}[Hardware Error]:   section_type: PCIe error
[   81.875444] {7}[Hardware Error]:   port_type: 4, root port
[   81.880914] {7}[Hardware Error]:   version: 3.0
[   81.885430] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   81.891595] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   81.897066] {7}[Hardware Error]:   slot: 0
[   81.901147] {7}[Hardware Error]:   secondary_bus: 0x09
[   81.906271] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   81.912870] {7}[Hardware Error]:   class_code: 000406
[   81.917906] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   81.925634] {7}[Hardware Error]:  Error 4, type: corrected
[   81.931104] {7}[Hardware Error]:   section_type: PCIe error
[   81.936662] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   81.942566] {7}[Hardware Error]:   version: 3.0
[   81.947082] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   81.953247] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   81.958717] {7}[Hardware Error]:   slot: 0
[   81.962799] {7}[Hardware Error]:   secondary_bus: 0x00
[   81.967923] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   81.974522] {7}[Hardware Error]:   class_code: 000002
[   81.979558] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   81.986244] {7}[Hardware Error]:  Error 5, type: corrected
[   81.991715] {7}[Hardware Error]:   section_type: PCIe error
[   81.997272] {7}[Hardware Error]:   port_type: 4, root port
[   82.002743] {7}[Hardware Error]:   version: 3.0
[   82.007259] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   82.013424] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   82.018894] {7}[Hardware Error]:   slot: 0
[   82.022976] {7}[Hardware Error]:   secondary_bus: 0x09
[   82.028099] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   82.034698] {7}[Hardware Error]:   class_code: 000406
[   82.039735] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   82.047463] {7}[Hardware Error]:  Error 6, type: corrected
[   82.052933] {7}[Hardware Error]:   section_type: PCIe error
[   82.058491] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   82.064395] {7}[Hardware Error]:   version: 3.0
[   82.068911] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   82.075076] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   82.080547] {7}[Hardware Error]:   slot: 0
[   82.084628] {7}[Hardware Error]:   secondary_bus: 0x00
[   82.089752] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   82.096351] {7}[Hardware Error]:   class_code: 000002
[   82.101387] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   82.108073] {7}[Hardware Error]:  Error 7, type: corrected
[   82.113544] {7}[Hardware Error]:   section_type: PCIe error
[   82.119101] {7}[Hardware Error]:   port_type: 4, root port
[   82.124572] {7}[Hardware Error]:   version: 3.0
[   82.129087] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   82.135252] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   82.140723] {7}[Hardware Error]:   slot: 0
[   82.144805] {7}[Hardware Error]:   secondary_bus: 0x09
[   82.149928] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   82.156527] {7}[Hardware Error]:   class_code: 000406
[   82.161564] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   82.169291] {7}[Hardware Error]:  Error 8, type: corrected
[   82.174762] {7}[Hardware Error]:   section_type: PCIe error
[   82.180319] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   82.186224] {7}[Hardware Error]:   version: 3.0
[   82.190739] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   82.196904] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   82.202375] {7}[Hardware Error]:   slot: 0
[   82.206456] {7}[Hardware Error]:   secondary_bus: 0x00
[   82.211580] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   82.218179] {7}[Hardware Error]:   class_code: 000002
[   82.223216] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   82.229901] {7}[Hardware Error]:  Error 9, type: corrected
[   82.235372] {7}[Hardware Error]:   section_type: PCIe error
[   82.240929] {7}[Hardware Error]:   port_type: 4, root port
[   82.246400] {7}[Hardware Error]:   version: 3.0
[   82.250916] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   82.257081] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   82.262551] {7}[Hardware Error]:   slot: 0
[   82.266633] {7}[Hardware Error]:   secondary_bus: 0x09
[   82.271756] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   82.278355] {7}[Hardware Error]:   class_code: 000406
[   82.283392] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   82.291119] {7}[Hardware Error]:  Error 10, type: corrected
[   82.296676] {7}[Hardware Error]:   section_type: PCIe error
[   82.302234] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   82.308138] {7}[Hardware Error]:   version: 3.0
[   82.312654] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   82.318819] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   82.324290] {7}[Hardware Error]:   slot: 0
[   82.328371] {7}[Hardware Error]:   secondary_bus: 0x00
[   82.333495] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   82.340094] {7}[Hardware Error]:   class_code: 000002
[   82.345131] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   82.351816] {7}[Hardware Error]:  Error 11, type: corrected
[   82.357374] {7}[Hardware Error]:   section_type: PCIe error
[   82.362931] {7}[Hardware Error]:   port_type: 4, root port
[   82.368402] {7}[Hardware Error]:   version: 3.0
[   82.372917] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   82.379082] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   82.384553] {7}[Hardware Error]:   slot: 0
[   82.388635] {7}[Hardware Error]:   secondary_bus: 0x09
[   82.393758] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   82.400357] {7}[Hardware Error]:   class_code: 000406
[   82.405394] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   82.413121] {7}[Hardware Error]:  Error 12, type: corrected
[   82.418678] {7}[Hardware Error]:   section_type: PCIe error
[   82.424236] {7}[Hardware Error]:   port_type: 0, PCIe end point
[   82.430140] {7}[Hardware Error]:   version: 3.0
[   82.434656] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
[   82.440821] {7}[Hardware Error]:   device_id: 0000:09:00.0
[   82.446291] {7}[Hardware Error]:   slot: 0
[   82.450373] {7}[Hardware Error]:   secondary_bus: 0x00
[   82.455497] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   82.462096] {7}[Hardware Error]:   class_code: 000002
[   82.467132] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   82.473818] {7}[Hardware Error]:  Error 13, type: corrected
[   82.479375] {7}[Hardware Error]:   section_type: PCIe error
[   82.484933] {7}[Hardware Error]:   port_type: 4, root port
[   82.490403] {7}[Hardware Error]:   version: 3.0
[   82.494919] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
[   82.501084] {7}[Hardware Error]:   device_id: 0000:00:09.0
[   82.506555] {7}[Hardware Error]:   slot: 0
[   82.510636] {7}[Hardware Error]:   secondary_bus: 0x09
[   82.515760] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   82.522359] {7}[Hardware Error]:   class_code: 000406
[   82.527395] {7}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   82.535171] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.542476] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.550301] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.558032] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   82.566296] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.573597] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.581421] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.589151] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   82.597411] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.604711] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.612535] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.620271] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   82.628525] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.635826] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.643649] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.651385] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   82.659645] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.666940] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.674763] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.682498] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   82.690759] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.698053] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.705876] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.713612] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   82.721872] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   82.729167] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   82.736990] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   82.744725] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   88.059225] {8}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
[   88.067478] {8}[Hardware Error]: It has been corrected by h/w and
requires no further action
[   88.075899] {8}[Hardware Error]: event severity: corrected
[   88.081370] {8}[Hardware Error]:  Error 0, type: corrected
[   88.086841] {8}[Hardware Error]:   section_type: PCIe error
[   88.092399] {8}[Hardware Error]:   port_type: 0, PCIe end point
[   88.098303] {8}[Hardware Error]:   version: 3.0
[   88.102819] {8}[Hardware Error]:   command: 0x0507, status: 0x0010
[   88.108984] {8}[Hardware Error]:   device_id: 0000:09:00.0
[   88.114455] {8}[Hardware Error]:   slot: 0
[   88.118536] {8}[Hardware Error]:   secondary_bus: 0x00
[   88.123660] {8}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   88.130259] {8}[Hardware Error]:   class_code: 000002
[   88.135296] {8}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   88.141981] {8}[Hardware Error]:  Error 1, type: corrected
[   88.147452] {8}[Hardware Error]:   section_type: PCIe error
[   88.153009] {8}[Hardware Error]:   port_type: 4, root port
[   88.158480] {8}[Hardware Error]:   version: 3.0
[   88.162995] {8}[Hardware Error]:   command: 0x0106, status: 0x4010
[   88.169161] {8}[Hardware Error]:   device_id: 0000:00:09.0
[   88.174633] {8}[Hardware Error]:   slot: 0
[   88.180018] {8}[Hardware Error]:   secondary_bus: 0x09
[   88.185142] {8}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
[   88.191914] {8}[Hardware Error]:   class_code: 000406
[   88.196951] {8}[Hardware Error]:   bridge: secondary_status:
0x6000, control: 0x0002
[   88.204852] {8}[Hardware Error]:  Error 2, type: corrected
[   88.210323] {8}[Hardware Error]:   section_type: PCIe error
[   88.215881] {8}[Hardware Error]:   port_type: 0, PCIe end point
[   88.221786] {8}[Hardware Error]:   version: 3.0
[   88.226301] {8}[Hardware Error]:   command: 0x0507, status: 0x0010
[   88.232466] {8}[Hardware Error]:   device_id: 0000:09:00.0
[   88.237937] {8}[Hardware Error]:   slot: 0
[   88.242019] {8}[Hardware Error]:   secondary_bus: 0x00
[   88.247142] {8}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
[   88.253741] {8}[Hardware Error]:   class_code: 000002
[   88.258778] {8}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
[   88.265509] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   88.272812] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID
[   88.280635] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
aer_mask: 0x00002000
[   88.288363] pcieport 0000:00:09.0: AER: aer_layer=Transaction
Layer, aer_agent=Receiver ID
[   88.296622] igb 0000:09:00.0: AER: aer_status: 0x00002000,
aer_mask: 0x00002000
[   88.305391] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
aer_agent=Receiver ID

> Case I is using APEI, and it looks like that can queue up 16 errors
> (AER_RECOVER_RING_SIZE), so that queue could be completely full before
> we even get a chance to reset the device.  But I would think that the
> reset should *eventually* stop the errors, even though we might log
> 30+ of them first.
>
> As an experiment, you could reduce AER_RECOVER_RING_SIZE to 1 or 2 and
> see if it reduces the logging.

Did not tried this experiment. I believe it is not required now

--pk

>
> > > > Problem mentioned in case I and II goes away if do pci_reset_function
> > > > during enumeration phase of kdump kernel.
> > > > can we thought of doing pci_reset_function for all devices in kdump
> > > > kernel or device specific quirk.
> > > >
> > > > --pk
> > > >
> > > >
> > > > > > As per my understanding, possible solutions are
> > > > > >  - Copy SMMU table i.e. this patch
> > > > > > OR
> > > > > >  - Doing pci_reset_function() during enumeration phase.
> > > > > > I also tried clearing "M" bit using pci_clear_master during
> > > > > > enumeration but it did not help. Because driver re-set M bit causing
> > > > > > same AER error again.
> > > > > >
> > > > > >
> > > > > > -pk
> > > > > >
> > > > > > ---------------------------------------------------------------------------------------------------------------------------
> > > > > > [1] with bootargs having pci=noaer
> > > > > >
> > > > > > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > > > > > Hardware Error Source: 1
> > > > > > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > > > > > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > > > > > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > > > > > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > > > > > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > > > > > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > > > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > > > > > [   22.582323] {4}[Hardware Error]:   slot: 0
> > > > > > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > > > > > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > > > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > > > > > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > > > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > > aer_mask: 0x00000000
> > > > > > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > > aer_agent=Requester ID
> > > > > > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > > > > > total mem (8153768 kB)
> > > > > > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > > > > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > > > > > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > > > > > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > > aer_mask: 0x00000000
> > > > > > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > > aer_agent=Requester ID
> > > > > > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
>
> <snip>

Bjorn Helgaas May 29, 2020, 7:33 p.m. UTC | #13

On Fri, May 29, 2020 at 07:48:10PM +0530, Prabhakar Kushwaha wrote:
> On Thu, May 28, 2020 at 1:48 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
> > > On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > > > > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > > > > via this new table..
> > > > > > > > >
> > > > > > > > > There are scenarios, where devices are still having old pending
> > > > > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > > > > As new stream table does not have entry for older transactions,
> > > > > > > > > it will be aborted by SMMU.
> > > > > > > > >
> > > > > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > > > > >
> > > > > > > > That sounds like exactly what we want, doesn't it?
> > > > > > > >
> > > > > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > > > > as it's not memory used by the kdump kernel.
> > > > > > >
> > > > > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > > > > But current abort is happening because of SMMU and no driver/pcie
> > > > > > > setup present at this moment.
> > > > > >
> > > > > > I don't understand what you mean by "in context of driver."  The whole
> > > > > > problem is that we can't control *when* the abort happens, so it may
> > > > > > happen in *any* context.  It may happen when a NIC receives a packet
> > > > > > or at some other unpredictable time.
> > > > > >
> > > > > > > Solution of this issue should be at 2 place
> > > > > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > > > > issue till finally driver's probe takeover.
> > > > > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > > > > able to recover.
> > > > > > >
> > > > > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > > > > A network device hang is observed even after continuous
> > > > > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > > > > >
> > > > > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > > > > we could recover and reliably use the device after the error, that
> > > > > > > > seems like it would be a more robust solution that having to add
> > > > > > > > special cases in every IOMMU driver.
> > > > > > > >
> > > > > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > > > > it because we want to recover from that sort of error in normal
> > > > > > > > (non-crash) situations as well.
> > > > > > > >
> > > > > > > Completion abort case should be gracefully handled.  And device should
> > > > > > > always remain usable.
> > > > > > >
> > > > > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > > > > 82576 Gigabit Network card.
> > > > > > >
> > > > > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > > > > >     -  kdump file system does not have Ethernet driver
> > > > > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > > > > of kdump root file system.
> > > > > >
> > > > > > In this case, I think report_error_detected() is deciding that because
> > > > > > the device has no driver, we can't do anything.  The flow is like
> > > > > > this:
> > > > > >
> > > > > >   aer_recover_work_func               # aer_recover_work
> > > > > >     kfifo_get(aer_recover_ring, entry)
> > > > > >     dev = pci_get_domain_bus_and_slot
> > > > > >     cper_print_aer(dev, ...)
> > > > > >       pci_err("AER: aer_status:")
> > > > > >       pci_err("AER:   [14] CmpltTO")
> > > > > >       pci_err("AER: aer_layer=")
> > > > > >     if (AER_NONFATAL)
> > > > > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > > > > >         status = CAN_RECOVER
> > > > > >         pci_walk_bus(report_normal_detected)
> > > > > >           report_error_detected
> > > > > >             if (!dev->driver)
> > > > > >               vote = NO_AER_DRIVER
> > > > > >               pci_info("can't recover (no error_detected callback)")
> > > > > >             *result = merge_result(*, NO_AER_DRIVER)
> > > > > >             # always NO_AER_DRIVER
> > > > > >         status is now NO_AER_DRIVER
> > > > > >
> > > > > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > > > > and status is not RECOVERED, so it skips .resume().
> > > > > >
> > > > > > I don't remember the history there, but if a device has no driver and
> > > > > > the device generates errors, it seems like we ought to be able to
> > > > > > reset it.
> > > > >
> > > > > But how to reset the device considering there is no driver.
> > > > > Hypothetically, this case should be taken care by PCIe subsystem to
> > > > > perform reset at PCIe level.
> > > >
> > > > I don't understand your question.  The PCI core (not the device
> > > > driver) already does the reset.  When pcie_do_recovery() calls
> > > > reset_link(), all devices on the other side of the link are reset.
> > > >
> > > > > > We should be able to field one (or a few) AER errors, reset the
> > > > > > device, and you should be able to use the shell in the kdump kernel.
> > > > > >
> > > > > here kdump shell is usable only problem is a "lot of AER Errors". One
> > > > > cannot see what they are typing.
> > > >
> > > > Right, that's what I expect.  If the PCI core resets the device, you
> > > > should get just a few AER errors, and they should stop after the
> > > > device is reset.
> > > >
> > > > > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > > > > >
> > > > > > > II) Crash testing using default root file system: Specific case to
> > > > > > > test Ethernet driver in second kernel
> > > > > > >    -  Default root file system have Ethernet driver
> > > > > > >    -  AER error comes even before the driver probe starts.
> > > > > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > > > > >    -  AER also tries to recover. but no success.  [2]
> > > > > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > > > > and commenting ghes_handle_aer() from GHES driver..
> > > > > > >           than different set of errors come which also never able to recover [3]
> > > > > > >
> > > > >
> > > > > Please suggest your view on this case. Here driver is preset.
> > > > > (driver/net/ethernet/intel/igb/igb_main.c)
> > > > > In this case AER errors starts even before driver probe starts.
> > > > > After probe, driver does the device reset with no success and even AER
> > > > > recovery does not work.
> > > >
> > > > This case should be the same as the one above.  If we can change the
> > > > PCI core so it can reset the device when there's no driver,  that would
> > > > apply to case I (where there will never be a driver) and to case II
> > > > (where there is no driver now, but a driver will probe the device
> > > > later).
> > >
> > > Does this means change are required in PCI core.
> >
> > Yes, I am suggesting that the PCI core does not do the right thing
> > here.
> >
> > > I tried following changes in pcie_do_recovery() but it did not help.
> > > Same error as before.
> > >
> > > -- a/drivers/pci/pcie/err.c
> > > +++ b/drivers/pci/pcie/err.c
> > >         pci_info(dev, "broadcast resume message\n");
> > >         pci_walk_bus(bus, report_resume, &status);
> > > @@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> > >         return status;
> > >
> > >  failed:
> > >         pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> > > +       pci_reset_function(dev);
> > > +       pci_aer_clear_device_status(dev);
> > > +       pci_aer_clear_nonfatal_status(dev);
> >
> > Did you confirm that this resets the devices in question (0000:09:00.0
> > and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
> > PM, etc)?
> 
> Earlier reset  was happening with P2P bridge(0000:00:09.0) this the
> reason no effect. After making following changes,  both devices are
> now getting reset.
> Both devices are using FLR.
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 117c0a2b2ba4..26b908f55aef 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
>                 if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>                         vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>                         pci_info(dev, "can't recover (no
> error_detected callback)\n");
> +
> +                       pci_save_state(dev);
> +                       pci_cfg_access_lock(dev);
> +
> +                       /* Quiesce the device completely */
> +                       pci_write_config_word(dev, PCI_COMMAND,
> +                             PCI_COMMAND_INTX_DISABLE);
> +                       if (!__pci_reset_function_locked(dev)) {
> +                               vote = PCI_ERS_RESULT_RECOVERED;
> +                               pci_info(dev, "recovered via pci level
> reset\n");
> +                       }

Why do we need to save the state and quiesce the device?  The reset
should disable interrupts anyway.  In this particular case where
there's no driver, I don't think we should have to restore the state.
We maybe should *remove* the device and re-enumerate it after the
reset, but the state from before the reset should be irrelevant.

> +                       pci_cfg_access_unlock(dev);
> +                       pci_restore_state(dev);
>                 } else {
>                         vote = PCI_ERS_RESULT_NONE;
>                 }
> 
> in order to take care of case 2 (driver comes after sometime) ==>
> following code needs to be added to avoid crash during igb_probe.  It
> looks to be a race condition between AER and igb_probe().
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index b46bff8fe056..c48f0a54bb95 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -3012,6 +3012,11 @@ static int igb_probe(struct pci_dev *pdev,
> const struct pci_device_id *ent)
>         /* Catch broken hardware that put the wrong VF device ID in
>          * the PCIe SR-IOV capability.
>          */
> +       if (pci_dev_trylock(pdev)) {
> +               mdelay(1000);
> +               pci_info(pdev,"device is locked, try waiting 1 sec\n");
> +       }

This is interesting to learn about the AER/driver interaction, but of
course, we wouldn't want to add code like this permanently.

> Here are the observation with all above changes
> A) AER errors are less but they are still there for both case 1 (No
> driver at all) and case 2 (driver comes after some time)

We'll certainly get *some* AER errors.  We have to get one before we
know to reset the device.

> B) Each AER error(NON_FATAL) causes both devices to reset. It happens many times

I'm not sure why we reset both devices.  Are we seeing errors from
both, or could we be more selective in the code?

> C) After that AER errors [1] comes is only for device 0000:09:00.0.
> This is strange as this pci device is not being used during test.
> Ping/ssh are happening with 0000:09:01.0
> D) If wait for some more time. No more AER errors from any device
> E) Ping is working fine in case 2.
> 
> 09:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network
> Connection (rev 01)
> 09:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network
> Connection (rev 01)
> 
> # lspci -t -v
> 
>  \-[0000:00]-+-00.0  Cavium, Inc. CN99xx [ThunderX2] Integrated PCI Host bridge
>              +-01.0-[01]--
>              +-02.0-[02]--
>              +-03.0-[03]--
>              +-04.0-[04]--
>              +-05.0-[05]--+-00.0  Broadcom Inc. and subsidiaries
> BCM57840 NetXtreme II 10 Gigabit Ethernet
>              |            \-00.1  Broadcom Inc. and subsidiaries
> BCM57840 NetXtreme II 10 Gigabit Ethernet
>              +-06.0-[06]--
>              +-07.0-[07]--
>              +-08.0-[08]--
>              +-09.0-[09-0a]--+-00.0  Intel Corporation 82576 Gigabit
> Network Connection
>              |               \-00.1  Intel Corporation 82576 Gigabit
> Network Connection
> 
> 
> [1] AER error which comes for 09:00.0:
> 
> [   81.659825] {7}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
> [   81.668080] {7}[Hardware Error]: It has been corrected by h/w and
> requires no further action
> [   81.676503] {7}[Hardware Error]: event severity: corrected
> [   81.681975] {7}[Hardware Error]:  Error 0, type: corrected
> [   81.687447] {7}[Hardware Error]:   section_type: PCIe error
> [   81.693004] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   81.698908] {7}[Hardware Error]:   version: 3.0
> [   81.703424] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   81.709589] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   81.715059] {7}[Hardware Error]:   slot: 0
> [   81.719141] {7}[Hardware Error]:   secondary_bus: 0x00
> [   81.724265] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   81.730864] {7}[Hardware Error]:   class_code: 000002
> [   81.735901] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   81.742587] {7}[Hardware Error]:  Error 1, type: corrected
> [   81.748058] {7}[Hardware Error]:   section_type: PCIe error
> [   81.753615] {7}[Hardware Error]:   port_type: 4, root port
> [   81.759086] {7}[Hardware Error]:   version: 3.0
> [   81.763602] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   81.769767] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   81.775237] {7}[Hardware Error]:   slot: 0
> [   81.779319] {7}[Hardware Error]:   secondary_bus: 0x09
> [   81.784442] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   81.791041] {7}[Hardware Error]:   class_code: 000406
> [   81.796078] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   81.803806] {7}[Hardware Error]:  Error 2, type: corrected
> [   81.809276] {7}[Hardware Error]:   section_type: PCIe error
> [   81.814834] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   81.820738] {7}[Hardware Error]:   version: 3.0
> [   81.825254] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   81.831419] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   81.836889] {7}[Hardware Error]:   slot: 0
> [   81.840971] {7}[Hardware Error]:   secondary_bus: 0x00
> [   81.846094] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   81.852693] {7}[Hardware Error]:   class_code: 000002
> [   81.857730] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   81.864416] {7}[Hardware Error]:  Error 3, type: corrected
> [   81.869886] {7}[Hardware Error]:   section_type: PCIe error
> [   81.875444] {7}[Hardware Error]:   port_type: 4, root port
> [   81.880914] {7}[Hardware Error]:   version: 3.0
> [   81.885430] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   81.891595] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   81.897066] {7}[Hardware Error]:   slot: 0
> [   81.901147] {7}[Hardware Error]:   secondary_bus: 0x09
> [   81.906271] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   81.912870] {7}[Hardware Error]:   class_code: 000406
> [   81.917906] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   81.925634] {7}[Hardware Error]:  Error 4, type: corrected
> [   81.931104] {7}[Hardware Error]:   section_type: PCIe error
> [   81.936662] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   81.942566] {7}[Hardware Error]:   version: 3.0
> [   81.947082] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   81.953247] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   81.958717] {7}[Hardware Error]:   slot: 0
> [   81.962799] {7}[Hardware Error]:   secondary_bus: 0x00
> [   81.967923] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   81.974522] {7}[Hardware Error]:   class_code: 000002
> [   81.979558] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   81.986244] {7}[Hardware Error]:  Error 5, type: corrected
> [   81.991715] {7}[Hardware Error]:   section_type: PCIe error
> [   81.997272] {7}[Hardware Error]:   port_type: 4, root port
> [   82.002743] {7}[Hardware Error]:   version: 3.0
> [   82.007259] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   82.013424] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   82.018894] {7}[Hardware Error]:   slot: 0
> [   82.022976] {7}[Hardware Error]:   secondary_bus: 0x09
> [   82.028099] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   82.034698] {7}[Hardware Error]:   class_code: 000406
> [   82.039735] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   82.047463] {7}[Hardware Error]:  Error 6, type: corrected
> [   82.052933] {7}[Hardware Error]:   section_type: PCIe error
> [   82.058491] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   82.064395] {7}[Hardware Error]:   version: 3.0
> [   82.068911] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   82.075076] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   82.080547] {7}[Hardware Error]:   slot: 0
> [   82.084628] {7}[Hardware Error]:   secondary_bus: 0x00
> [   82.089752] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   82.096351] {7}[Hardware Error]:   class_code: 000002
> [   82.101387] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   82.108073] {7}[Hardware Error]:  Error 7, type: corrected
> [   82.113544] {7}[Hardware Error]:   section_type: PCIe error
> [   82.119101] {7}[Hardware Error]:   port_type: 4, root port
> [   82.124572] {7}[Hardware Error]:   version: 3.0
> [   82.129087] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   82.135252] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   82.140723] {7}[Hardware Error]:   slot: 0
> [   82.144805] {7}[Hardware Error]:   secondary_bus: 0x09
> [   82.149928] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   82.156527] {7}[Hardware Error]:   class_code: 000406
> [   82.161564] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   82.169291] {7}[Hardware Error]:  Error 8, type: corrected
> [   82.174762] {7}[Hardware Error]:   section_type: PCIe error
> [   82.180319] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   82.186224] {7}[Hardware Error]:   version: 3.0
> [   82.190739] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   82.196904] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   82.202375] {7}[Hardware Error]:   slot: 0
> [   82.206456] {7}[Hardware Error]:   secondary_bus: 0x00
> [   82.211580] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   82.218179] {7}[Hardware Error]:   class_code: 000002
> [   82.223216] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   82.229901] {7}[Hardware Error]:  Error 9, type: corrected
> [   82.235372] {7}[Hardware Error]:   section_type: PCIe error
> [   82.240929] {7}[Hardware Error]:   port_type: 4, root port
> [   82.246400] {7}[Hardware Error]:   version: 3.0
> [   82.250916] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   82.257081] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   82.262551] {7}[Hardware Error]:   slot: 0
> [   82.266633] {7}[Hardware Error]:   secondary_bus: 0x09
> [   82.271756] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   82.278355] {7}[Hardware Error]:   class_code: 000406
> [   82.283392] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   82.291119] {7}[Hardware Error]:  Error 10, type: corrected
> [   82.296676] {7}[Hardware Error]:   section_type: PCIe error
> [   82.302234] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   82.308138] {7}[Hardware Error]:   version: 3.0
> [   82.312654] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   82.318819] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   82.324290] {7}[Hardware Error]:   slot: 0
> [   82.328371] {7}[Hardware Error]:   secondary_bus: 0x00
> [   82.333495] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   82.340094] {7}[Hardware Error]:   class_code: 000002
> [   82.345131] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   82.351816] {7}[Hardware Error]:  Error 11, type: corrected
> [   82.357374] {7}[Hardware Error]:   section_type: PCIe error
> [   82.362931] {7}[Hardware Error]:   port_type: 4, root port
> [   82.368402] {7}[Hardware Error]:   version: 3.0
> [   82.372917] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   82.379082] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   82.384553] {7}[Hardware Error]:   slot: 0
> [   82.388635] {7}[Hardware Error]:   secondary_bus: 0x09
> [   82.393758] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   82.400357] {7}[Hardware Error]:   class_code: 000406
> [   82.405394] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   82.413121] {7}[Hardware Error]:  Error 12, type: corrected
> [   82.418678] {7}[Hardware Error]:   section_type: PCIe error
> [   82.424236] {7}[Hardware Error]:   port_type: 0, PCIe end point
> [   82.430140] {7}[Hardware Error]:   version: 3.0
> [   82.434656] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   82.440821] {7}[Hardware Error]:   device_id: 0000:09:00.0
> [   82.446291] {7}[Hardware Error]:   slot: 0
> [   82.450373] {7}[Hardware Error]:   secondary_bus: 0x00
> [   82.455497] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   82.462096] {7}[Hardware Error]:   class_code: 000002
> [   82.467132] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   82.473818] {7}[Hardware Error]:  Error 13, type: corrected
> [   82.479375] {7}[Hardware Error]:   section_type: PCIe error
> [   82.484933] {7}[Hardware Error]:   port_type: 4, root port
> [   82.490403] {7}[Hardware Error]:   version: 3.0
> [   82.494919] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   82.501084] {7}[Hardware Error]:   device_id: 0000:00:09.0
> [   82.506555] {7}[Hardware Error]:   slot: 0
> [   82.510636] {7}[Hardware Error]:   secondary_bus: 0x09
> [   82.515760] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   82.522359] {7}[Hardware Error]:   class_code: 000406
> [   82.527395] {7}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   82.535171] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.542476] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.550301] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.558032] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   82.566296] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.573597] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.581421] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.589151] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   82.597411] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.604711] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.612535] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.620271] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   82.628525] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.635826] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.643649] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.651385] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   82.659645] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.666940] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.674763] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.682498] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   82.690759] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.698053] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.705876] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.713612] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   82.721872] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   82.729167] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   82.736990] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   82.744725] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   88.059225] {8}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
> [   88.067478] {8}[Hardware Error]: It has been corrected by h/w and
> requires no further action
> [   88.075899] {8}[Hardware Error]: event severity: corrected
> [   88.081370] {8}[Hardware Error]:  Error 0, type: corrected
> [   88.086841] {8}[Hardware Error]:   section_type: PCIe error
> [   88.092399] {8}[Hardware Error]:   port_type: 0, PCIe end point
> [   88.098303] {8}[Hardware Error]:   version: 3.0
> [   88.102819] {8}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   88.108984] {8}[Hardware Error]:   device_id: 0000:09:00.0
> [   88.114455] {8}[Hardware Error]:   slot: 0
> [   88.118536] {8}[Hardware Error]:   secondary_bus: 0x00
> [   88.123660] {8}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   88.130259] {8}[Hardware Error]:   class_code: 000002
> [   88.135296] {8}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   88.141981] {8}[Hardware Error]:  Error 1, type: corrected
> [   88.147452] {8}[Hardware Error]:   section_type: PCIe error
> [   88.153009] {8}[Hardware Error]:   port_type: 4, root port
> [   88.158480] {8}[Hardware Error]:   version: 3.0
> [   88.162995] {8}[Hardware Error]:   command: 0x0106, status: 0x4010
> [   88.169161] {8}[Hardware Error]:   device_id: 0000:00:09.0
> [   88.174633] {8}[Hardware Error]:   slot: 0
> [   88.180018] {8}[Hardware Error]:   secondary_bus: 0x09
> [   88.185142] {8}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> [   88.191914] {8}[Hardware Error]:   class_code: 000406
> [   88.196951] {8}[Hardware Error]:   bridge: secondary_status:
> 0x6000, control: 0x0002
> [   88.204852] {8}[Hardware Error]:  Error 2, type: corrected
> [   88.210323] {8}[Hardware Error]:   section_type: PCIe error
> [   88.215881] {8}[Hardware Error]:   port_type: 0, PCIe end point
> [   88.221786] {8}[Hardware Error]:   version: 3.0
> [   88.226301] {8}[Hardware Error]:   command: 0x0507, status: 0x0010
> [   88.232466] {8}[Hardware Error]:   device_id: 0000:09:00.0
> [   88.237937] {8}[Hardware Error]:   slot: 0
> [   88.242019] {8}[Hardware Error]:   secondary_bus: 0x00
> [   88.247142] {8}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> [   88.253741] {8}[Hardware Error]:   class_code: 000002
> [   88.258778] {8}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> [   88.265509] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   88.272812] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [   88.280635] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> aer_mask: 0x00002000
> [   88.288363] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> Layer, aer_agent=Receiver ID
> [   88.296622] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> aer_mask: 0x00002000
> [   88.305391] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> 
> > Case I is using APEI, and it looks like that can queue up 16 errors
> > (AER_RECOVER_RING_SIZE), so that queue could be completely full before
> > we even get a chance to reset the device.  But I would think that the
> > reset should *eventually* stop the errors, even though we might log
> > 30+ of them first.
> >
> > As an experiment, you could reduce AER_RECOVER_RING_SIZE to 1 or 2 and
> > see if it reduces the logging.
> 
> Did not tried this experiment. I believe it is not required now
> 
> --pk
> 
> >
> > > > > Problem mentioned in case I and II goes away if do pci_reset_function
> > > > > during enumeration phase of kdump kernel.
> > > > > can we thought of doing pci_reset_function for all devices in kdump
> > > > > kernel or device specific quirk.
> > > > >
> > > > > --pk
> > > > >
> > > > >
> > > > > > > As per my understanding, possible solutions are
> > > > > > >  - Copy SMMU table i.e. this patch
> > > > > > > OR
> > > > > > >  - Doing pci_reset_function() during enumeration phase.
> > > > > > > I also tried clearing "M" bit using pci_clear_master during
> > > > > > > enumeration but it did not help. Because driver re-set M bit causing
> > > > > > > same AER error again.
> > > > > > >
> > > > > > >
> > > > > > > -pk
> > > > > > >
> > > > > > > ---------------------------------------------------------------------------------------------------------------------------
> > > > > > > [1] with bootargs having pci=noaer
> > > > > > >
> > > > > > > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > > > > > > Hardware Error Source: 1
> > > > > > > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > > > > > > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > > > > > > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > > > > > > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > > > > > > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > > > > > > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > > > > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > > > > > > [   22.582323] {4}[Hardware Error]:   slot: 0
> > > > > > > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > > > > > > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > > > > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > > > > > > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > > > > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > > > aer_mask: 0x00000000
> > > > > > > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > > > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > > > aer_agent=Requester ID
> > > > > > > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > > > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > > > > > > total mem (8153768 kB)
> > > > > > > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > > > > > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > > > > > > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > > > > > > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > > > aer_mask: 0x00000000
> > > > > > > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > > > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > > > aer_agent=Requester ID
> > > > > > > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > > > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> >
> > <snip>

Will Deacon June 1, 2020, 7:39 a.m. UTC | #14

On Thu, May 21, 2020 at 04:52:02PM +0530, Prabhakar Kushwaha wrote:
> On Thu, May 21, 2020 at 2:53 PM Will Deacon <will@kernel.org> wrote:
> >
> > On Tue, May 19, 2020 at 08:24:21AM +0530, Prabhakar Kushwaha wrote:
> > > On Mon, May 18, 2020 at 9:25 PM Will Deacon <will@kernel.org> wrote:
> > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > > >       return 0;
> > > > >  }
> > > > >
> > > > > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > > > > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > > > > +{
> > > > > +     struct arm_smmu_strtab_cfg rdcfg;
> > > > > +
> > > > > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > > > > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > > > > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > > > > +
> > > > > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > > > > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > > > > +
> > > > > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > > > > +
> > >
> > > this need a fix. It should be memcpy.
> > >
> > > > > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > > >
> > > > Sorry, but this is unacceptable. These things were allocated by the DMA API
> > > > so you can't just memcpy them around and hope for the best.
> > > >
> > >
> > > I was referring copy_context_table() in drivers/iommu/intel-iommu.c.
> > > here i see usage of memremap and memcpy to copy older iommu table.
> > > did I take wrong reference?
> > >
> > > What kind of issue you are foreseeing in using memcpy(). May be we can
> > > try to find a solution.
> >
> > Well the thing might not be cache-coherent to start with...
> >
> 
> Thanks for telling possible issue area.  Let me try to explain why
> this should not be an issue.
> 
> kdump kernel runs from reserved memory space defined during the boot
> of first kernel. kdump does not touch memory of the previous kernel.
> So no page has been created in kdump kernel  and  there should not be
> any data/attribute/coherency issue from MMU point of view .

Then how does this work?:

	rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);

You're explicitly asking for a write-back mapping.

> During SMMU probe functions,  dmem_alloc_coherent() will be used
> allocate new memory (part of existing flow).
> This patch copy STE or first level descriptor to *this* memory, after
> mapping physical address using memremap().
> It just copy everything  so there should not be any issue related to
> attribute/content.
> 
> Yes, copying  done after mapping it as MEMREMAP_WB. if you want I can
> use it as MEMREMAP_WT

You need to take into account whether or not the device is coherent, and the
DMA API is designed to handle that for you. But even then, this is fragile
as hell because you end up having to infer the hardware configuration
from the device to understand the size and format of the data structures.
If the crashkernel isn't identical to the host kernel (in terms of kconfig,
driver version, firmware tables, cmdline etc) then this is very likely to
go wrong.

That's why I think that you need to reinitialise any devices that want to
do DMA.

Will

Prabhakar Kushwaha June 2, 2020, 2:04 p.m. UTC | #15

Hi Will,

Thanks for replying..

On Mon, Jun 1, 2020 at 1:10 PM Will Deacon <will@kernel.org> wrote:
>
> On Thu, May 21, 2020 at 04:52:02PM +0530, Prabhakar Kushwaha wrote:
> > On Thu, May 21, 2020 at 2:53 PM Will Deacon <will@kernel.org> wrote:
> > >
> > > On Tue, May 19, 2020 at 08:24:21AM +0530, Prabhakar Kushwaha wrote:
> > > > On Mon, May 18, 2020 at 9:25 PM Will Deacon <will@kernel.org> wrote:
> > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > @@ -3272,6 +3281,23 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
> > > > > >       return 0;
> > > > > >  }
> > > > > >
> > > > > > +static void arm_smmu_copy_table(struct arm_smmu_device *smmu,
> > > > > > +                            struct arm_smmu_strtab_cfg *cfg, u32 size)
> > > > > > +{
> > > > > > +     struct arm_smmu_strtab_cfg rdcfg;
> > > > > > +
> > > > > > +     rdcfg.strtab_dma = readq_relaxed(smmu->base + ARM_SMMU_STRTAB_BASE);
> > > > > > +     rdcfg.strtab_base_cfg = readq_relaxed(smmu->base
> > > > > > +                                           + ARM_SMMU_STRTAB_BASE_CFG);
> > > > > > +
> > > > > > +     rdcfg.strtab_dma &= STRTAB_BASE_ADDR_MASK;
> > > > > > +     rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> > > > > > +
> > > > > > +     memcpy_fromio(cfg->strtab, rdcfg.strtab, size);
> > > > > > +
> > > >
> > > > this need a fix. It should be memcpy.
> > > >
> > > > > > +     cfg->strtab_base_cfg = rdcfg.strtab_base_cfg;
> > > > >
> > > > > Sorry, but this is unacceptable. These things were allocated by the DMA API
> > > > > so you can't just memcpy them around and hope for the best.
> > > > >
> > > >
> > > > I was referring copy_context_table() in drivers/iommu/intel-iommu.c.
> > > > here i see usage of memremap and memcpy to copy older iommu table.
> > > > did I take wrong reference?
> > > >
> > > > What kind of issue you are foreseeing in using memcpy(). May be we can
> > > > try to find a solution.
> > >
> > > Well the thing might not be cache-coherent to start with...
> > >
> >
> > Thanks for telling possible issue area.  Let me try to explain why
> > this should not be an issue.
> >
> > kdump kernel runs from reserved memory space defined during the boot
> > of first kernel. kdump does not touch memory of the previous kernel.
> > So no page has been created in kdump kernel  and  there should not be
> > any data/attribute/coherency issue from MMU point of view .
>
> Then how does this work?:
>
>         rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
>
> You're explicitly asking for a write-back mapping.
>

As i mentioned earlier, I will replace it with MEMREMAP_WT to make
sure data is written into the memory.

Please note, this memmap is temporary for copying older SMMU table to
cfg->strtab.
Here, cfg->strtab & cfg->strtab_dma allocated via dmam_alloc_coherent
during SMMU probe.


> > During SMMU probe functions,  dmem_alloc_coherent() will be used
> > allocate new memory (part of existing flow).
> > This patch copy STE or first level descriptor to *this* memory, after
> > mapping physical address using memremap().
> > It just copy everything  so there should not be any issue related to
> > attribute/content.
> >
> > Yes, copying  done after mapping it as MEMREMAP_WB. if you want I can
> > use it as MEMREMAP_WT
>
> You need to take into account whether or not the device is coherent, and the
> DMA API is designed to handle that for you. But even then, this is fragile
> as hell because you end up having to infer the hardware configuration
> from the device to understand the size and format of the data structures.
> If the crashkernel isn't identical to the host kernel (in terms of kconfig,
> driver version, firmware tables, cmdline etc) then this is very likely to
> go wrong.

There are two possible scenarios for mismatched kdump kernel
1.  kdump kernel does not have the devices' driver
2.  kdump kernel have the different variation/configuration of driver

This patch create temporary SMMU table entries which are overwritten
by driver-probe.
Driver's probe will overwrite SMMU entries based on its new
requirement (size, format, data structures etc).

for "1",  As  no device driver,  SMMU entry will remain there.
Means no-one looking for the copied content (even if device continued
to perform DMA).

About coherency between Cores and Memory(DMA).
At the time of crash:  Only one CPU is allowed to remain continue,
rest are stopped.
__crash_kexec --> machine_crash_shutdown --> crash_smp_send_stop()

The active CPU is used to boot kdump kernel. hence none of the CPUs is
looking for data copied by DMA.
Coherency issue should not be there.

please let me know your view.

--pk

Prabhakar Kushwaha June 3, 2020, 5:42 p.m. UTC | #16

Hi Bjorn,

On Sat, May 30, 2020 at 1:03 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, May 29, 2020 at 07:48:10PM +0530, Prabhakar Kushwaha wrote:
> > On Thu, May 28, 2020 at 1:48 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
> > > > On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > > > > > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > > > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > > > > > via this new table..
> > > > > > > > > >
> > > > > > > > > > There are scenarios, where devices are still having old pending
> > > > > > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > > > > > As new stream table does not have entry for older transactions,
> > > > > > > > > > it will be aborted by SMMU.
> > > > > > > > > >
> > > > > > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > > > > > >
> > > > > > > > > That sounds like exactly what we want, doesn't it?
> > > > > > > > >
> > > > > > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > > > > > as it's not memory used by the kdump kernel.
> > > > > > > >
> > > > > > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > > > > > But current abort is happening because of SMMU and no driver/pcie
> > > > > > > > setup present at this moment.
> > > > > > >
> > > > > > > I don't understand what you mean by "in context of driver."  The whole
> > > > > > > problem is that we can't control *when* the abort happens, so it may
> > > > > > > happen in *any* context.  It may happen when a NIC receives a packet
> > > > > > > or at some other unpredictable time.
> > > > > > >
> > > > > > > > Solution of this issue should be at 2 place
> > > > > > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > > > > > issue till finally driver's probe takeover.
> > > > > > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > > > > > able to recover.
> > > > > > > >
> > > > > > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > > > > > A network device hang is observed even after continuous
> > > > > > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > > > > > >
> > > > > > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > > > > > we could recover and reliably use the device after the error, that
> > > > > > > > > seems like it would be a more robust solution that having to add
> > > > > > > > > special cases in every IOMMU driver.
> > > > > > > > >
> > > > > > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > > > > > it because we want to recover from that sort of error in normal
> > > > > > > > > (non-crash) situations as well.
> > > > > > > > >
> > > > > > > > Completion abort case should be gracefully handled.  And device should
> > > > > > > > always remain usable.
> > > > > > > >
> > > > > > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > > > > > 82576 Gigabit Network card.
> > > > > > > >
> > > > > > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > > > > > >     -  kdump file system does not have Ethernet driver
> > > > > > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > > > > > of kdump root file system.
> > > > > > >
> > > > > > > In this case, I think report_error_detected() is deciding that because
> > > > > > > the device has no driver, we can't do anything.  The flow is like
> > > > > > > this:
> > > > > > >
> > > > > > >   aer_recover_work_func               # aer_recover_work
> > > > > > >     kfifo_get(aer_recover_ring, entry)
> > > > > > >     dev = pci_get_domain_bus_and_slot
> > > > > > >     cper_print_aer(dev, ...)
> > > > > > >       pci_err("AER: aer_status:")
> > > > > > >       pci_err("AER:   [14] CmpltTO")
> > > > > > >       pci_err("AER: aer_layer=")
> > > > > > >     if (AER_NONFATAL)
> > > > > > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > > > > > >         status = CAN_RECOVER
> > > > > > >         pci_walk_bus(report_normal_detected)
> > > > > > >           report_error_detected
> > > > > > >             if (!dev->driver)
> > > > > > >               vote = NO_AER_DRIVER
> > > > > > >               pci_info("can't recover (no error_detected callback)")
> > > > > > >             *result = merge_result(*, NO_AER_DRIVER)
> > > > > > >             # always NO_AER_DRIVER
> > > > > > >         status is now NO_AER_DRIVER
> > > > > > >
> > > > > > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > > > > > and status is not RECOVERED, so it skips .resume().
> > > > > > >
> > > > > > > I don't remember the history there, but if a device has no driver and
> > > > > > > the device generates errors, it seems like we ought to be able to
> > > > > > > reset it.
> > > > > >
> > > > > > But how to reset the device considering there is no driver.
> > > > > > Hypothetically, this case should be taken care by PCIe subsystem to
> > > > > > perform reset at PCIe level.
> > > > >
> > > > > I don't understand your question.  The PCI core (not the device
> > > > > driver) already does the reset.  When pcie_do_recovery() calls
> > > > > reset_link(), all devices on the other side of the link are reset.
> > > > >
> > > > > > > We should be able to field one (or a few) AER errors, reset the
> > > > > > > device, and you should be able to use the shell in the kdump kernel.
> > > > > > >
> > > > > > here kdump shell is usable only problem is a "lot of AER Errors". One
> > > > > > cannot see what they are typing.
> > > > >
> > > > > Right, that's what I expect.  If the PCI core resets the device, you
> > > > > should get just a few AER errors, and they should stop after the
> > > > > device is reset.
> > > > >
> > > > > > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > > > > > >
> > > > > > > > II) Crash testing using default root file system: Specific case to
> > > > > > > > test Ethernet driver in second kernel
> > > > > > > >    -  Default root file system have Ethernet driver
> > > > > > > >    -  AER error comes even before the driver probe starts.
> > > > > > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > > > > > >    -  AER also tries to recover. but no success.  [2]
> > > > > > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > > > > > and commenting ghes_handle_aer() from GHES driver..
> > > > > > > >           than different set of errors come which also never able to recover [3]
> > > > > > > >
> > > > > >
> > > > > > Please suggest your view on this case. Here driver is preset.
> > > > > > (driver/net/ethernet/intel/igb/igb_main.c)
> > > > > > In this case AER errors starts even before driver probe starts.
> > > > > > After probe, driver does the device reset with no success and even AER
> > > > > > recovery does not work.
> > > > >
> > > > > This case should be the same as the one above.  If we can change the
> > > > > PCI core so it can reset the device when there's no driver,  that would
> > > > > apply to case I (where there will never be a driver) and to case II
> > > > > (where there is no driver now, but a driver will probe the device
> > > > > later).
> > > >
> > > > Does this means change are required in PCI core.
> > >
> > > Yes, I am suggesting that the PCI core does not do the right thing
> > > here.
> > >
> > > > I tried following changes in pcie_do_recovery() but it did not help.
> > > > Same error as before.
> > > >
> > > > -- a/drivers/pci/pcie/err.c
> > > > +++ b/drivers/pci/pcie/err.c
> > > >         pci_info(dev, "broadcast resume message\n");
> > > >         pci_walk_bus(bus, report_resume, &status);
> > > > @@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> > > >         return status;
> > > >
> > > >  failed:
> > > >         pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> > > > +       pci_reset_function(dev);
> > > > +       pci_aer_clear_device_status(dev);
> > > > +       pci_aer_clear_nonfatal_status(dev);
> > >
> > > Did you confirm that this resets the devices in question (0000:09:00.0
> > > and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
> > > PM, etc)?
> >
> > Earlier reset  was happening with P2P bridge(0000:00:09.0) this the
> > reason no effect. After making following changes,  both devices are
> > now getting reset.
> > Both devices are using FLR.
> >
> > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > index 117c0a2b2ba4..26b908f55aef 100644
> > --- a/drivers/pci/pcie/err.c
> > +++ b/drivers/pci/pcie/err.c
> > @@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
> >                 if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> >                         vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> >                         pci_info(dev, "can't recover (no
> > error_detected callback)\n");
> > +
> > +                       pci_save_state(dev);
> > +                       pci_cfg_access_lock(dev);
> > +
> > +                       /* Quiesce the device completely */
> > +                       pci_write_config_word(dev, PCI_COMMAND,
> > +                             PCI_COMMAND_INTX_DISABLE);
> > +                       if (!__pci_reset_function_locked(dev)) {
> > +                               vote = PCI_ERS_RESULT_RECOVERED;
> > +                               pci_info(dev, "recovered via pci level
> > reset\n");
> > +                       }
>
> Why do we need to save the state and quiesce the device?  The reset
> should disable interrupts anyway.  In this particular case where
> there's no driver, I don't think we should have to restore the state.
> We maybe should *remove* the device and re-enumerate it after the
> reset, but the state from before the reset should be irrelevant.
>

I tried pci_reset_fucntion_locked without save/restore then I got the
synchronous abort during igb_probe (case 2 i.e. with driver). This is
100% reproducible.
looks like pci_reset_function_locked is causing PCI configuration
space random. Same is mentioned here
https://www.kernel.org/doc/html/latest/driver-api/pci/pci.html

[   16.492586] Internal error: synchronous external abort: 96000610 [#1] SMP
[   16.499362] Modules linked in: mpt3sas(+) igb(+) nvme nvme_core
raid_class scsi_transport_sas i2c_algo_bit mdio libcrc32c gpio_xlp
i2c_xlp9xx(+) uas usb_storage
[   16.513696] CPU: 0 PID: 477 Comm: systemd-udevd Not tainted 5.7.0-rc3+ #132
[   16.520644] Hardware name: Cavium Inc. Saber/Saber, BIOS
TX2-FW-Release-3.1-build_01-2803-g74253a541a mm/dd/yyyy
[   16.530805] pstate: 60400009 (nZCv daif +PAN -UAO)
[   16.535598] pc : igb_rd32+0x24/0xe0 [igb]
[   16.539603] lr : igb_get_invariants_82575+0xb0/0xde8 [igb]
[   16.545074] sp : ffffffc012e2b7e0
[   16.548375] x29: ffffffc012e2b7e0 x28: ffffffc008baa4d8
[   16.553674] x27: 0000000000000001 x26: ffffffc008b99a70
[   16.558972] x25: ffffff8cdef60900 x24: ffffff8cdef60e48
[   16.564270] x23: ffffff8cf30b50b0 x22: ffffffc011359988
[   16.569568] x21: ffffff8cdef612e0 x20: ffffff8cdef60e68
[   16.574866] x19: ffffffc0140a0018 x18: 0000000000000000
[   16.580164] x17: 0000000000000000 x16: 0000000000000000
[   16.585463] x15: 0000000000000000 x14: 0000000000000000
[   16.590761] x13: 0000000000000000 x12: 0000000000000000
[   16.596059] x11: ffffffc008b86b08 x10: 0000000000000000
[   16.601357] x9 : ffffffc008b88888 x8 : ffffffc008b81050
[   16.606655] x7 : 0000000000000000 x6 : ffffff8cdef611a8
[   16.611952] x5 : ffffffc008b887d8 x4 : ffffffc008ba7a68
[   16.617250] x3 : 0000000000000000 x2 : ffffffc0140a0000
[   16.622548] x1 : 0000000000000018 x0 : ffffff8cdef60e48
[   16.627846] Call trace:
[   16.630288]  igb_rd32+0x24/0xe0 [igb]
[   16.633943]  igb_get_invariants_82575+0xb0/0xde8 [igb]
[   16.639073]  igb_probe+0x264/0xed8 [igb]
[   16.642989]  local_pci_probe+0x48/0xb8
[   16.646727]  pci_device_probe+0x120/0x1b8
[   16.650735]  really_probe+0xe4/0x448
[   16.654298]  driver_probe_device+0xe8/0x140
[   16.658469]  device_driver_attach+0x7c/0x88
[   16.662638]  __driver_attach+0xac/0x178
[   16.666462]  bus_for_each_dev+0x7c/0xd0
[   16.670284]  driver_attach+0x2c/0x38
[   16.673846]  bus_add_driver+0x1a8/0x240
[   16.677670]  driver_register+0x6c/0x128
[   16.681492]  __pci_register_driver+0x4c/0x58
[   16.685754]  igb_init_module+0x64/0x1000 [igb]
[   16.690189]  do_one_initcall+0x54/0x228
[   16.694021]  do_init_module+0x60/0x240
[   16.697757]  load_module+0x1614/0x1970
[   16.701493]  __do_sys_finit_module+0xb4/0x118
[   16.705837]  __arm64_sys_finit_module+0x28/0x38
[   16.710367]  do_el0_svc+0xf8/0x1b8
[   16.713761]  el0_sync_handler+0x12c/0x20c
[   16.717757]  el0_sync+0x158/0x180
[   16.721062] Code: a90153f3 f9400402 b4000482 8b214053 (b9400273)
[   16.727144] ---[ end trace 95523d7d37f1d883 ]---
[   16.731748] Kernel panic - not syncing: Fatal exception
[   16.736962] Kernel Offset: disabled
[   16.740438] CPU features: 0x084002,22000c38
[   16.744607] Memory Limit: none

> > +                       pci_cfg_access_unlock(dev);
> > +                       pci_restore_state(dev);
> >                 } else {
> >                         vote = PCI_ERS_RESULT_NONE;
> >                 }
> >
> > in order to take care of case 2 (driver comes after sometime) ==>
> > following code needs to be added to avoid crash during igb_probe.  It
> > looks to be a race condition between AER and igb_probe().
> >
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > b/drivers/net/ethernet/intel/igb/igb_main.c
> > index b46bff8fe056..c48f0a54bb95 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -3012,6 +3012,11 @@ static int igb_probe(struct pci_dev *pdev,
> > const struct pci_device_id *ent)
> >         /* Catch broken hardware that put the wrong VF device ID in
> >          * the PCIe SR-IOV capability.
> >          */
> > +       if (pci_dev_trylock(pdev)) {
> > +               mdelay(1000);
> > +               pci_info(pdev,"device is locked, try waiting 1 sec\n");
> > +       }
>
> This is interesting to learn about the AER/driver interaction, but of
> course, we wouldn't want to add code like this permanently.
>
> > Here are the observation with all above changes
> > A) AER errors are less but they are still there for both case 1 (No
> > driver at all) and case 2 (driver comes after some time)
>
> We'll certainly get *some* AER errors.  We have to get one before we
> know to reset the device.
>
> > B) Each AER error(NON_FATAL) causes both devices to reset. It happens many times
>
> I'm not sure why we reset both devices.  Are we seeing errors from
> both, or could we be more selective in the code?
>

I tried even with a reset of 09.01.0 *only* but again AER errors were
found from 09.00.0 as mentioned in previous mail.
So either do a reset of one or both devices, AER error from 09.00.0 is
inevitable. So better to do rest for all devices connected to the bus.

Following changes looks to be working with these observations for case
1 (No  driver at all) & case 2 (driver comes after some time)
A) AER errors are less
B) For NON_FATAL AER errors both devices get reset.
C) Few AER errors(neither NON_FATAL nor FATAL) for 09.00.0 still
comes. (Note this device is never used for networking in the primary
kernel)
D) No action taking for "c" as below changes does not cover "c".
E)  No AER errors from any device after some time (At least 8-10 AER
errors, all from 09.00.0)
F) Ping/SSH is working fine in case 2 for kudmp kernel.

Please let me know your view. I can send a patch after detailed testing.

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..585a43b9c0da 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -66,6 +66,19 @@ static int report_error_detected(struct pci_dev *dev,
                if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
                        vote = PCI_ERS_RESULT_NO_AER_DRIVER;
                        pci_info(dev, "can't recover (no
error_detected callback)\n");
+
+                       pci_save_state(dev);
+                       pci_cfg_access_lock(dev);
+
+                       pci_write_config_word(dev, PCI_COMMAND,
PCI_COMMAND_INTX_DISABLE);
+
+                       if (!__pci_reset_function_locked(dev)) {
+                               vote = PCI_ERS_RESULT_RECOVERED;
+                               pci_info(dev, "Recovered via pci level
reset\n");
+                       }
+
+                       pci_cfg_access_unlock(dev);
+                       pci_restore_state(dev);
                } else {
                        vote = PCI_ERS_RESULT_NONE;

--pk


> > C) After that AER errors [1] comes is only for device 0000:09:00.0.
> > This is strange as this pci device is not being used during test.
> > Ping/ssh are happening with 0000:09:01.0
> > D) If wait for some more time. No more AER errors from any device
> > E) Ping is working fine in case 2.
> >
> > 09:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network
> > Connection (rev 01)
> > 09:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network
> > Connection (rev 01)
> >
> > # lspci -t -v
> >
> >  \-[0000:00]-+-00.0  Cavium, Inc. CN99xx [ThunderX2] Integrated PCI Host bridge
> >              +-01.0-[01]--
> >              +-02.0-[02]--
> >              +-03.0-[03]--
> >              +-04.0-[04]--
> >              +-05.0-[05]--+-00.0  Broadcom Inc. and subsidiaries
> > BCM57840 NetXtreme II 10 Gigabit Ethernet
> >              |            \-00.1  Broadcom Inc. and subsidiaries
> > BCM57840 NetXtreme II 10 Gigabit Ethernet
> >              +-06.0-[06]--
> >              +-07.0-[07]--
> >              +-08.0-[08]--
> >              +-09.0-[09-0a]--+-00.0  Intel Corporation 82576 Gigabit
> > Network Connection
> >              |               \-00.1  Intel Corporation 82576 Gigabit
> > Network Connection
> >
> >
> > [1] AER error which comes for 09:00.0:
> >
> > [   81.659825] {7}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 0
> > [   81.668080] {7}[Hardware Error]: It has been corrected by h/w and
> > requires no further action
> > [   81.676503] {7}[Hardware Error]: event severity: corrected
> > [   81.681975] {7}[Hardware Error]:  Error 0, type: corrected
> > [   81.687447] {7}[Hardware Error]:   section_type: PCIe error
> > [   81.693004] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   81.698908] {7}[Hardware Error]:   version: 3.0
> > [   81.703424] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   81.709589] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   81.715059] {7}[Hardware Error]:   slot: 0
> > [   81.719141] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   81.724265] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   81.730864] {7}[Hardware Error]:   class_code: 000002
> > [   81.735901] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   81.742587] {7}[Hardware Error]:  Error 1, type: corrected
> > [   81.748058] {7}[Hardware Error]:   section_type: PCIe error
> > [   81.753615] {7}[Hardware Error]:   port_type: 4, root port
> > [   81.759086] {7}[Hardware Error]:   version: 3.0
> > [   81.763602] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   81.769767] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   81.775237] {7}[Hardware Error]:   slot: 0
> > [   81.779319] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   81.784442] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   81.791041] {7}[Hardware Error]:   class_code: 000406
> > [   81.796078] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   81.803806] {7}[Hardware Error]:  Error 2, type: corrected
> > [   81.809276] {7}[Hardware Error]:   section_type: PCIe error
> > [   81.814834] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   81.820738] {7}[Hardware Error]:   version: 3.0
> > [   81.825254] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   81.831419] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   81.836889] {7}[Hardware Error]:   slot: 0
> > [   81.840971] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   81.846094] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   81.852693] {7}[Hardware Error]:   class_code: 000002
> > [   81.857730] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   81.864416] {7}[Hardware Error]:  Error 3, type: corrected
> > [   81.869886] {7}[Hardware Error]:   section_type: PCIe error
> > [   81.875444] {7}[Hardware Error]:   port_type: 4, root port
> > [   81.880914] {7}[Hardware Error]:   version: 3.0
> > [   81.885430] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   81.891595] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   81.897066] {7}[Hardware Error]:   slot: 0
> > [   81.901147] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   81.906271] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   81.912870] {7}[Hardware Error]:   class_code: 000406
> > [   81.917906] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   81.925634] {7}[Hardware Error]:  Error 4, type: corrected
> > [   81.931104] {7}[Hardware Error]:   section_type: PCIe error
> > [   81.936662] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   81.942566] {7}[Hardware Error]:   version: 3.0
> > [   81.947082] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   81.953247] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   81.958717] {7}[Hardware Error]:   slot: 0
> > [   81.962799] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   81.967923] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   81.974522] {7}[Hardware Error]:   class_code: 000002
> > [   81.979558] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   81.986244] {7}[Hardware Error]:  Error 5, type: corrected
> > [   81.991715] {7}[Hardware Error]:   section_type: PCIe error
> > [   81.997272] {7}[Hardware Error]:   port_type: 4, root port
> > [   82.002743] {7}[Hardware Error]:   version: 3.0
> > [   82.007259] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   82.013424] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   82.018894] {7}[Hardware Error]:   slot: 0
> > [   82.022976] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   82.028099] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   82.034698] {7}[Hardware Error]:   class_code: 000406
> > [   82.039735] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   82.047463] {7}[Hardware Error]:  Error 6, type: corrected
> > [   82.052933] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.058491] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   82.064395] {7}[Hardware Error]:   version: 3.0
> > [   82.068911] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   82.075076] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   82.080547] {7}[Hardware Error]:   slot: 0
> > [   82.084628] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   82.089752] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   82.096351] {7}[Hardware Error]:   class_code: 000002
> > [   82.101387] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   82.108073] {7}[Hardware Error]:  Error 7, type: corrected
> > [   82.113544] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.119101] {7}[Hardware Error]:   port_type: 4, root port
> > [   82.124572] {7}[Hardware Error]:   version: 3.0
> > [   82.129087] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   82.135252] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   82.140723] {7}[Hardware Error]:   slot: 0
> > [   82.144805] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   82.149928] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   82.156527] {7}[Hardware Error]:   class_code: 000406
> > [   82.161564] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   82.169291] {7}[Hardware Error]:  Error 8, type: corrected
> > [   82.174762] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.180319] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   82.186224] {7}[Hardware Error]:   version: 3.0
> > [   82.190739] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   82.196904] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   82.202375] {7}[Hardware Error]:   slot: 0
> > [   82.206456] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   82.211580] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   82.218179] {7}[Hardware Error]:   class_code: 000002
> > [   82.223216] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   82.229901] {7}[Hardware Error]:  Error 9, type: corrected
> > [   82.235372] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.240929] {7}[Hardware Error]:   port_type: 4, root port
> > [   82.246400] {7}[Hardware Error]:   version: 3.0
> > [   82.250916] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   82.257081] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   82.262551] {7}[Hardware Error]:   slot: 0
> > [   82.266633] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   82.271756] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   82.278355] {7}[Hardware Error]:   class_code: 000406
> > [   82.283392] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   82.291119] {7}[Hardware Error]:  Error 10, type: corrected
> > [   82.296676] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.302234] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   82.308138] {7}[Hardware Error]:   version: 3.0
> > [   82.312654] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   82.318819] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   82.324290] {7}[Hardware Error]:   slot: 0
> > [   82.328371] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   82.333495] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   82.340094] {7}[Hardware Error]:   class_code: 000002
> > [   82.345131] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   82.351816] {7}[Hardware Error]:  Error 11, type: corrected
> > [   82.357374] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.362931] {7}[Hardware Error]:   port_type: 4, root port
> > [   82.368402] {7}[Hardware Error]:   version: 3.0
> > [   82.372917] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   82.379082] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   82.384553] {7}[Hardware Error]:   slot: 0
> > [   82.388635] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   82.393758] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   82.400357] {7}[Hardware Error]:   class_code: 000406
> > [   82.405394] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   82.413121] {7}[Hardware Error]:  Error 12, type: corrected
> > [   82.418678] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.424236] {7}[Hardware Error]:   port_type: 0, PCIe end point
> > [   82.430140] {7}[Hardware Error]:   version: 3.0
> > [   82.434656] {7}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   82.440821] {7}[Hardware Error]:   device_id: 0000:09:00.0
> > [   82.446291] {7}[Hardware Error]:   slot: 0
> > [   82.450373] {7}[Hardware Error]:   secondary_bus: 0x00
> > [   82.455497] {7}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   82.462096] {7}[Hardware Error]:   class_code: 000002
> > [   82.467132] {7}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   82.473818] {7}[Hardware Error]:  Error 13, type: corrected
> > [   82.479375] {7}[Hardware Error]:   section_type: PCIe error
> > [   82.484933] {7}[Hardware Error]:   port_type: 4, root port
> > [   82.490403] {7}[Hardware Error]:   version: 3.0
> > [   82.494919] {7}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   82.501084] {7}[Hardware Error]:   device_id: 0000:00:09.0
> > [   82.506555] {7}[Hardware Error]:   slot: 0
> > [   82.510636] {7}[Hardware Error]:   secondary_bus: 0x09
> > [   82.515760] {7}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   82.522359] {7}[Hardware Error]:   class_code: 000406
> > [   82.527395] {7}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   82.535171] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.542476] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.550301] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.558032] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   82.566296] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.573597] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.581421] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.589151] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   82.597411] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.604711] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.612535] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.620271] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   82.628525] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.635826] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.643649] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.651385] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   82.659645] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.666940] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.674763] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.682498] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   82.690759] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.698053] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.705876] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.713612] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   82.721872] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   82.729167] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   82.736990] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   82.744725] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   88.059225] {8}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 0
> > [   88.067478] {8}[Hardware Error]: It has been corrected by h/w and
> > requires no further action
> > [   88.075899] {8}[Hardware Error]: event severity: corrected
> > [   88.081370] {8}[Hardware Error]:  Error 0, type: corrected
> > [   88.086841] {8}[Hardware Error]:   section_type: PCIe error
> > [   88.092399] {8}[Hardware Error]:   port_type: 0, PCIe end point
> > [   88.098303] {8}[Hardware Error]:   version: 3.0
> > [   88.102819] {8}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   88.108984] {8}[Hardware Error]:   device_id: 0000:09:00.0
> > [   88.114455] {8}[Hardware Error]:   slot: 0
> > [   88.118536] {8}[Hardware Error]:   secondary_bus: 0x00
> > [   88.123660] {8}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   88.130259] {8}[Hardware Error]:   class_code: 000002
> > [   88.135296] {8}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   88.141981] {8}[Hardware Error]:  Error 1, type: corrected
> > [   88.147452] {8}[Hardware Error]:   section_type: PCIe error
> > [   88.153009] {8}[Hardware Error]:   port_type: 4, root port
> > [   88.158480] {8}[Hardware Error]:   version: 3.0
> > [   88.162995] {8}[Hardware Error]:   command: 0x0106, status: 0x4010
> > [   88.169161] {8}[Hardware Error]:   device_id: 0000:00:09.0
> > [   88.174633] {8}[Hardware Error]:   slot: 0
> > [   88.180018] {8}[Hardware Error]:   secondary_bus: 0x09
> > [   88.185142] {8}[Hardware Error]:   vendor_id: 0x177d, device_id: 0xaf84
> > [   88.191914] {8}[Hardware Error]:   class_code: 000406
> > [   88.196951] {8}[Hardware Error]:   bridge: secondary_status:
> > 0x6000, control: 0x0002
> > [   88.204852] {8}[Hardware Error]:  Error 2, type: corrected
> > [   88.210323] {8}[Hardware Error]:   section_type: PCIe error
> > [   88.215881] {8}[Hardware Error]:   port_type: 0, PCIe end point
> > [   88.221786] {8}[Hardware Error]:   version: 3.0
> > [   88.226301] {8}[Hardware Error]:   command: 0x0507, status: 0x0010
> > [   88.232466] {8}[Hardware Error]:   device_id: 0000:09:00.0
> > [   88.237937] {8}[Hardware Error]:   slot: 0
> > [   88.242019] {8}[Hardware Error]:   secondary_bus: 0x00
> > [   88.247142] {8}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > [   88.253741] {8}[Hardware Error]:   class_code: 000002
> > [   88.258778] {8}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > [   88.265509] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   88.272812] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> > [   88.280635] pcieport 0000:00:09.0: AER: aer_status: 0x00000000,
> > aer_mask: 0x00002000
> > [   88.288363] pcieport 0000:00:09.0: AER: aer_layer=Transaction
> > Layer, aer_agent=Receiver ID
> > [   88.296622] igb 0000:09:00.0: AER: aer_status: 0x00002000,
> > aer_mask: 0x00002000
> > [   88.305391] igb 0000:09:00.0: AER: aer_layer=Transaction Layer,
> > aer_agent=Receiver ID
> >
> > > Case I is using APEI, and it looks like that can queue up 16 errors
> > > (AER_RECOVER_RING_SIZE), so that queue could be completely full before
> > > we even get a chance to reset the device.  But I would think that the
> > > reset should *eventually* stop the errors, even though we might log
> > > 30+ of them first.
> > >
> > > As an experiment, you could reduce AER_RECOVER_RING_SIZE to 1 or 2 and
> > > see if it reduces the logging.
> >
> > Did not tried this experiment. I believe it is not required now
> >
> > --pk
> >
> > >
> > > > > > Problem mentioned in case I and II goes away if do pci_reset_function
> > > > > > during enumeration phase of kdump kernel.
> > > > > > can we thought of doing pci_reset_function for all devices in kdump
> > > > > > kernel or device specific quirk.
> > > > > >
> > > > > > --pk
> > > > > >
> > > > > >
> > > > > > > > As per my understanding, possible solutions are
> > > > > > > >  - Copy SMMU table i.e. this patch
> > > > > > > > OR
> > > > > > > >  - Doing pci_reset_function() during enumeration phase.
> > > > > > > > I also tried clearing "M" bit using pci_clear_master during
> > > > > > > > enumeration but it did not help. Because driver re-set M bit causing
> > > > > > > > same AER error again.
> > > > > > > >
> > > > > > > >
> > > > > > > > -pk
> > > > > > > >
> > > > > > > > ---------------------------------------------------------------------------------------------------------------------------
> > > > > > > > [1] with bootargs having pci=noaer
> > > > > > > >
> > > > > > > > [   22.494648] {4}[Hardware Error]: Hardware error from APEI Generic
> > > > > > > > Hardware Error Source: 1
> > > > > > > > [   22.512773] {4}[Hardware Error]: event severity: recoverable
> > > > > > > > [   22.518419] {4}[Hardware Error]:  Error 0, type: recoverable
> > > > > > > > [   22.544804] {4}[Hardware Error]:   section_type: PCIe error
> > > > > > > > [   22.550363] {4}[Hardware Error]:   port_type: 0, PCIe end point
> > > > > > > > [   22.556268] {4}[Hardware Error]:   version: 3.0
> > > > > > > > [   22.560785] {4}[Hardware Error]:   command: 0x0507, status: 0x4010
> > > > > > > > [   22.576852] {4}[Hardware Error]:   device_id: 0000:09:00.1
> > > > > > > > [   22.582323] {4}[Hardware Error]:   slot: 0
> > > > > > > > [   22.586406] {4}[Hardware Error]:   secondary_bus: 0x00
> > > > > > > > [   22.591530] {4}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10c9
> > > > > > > > [   22.608900] {4}[Hardware Error]:   class_code: 000002
> > > > > > > > [   22.613938] {4}[Hardware Error]:   serial number: 0xff1b4580, 0x90e2baff
> > > > > > > > [   22.803534] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > > > > aer_mask: 0x00000000
> > > > > > > > [   22.810838] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > > > > [   22.817613] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > > > > aer_agent=Requester ID
> > > > > > > > [   22.847374] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > > > > [   22.866161] mpt3sas_cm0: 63 BIT PCI BUS DMA ADDRESSING SUPPORTED,
> > > > > > > > total mem (8153768 kB)
> > > > > > > > [   22.946178] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > > > > > > > [   22.995142] pci 0000:09:00.1: AER: can't recover (no error_detected callback)
> > > > > > > > [   23.002300] pcieport 0000:00:09.0: AER: device recovery failed
> > > > > > > > [   23.027607] pci 0000:09:00.1: AER: aer_status: 0x00004000,
> > > > > > > > aer_mask: 0x00000000
> > > > > > > > [   23.044109] pci 0000:09:00.1: AER:    [14] CmpltTO                (First)
> > > > > > > > [   23.060713] pci 0000:09:00.1: AER: aer_layer=Transaction Layer,
> > > > > > > > aer_agent=Requester ID
> > > > > > > > [   23.068616] pci 0000:09:00.1: AER: aer_uncor_severity: 0x00062011
> > > > > > > > [   23.122056] pci 0000:09:00.0: AER: can't recover (no error_detected callback)
> > >
> > > <snip>

Bjorn Helgaas June 4, 2020, 12:02 a.m. UTC | #17

On Wed, Jun 03, 2020 at 11:12:48PM +0530, Prabhakar Kushwaha wrote:
> On Sat, May 30, 2020 at 1:03 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Fri, May 29, 2020 at 07:48:10PM +0530, Prabhakar Kushwaha wrote:
> > > On Thu, May 28, 2020 at 1:48 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > >
> > > > On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
> > > > > On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > > > > > > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > > > > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > > > > > > via this new table..
> > > > > > > > > > >
> > > > > > > > > > > There are scenarios, where devices are still having old pending
> > > > > > > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > > > > > > As new stream table does not have entry for older transactions,
> > > > > > > > > > > it will be aborted by SMMU.
> > > > > > > > > > >
> > > > > > > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > > > > > > >
> > > > > > > > > > That sounds like exactly what we want, doesn't it?
> > > > > > > > > >
> > > > > > > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > > > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > > > > > > as it's not memory used by the kdump kernel.
> > > > > > > > >
> > > > > > > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > > > > > > But current abort is happening because of SMMU and no driver/pcie
> > > > > > > > > setup present at this moment.
> > > > > > > >
> > > > > > > > I don't understand what you mean by "in context of driver."  The whole
> > > > > > > > problem is that we can't control *when* the abort happens, so it may
> > > > > > > > happen in *any* context.  It may happen when a NIC receives a packet
> > > > > > > > or at some other unpredictable time.
> > > > > > > >
> > > > > > > > > Solution of this issue should be at 2 place
> > > > > > > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > > > > > > issue till finally driver's probe takeover.
> > > > > > > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > > > > > > able to recover.
> > > > > > > > >
> > > > > > > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > > > > > > A network device hang is observed even after continuous
> > > > > > > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > > > > > > >
> > > > > > > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > > > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > > > > > > we could recover and reliably use the device after the error, that
> > > > > > > > > > seems like it would be a more robust solution that having to add
> > > > > > > > > > special cases in every IOMMU driver.
> > > > > > > > > >
> > > > > > > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > > > > > > it because we want to recover from that sort of error in normal
> > > > > > > > > > (non-crash) situations as well.
> > > > > > > > > >
> > > > > > > > > Completion abort case should be gracefully handled.  And device should
> > > > > > > > > always remain usable.
> > > > > > > > >
> > > > > > > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > > > > > > 82576 Gigabit Network card.
> > > > > > > > >
> > > > > > > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > > > > > > >     -  kdump file system does not have Ethernet driver
> > > > > > > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > > > > > > of kdump root file system.
> > > > > > > >
> > > > > > > > In this case, I think report_error_detected() is deciding that because
> > > > > > > > the device has no driver, we can't do anything.  The flow is like
> > > > > > > > this:
> > > > > > > >
> > > > > > > >   aer_recover_work_func               # aer_recover_work
> > > > > > > >     kfifo_get(aer_recover_ring, entry)
> > > > > > > >     dev = pci_get_domain_bus_and_slot
> > > > > > > >     cper_print_aer(dev, ...)
> > > > > > > >       pci_err("AER: aer_status:")
> > > > > > > >       pci_err("AER:   [14] CmpltTO")
> > > > > > > >       pci_err("AER: aer_layer=")
> > > > > > > >     if (AER_NONFATAL)
> > > > > > > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > > > > > > >         status = CAN_RECOVER
> > > > > > > >         pci_walk_bus(report_normal_detected)
> > > > > > > >           report_error_detected
> > > > > > > >             if (!dev->driver)
> > > > > > > >               vote = NO_AER_DRIVER
> > > > > > > >               pci_info("can't recover (no error_detected callback)")
> > > > > > > >             *result = merge_result(*, NO_AER_DRIVER)
> > > > > > > >             # always NO_AER_DRIVER
> > > > > > > >         status is now NO_AER_DRIVER
> > > > > > > >
> > > > > > > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > > > > > > and status is not RECOVERED, so it skips .resume().
> > > > > > > >
> > > > > > > > I don't remember the history there, but if a device has no driver and
> > > > > > > > the device generates errors, it seems like we ought to be able to
> > > > > > > > reset it.
> > > > > > >
> > > > > > > But how to reset the device considering there is no driver.
> > > > > > > Hypothetically, this case should be taken care by PCIe subsystem to
> > > > > > > perform reset at PCIe level.
> > > > > >
> > > > > > I don't understand your question.  The PCI core (not the device
> > > > > > driver) already does the reset.  When pcie_do_recovery() calls
> > > > > > reset_link(), all devices on the other side of the link are reset.
> > > > > >
> > > > > > > > We should be able to field one (or a few) AER errors, reset the
> > > > > > > > device, and you should be able to use the shell in the kdump kernel.
> > > > > > > >
> > > > > > > here kdump shell is usable only problem is a "lot of AER Errors". One
> > > > > > > cannot see what they are typing.
> > > > > >
> > > > > > Right, that's what I expect.  If the PCI core resets the device, you
> > > > > > should get just a few AER errors, and they should stop after the
> > > > > > device is reset.
> > > > > >
> > > > > > > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > > > > > > >
> > > > > > > > > II) Crash testing using default root file system: Specific case to
> > > > > > > > > test Ethernet driver in second kernel
> > > > > > > > >    -  Default root file system have Ethernet driver
> > > > > > > > >    -  AER error comes even before the driver probe starts.
> > > > > > > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > > > > > > >    -  AER also tries to recover. but no success.  [2]
> > > > > > > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > > > > > > and commenting ghes_handle_aer() from GHES driver..
> > > > > > > > >           than different set of errors come which also never able to recover [3]
> > > > > > > > >
> > > > > > >
> > > > > > > Please suggest your view on this case. Here driver is preset.
> > > > > > > (driver/net/ethernet/intel/igb/igb_main.c)
> > > > > > > In this case AER errors starts even before driver probe starts.
> > > > > > > After probe, driver does the device reset with no success and even AER
> > > > > > > recovery does not work.
> > > > > >
> > > > > > This case should be the same as the one above.  If we can change the
> > > > > > PCI core so it can reset the device when there's no driver,  that would
> > > > > > apply to case I (where there will never be a driver) and to case II
> > > > > > (where there is no driver now, but a driver will probe the device
> > > > > > later).
> > > > >
> > > > > Does this means change are required in PCI core.
> > > >
> > > > Yes, I am suggesting that the PCI core does not do the right thing
> > > > here.
> > > >
> > > > > I tried following changes in pcie_do_recovery() but it did not help.
> > > > > Same error as before.
> > > > >
> > > > > -- a/drivers/pci/pcie/err.c
> > > > > +++ b/drivers/pci/pcie/err.c
> > > > >         pci_info(dev, "broadcast resume message\n");
> > > > >         pci_walk_bus(bus, report_resume, &status);
> > > > > @@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> > > > >         return status;
> > > > >
> > > > >  failed:
> > > > >         pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> > > > > +       pci_reset_function(dev);
> > > > > +       pci_aer_clear_device_status(dev);
> > > > > +       pci_aer_clear_nonfatal_status(dev);
> > > >
> > > > Did you confirm that this resets the devices in question (0000:09:00.0
> > > > and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
> > > > PM, etc)?
> > >
> > > Earlier reset  was happening with P2P bridge(0000:00:09.0) this the
> > > reason no effect. After making following changes,  both devices are
> > > now getting reset.
> > > Both devices are using FLR.
> > >
> > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > > index 117c0a2b2ba4..26b908f55aef 100644
> > > --- a/drivers/pci/pcie/err.c
> > > +++ b/drivers/pci/pcie/err.c
> > > @@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
> > >                 if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> > >                         vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> > >                         pci_info(dev, "can't recover (no
> > > error_detected callback)\n");
> > > +
> > > +                       pci_save_state(dev);
> > > +                       pci_cfg_access_lock(dev);
> > > +
> > > +                       /* Quiesce the device completely */
> > > +                       pci_write_config_word(dev, PCI_COMMAND,
> > > +                             PCI_COMMAND_INTX_DISABLE);
> > > +                       if (!__pci_reset_function_locked(dev)) {
> > > +                               vote = PCI_ERS_RESULT_RECOVERED;
> > > +                               pci_info(dev, "recovered via pci level
> > > reset\n");
> > > +                       }
> >
> > Why do we need to save the state and quiesce the device?  The reset
> > should disable interrupts anyway.  In this particular case where
> > there's no driver, I don't think we should have to restore the state.
> > We maybe should *remove* the device and re-enumerate it after the
> > reset, but the state from before the reset should be irrelevant.
> 
> I tried pci_reset_function_locked without save/restore then I got the
> synchronous abort during igb_probe (case 2 i.e. with driver). This is
> 100% reproducible.
> looks like pci_reset_function_locked is causing PCI configuration
> space random. Same is mentioned here
> https://www.kernel.org/doc/html/latest/driver-api/pci/pci.html

That documentation is poorly worded.  A reset doesn't make the
contents of config space "random," but of course it sets config space
registers to their initialization values, including things like the
device BARs.  After a reset, the device BARs are zero, so it won't
respond at the address we expect, and I'm sure that's what's causing
the external abort.

So I guess we *do* need to save the state before the reset and restore
it (either that or enumerate the device from scratch just like we
would if it had been hot-added).  I'm not really thrilled with trying
to save the state after the device has already reported an error.  I'd
rather do it earlier, maybe during enumeration, like in
pci_init_capabilities().  But I don't understand all the subtleties of
dev->state_saved, so that requires some legwork.

I don't think we should set INTX_DISABLE; the reset will make whatever
we do with it irrelevant anyway.

Remind me why the pci_cfg_access_lock()?

Bjorn

Prabhakar Kushwaha June 7, 2020, 8:30 a.m. UTC | #18

Hi Bjorn,

On Thu, Jun 4, 2020 at 5:32 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Jun 03, 2020 at 11:12:48PM +0530, Prabhakar Kushwaha wrote:
> > On Sat, May 30, 2020 at 1:03 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Fri, May 29, 2020 at 07:48:10PM +0530, Prabhakar Kushwaha wrote:
> > > > On Thu, May 28, 2020 at 1:48 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > >
> > > > > On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
> > > > > > On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
> > > > > > > > On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > > On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
> > > > > > > > > > On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > > > > > > On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
> > > > > > > > > > > > An SMMU Stream table is created by the primary kernel. This table is
> > > > > > > > > > > > used by the SMMU to perform address translations for device-originated
> > > > > > > > > > > > transactions. Any crash (if happened) launches the kdump kernel which
> > > > > > > > > > > > re-creates the SMMU Stream table. New transactions will be translated
> > > > > > > > > > > > via this new table..
> > > > > > > > > > > >
> > > > > > > > > > > > There are scenarios, where devices are still having old pending
> > > > > > > > > > > > transactions (configured in the primary kernel). These transactions
> > > > > > > > > > > > come in-between Stream table creation and device-driver probe.
> > > > > > > > > > > > As new stream table does not have entry for older transactions,
> > > > > > > > > > > > it will be aborted by SMMU.
> > > > > > > > > > > >
> > > > > > > > > > > > Similar observations were found with PCIe-Intel 82576 Gigabit
> > > > > > > > > > > > Network card. It sends old Memory Read transaction in kdump kernel.
> > > > > > > > > > > > Transactions configured for older Stream table entries, that do not
> > > > > > > > > > > > exist any longer in the new table, will cause a PCIe Completion Abort.
> > > > > > > > > > >
> > > > > > > > > > > That sounds like exactly what we want, doesn't it?
> > > > > > > > > > >
> > > > > > > > > > > Or do you *want* DMA from the previous kernel to complete?  That will
> > > > > > > > > > > read or scribble on something, but maybe that's not terrible as long
> > > > > > > > > > > as it's not memory used by the kdump kernel.
> > > > > > > > > >
> > > > > > > > > > Yes, Abort should happen. But it should happen in context of driver.
> > > > > > > > > > But current abort is happening because of SMMU and no driver/pcie
> > > > > > > > > > setup present at this moment.
> > > > > > > > >
> > > > > > > > > I don't understand what you mean by "in context of driver."  The whole
> > > > > > > > > problem is that we can't control *when* the abort happens, so it may
> > > > > > > > > happen in *any* context.  It may happen when a NIC receives a packet
> > > > > > > > > or at some other unpredictable time.
> > > > > > > > >
> > > > > > > > > > Solution of this issue should be at 2 place
> > > > > > > > > > a) SMMU level: I still believe, this patch has potential to overcome
> > > > > > > > > > issue till finally driver's probe takeover.
> > > > > > > > > > b) Device level: Even if something goes wrong. Driver/device should
> > > > > > > > > > able to recover.
> > > > > > > > > >
> > > > > > > > > > > > Returned PCIe completion abort further leads to AER Errors from APEI
> > > > > > > > > > > > Generic Hardware Error Source (GHES) with completion timeout.
> > > > > > > > > > > > A network device hang is observed even after continuous
> > > > > > > > > > > > reset/recovery from driver, Hence device is no more usable.
> > > > > > > > > > >
> > > > > > > > > > > The fact that the device is no longer usable is definitely a problem.
> > > > > > > > > > > But in principle we *should* be able to recover from these errors.  If
> > > > > > > > > > > we could recover and reliably use the device after the error, that
> > > > > > > > > > > seems like it would be a more robust solution that having to add
> > > > > > > > > > > special cases in every IOMMU driver.
> > > > > > > > > > >
> > > > > > > > > > > If you have details about this sort of error, I'd like to try to fix
> > > > > > > > > > > it because we want to recover from that sort of error in normal
> > > > > > > > > > > (non-crash) situations as well.
> > > > > > > > > > >
> > > > > > > > > > Completion abort case should be gracefully handled.  And device should
> > > > > > > > > > always remain usable.
> > > > > > > > > >
> > > > > > > > > > There are 2 scenario which I am testing with Ethernet card PCIe-Intel
> > > > > > > > > > 82576 Gigabit Network card.
> > > > > > > > > >
> > > > > > > > > > I)  Crash testing using kdump root file system: De-facto scenario
> > > > > > > > > >     -  kdump file system does not have Ethernet driver
> > > > > > > > > >     -  A lot of AER prints [1], making it impossible to work on shell
> > > > > > > > > > of kdump root file system.
> > > > > > > > >
> > > > > > > > > In this case, I think report_error_detected() is deciding that because
> > > > > > > > > the device has no driver, we can't do anything.  The flow is like
> > > > > > > > > this:
> > > > > > > > >
> > > > > > > > >   aer_recover_work_func               # aer_recover_work
> > > > > > > > >     kfifo_get(aer_recover_ring, entry)
> > > > > > > > >     dev = pci_get_domain_bus_and_slot
> > > > > > > > >     cper_print_aer(dev, ...)
> > > > > > > > >       pci_err("AER: aer_status:")
> > > > > > > > >       pci_err("AER:   [14] CmpltTO")
> > > > > > > > >       pci_err("AER: aer_layer=")
> > > > > > > > >     if (AER_NONFATAL)
> > > > > > > > >       pcie_do_recovery(dev, pci_channel_io_normal)
> > > > > > > > >         status = CAN_RECOVER
> > > > > > > > >         pci_walk_bus(report_normal_detected)
> > > > > > > > >           report_error_detected
> > > > > > > > >             if (!dev->driver)
> > > > > > > > >               vote = NO_AER_DRIVER
> > > > > > > > >               pci_info("can't recover (no error_detected callback)")
> > > > > > > > >             *result = merge_result(*, NO_AER_DRIVER)
> > > > > > > > >             # always NO_AER_DRIVER
> > > > > > > > >         status is now NO_AER_DRIVER
> > > > > > > > >
> > > > > > > > > So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
> > > > > > > > > and status is not RECOVERED, so it skips .resume().
> > > > > > > > >
> > > > > > > > > I don't remember the history there, but if a device has no driver and
> > > > > > > > > the device generates errors, it seems like we ought to be able to
> > > > > > > > > reset it.
> > > > > > > >
> > > > > > > > But how to reset the device considering there is no driver.
> > > > > > > > Hypothetically, this case should be taken care by PCIe subsystem to
> > > > > > > > perform reset at PCIe level.
> > > > > > >
> > > > > > > I don't understand your question.  The PCI core (not the device
> > > > > > > driver) already does the reset.  When pcie_do_recovery() calls
> > > > > > > reset_link(), all devices on the other side of the link are reset.
> > > > > > >
> > > > > > > > > We should be able to field one (or a few) AER errors, reset the
> > > > > > > > > device, and you should be able to use the shell in the kdump kernel.
> > > > > > > > >
> > > > > > > > here kdump shell is usable only problem is a "lot of AER Errors". One
> > > > > > > > cannot see what they are typing.
> > > > > > >
> > > > > > > Right, that's what I expect.  If the PCI core resets the device, you
> > > > > > > should get just a few AER errors, and they should stop after the
> > > > > > > device is reset.
> > > > > > >
> > > > > > > > > >     -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.
> > > > > > > > > >
> > > > > > > > > > II) Crash testing using default root file system: Specific case to
> > > > > > > > > > test Ethernet driver in second kernel
> > > > > > > > > >    -  Default root file system have Ethernet driver
> > > > > > > > > >    -  AER error comes even before the driver probe starts.
> > > > > > > > > >    -  Driver does reset Ethernet card as part of probe but no success.
> > > > > > > > > >    -  AER also tries to recover. but no success.  [2]
> > > > > > > > > >    -  I also tries to remove AER errors by using "pci=noaer" bootargs
> > > > > > > > > > and commenting ghes_handle_aer() from GHES driver..
> > > > > > > > > >           than different set of errors come which also never able to recover [3]
> > > > > > > > > >
> > > > > > > >
> > > > > > > > Please suggest your view on this case. Here driver is preset.
> > > > > > > > (driver/net/ethernet/intel/igb/igb_main.c)
> > > > > > > > In this case AER errors starts even before driver probe starts.
> > > > > > > > After probe, driver does the device reset with no success and even AER
> > > > > > > > recovery does not work.
> > > > > > >
> > > > > > > This case should be the same as the one above.  If we can change the
> > > > > > > PCI core so it can reset the device when there's no driver,  that would
> > > > > > > apply to case I (where there will never be a driver) and to case II
> > > > > > > (where there is no driver now, but a driver will probe the device
> > > > > > > later).
> > > > > >
> > > > > > Does this means change are required in PCI core.
> > > > >
> > > > > Yes, I am suggesting that the PCI core does not do the right thing
> > > > > here.
> > > > >
> > > > > > I tried following changes in pcie_do_recovery() but it did not help.
> > > > > > Same error as before.
> > > > > >
> > > > > > -- a/drivers/pci/pcie/err.c
> > > > > > +++ b/drivers/pci/pcie/err.c
> > > > > >         pci_info(dev, "broadcast resume message\n");
> > > > > >         pci_walk_bus(bus, report_resume, &status);
> > > > > > @@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> > > > > >         return status;
> > > > > >
> > > > > >  failed:
> > > > > >         pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> > > > > > +       pci_reset_function(dev);
> > > > > > +       pci_aer_clear_device_status(dev);
> > > > > > +       pci_aer_clear_nonfatal_status(dev);
> > > > >
> > > > > Did you confirm that this resets the devices in question (0000:09:00.0
> > > > > and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
> > > > > PM, etc)?
> > > >
> > > > Earlier reset  was happening with P2P bridge(0000:00:09.0) this the
> > > > reason no effect. After making following changes,  both devices are
> > > > now getting reset.
> > > > Both devices are using FLR.
> > > >
> > > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > > > index 117c0a2b2ba4..26b908f55aef 100644
> > > > --- a/drivers/pci/pcie/err.c
> > > > +++ b/drivers/pci/pcie/err.c
> > > > @@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
> > > >                 if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> > > >                         vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> > > >                         pci_info(dev, "can't recover (no
> > > > error_detected callback)\n");
> > > > +
> > > > +                       pci_save_state(dev);
> > > > +                       pci_cfg_access_lock(dev);
> > > > +
> > > > +                       /* Quiesce the device completely */
> > > > +                       pci_write_config_word(dev, PCI_COMMAND,
> > > > +                             PCI_COMMAND_INTX_DISABLE);
> > > > +                       if (!__pci_reset_function_locked(dev)) {
> > > > +                               vote = PCI_ERS_RESULT_RECOVERED;
> > > > +                               pci_info(dev, "recovered via pci level
> > > > reset\n");
> > > > +                       }
> > >
> > > Why do we need to save the state and quiesce the device?  The reset
> > > should disable interrupts anyway.  In this particular case where
> > > there's no driver, I don't think we should have to restore the state.
> > > We maybe should *remove* the device and re-enumerate it after the
> > > reset, but the state from before the reset should be irrelevant.
> >
> > I tried pci_reset_function_locked without save/restore then I got the
> > synchronous abort during igb_probe (case 2 i.e. with driver). This is
> > 100% reproducible.
> > looks like pci_reset_function_locked is causing PCI configuration
> > space random. Same is mentioned here
> > https://www.kernel.org/doc/html/latest/driver-api/pci/pci.html
>
> That documentation is poorly worded.  A reset doesn't make the
> contents of config space "random," but of course it sets config space
> registers to their initialization values, including things like the
> device BARs.  After a reset, the device BARs are zero, so it won't
> respond at the address we expect, and I'm sure that's what's causing
> the external abort.
>
> So I guess we *do* need to save the state before the reset and restore
> it (either that or enumerate the device from scratch just like we
> would if it had been hot-added).  I'm not really thrilled with trying
> to save the state after the device has already reported an error.  I'd
> rather do it earlier, maybe during enumeration, like in
> pci_init_capabilities().  But I don't understand all the subtleties of
> dev->state_saved, so that requires some legwork.
>

I tried moving pci_save_state earlier. All observations are the same
as mentioned in earlier discussions.

Some modifications are required in pci_restore_state() as by default
it makes dev->state_saved = false after restore. .
So the next AER causes the earlier mentioned
crash(igb_get_invariants_82575 --> igb_rd32).  It is because
pci_restore_state() returns without restoring any state.

Code changes are below [1]

> I don't think we should set INTX_DISABLE; the reset will make whatever
> we do with it irrelevant anyway.
>
Yes.. It is not required.

> Remind me why the pci_cfg_access_lock()?

I thought of the race conditions between AER (save/restore) and
igb_probe. So I added this.
It is not required as lock is inherently "taken care" in both AER (bus
walk) and igb_probe by the framework.

[1]
root@localhost$ git diff
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 595fcf59843f..35396eb4fd9e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1537,11 +1537,7 @@ static void pci_restore_rebar_state(struct pci_dev *pdev)
        }
 }

-/**
- * pci_restore_state - Restore the saved state of a PCI device
- * @dev: PCI device that we're dealing with
- */
-void pci_restore_state(struct pci_dev *dev)
+void __pci_restore_state(struct pci_dev *dev, int retain_state)
 {
        if (!dev->state_saved)
                return;
@@ -1572,10 +1568,26 @@ void pci_restore_state(struct pci_dev *dev)
        pci_enable_acs(dev);
        pci_restore_iov_state(dev);

-       dev->state_saved = false;
+       if (!retain_state)
+               dev->state_saved = false;
+}
+
+/**
+ * pci_restore_state - Restore the saved state of a PCI device
+ * @dev: PCI device that we're dealing with
+ */
+void pci_restore_state(struct pci_dev *dev)
+{
+       __pci_restore_state(dev, 0);
 }
 EXPORT_SYMBOL(pci_restore_state);

+void pci_restore_retain_state(struct pci_dev *dev)
+{
+       __pci_restore_state(dev, 1);
+}
+EXPORT_SYMBOL(pci_restore_retain_state);
+
 struct pci_saved_state {
        u32 config_space[16];
        struct pci_cap_saved_data cap[0];
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..621eaa34bf9f 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -66,6 +66,13 @@ static int report_error_detected(struct pci_dev *dev,
                if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
                        vote = PCI_ERS_RESULT_NO_AER_DRIVER;
                        pci_info(dev, "can't recover (no
error_detected callback)\n");
+
+                       if (!__pci_reset_function_locked(dev)) {
+                               vote = PCI_ERS_RESULT_RECOVERED;
+                               pci_info(dev, "Recovered via pci level
reset\n");
+                       }
+
+                       pci_restore_retain_state(dev);
                } else {
                        vote = PCI_ERS_RESULT_NONE;
                }
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 77b8a145c39b..af4e27c95421 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2448,6 +2448,8 @@ void pci_device_add(struct pci_dev *dev, struct
pci_bus *bus)

        pci_init_capabilities(dev);

+       pci_save_state(dev);
+
        /*
         * Add the device to our list of discovered devices
         * and the bus list for fixup functions, etc.
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 83ce1cdf5676..42ab7ef850b7 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1234,6 +1234,7 @@ void pci_unmap_rom(struct pci_dev *pdev, void
__iomem *rom);

 /* Power management related routines */
 int pci_save_state(struct pci_dev *dev);
+void pci_restore_retain_state(struct pci_dev *dev);
 void pci_restore_state(struct pci_dev *dev);
 struct pci_saved_state *pci_store_saved_state(struct pci_dev *dev);
 int pci_load_saved_state(struct pci_dev *dev,

--pk

Will Deacon June 8, 2020, 11:41 a.m. UTC | #19

On Tue, Jun 02, 2020 at 07:34:47PM +0530, Prabhakar Kushwaha wrote:
> On Mon, Jun 1, 2020 at 1:10 PM Will Deacon <will@kernel.org> wrote:
> > On Thu, May 21, 2020 at 04:52:02PM +0530, Prabhakar Kushwaha wrote:
> > > On Thu, May 21, 2020 at 2:53 PM Will Deacon <will@kernel.org> wrote:
> > > > On Tue, May 19, 2020 at 08:24:21AM +0530, Prabhakar Kushwaha wrote:
> > > > > What kind of issue you are foreseeing in using memcpy(). May be we can
> > > > > try to find a solution.
> > > >
> > > > Well the thing might not be cache-coherent to start with...
> > > >
> > >
> > > Thanks for telling possible issue area.  Let me try to explain why
> > > this should not be an issue.
> > >
> > > kdump kernel runs from reserved memory space defined during the boot
> > > of first kernel. kdump does not touch memory of the previous kernel.
> > > So no page has been created in kdump kernel  and  there should not be
> > > any data/attribute/coherency issue from MMU point of view .
> >
> > Then how does this work?:
> >
> >         rdcfg.strtab = memremap(rdcfg.strtab_dma, size, MEMREMAP_WB);
> >
> > You're explicitly asking for a write-back mapping.
> >
> 
> As i mentioned earlier, I will replace it with MEMREMAP_WT to make
> sure data is written into the memory.
> 
> Please note, this memmap is temporary for copying older SMMU table to
> cfg->strtab.
> Here, cfg->strtab & cfg->strtab_dma allocated via dmam_alloc_coherent
> during SMMU probe.
> 
> 
> > > During SMMU probe functions,  dmem_alloc_coherent() will be used
> > > allocate new memory (part of existing flow).
> > > This patch copy STE or first level descriptor to *this* memory, after
> > > mapping physical address using memremap().
> > > It just copy everything  so there should not be any issue related to
> > > attribute/content.
> > >
> > > Yes, copying  done after mapping it as MEMREMAP_WB. if you want I can
> > > use it as MEMREMAP_WT
> >
> > You need to take into account whether or not the device is coherent, and the
> > DMA API is designed to handle that for you. But even then, this is fragile
> > as hell because you end up having to infer the hardware configuration
> > from the device to understand the size and format of the data structures.
> > If the crashkernel isn't identical to the host kernel (in terms of kconfig,
> > driver version, firmware tables, cmdline etc) then this is very likely to
> > go wrong.
> 
> There are two possible scenarios for mismatched kdump kernel
> 1.  kdump kernel does not have the devices' driver
> 2.  kdump kernel have the different variation/configuration of driver
> 
> This patch create temporary SMMU table entries which are overwritten
> by driver-probe.

What exactly does this achieve, given that you don't copy the context
descriptors or the page tables?

> Driver's probe will overwrite SMMU entries based on its new
> requirement (size, format, data structures etc).
> 
> for "1",  As  no device driver,  SMMU entry will remain there.
> Means no-one looking for the copied content (even if device continued
> to perform DMA).
> 
> About coherency between Cores and Memory(DMA).
> At the time of crash:  Only one CPU is allowed to remain continue,
> rest are stopped.
> __crash_kexec --> machine_crash_shutdown --> crash_smp_send_stop()
> 
> The active CPU is used to boot kdump kernel. hence none of the CPUs is
> looking for data copied by DMA.
> Coherency issue should not be there.

I'm talking about coherency between the SMMU and the CPU, so I don't think
the number of CPUs is relevant.

> please let me know your view.

It still seems extremely fragile to me, so I continue to think that this
is the wrong approach.

Will

Bjorn Helgaas June 11, 2020, 11:03 p.m. UTC | #20

On Sun, Jun 07, 2020 at 02:00:35PM +0530, Prabhakar Kushwaha wrote:
> On Thu, Jun 4, 2020 at 5:32 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Wed, Jun 03, 2020 at 11:12:48PM +0530, Prabhakar Kushwaha wrote:
> > > On Sat, May 30, 2020 at 1:03 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Fri, May 29, 2020 at 07:48:10PM +0530, Prabhakar Kushwaha wrote:
<snip>

> > > > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > > > > index 117c0a2b2ba4..26b908f55aef 100644
> > > > > --- a/drivers/pci/pcie/err.c
> > > > > +++ b/drivers/pci/pcie/err.c
> > > > > @@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
> > > > >                 if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> > > > >                         vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> > > > >                         pci_info(dev, "can't recover (no
> > > > > error_detected callback)\n");
> > > > > +
> > > > > +                       pci_save_state(dev);
> > > > > +                       pci_cfg_access_lock(dev);
> > > > > +
> > > > > +                       /* Quiesce the device completely */
> > > > > +                       pci_write_config_word(dev, PCI_COMMAND,
> > > > > +                             PCI_COMMAND_INTX_DISABLE);
> > > > > +                       if (!__pci_reset_function_locked(dev)) {
> > > > > +                               vote = PCI_ERS_RESULT_RECOVERED;
> > > > > +                               pci_info(dev, "recovered via pci level
> > > > > reset\n");
> > > > > +                       }
> >
> > So I guess we *do* need to save the state before the reset and restore
> > it (either that or enumerate the device from scratch just like we
> > would if it had been hot-added).  I'm not really thrilled with trying
> > to save the state after the device has already reported an error.  I'd
> > rather do it earlier, maybe during enumeration, like in
> > pci_init_capabilities().  But I don't understand all the subtleties of
> > dev->state_saved, so that requires some legwork.
> 
> I tried moving pci_save_state earlier. All observations are the same
> as mentioned in earlier discussions.

By "legwork", I didn't mean just trying things to see whether they
seem to work.  I meant researching the history to find out *why* it's
designed the way it is so that when we change it, we don't break
things.

For example, these commits are obviously important to understand:

  aa8c6c93747f ("PCI PM: Restore standard config registers of all devices early")
  c82f63e411f1 ("PCI: check saved state before restore")
  4b77b0a2ba27 ("PCI: Clear saved_state after the state has been restored")

I think we need to step back and separate this AER issue from the
whole SMMU table copying thing.  Then do the research and start a
new thread with a patch to fix just the AER issue.

The ARM guys would probably be grateful to be dropped from the AER
thread because it really has nothing to do with ARM.

Bjorn

[v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel

Commit Message

Comments

Patch