diff mbox series

[PATCHv2,2/2] PCI: err: ensure stable topology during handling

Message ID 20240612181625.3604512-3-kbusch@meta.com (mailing list archive)
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series pcie hotplug and error fixes | expand

Commit Message

Keith Busch June 12, 2024, 6:16 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

DPC and AER handling access their subordinate bus devices. If pciehp should
happen to also trigger during this handling, it will remove all the subordinate
buses, then dereferecing any children may be a use-after-free. That may lead to
kernel panics like the below.

 BUG: unable to handle page fault for address: 00000000091400c0
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP
 CPU: 15 PID: 2464 Comm: irq/53-pcie-dpc Kdump: loaded Tainted: G            E      6.9.0-0_fbk0_rc10_871_g4e98bf884071 #1
 RIP: 0010:pci_bus_read_config_dword+0x17/0x50
 Code: e9 0e 00 00 00 c7 01 ff ff ff ff b8 86 00 00 00 c3 cc cc 0f 1f 44 00 00 53 50 c7 44 24 04 00 00 00 00 f6 c2 03 75 27 48 89 cb <48> 8b 87 c0 00 00 00 4c 8d 44 24 04 b9 04 00 00 00 ff 50 18 85 c0
 RSP: 0018:ffffc90039113d60 EFLAGS: 00010246
 RAX: 0000000009140000 RBX: ffffc90039113d7c RCX: ffffc90039113d7c
 RDX: 0000000000000004 RSI: 0000000000000000 RDI: 0000000009140000
 RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000001
 R10: 0000000000000000 R11: 0000001f975c6971 R12: 000000000000e9fc
 R13: ffff88811b5b4000 R14: ffffc90039113d7c R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff899f7d3c0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00000000091400c0 CR3: 00000243fb00f002 CR4: 0000000000770ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ? __die+0x78/0xc0
  ? page_fault_oops+0x2a8/0x3a0
  ? sched_clock+0x5/0x10
  ? psi_task_switch+0x39/0xc90
  ? __switch_to+0x131/0x530
  ? exc_page_fault+0x63/0x130
  ? asm_exc_page_fault+0x22/0x30
  ? pci_bus_read_config_dword+0x17/0x50
  pci_dev_wait+0x107/0x190
  ? dpc_completed+0x50/0x50
  dpc_reset_link+0x4e/0xd0
  pcie_do_recovery+0xb2/0x2d0
  ? irq_forced_thread_fn+0x60/0x60
  dpc_handler+0x107/0x130
  irq_thread_fn+0x19/0x40
  irq_thread+0x120/0x1e0
  ? irq_thread_fn+0x40/0x40
  ? irq_forced_secondary_handler+0x20/0x20
  kthread+0xae/0xe0
  ? file_tty_write+0x360/0x360
  ret_from_fork+0x2f/0x40
  ? file_tty_write+0x360/0x360
  ret_from_fork_asm+0x11/0x20
  </TASK>

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/pci/pcie/err.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Lukas Wunner June 15, 2024, 2:09 p.m. UTC | #1
On Wed, Jun 12, 2024 at 11:16:25AM -0700, Keith Busch wrote:
> DPC and AER handling access their subordinate bus devices. If pciehp
> should happen to also trigger during this handling, it will remove
> all the subordinate buses, then dereferecing any children may be a
> use-after-free. That may lead to kernel panics like the below.

I assume the crash occurs because the struct pci_bus accessed by
pci_bus_read_config_dword() has already been freed?

Generally the solution to issues like this is to hold references
on the structs being accessed, not to acquire locks that happen
to prevent the structs from being freed.

Question is, which struct ref needs to be held and where?

Holding a ref on a struct pci_dev also holds the pci_bus it resides on
in place.  So I suspect we need to call pci_dev_get() somewhere.

The stack trace looks incomplete for some reason:

>   ? pci_bus_read_config_dword+0x17/0x50
>   pci_dev_wait+0x107/0x190
>   ? dpc_completed+0x50/0x50
>   dpc_reset_link+0x4e/0xd0
>   pcie_do_recovery+0xb2/0x2d0

I'd expect a call to pci_bridge_wait_for_secondary_bus() from
dpc_reset_link(), which in turn calls pci_dev_wait().

Indeed pci_bridge_wait_for_secondary_bus() does something fishy:
It takes the first entry from the devices list without acquiring
a ref:

	child = list_first_entry(&dev->subordinate->devices, struct pci_dev,
				 bus_list);

Below is a small patch which acquires a ref on child.  Maybe this
already does the trick?

-- >8 --

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2a8063e..82db9a8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4753,7 +4753,7 @@ static int pci_bus_max_d3cold_delay(const struct pci_bus *bus)
  */
 int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
 {
-	struct pci_dev *child;
+	struct pci_dev *child __free(pci_dev_put) = NULL;
 	int delay;
 
 	if (pci_dev_is_disconnected(dev))
@@ -4782,8 +4782,9 @@ int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
 		return 0;
 	}
 
-	child = list_first_entry(&dev->subordinate->devices, struct pci_dev,
-				 bus_list);
+
+	child = pci_dev_get(list_first_entry(&dev->subordinate->devices,
+					     struct pci_dev, bus_list));
 	up_read(&pci_bus_sem);
 
 	/*
Keith Busch June 17, 2024, 5:04 p.m. UTC | #2
On Sat, Jun 15, 2024 at 04:09:41PM +0200, Lukas Wunner wrote:
> Below is a small patch which acquires a ref on child.  Maybe this
> already does the trick?

Thanks, it appears your idea works! I've run it through 100 test
iterations successfully; previously it would reliably panic within 5 or
fewer iterations.

I also need the first patch from this series, though, otherwise it
inevitably deadlocks. I'd be interested to hear if you have any thoughts
on that one.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Tested-by: Keith Busch <kbusch@kernel.org>

> -- >8 --
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 2a8063e..82db9a8 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4753,7 +4753,7 @@ static int pci_bus_max_d3cold_delay(const struct pci_bus *bus)
>   */
>  int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
>  {
> -	struct pci_dev *child;
> +	struct pci_dev *child __free(pci_dev_put) = NULL;
>  	int delay;
>  
>  	if (pci_dev_is_disconnected(dev))
> @@ -4782,8 +4782,9 @@ int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
>  		return 0;
>  	}
>  
> -	child = list_first_entry(&dev->subordinate->devices, struct pci_dev,
> -				 bus_list);
> +
> +	child = pci_dev_get(list_first_entry(&dev->subordinate->devices,
> +					     struct pci_dev, bus_list));
>  	up_read(&pci_bus_sem);
>  
>  	/*
diff mbox series

Patch

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 31090770fffcc..5355fc0fbf910 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -192,7 +192,7 @@  pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 		pci_channel_state_t state,
 		pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
 {
-	int type = pci_pcie_type(dev);
+	int type = pci_pcie_type(dev), ret;
 	struct pci_dev *bridge;
 	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
 	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
@@ -214,6 +214,10 @@  pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	else
 		bridge = pci_upstream_bridge(dev);
 
+
+	ret = pci_trylock_rescan_remove(bridge);
+	if (!ret)
+		return PCI_ERS_RESULT_DISCONNECT;
 	pci_walk_bridge(bridge, pci_pm_runtime_get_sync, NULL);
 
 	pci_dbg(bridge, "broadcast error_detected message\n");
@@ -262,12 +266,14 @@  pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	}
 
 	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
+	pci_unlock_rescan_remove();
 
 	pci_info(bridge, "device recovery successful\n");
 	return status;
 
 failed:
 	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
+	pci_unlock_rescan_remove();
 
 	pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT);