diff mbox series

[v3] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

Message ID 20240802183327.1309020-1-amachhiw@linux.ibm.com (mailing list archive)
State New
Delegated to: Bjorn Helgaas
Headers show
Series [v3] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest | expand

Commit Message

Amit Machhiwal Aug. 2, 2024, 6:33 p.m. UTC
With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
of a PCI device attached to a PCI-bridge causes following kernel Oops on
a pseries KVM guest:

 RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
 Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
 BUG: Unable to handle kernel data access on read at 0x10ec00000048
 Faulting instruction address: 0xc0000000012d8728
 Oops: Kernel access of bad area, sig: 11 [#1]
 LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
<snip>
 NIP [c0000000012d8728] __of_changeset_entry_invert+0x10/0x1ac
 LR [c0000000012da7f0] __of_changeset_revert_entries+0x98/0x180
 Call Trace:
 [c00000000bcc3970] [c0000000012daa60] of_changeset_revert+0x58/0xd8
 [c00000000bcc39c0] [c000000000d0ed78] of_pci_remove_node+0x74/0xb0
 [c00000000bcc39f0] [c000000000cdcfe0] pci_stop_bus_device+0xf4/0x138
 [c00000000bcc3a30] [c000000000cdd140] pci_stop_and_remove_bus_device_locked+0x34/0x64
 [c00000000bcc3a60] [c000000000cf3780] remove_store+0xf0/0x108
 [c00000000bcc3ab0] [c000000000e89e04] dev_attr_store+0x34/0x78
 [c00000000bcc3ad0] [c0000000007f8dd4] sysfs_kf_write+0x70/0xa4
 [c00000000bcc3af0] [c0000000007f7248] kernfs_fop_write_iter+0x1d0/0x2e0
 [c00000000bcc3b40] [c0000000006c9b08] vfs_write+0x27c/0x558
 [c00000000bcc3bf0] [c0000000006ca168] ksys_write+0x90/0x170
 [c00000000bcc3c40] [c000000000033248] system_call_exception+0xf8/0x290
 [c00000000bcc3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
<snip>

A git bisect pointed this regression to be introduced via [1] that added
a mechanism to create device tree nodes for parent PCI bridges when a
PCI device is hot-plugged.

The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
device-tree node associated with the pci_dev that was earlier
hot-plugged and was attached under a pci-bridge. The PCI dev header
`dev->hdr_type` being 0, results a conditional check done with
`pci_is_bridge()` into false. Consequently, a call to
`of_pci_make_dev_node()` to create a device node is never made. When at
a later point in time, in the device node removal path, a memcpy is
attempted in `__of_changeset_entry_invert()`; since the device node was
never created, results in an Oops due to kernel read access to a bad
address.

To fix this issue, the patch introduces a new flag OF_CREATE_WITH_CSET
to keep track of device nodes created via `of_pci_make_dev_node()` and
later attempt to destroy only such device nodes which have this flag
set.

[1] commit 407d1a51921e ("PCI: Create device tree node for bridge")

Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
Reported-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
Changes since v2:
    * Drop v2 changes and introduce a different approach from Lizhi discussed
      over the v2 of this patch
    * V2: https://lore.kernel.org/all/20240715080726.2496198-1-amachhiw@linux.ibm.com/
Changes since v1:
    * Included Lizhi's suggested changes on V1
    * Fixed below two warnings from Lizhi's changes and rearranged the cleanup
      part a bit in `of_pci_make_dev_node`
	drivers/pci/of.c:611:6: warning: no previous prototype for ‘of_pci_free_node’ [-Wmissing-prototypes]
	  611 | void of_pci_free_node(struct device_node *np)
	      |      ^~~~~~~~~~~~~~~~               
	drivers/pci/of.c: In function ‘of_pci_make_dev_node’:
	drivers/pci/of.c:696:1: warning: label ‘out_destroy_cset’ defined but not used [-Wunused-label]
	  696 | out_destroy_cset:       
	      | ^~~~~~~~~~~~~~~~  
    * V1: https://lore.kernel.org/all/20240703141634.2974589-1-amachhiw@linux.ibm.com/

 drivers/pci/of.c   | 3 ++-
 include/linux/of.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)


base-commit: 948752d2e010e11b56a877975e7e9158d6d31823

Comments

Bjorn Helgaas Aug. 6, 2024, 8 p.m. UTC | #1
On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote:
> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> of a PCI device attached to a PCI-bridge causes following kernel Oops on
> a pseries KVM guest:

What is unique about pseries here?  There's nothing specific to
pseries in the patch, so I would expect this to be a generic problem
on any arch.

>  RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
>  Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
>  BUG: Unable to handle kernel data access on read at 0x10ec00000048

Weird address.  I would expect NULL or something.  Where did this
non-NULL pointer come from?

>  Faulting instruction address: 0xc0000000012d8728
>  Oops: Kernel access of bad area, sig: 11 [#1]
>  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> <snip>
>  NIP [c0000000012d8728] __of_changeset_entry_invert+0x10/0x1ac
>  LR [c0000000012da7f0] __of_changeset_revert_entries+0x98/0x180
>  Call Trace:
>  [c00000000bcc3970] [c0000000012daa60] of_changeset_revert+0x58/0xd8
>  [c00000000bcc39c0] [c000000000d0ed78] of_pci_remove_node+0x74/0xb0
>  [c00000000bcc39f0] [c000000000cdcfe0] pci_stop_bus_device+0xf4/0x138
>  [c00000000bcc3a30] [c000000000cdd140] pci_stop_and_remove_bus_device_locked+0x34/0x64
>  [c00000000bcc3a60] [c000000000cf3780] remove_store+0xf0/0x108
>  [c00000000bcc3ab0] [c000000000e89e04] dev_attr_store+0x34/0x78
>  [c00000000bcc3ad0] [c0000000007f8dd4] sysfs_kf_write+0x70/0xa4
>  [c00000000bcc3af0] [c0000000007f7248] kernfs_fop_write_iter+0x1d0/0x2e0
>  [c00000000bcc3b40] [c0000000006c9b08] vfs_write+0x27c/0x558
>  [c00000000bcc3bf0] [c0000000006ca168] ksys_write+0x90/0x170
>  [c00000000bcc3c40] [c000000000033248] system_call_exception+0xf8/0x290
>  [c00000000bcc3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
> <snip>
> 
> A git bisect pointed this regression to be introduced via [1] that added
> a mechanism to create device tree nodes for parent PCI bridges when a
> PCI device is hot-plugged.
> 
> The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
> device-tree node associated with the pci_dev that was earlier
> hot-plugged and was attached under a pci-bridge. The PCI dev header
> `dev->hdr_type` being 0, results a conditional check done with
> `pci_is_bridge()` into false. Consequently, a call to
> `of_pci_make_dev_node()` to create a device node is never made. When at
> a later point in time, in the device node removal path, a memcpy is
> attempted in `__of_changeset_entry_invert()`; since the device node was
> never created, results in an Oops due to kernel read access to a bad
> address.

I'm sure this description is 100% correct, but it's at such a low
level that it doesn't really help understand the underlying design
problem.

Will need an ack from Rob.

> To fix this issue, the patch introduces a new flag OF_CREATE_WITH_CSET
> to keep track of device nodes created via `of_pci_make_dev_node()` and
> later attempt to destroy only such device nodes which have this flag
> set.
> 
> [1] commit 407d1a51921e ("PCI: Create device tree node for bridge")
> 
> Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
> Reported-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
> Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> ---
> Changes since v2:
>     * Drop v2 changes and introduce a different approach from Lizhi discussed
>       over the v2 of this patch
>     * V2: https://lore.kernel.org/all/20240715080726.2496198-1-amachhiw@linux.ibm.com/
> Changes since v1:
>     * Included Lizhi's suggested changes on V1
>     * Fixed below two warnings from Lizhi's changes and rearranged the cleanup
>       part a bit in `of_pci_make_dev_node`
> 	drivers/pci/of.c:611:6: warning: no previous prototype for ‘of_pci_free_node’ [-Wmissing-prototypes]
> 	  611 | void of_pci_free_node(struct device_node *np)
> 	      |      ^~~~~~~~~~~~~~~~               
> 	drivers/pci/of.c: In function ‘of_pci_make_dev_node’:
> 	drivers/pci/of.c:696:1: warning: label ‘out_destroy_cset’ defined but not used [-Wunused-label]
> 	  696 | out_destroy_cset:       
> 	      | ^~~~~~~~~~~~~~~~  
>     * V1: https://lore.kernel.org/all/20240703141634.2974589-1-amachhiw@linux.ibm.com/
> 
>  drivers/pci/of.c   | 3 ++-
>  include/linux/of.h | 1 +
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> index dacea3fc5128..bc455370143e 100644
> --- a/drivers/pci/of.c
> +++ b/drivers/pci/of.c
> @@ -653,7 +653,7 @@ void of_pci_remove_node(struct pci_dev *pdev)
>  	struct device_node *np;
>  
>  	np = pci_device_to_OF_node(pdev);
> -	if (!np || !of_node_check_flag(np, OF_DYNAMIC))
> +	if (!np || !of_node_check_flag(np, OF_CREATE_WITH_CSET))
>  		return;
>  	pdev->dev.of_node = NULL;

of_pci_remove_node() goes on to call of_changeset_revert() and
of_changeset_destroy().  of_pci_remove_node() has nothing PCI-specific
in it.

There are a few other callers of of_changeset_destroy(), but they
don't look anything like of_pci_remove_node().  Seems like there
should be some similarity across the callers, so it makes me a little
nervous that there isn't.

> @@ -712,6 +712,7 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
>  	if (ret)
>  		goto out_free_node;
>  
> +	of_node_set_flag(np, OF_CREATE_WITH_CSET);
>  	np->data = cset;
>  	pdev->dev.of_node = np;
>  	kfree(name);
> diff --git a/include/linux/of.h b/include/linux/of.h
> index 85b60ac9eec5..5faa5a1198c6 100644
> --- a/include/linux/of.h
> +++ b/include/linux/of.h
> @@ -153,6 +153,7 @@ extern struct device_node *of_stdout;
>  #define OF_POPULATED_BUS	4 /* platform bus created for children */
>  #define OF_OVERLAY		5 /* allocated for an overlay */
>  #define OF_OVERLAY_FREE_CSET	6 /* in overlay cset being freed */
> +#define OF_CREATE_WITH_CSET	7 /* Created by of_changeset_create_node */

Follow existing capitalization style.
Rob Herring (Arm) Aug. 13, 2024, 4:43 p.m. UTC | #2
On Tue, Aug 06, 2024 at 03:00:59PM -0500, Bjorn Helgaas wrote:
> On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote:
> > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> > of a PCI device attached to a PCI-bridge causes following kernel Oops on
> > a pseries KVM guest:
> 
> What is unique about pseries here?  There's nothing specific to
> pseries in the patch, so I would expect this to be a generic problem
> on any arch.

Only pseries does PCI hotplug with DT I think. It would happen if 
another system did though, but I think documenting the exact system is 
good. No reason we can't do both though.

Rob
Rob Herring (Arm) Aug. 13, 2024, 4:44 p.m. UTC | #3
On Sat, 03 Aug 2024 00:03:25 +0530, Amit Machhiwal wrote:
> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> of a PCI device attached to a PCI-bridge causes following kernel Oops on
> a pseries KVM guest:
> 
>  RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
>  Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
>  BUG: Unable to handle kernel data access on read at 0x10ec00000048
>  Faulting instruction address: 0xc0000000012d8728
>  Oops: Kernel access of bad area, sig: 11 [#1]
>  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> <snip>
>  NIP [c0000000012d8728] __of_changeset_entry_invert+0x10/0x1ac
>  LR [c0000000012da7f0] __of_changeset_revert_entries+0x98/0x180
>  Call Trace:
>  [c00000000bcc3970] [c0000000012daa60] of_changeset_revert+0x58/0xd8
>  [c00000000bcc39c0] [c000000000d0ed78] of_pci_remove_node+0x74/0xb0
>  [c00000000bcc39f0] [c000000000cdcfe0] pci_stop_bus_device+0xf4/0x138
>  [c00000000bcc3a30] [c000000000cdd140] pci_stop_and_remove_bus_device_locked+0x34/0x64
>  [c00000000bcc3a60] [c000000000cf3780] remove_store+0xf0/0x108
>  [c00000000bcc3ab0] [c000000000e89e04] dev_attr_store+0x34/0x78
>  [c00000000bcc3ad0] [c0000000007f8dd4] sysfs_kf_write+0x70/0xa4
>  [c00000000bcc3af0] [c0000000007f7248] kernfs_fop_write_iter+0x1d0/0x2e0
>  [c00000000bcc3b40] [c0000000006c9b08] vfs_write+0x27c/0x558
>  [c00000000bcc3bf0] [c0000000006ca168] ksys_write+0x90/0x170
>  [c00000000bcc3c40] [c000000000033248] system_call_exception+0xf8/0x290
>  [c00000000bcc3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
> <snip>
> 
> A git bisect pointed this regression to be introduced via [1] that added
> a mechanism to create device tree nodes for parent PCI bridges when a
> PCI device is hot-plugged.
> 
> The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
> device-tree node associated with the pci_dev that was earlier
> hot-plugged and was attached under a pci-bridge. The PCI dev header
> `dev->hdr_type` being 0, results a conditional check done with
> `pci_is_bridge()` into false. Consequently, a call to
> `of_pci_make_dev_node()` to create a device node is never made. When at
> a later point in time, in the device node removal path, a memcpy is
> attempted in `__of_changeset_entry_invert()`; since the device node was
> never created, results in an Oops due to kernel read access to a bad
> address.
> 
> To fix this issue, the patch introduces a new flag OF_CREATE_WITH_CSET
> to keep track of device nodes created via `of_pci_make_dev_node()` and
> later attempt to destroy only such device nodes which have this flag
> set.
> 
> [1] commit 407d1a51921e ("PCI: Create device tree node for bridge")
> 
> Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
> Reported-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
> Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> ---
> Changes since v2:
>     * Drop v2 changes and introduce a different approach from Lizhi discussed
>       over the v2 of this patch
>     * V2: https://lore.kernel.org/all/20240715080726.2496198-1-amachhiw@linux.ibm.com/
> Changes since v1:
>     * Included Lizhi's suggested changes on V1
>     * Fixed below two warnings from Lizhi's changes and rearranged the cleanup
>       part a bit in `of_pci_make_dev_node`
> 	drivers/pci/of.c:611:6: warning: no previous prototype for ‘of_pci_free_node’ [-Wmissing-prototypes]
> 	  611 | void of_pci_free_node(struct device_node *np)
> 	      |      ^~~~~~~~~~~~~~~~
> 	drivers/pci/of.c: In function ‘of_pci_make_dev_node’:
> 	drivers/pci/of.c:696:1: warning: label ‘out_destroy_cset’ defined but not used [-Wunused-label]
> 	  696 | out_destroy_cset:
> 	      | ^~~~~~~~~~~~~~~~
>     * V1: https://lore.kernel.org/all/20240703141634.2974589-1-amachhiw@linux.ibm.com/
> 
>  drivers/pci/of.c   | 3 ++-
>  include/linux/of.h | 1 +
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 

Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Michael Ellerman Aug. 15, 2024, 3:20 a.m. UTC | #4
Bjorn Helgaas <helgaas@kernel.org> writes:
> On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote:
>> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
>> of a PCI device attached to a PCI-bridge causes following kernel Oops on
>> a pseries KVM guest:
>
> What is unique about pseries here?  There's nothing specific to
> pseries in the patch, so I would expect this to be a generic problem
> on any arch.
>
>>  RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
>>  Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
>>  BUG: Unable to handle kernel data access on read at 0x10ec00000048
>
> Weird address.  I would expect NULL or something.  Where did this
> non-NULL pointer come from?

It originally comes from np->data, which is supposed to be an
of_changeset.

The powerpc code also uses np->data for the struct pci_dn pointer, see
pci_add_device_node_info().

I wonder if that's why it's non-NULL?

Amit, do we have exact steps to reproduce this? I poked around a bit but
couldn't get it to trigger.

cheers
Amit Machhiwal Aug. 16, 2024, 12:43 p.m. UTC | #5
Hi Michael,

On 2024/08/15 01:20 PM, Michael Ellerman wrote:
> Bjorn Helgaas <helgaas@kernel.org> writes:
> > On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote:
> >> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> >> of a PCI device attached to a PCI-bridge causes following kernel Oops on
> >> a pseries KVM guest:
> >
> > What is unique about pseries here?  There's nothing specific to
> > pseries in the patch, so I would expect this to be a generic problem
> > on any arch.
> >
> >>  RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
> >>  Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
> >>  BUG: Unable to handle kernel data access on read at 0x10ec00000048
> >
> > Weird address.  I would expect NULL or something.  Where did this
> > non-NULL pointer come from?
> 
> It originally comes from np->data, which is supposed to be an
> of_changeset.
> 
> The powerpc code also uses np->data for the struct pci_dn pointer, see
> pci_add_device_node_info().
> 
> I wonder if that's why it's non-NULL?

I'm also looking into the code to figure out where's that value coming from. I
will update as soon as I get there.

> 
> Amit, do we have exact steps to reproduce this? I poked around a bit but
> couldn't get it to trigger.

Sure, below are the steps:

1. Set CONFIG_PCI_DYNAMIC_OF_NODES=y in the kernel config and compile (Fedora
   has it disabled in it's distro config, Ubuntu has it enabled but will have it
   disabled in the next update)

2. If you are using Fedora cloud images, make sure you've these packages
   installed:
    $ rpm -qa | grep -e 'ppc64-diag\|powerpc-utils'
    powerpc-utils-core-1.3.11-6.fc40.ppc64le
    powerpc-utils-1.3.11-6.fc40.ppc64le
    ppc64-diag-rtas-2.7.9-6.fc40.ppc64le
    ppc64-diag-2.7.9-6.fc40.ppc64le

3. Hotplug a pci device as follows:
    virsh attach-interface <domain_name> bridge --source virbr0

4. Check if the pci device was added by running `ip a s`

5. Try hot-unplug of that device by supplying the MAC, which should trigger the
   Oops
    virsh detach-interface <domain_name> bridge <mac_addr>

Thanks,
Amit

> cheers
Michael Ellerman Aug. 16, 2024, 10:59 p.m. UTC | #6
Amit Machhiwal <amachhiw@linux.ibm.com> writes:
> On 2024/08/15 01:20 PM, Michael Ellerman wrote:
>> Bjorn Helgaas <helgaas@kernel.org> writes:
>> > On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote:
>> >> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
>> >> of a PCI device attached to a PCI-bridge causes following kernel Oops on
>> >> a pseries KVM guest:
>> >
>> > What is unique about pseries here?  There's nothing specific to
>> > pseries in the patch, so I would expect this to be a generic problem
>> > on any arch.
>> >
>> >>  RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
>> >>  Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
>> >>  BUG: Unable to handle kernel data access on read at 0x10ec00000048
>> >
>> > Weird address.  I would expect NULL or something.  Where did this
>> > non-NULL pointer come from?
>> 
>> It originally comes from np->data, which is supposed to be an
>> of_changeset.
>> 
>> The powerpc code also uses np->data for the struct pci_dn pointer, see
>> pci_add_device_node_info().
>> 
>> I wonder if that's why it's non-NULL?
>
> I'm also looking into the code to figure out where's that value coming from. I
> will update as soon as I get there.

Thanks.
 
>> Amit, do we have exact steps to reproduce this? I poked around a bit but
>> couldn't get it to trigger.
>
> Sure, below are the steps:
>
> 1. Set CONFIG_PCI_DYNAMIC_OF_NODES=y in the kernel config and compile (Fedora
>    has it disabled in it's distro config, Ubuntu has it enabled but will have it
>    disabled in the next update)
>
> 2. If you are using Fedora cloud images, make sure you've these packages
>    installed:
>     $ rpm -qa | grep -e 'ppc64-diag\|powerpc-utils'
>     powerpc-utils-core-1.3.11-6.fc40.ppc64le
>     powerpc-utils-1.3.11-6.fc40.ppc64le
>     ppc64-diag-rtas-2.7.9-6.fc40.ppc64le
>     ppc64-diag-2.7.9-6.fc40.ppc64le
>
> 3. Hotplug a pci device as follows:
>     virsh attach-interface <domain_name> bridge --source virbr0

I don't use virsh :)

Any idea how to do it with just qemu monitor commands?

cheers
Amit Machhiwal Aug. 19, 2024, 11:33 a.m. UTC | #7
Hi Michael,

On 2024/08/17 08:59 AM, Michael Ellerman wrote:
> Amit Machhiwal <amachhiw@linux.ibm.com> writes:
> > On 2024/08/15 01:20 PM, Michael Ellerman wrote:
> >> Bjorn Helgaas <helgaas@kernel.org> writes:
> >> > On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote:
> >> >> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> >> >> of a PCI device attached to a PCI-bridge causes following kernel Oops on
> >> >> a pseries KVM guest:
> >> >
> >> > What is unique about pseries here?  There's nothing specific to
> >> > pseries in the patch, so I would expect this to be a generic problem
> >> > on any arch.
> >> >
> >> >>  RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
> >> >>  Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
> >> >>  BUG: Unable to handle kernel data access on read at 0x10ec00000048
> >> >
> >> > Weird address.  I would expect NULL or something.  Where did this
> >> > non-NULL pointer come from?
> >> 
> >> It originally comes from np->data, which is supposed to be an
> >> of_changeset.
> >> 
> >> The powerpc code also uses np->data for the struct pci_dn pointer, see
> >> pci_add_device_node_info().
> >> 
> >> I wonder if that's why it's non-NULL?
> >
> > I'm also looking into the code to figure out where's that value coming from. I
> > will update as soon as I get there.
> 
> Thanks.
>  
> >> Amit, do we have exact steps to reproduce this? I poked around a bit but
> >> couldn't get it to trigger.
> >
> > Sure, below are the steps:
> >
> > 1. Set CONFIG_PCI_DYNAMIC_OF_NODES=y in the kernel config and compile (Fedora
> >    has it disabled in it's distro config, Ubuntu has it enabled but will have it
> >    disabled in the next update)
> >
> > 2. If you are using Fedora cloud images, make sure you've these packages
> >    installed:
> >     $ rpm -qa | grep -e 'ppc64-diag\|powerpc-utils'
> >     powerpc-utils-core-1.3.11-6.fc40.ppc64le
> >     powerpc-utils-1.3.11-6.fc40.ppc64le
> >     ppc64-diag-rtas-2.7.9-6.fc40.ppc64le
> >     ppc64-diag-2.7.9-6.fc40.ppc64le
> >
> > 3. Hotplug a pci device as follows:
> >     virsh attach-interface <domain_name> bridge --source virbr0
> 
> I don't use virsh :)

No worries. Fortunately, we do have a way to do it with qemu monitor.

> 
> Any idea how to do it with just qemu monitor commands?
> 

1. Boot the guest with the below included in the qemu cmdline:

    -netdev bridge,id=<net_name>,br=virbr0,helper=/usr/libexec/qemu-bridge-helper

2. Once the guest boots, run the below command on qemu monitor to hot-plug a pci
   device:

    device_add rtl8139,netdev=<net_name>,mac=52:54:00:88:31:28,id=<net_id>

    dmesg
    =====
    [  116.968210] pci 0000:00:01.0: [10ec:8139] type 00 class 0x020000 conventional PCI endpoint
    [  116.969260] pci 0000:00:01.0: BAR 0 [io  0x10000-0x100ff]
    [  116.969904] pci 0000:00:01.0: BAR 1 [mem 0x00000000-0x000000ff]
    [  116.970745] pci 0000:00:01.0: ROM [mem 0x00000000-0x0003ffff pref]
    [  116.971456] pci 0000:00:01.0: No hypervisor support for SR-IOV on this device, IOV BARs disabled.
    [  116.972583] pci 0000:00:01.0: Adding to iommu group 0
    [  116.978466] pci 0000:00:01.0: ROM [mem 0x200080080000-0x2000800bffff pref]: assigned
    [  116.979347] pci 0000:00:01.0: BAR 0 [io  0x10400-0x104ff]: assigned
    [  116.980063] pci 0000:00:01.0: BAR 1 [mem 0x200080001000-0x2000800010ff]: assigned
    [  117.017187] 8139cp: 8139cp: 10/100 PCI Ethernet driver v1.3 (Mar 22, 2004)
    [  117.018577] 8139cp 0000:00:01.0: enabling device (0000 -> 0003)
    [  117.025414] 8139cp 0000:00:01.0 eth1: RTL-8139C+ at 0x00000000fbf09e59, 52:54:00:88:31:28, IRQ 26
    [  117.051028] 8139too: 8139too Fast Ethernet driver 0.9.28
    [  117.076577] 8139cp 0000:00:01.0 eth1: link up, 100Mbps, full-duplex, lpa 0x05E1

3. Try hot-unplug of the device to recreate the kernel Oops.

    device_del <net_id>

Thanks,
Amit

> cheers
diff mbox series

Patch

diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index dacea3fc5128..bc455370143e 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -653,7 +653,7 @@  void of_pci_remove_node(struct pci_dev *pdev)
 	struct device_node *np;
 
 	np = pci_device_to_OF_node(pdev);
-	if (!np || !of_node_check_flag(np, OF_DYNAMIC))
+	if (!np || !of_node_check_flag(np, OF_CREATE_WITH_CSET))
 		return;
 	pdev->dev.of_node = NULL;
 
@@ -712,6 +712,7 @@  void of_pci_make_dev_node(struct pci_dev *pdev)
 	if (ret)
 		goto out_free_node;
 
+	of_node_set_flag(np, OF_CREATE_WITH_CSET);
 	np->data = cset;
 	pdev->dev.of_node = np;
 	kfree(name);
diff --git a/include/linux/of.h b/include/linux/of.h
index 85b60ac9eec5..5faa5a1198c6 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -153,6 +153,7 @@  extern struct device_node *of_stdout;
 #define OF_POPULATED_BUS	4 /* platform bus created for children */
 #define OF_OVERLAY		5 /* allocated for an overlay */
 #define OF_OVERLAY_FREE_CSET	6 /* in overlay cset being freed */
+#define OF_CREATE_WITH_CSET	7 /* Created by of_changeset_create_node */
 
 #define OF_BAD_ADDR	((u64)-1)