diff mbox series

iommu/arm-smmu-v3: Allow stream table to have nodes with the same ID

Message ID 20250411044706.356395-1-nicolinc@nvidia.com (mailing list archive)
State New
Headers show
Series iommu/arm-smmu-v3: Allow stream table to have nodes with the same ID | expand

Commit Message

Nicolin Chen April 11, 2025, 4:47 a.m. UTC
From: Jason Gunthorpe <jgg@nvidia.com>

ASPEED VGA card has two built-in devices:
 0008:06:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 06)
 0008:07:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)

Its toplogy looks like this:
 +-[0008:00]---00.0-[01-09]--+-00.0-[02-09]--+-00.0-[03]----00.0  Sandisk Corp Device 5017
                             |               +-01.0-[04]--
                             |               +-02.0-[05]----00.0  NVIDIA Corporation Device
                             |               +-03.0-[06-07]----00.0-[07]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family
                             |               +-04.0-[08]----00.0  Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
                             |               \-05.0-[09]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
                             \-00.1  PMC-Sierra Inc. Device 4028

Being a legacy PCI device that does not have RID on the wire, the system
does not preserve a RID for that PCI bridge (0008:06), so the IORT code
has to dma alias for iort_pci_iommu_init() via pci_for_each_dma_alias(),
resulting in both of them getting the same Stream ID.

On a kernel prior to v6.15-rc1, there has been an overlooked warning:
  pci 0008:07:00.0: vgaarb: setting as boot VGA device
  pci 0008:07:00.0: vgaarb: bridge control possible
  pci 0008:07:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
  pcieport 0008:06:00.0: Adding to iommu group 14
  ast 0008:07:00.0: stream 67328 already in tree   <===== WARNING
  ast 0008:07:00.0: enabling device (0002 -> 0003)
  ast 0008:07:00.0: Using default configuration
  ast 0008:07:00.0: AST 2600 detected
  ast 0008:07:00.0: [drm] Using analog VGA
  ast 0008:07:00.0: [drm] dram MCLK=396 Mhz type=1 bus_width=16
  [drm] Initialized ast 0.1.0 for 0008:07:00.0 on minor 0
  ast 0008:07:00.0: [drm] fb0: astdrmfb frame buffer device

As such a legacy system might not use DMA at all, an iommu_probe_device()
failure didn't actually break it, except that warning that does not block
the system boot flow.

With v6.15-rc, since the commit bcb81ac6ae3c ("iommu: Get DT/ACPI parsing
into the proper probe path"), the error returned with the warning is moved
to the IOMMU device probe flow, e.g. call trace with SMMUv3:
  arm_smmu_probe_device+0x15c/0x4c0
  __iommu_probe_device+0x150/0x4f8
  probe_iommu_group+0x44/0x80
  bus_for_each_dev+0x7c/0x100
  bus_iommu_probe+0x48/0x1a8
  iommu_device_register+0xb8/0x178
  arm_smmu_device_probe+0x1350/0x1db0

This then fails the entire SMMU driver probe:
  arm-smmu-v3 arm-smmu-v3.9.auto: found companion CMDQV device: NVDA200C:04
  arm-smmu-v3 arm-smmu-v3.9.auto: option mask 0x10
  arm-smmu-v3 arm-smmu-v3.9.auto: ias 48-bit, oas 48-bit (features 0x001e1fbf)
  arm-smmu-v3 arm-smmu-v3.9.auto: allocated 524288 entries for cmdq
  arm-smmu-v3 arm-smmu-v3.9.auto: allocated 524288 entries for evtq
  arm-smmu-v3 arm-smmu-v3.9.auto: allocated 524288 entries for priq
  arm-smmu-v3 arm-smmu-v3.9.auto: allocated 524288 entries for vcmdq0
  arm-smmu-v3 arm-smmu-v3.9.auto: allocated 524288 entries for vcmdq1
  arm-smmu-v3 arm-smmu-v3.9.auto: msi_domain absent - falling back to wired irqs
  arm-smmu-v3 arm-smmu-v3.9.auto: no priq irq - PRI will be broken
  pci 0008:00:00.0: Adding to iommu group 10
  pci 0008:01:00.0: Adding to iommu group 11
  pci 0008:01:00.1: Adding to iommu group 12
  pci 0008:02:00.0: Adding to iommu group 13
  pci 0008:02:01.0: Adding to iommu group 14
  pci 0008:02:02.0: Adding to iommu group 15
  pci 0008:02:03.0: Adding to iommu group 16
  pci 0008:02:04.0: Adding to iommu group 17
  pci 0008:02:05.0: Adding to iommu group 18
  pci 0008:03:00.0: Adding to iommu group 19
  pci 0008:05:00.0: Adding to iommu group 20
  pci 0008:06:00.0: Adding to iommu group 21
  pci 0008:07:00.0: stream 67328 already in tree
  arm-smmu-v3 arm-smmu-v3.9.auto: Failed to register iommu
  arm-smmu-v3 arm-smmu-v3.9.auto: probe with driver arm-smmu-v3 failed with error -22

Given that a device bundled with a legacy PCI bridge could have duplicated
Stream IDs, the concept of a stream_id tree with unique node in the SMMUv3
driver doesn't work any more.

Change the arm_smmu_streams_cmp_node() to allow the stream table to hold
multiple nodes with the same Stream ID. Meanwhile, the reverse lookup from
the Stream ID to a device pointer will have to be broken, i.e. the eventq
handler will no longer find the device with a Stream ID in such cases.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 31 +++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

Comments

Jason Gunthorpe April 11, 2025, 11:45 a.m. UTC | #1
On Thu, Apr 10, 2025 at 09:47:06PM -0700, Nicolin Chen wrote:
> Change the arm_smmu_streams_cmp_node() to allow the stream table to hold
> multiple nodes with the same Stream ID. Meanwhile, the reverse lookup from
> the Stream ID to a device pointer will have to be broken, i.e. the eventq
> handler will no longer find the device with a Stream ID in such cases.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 31 +++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)

This should go to rc, also a fixes line is probably something like:

Fixes: cdf315f907d4 ("iommu/arm-smmu-v3: Maintain a SID->device structure")

Since it has been broken for a while, it just didn't become critical
until probe started failing in v6.15-rc1

Jason
Robin Murphy April 11, 2025, 12:10 p.m. UTC | #2
On 2025-04-11 5:47 am, Nicolin Chen wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> ASPEED VGA card has two built-in devices:
>   0008:06:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 06)
>   0008:07:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)
> 
> Its toplogy looks like this:
>   +-[0008:00]---00.0-[01-09]--+-00.0-[02-09]--+-00.0-[03]----00.0  Sandisk Corp Device 5017
>                               |               +-01.0-[04]--
>                               |               +-02.0-[05]----00.0  NVIDIA Corporation Device
>                               |               +-03.0-[06-07]----00.0-[07]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family
>                               |               +-04.0-[08]----00.0  Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
>                               |               \-05.0-[09]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>                               \-00.1  PMC-Sierra Inc. Device 4028
> 
> Being a legacy PCI device that does not have RID on the wire, the system
> does not preserve a RID for that PCI bridge (0008:06), so the IORT code
> has to dma alias for iort_pci_iommu_init() via pci_for_each_dma_alias(),
> resulting in both of them getting the same Stream ID.
> 
> On a kernel prior to v6.15-rc1, there has been an overlooked warning:
>    pci 0008:07:00.0: vgaarb: setting as boot VGA device
>    pci 0008:07:00.0: vgaarb: bridge control possible
>    pci 0008:07:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
>    pcieport 0008:06:00.0: Adding to iommu group 14
>    ast 0008:07:00.0: stream 67328 already in tree   <===== WARNING
>    ast 0008:07:00.0: enabling device (0002 -> 0003)
>    ast 0008:07:00.0: Using default configuration
>    ast 0008:07:00.0: AST 2600 detected
>    ast 0008:07:00.0: [drm] Using analog VGA
>    ast 0008:07:00.0: [drm] dram MCLK=396 Mhz type=1 bus_width=16
>    [drm] Initialized ast 0.1.0 for 0008:07:00.0 on minor 0
>    ast 0008:07:00.0: [drm] fb0: astdrmfb frame buffer device
> 
> As such a legacy system might not use DMA at all, an iommu_probe_device()
> failure didn't actually break it, except that warning that does not block
> the system boot flow.
> 
> With v6.15-rc, since the commit bcb81ac6ae3c ("iommu: Get DT/ACPI parsing
> into the proper probe path"), the error returned with the warning is moved
> to the IOMMU device probe flow, e.g. call trace with SMMUv3:
>    arm_smmu_probe_device+0x15c/0x4c0
>    __iommu_probe_device+0x150/0x4f8
>    probe_iommu_group+0x44/0x80
>    bus_for_each_dev+0x7c/0x100
>    bus_iommu_probe+0x48/0x1a8
>    iommu_device_register+0xb8/0x178
>    arm_smmu_device_probe+0x1350/0x1db0
> 
> This then fails the entire SMMU driver probe:
>    arm-smmu-v3 arm-smmu-v3.9.auto: found companion CMDQV device: NVDA200C:04
>    arm-smmu-v3 arm-smmu-v3.9.auto: option mask 0x10
>    arm-smmu-v3 arm-smmu-v3.9.auto: ias 48-bit, oas 48-bit (features 0x001e1fbf)
>    arm-smmu-v3 arm-smmu-v3.9.auto: allocated 524288 entries for cmdq
>    arm-smmu-v3 arm-smmu-v3.9.auto: allocated 524288 entries for evtq
>    arm-smmu-v3 arm-smmu-v3.9.auto: allocated 524288 entries for priq
>    arm-smmu-v3 arm-smmu-v3.9.auto: allocated 524288 entries for vcmdq0
>    arm-smmu-v3 arm-smmu-v3.9.auto: allocated 524288 entries for vcmdq1
>    arm-smmu-v3 arm-smmu-v3.9.auto: msi_domain absent - falling back to wired irqs
>    arm-smmu-v3 arm-smmu-v3.9.auto: no priq irq - PRI will be broken
>    pci 0008:00:00.0: Adding to iommu group 10
>    pci 0008:01:00.0: Adding to iommu group 11
>    pci 0008:01:00.1: Adding to iommu group 12
>    pci 0008:02:00.0: Adding to iommu group 13
>    pci 0008:02:01.0: Adding to iommu group 14
>    pci 0008:02:02.0: Adding to iommu group 15
>    pci 0008:02:03.0: Adding to iommu group 16
>    pci 0008:02:04.0: Adding to iommu group 17
>    pci 0008:02:05.0: Adding to iommu group 18
>    pci 0008:03:00.0: Adding to iommu group 19
>    pci 0008:05:00.0: Adding to iommu group 20
>    pci 0008:06:00.0: Adding to iommu group 21
>    pci 0008:07:00.0: stream 67328 already in tree
>    arm-smmu-v3 arm-smmu-v3.9.auto: Failed to register iommu
>    arm-smmu-v3 arm-smmu-v3.9.auto: probe with driver arm-smmu-v3 failed with error -22
> 
> Given that a device bundled with a legacy PCI bridge could have duplicated
> Stream IDs, the concept of a stream_id tree with unique node in the SMMUv3
> driver doesn't work any more.
> 
> Change the arm_smmu_streams_cmp_node() to allow the stream table to hold
> multiple nodes with the same Stream ID. Meanwhile, the reverse lookup from
> the Stream ID to a device pointer will have to be broken, i.e. the eventq
> handler will no longer find the device with a Stream ID in such cases.

Heh, I was about to start looking at this one myself once I finished 
making sense of the Juno issue...

This is adding support for StreamID aliasing between devices, and as 
such it is incomplete. It's not OK to just allow devices to arbitrarily 
rewrite each other's STEs, that's the whole reason this uniqueness check 
exists in the first place! Aliases can only be permitted within a group, 
which means arm_smmu_device_group() also has to check and account for 
them in the first place - note that that applies to PCI devices as well, 
because as soon as we allow StreamID aliasing at all then we're 
inherently allowing RID->SID mappings to alias outside the PCI hierarchy 
in ways that pci_device_group() can't know about. It should work out 
basically the same as SMMUv2, just with the streams tree in place of the 
S2CR array.

> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 31 +++++++++++++++++++--
>   1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index b4c21aaed126..5ce64dc78e12 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1762,8 +1762,25 @@ static int arm_smmu_streams_cmp_key(const void *lhs, const struct rb_node *rhs)
>   static int arm_smmu_streams_cmp_node(struct rb_node *lhs,
>   				     const struct rb_node *rhs)
>   {
> -	return arm_smmu_streams_cmp_key(
> -		&rb_entry(lhs, struct arm_smmu_stream, node)->id, rhs);
> +	struct arm_smmu_stream *stream_lhs =
> +		rb_entry(lhs, struct arm_smmu_stream, node);
> +	struct arm_smmu_stream *stream_rhs =
> +		rb_entry(rhs, struct arm_smmu_stream, node);
> +
> +	if (stream_lhs->id < stream_rhs->id)
> +		return -1;
> +	if (stream_lhs->id > stream_rhs->id)
> +		return 1;
> +
> +	/*
> +	 * The stream table can have multiple nodes with the same ID if there
> +	 * are DMA aliases.
> +	 */
> +	if (stream_lhs < stream_rhs)
> +		return -1;
> +	if (stream_lhs > stream_rhs)
> +		return 1;
> +	return 0;
>   }
>   
>   static struct arm_smmu_master *
> @@ -1776,6 +1793,16 @@ arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
>   	node = rb_find(&sid, &smmu->streams, arm_smmu_streams_cmp_key);
>   	if (!node)
>   		return NULL;
> +	/*
> +	 * If there are DMA alises then there are multiple devices with the same
> +	 * stream ID and we cannot reliably convert from SID to master.
> +	 */
> +	if (node->rb_left &&
> +	    rb_entry(node->rb_left, struct arm_smmu_stream, node)->id == sid)
> +		return NULL;
> +	if (node->rb_right &&
> +	    rb_entry(node->rb_right, struct arm_smmu_stream, node)->id == sid)
> +		return NULL;

This doesn't really work - the whole mechanism needs to fundamentally 
change to mapping StreamIDs to groups rather than to devices. Then it's 
really up to individual callers what they want to do if the group has 
more than one device.

Thanks,
Robin.

>   	return rb_entry(node, struct arm_smmu_stream, node)->master;
>   }
>
Jason Gunthorpe April 11, 2025, 1:01 p.m. UTC | #3
On Fri, Apr 11, 2025 at 01:10:29PM +0100, Robin Murphy wrote:

> This is adding support for StreamID aliasing between devices, and as such it
> is incomplete. It's not OK to just allow devices to arbitrarily rewrite each
> other's STEs,

Okay, yes, we should be checking the iommu_group before permitting two
devices to share the STE. That is an easy fix, see below

> Aliases can only be permitted within a group, which means
> arm_smmu_device_group() also has to check and account for them in the first
> place - note that that applies to PCI devices as well, because as
> soon as we

On this system the alias come from the PCI DMA alias support and
pci_device_group() is already correctly grouping things.

Aliases from other places, like the IORT, never did work..

> allow StreamID aliasing at all then we're inherently allowing RID->SID
> mappings to alias outside the PCI hierarchy in ways that pci_device_group()
> can't know about. It should work out basically the same as SMMUv2, just with
> the streams tree in place of the S2CR array.

You mean the logic in v2's arm_smmu_device_group() to consult the
stream map to select the group if the IORT is creating aliases? Yes it
could be done..

However, this is a significant regression fix and I think we can be
confident there are no IORT tables in the wild that have aliases or
they would already be broken.

How about we add the missing validation that the group is the same,
that is easy to do and should be there anyhow:

static int arm_smmu_streams_cmp_node(struct rb_node *lhs,
				     const struct rb_node *rhs)
{
	struct arm_smmu_stream *stream_lhs =
		&rb_entry(lhs, struct arm_smmu_stream, node);
	struct arm_smmu_stream *stream_rhs =
		rb_entry(rhs, struct arm_smmu_stream, node);

	if (stream_lhs->id < stream_rhs->id)
		return -1;
	if (stream_lhs->id > stream_rhs->id)
		return 1;

	/*
	 * The stream table can have multiple nodes with the same ID if there
	 * are DMA aliases. If multiple masters share the same iommu group then
	 * they can use the overlapping STEs within the group.
	 */
	if (stream_lhs->master->dev->iommu_group ==
	    stream_rhs->master->dev->iommu_group) {
		if (stream_lhs < stream_rhs)
			return -1;
		if (stream_lhs > stream_rhs)
			return 1;
	}
	return 0;
}

That change will narrow this patch to only enable PCI DMA aliases that
already generate the correct iommu groupings. Other sources of alising
that don't generate the right groupings will continue to fail as they
do today.

Then I propose continuing to wait for a user before adding support for
more alias scenarios to arm_smmu_device_group()?

> > +	/*
> > +	 * If there are DMA alises then there are multiple devices with the same
> > +	 * stream ID and we cannot reliably convert from SID to master.
> > +	 */
> > +	if (node->rb_left &&
> > +	    rb_entry(node->rb_left, struct arm_smmu_stream, node)->id == sid)
> > +		return NULL;
> > +	if (node->rb_right &&
> > +	    rb_entry(node->rb_right, struct arm_smmu_stream, node)->id == sid)
> > +		return NULL;
> 
> This doesn't really work - the whole mechanism needs to fundamentally change
> to mapping StreamIDs to groups rather than to devices. Then it's really up
> to individual callers what they want to do if the group has more than one
> device.

There are only two callers. One is using it to print the log message,
in this case it will fall back to the unknown stream ID path and still
print a log message. This could perhaps print the group ID # instead
of the raw stream ID but I wouldn't do that in a regression fix rc
patch.

The other is doing stall/future PRI, and I don't think we should be
doing iommu_group based fault reporting at all. Returning NULL
effectively disables it.

Jason
Pranjal Shrivastava April 11, 2025, 1:35 p.m. UTC | #4
On Fri, Apr 11, 2025 at 10:01:55AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 11, 2025 at 01:10:29PM +0100, Robin Murphy wrote:
> 
> > This is adding support for StreamID aliasing between devices, and as such it
> > is incomplete. It's not OK to just allow devices to arbitrarily rewrite each
> > other's STEs,
> 
> Okay, yes, we should be checking the iommu_group before permitting two
> devices to share the STE. That is an easy fix, see below
> 
> > Aliases can only be permitted within a group, which means
> > arm_smmu_device_group() also has to check and account for them in the first
> > place - note that that applies to PCI devices as well, because as
> > soon as we
> 
> On this system the alias come from the PCI DMA alias support and
> pci_device_group() is already correctly grouping things.
> 
> Aliases from other places, like the IORT, never did work..
> 
> > allow StreamID aliasing at all then we're inherently allowing RID->SID
> > mappings to alias outside the PCI hierarchy in ways that pci_device_group()
> > can't know about. It should work out basically the same as SMMUv2, just with
> > the streams tree in place of the S2CR array.
> 
> You mean the logic in v2's arm_smmu_device_group() to consult the
> stream map to select the group if the IORT is creating aliases? Yes it
> could be done..
> 
> However, this is a significant regression fix and I think we can be
> confident there are no IORT tables in the wild that have aliases or
> they would already be broken.
> 
> How about we add the missing validation that the group is the same,
> that is easy to do and should be there anyhow:
> 
> static int arm_smmu_streams_cmp_node(struct rb_node *lhs,
> 				     const struct rb_node *rhs)
> {
> 	struct arm_smmu_stream *stream_lhs =
> 		&rb_entry(lhs, struct arm_smmu_stream, node);
> 	struct arm_smmu_stream *stream_rhs =
> 		rb_entry(rhs, struct arm_smmu_stream, node);
> 
> 	if (stream_lhs->id < stream_rhs->id)
> 		return -1;
> 	if (stream_lhs->id > stream_rhs->id)
> 		return 1;
> 
> 	/*
> 	 * The stream table can have multiple nodes with the same ID if there
> 	 * are DMA aliases. If multiple masters share the same iommu group then
> 	 * they can use the overlapping STEs within the group.
> 	 */
> 	if (stream_lhs->master->dev->iommu_group ==
> 	    stream_rhs->master->dev->iommu_group) {
> 		if (stream_lhs < stream_rhs)
> 			return -1;
> 		if (stream_lhs > stream_rhs)
> 			return 1;
> 	}
> 	return 0;
> }
> 
> That change will narrow this patch to only enable PCI DMA aliases that
> already generate the correct iommu groupings. Other sources of alising
> that don't generate the right groupings will continue to fail as they
> do today.
> 

Isn't the device grouped *after* the ops->probe_device call? I see that
the dev->iommu_group is assigned the ops->probe_device call in
iommu_init_device.. so I guess this would still fail?

Thanks,
Praan

> Then I propose continuing to wait for a user before adding support for
> more alias scenarios to arm_smmu_device_group()?
> 
> > > +	/*
> > > +	 * If there are DMA alises then there are multiple devices with the same
> > > +	 * stream ID and we cannot reliably convert from SID to master.
> > > +	 */
> > > +	if (node->rb_left &&
> > > +	    rb_entry(node->rb_left, struct arm_smmu_stream, node)->id == sid)
> > > +		return NULL;
> > > +	if (node->rb_right &&
> > > +	    rb_entry(node->rb_right, struct arm_smmu_stream, node)->id == sid)
> > > +		return NULL;
> > 
> > This doesn't really work - the whole mechanism needs to fundamentally change
> > to mapping StreamIDs to groups rather than to devices. Then it's really up
> > to individual callers what they want to do if the group has more than one
> > device.
> 
> There are only two callers. One is using it to print the log message,
> in this case it will fall back to the unknown stream ID path and still
> print a log message. This could perhaps print the group ID # instead
> of the raw stream ID but I wouldn't do that in a regression fix rc
> patch.
> 
> The other is doing stall/future PRI, and I don't think we should be
> doing iommu_group based fault reporting at all. Returning NULL
> effectively disables it.
> 
> Jason
Jason Gunthorpe April 11, 2025, 1:42 p.m. UTC | #5
On Fri, Apr 11, 2025 at 01:35:34PM +0000, Pranjal Shrivastava wrote:
> Isn't the device grouped *after* the ops->probe_device call? I see that
> the dev->iommu_group is assigned the ops->probe_device call in
> iommu_init_device.. so I guess this would still fail?

Yes you are right, we'd have to also change to setup the stream table
inside the device_group call back to validate stream table groups are
the same.

At that point it is probably trivial to just return the group already
in the stream table.

Jason
Robin Murphy April 11, 2025, 3:13 p.m. UTC | #6
On 11/04/2025 5:47 am, Nicolin Chen wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> ASPEED VGA card has two built-in devices:
>   0008:06:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 06)
>   0008:07:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)
> 
> Its toplogy looks like this:
>   +-[0008:00]---00.0-[01-09]--+-00.0-[02-09]--+-00.0-[03]----00.0  Sandisk Corp Device 5017
>                               |               +-01.0-[04]--
>                               |               +-02.0-[05]----00.0  NVIDIA Corporation Device
>                               |               +-03.0-[06-07]----00.0-[07]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family
>                               |               +-04.0-[08]----00.0  Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
>                               |               \-05.0-[09]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>                               \-00.1  PMC-Sierra Inc. Device 4028
> 
> Being a legacy PCI device that does not have RID on the wire, the system
> does not preserve a RID for that PCI bridge (0008:06), so the IORT code
> has to dma alias for iort_pci_iommu_init() via pci_for_each_dma_alias(),
> resulting in both of them getting the same Stream ID.

Hmm, actually, this doesn't even make a whole heap of sense, and it's
not what's happening here at all.

The bridge *does* claim its own RID, and per the aliasing rules the
devices behind it claim both their own RID and the alias to function
00.0 on the bridge's secondary bus, like so in action:

[    1.040905] pci 0000:04:00.0: arm_smmu_insert_master: SID 0x400
[    1.046988] pci 0000:04:00.0: Adding to iommu group 0
[    1.046995] pci 0000:05:04.0: arm_smmu_insert_master: SID 0x520
[    1.053750] pci 0000:05:04.0: arm_smmu_insert_master: SID 0x500
[    1.053759] pci 0000:05:04.0: Adding to iommu group 0
[    1.053765] pci 0000:05:05.0: arm_smmu_insert_master: SID 0x528
[    1.053767] pci 0000:05:05.0: arm_smmu_insert_master: SID 0x500
[    1.053768] pci 0000:05:05.0: Aliasing StreamID 0x500 unsupported, 
expect DMA to be broken

Which is a snippet from this topology:

-+-[0001:00]---00.0-[01]----00.0  Samsung Electronics Co Ltd NVMe SSD 
Controller SM981/PM981/PM983
  \-[0000:00]---00.0-[01-09]--+-00.0-[02-09]--+-08.0-[03]----00.0  VIA 
Technologies, Inc. VL805/806 xHCI USB 3.0 Controller
                              | 
+-10.0-[04-05]----00.0-[05]--+-04.0  Intel Corporation 82557/8/9/0/1 
Ethernet Pro 100
                              |               | 
   \-05.0  Broadcom Inc. and subsidiaries BCM4306 802.11b/g Wireless LAN 
Controller
                              |               +-11.0-[06]--
                              |               +-12.0-[07]----00.0 
Marvell Technology Group Ltd. Device 9170

(yes, I have a physical PCIe-to-PCI bridge board with random ancient
cards plugged into it specifically for griefing the SMMU drivers :D)

> On a kernel prior to v6.15-rc1, there has been an overlooked warning:
>    pci 0008:07:00.0: vgaarb: setting as boot VGA device
>    pci 0008:07:00.0: vgaarb: bridge control possible
>    pci 0008:07:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
>    pcieport 0008:06:00.0: Adding to iommu group 14
>    ast 0008:07:00.0: stream 67328 already in tree   <===== WARNING

However in the ASpeed VGA case, we get a single device aliasing with
*itself*, because the physical RID is already function 00.0 - this is
more obvious on our CI's Ampere Altra system where the SID mappings have
no additional upper bits:

[   22.953940] pci 0004:01:00.0: PCI bridge to [bus 02]
...
[   23.037980] pci 0004:02:00.0: [1a03:2000] type 00 class 0x030000 
conventional PCI endpoint
...
[   27.383602] pci 0004:01:00.0: Adding to iommu group 28
[   27.394562] pci 0004:02:00.0: stream 512 already in tree

512 == 0x200 == 02:00.0


Now that's supposed to have been supported since forever ago with
563b5cbe334e ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs"),
however from a quick look I'm guessing cdf315f907d4 ("iommu/arm-smmu-v3:
Maintain a SID->device structure") might have broken it...

Thanks,
Robin.
Nicolin Chen April 11, 2025, 11:33 p.m. UTC | #7
On Fri, Apr 11, 2025 at 04:13:01PM +0100, Robin Murphy wrote:
> On 11/04/2025 5:47 am, Nicolin Chen wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > 
> > ASPEED VGA card has two built-in devices:
> >   0008:06:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 06)
> >   0008:07:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52)
> > 
> > Its toplogy looks like this:
> >   +-[0008:00]---00.0-[01-09]--+-00.0-[02-09]--+-00.0-[03]----00.0  Sandisk Corp Device 5017
> >                               |               +-01.0-[04]--
> >                               |               +-02.0-[05]----00.0  NVIDIA Corporation Device
> >                               |               +-03.0-[06-07]----00.0-[07]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family
> >                               |               +-04.0-[08]----00.0  Renesas Technology Corp. uPD720201 USB 3.0 Host Controller
> >                               |               \-05.0-[09]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
> >                               \-00.1  PMC-Sierra Inc. Device 4028
> > 
> > Being a legacy PCI device that does not have RID on the wire, the system
> > does not preserve a RID for that PCI bridge (0008:06), so the IORT code
> > has to dma alias for iort_pci_iommu_init() via pci_for_each_dma_alias(),
> > resulting in both of them getting the same Stream ID.
> 
> Hmm, actually, this doesn't even make a whole heap of sense, and it's
> not what's happening here at all.
> 
> The bridge *does* claim its own RID, and per the aliasing rules the
> devices behind it claim both their own RID and the alias to function
> 00.0 on the bridge's secondary bus, like so in action:

Yea, I just found out that the bridge does have a different SID.
It was actually the VGA controller itself having two fwspec->ids
populated by the IORT code. Then, the SMMU driver allocated two
separate streams with the same set of device pointer and SID:
  pci 0008:06:00.0: arm_smmu_insert_master: fwspec index=0, sid=0x10600
  pci 0008:06:00.0: Adding to iommu group 21
  pci 0008:07:00.0: arm_smmu_insert_master: fwspec index=0, sid=0x10700
  pci 0008:07:00.0: arm_smmu_insert_master: fwspec index=1, sid=0x10700
  pci 0008:07:00.0: Adding to iommu group 21

Perhaps the duplicated fwspec->id should be avoided in the IORT
code at the first place v.s. bypassing the fwspec->ids[1]?

Thanks
Nicolin
Jason Gunthorpe April 11, 2025, 11:44 p.m. UTC | #8
On Fri, Apr 11, 2025 at 04:33:44PM -0700, Nicolin Chen wrote:

> > The bridge *does* claim its own RID, and per the aliasing rules the
> > devices behind it claim both their own RID and the alias to function
> > 00.0 on the bridge's secondary bus, like so in action:
> 
> Yea, I just found out that the bridge does have a different SID.
> It was actually the VGA controller itself having two fwspec->ids
> populated by the IORT code. Then, the SMMU driver allocated two
> separate streams with the same set of device pointer and SID:
>   pci 0008:06:00.0: arm_smmu_insert_master: fwspec index=0, sid=0x10600
>   pci 0008:06:00.0: Adding to iommu group 21
>   pci 0008:07:00.0: arm_smmu_insert_master: fwspec index=0, sid=0x10700
>   pci 0008:07:00.0: arm_smmu_insert_master: fwspec index=1, sid=0x10700
>   pci 0008:07:00.0: Adding to iommu group 21
> 
> Perhaps the duplicated fwspec->id should be avoided in the IORT
> code at the first place v.s. bypassing the fwspec->ids[1]?

It is a much easier fix if all you have to do is ignore hits in the RB
tree that match to the same master, just don't fail on duplicates and
don't add the duplicate rb node at all?

Seem strange though. Where did the duplicate come from?

Jason
Nicolin Chen April 12, 2025, 12:07 a.m. UTC | #9
On Fri, Apr 11, 2025 at 08:44:19PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 11, 2025 at 04:33:44PM -0700, Nicolin Chen wrote:
> 
> > > The bridge *does* claim its own RID, and per the aliasing rules the
> > > devices behind it claim both their own RID and the alias to function
> > > 00.0 on the bridge's secondary bus, like so in action:
> > 
> > Yea, I just found out that the bridge does have a different SID.
> > It was actually the VGA controller itself having two fwspec->ids
> > populated by the IORT code. Then, the SMMU driver allocated two
> > separate streams with the same set of device pointer and SID:
> >   pci 0008:06:00.0: arm_smmu_insert_master: fwspec index=0, sid=0x10600
> >   pci 0008:06:00.0: Adding to iommu group 21
> >   pci 0008:07:00.0: arm_smmu_insert_master: fwspec index=0, sid=0x10700
> >   pci 0008:07:00.0: arm_smmu_insert_master: fwspec index=1, sid=0x10700
> >   pci 0008:07:00.0: Adding to iommu group 21
> > 
> > Perhaps the duplicated fwspec->id should be avoided in the IORT
> > code at the first place v.s. bypassing the fwspec->ids[1]?
> 
> It is a much easier fix if all you have to do is ignore hits in the RB
> tree that match to the same master, just don't fail on duplicates and
> don't add the duplicate rb node at all?
>
> Seem strange though. Where did the duplicate come from?

iort_iommu_configure_id
 |=> pci_for_each_dma_alias
      |=> iort_pci_iommu_init
      |    |=> iort_iommu_xlate
      |         |=> acpi_iommu_fwspec_init
      |              |=> iommu_fwspec_add_id(sid=0x10700) //index=0
      |=> iort_pci_iommu_init
           |=> iort_iommu_xlate
                |=> acpi_iommu_fwspec_init
                     |=> iommu_fwspec_add_id(sid=0x10700) //index=1

We could add a piece of code in iommu_fwspec_add_ids() rejecting
a duplicated ID, which would be smaller than we do in SMMU driver
that requires a device_group change too?

Thanks
Nicolin
Nicolin Chen April 12, 2025, 3:39 a.m. UTC | #10
On Fri, Apr 11, 2025 at 05:07:03PM -0700, Nicolin Chen wrote:
> On Fri, Apr 11, 2025 at 08:44:19PM -0300, Jason Gunthorpe wrote:
> > On Fri, Apr 11, 2025 at 04:33:44PM -0700, Nicolin Chen wrote:
> > 
> > > > The bridge *does* claim its own RID, and per the aliasing rules the
> > > > devices behind it claim both their own RID and the alias to function
> > > > 00.0 on the bridge's secondary bus, like so in action:
> > > 
> > > Yea, I just found out that the bridge does have a different SID.
> > > It was actually the VGA controller itself having two fwspec->ids
> > > populated by the IORT code. Then, the SMMU driver allocated two
> > > separate streams with the same set of device pointer and SID:
> > >   pci 0008:06:00.0: arm_smmu_insert_master: fwspec index=0, sid=0x10600
> > >   pci 0008:06:00.0: Adding to iommu group 21
> > >   pci 0008:07:00.0: arm_smmu_insert_master: fwspec index=0, sid=0x10700
> > >   pci 0008:07:00.0: arm_smmu_insert_master: fwspec index=1, sid=0x10700
> > >   pci 0008:07:00.0: Adding to iommu group 21
> > > 
> > > Perhaps the duplicated fwspec->id should be avoided in the IORT
> > > code at the first place v.s. bypassing the fwspec->ids[1]?
> > 
> > It is a much easier fix if all you have to do is ignore hits in the RB
> > tree that match to the same master, just don't fail on duplicates and
> > don't add the duplicate rb node at all?
> >
> > Seem strange though. Where did the duplicate come from?
> 
> iort_iommu_configure_id
>  |=> pci_for_each_dma_alias
>       |=> iort_pci_iommu_init
>       |    |=> iort_iommu_xlate
>       |         |=> acpi_iommu_fwspec_init
>       |              |=> iommu_fwspec_add_id(sid=0x10700) //index=0
>       |=> iort_pci_iommu_init
>            |=> iort_iommu_xlate
>                 |=> acpi_iommu_fwspec_init
>                      |=> iommu_fwspec_add_id(sid=0x10700) //index=1
> 
> We could add a piece of code in iommu_fwspec_add_ids() rejecting
> a duplicated ID, which would be smaller than we do in SMMU driver
> that requires a device_group change too?

I figured that we could do something smaller too in SMMUv3 driver,
somewhat similar to what Robin did for STE in commit 563b5cbe334e
("iommu/arm-smmu-v3: Cope with duplicated Stream IDs"):
----------------------------------------------
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b4c21aaed1266..4b66e8ebef539 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3393,6 +3393,10 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
                new_stream->id = sid;
                new_stream->master = master;

+               /* Bridged PCI devices may end up with duplicated IDs */
+               if (master == arm_smmu_find_master(smmu, sid))
+                       continue;
+
                ret = arm_smmu_init_sid_strtab(smmu, sid);
                if (ret)
                        break;
----------------------------------------------

This would bypass the duplicated ID. And here is the test result:
[    6.327800] pci 0008:06:00.0: Adding to iommu group 21
[    6.327847] pci 0008:07:00.0: Adding to iommu group 21

Thanks
Nicolin
Jason Gunthorpe April 12, 2025, 1:55 p.m. UTC | #11
On Fri, Apr 11, 2025 at 08:39:36PM -0700, Nicolin Chen wrote:
> 
> I figured that we could do something smaller too in SMMUv3 driver,
> somewhat similar to what Robin did for STE in commit 563b5cbe334e
> ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs"):
> ----------------------------------------------
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index b4c21aaed1266..4b66e8ebef539 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3393,6 +3393,10 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
>                 new_stream->id = sid;
>                 new_stream->master = master;
> 
> +               /* Bridged PCI devices may end up with duplicated IDs */
> +               if (master == arm_smmu_find_master(smmu, sid))
> +                       continue;
> +
>                 ret = arm_smmu_init_sid_strtab(smmu, sid);
>                 if (ret)
>                         break;
> ----------------------------------------------

I'd write the above like this though:

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b4c21aaed1266a..97b9f8d7fb3340 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3389,6 +3389,7 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
 	for (i = 0; i < fwspec->num_ids; i++) {
 		struct arm_smmu_stream *new_stream = &master->streams[i];
 		u32 sid = fwspec->ids[i];
+		struct rb_node *existing;
 
 		new_stream->id = sid;
 		new_stream->master = master;
@@ -3398,10 +3399,20 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
 			break;
 
 		/* Insert into SID tree */
-		if (rb_find_add(&new_stream->node, &smmu->streams,
-				arm_smmu_streams_cmp_node)) {
-			dev_warn(master->dev, "stream %u already in tree\n",
-				 sid);
+		existing = rb_find_add(&new_stream->node, &smmu->streams,
+				       arm_smmu_streams_cmp_node);
+		if (existing) {
+			struct arm_smmu_master *existing_master =
+				rb_entry(existing, struct arm_smmu_stream, node)
+					->master;
+
+			/* Bridged PCI devices may end up with duplicated IDs */
+			if (existing_master == master)
+				continue;
+
+			dev_warn(master->dev,
+				 "stream %u already in tree from dev %s\n", sid,
+				 dev_name(existing_master->dev));
 			ret = -EINVAL;
 			break;
 		}

It seems like a reasonable rc fix, and if nobody has cared about DMA
Alias support since it was broken in 2021 maybe we just leave it.

Jason
Nicolin Chen April 12, 2025, 5:03 p.m. UTC | #12
On Sat, Apr 12, 2025 at 10:55:14AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 11, 2025 at 08:39:36PM -0700, Nicolin Chen wrote:
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index b4c21aaed1266a..97b9f8d7fb3340 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3389,6 +3389,7 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
>  	for (i = 0; i < fwspec->num_ids; i++) {
>  		struct arm_smmu_stream *new_stream = &master->streams[i];
>  		u32 sid = fwspec->ids[i];
> +		struct rb_node *existing;
>  
>  		new_stream->id = sid;
>  		new_stream->master = master;
> @@ -3398,10 +3399,20 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
>  			break;
>  
>  		/* Insert into SID tree */
> -		if (rb_find_add(&new_stream->node, &smmu->streams,
> -				arm_smmu_streams_cmp_node)) {
> -			dev_warn(master->dev, "stream %u already in tree\n",
> -				 sid);
> +		existing = rb_find_add(&new_stream->node, &smmu->streams,
> +				       arm_smmu_streams_cmp_node);

Oh right, it does find() already.

> +		if (existing) {
> +			struct arm_smmu_master *existing_master =
> +				rb_entry(existing, struct arm_smmu_stream, node)
> +					->master;
> +
> +			/* Bridged PCI devices may end up with duplicated IDs */
> +			if (existing_master == master)
> +				continue;
> +
> +			dev_warn(master->dev,
> +				 "stream %u already in tree from dev %s\n", sid,
> +				 dev_name(existing_master->dev));
>  			ret = -EINVAL;
>  			break;
>  		}
> 
> It seems like a reasonable rc fix, and if nobody has cared about DMA
> Alias support since it was broken in 2021 maybe we just leave it.

I will wrap it up and make a note with that.

Thanks!
Nicolin
diff mbox series

Patch

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b4c21aaed126..5ce64dc78e12 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1762,8 +1762,25 @@  static int arm_smmu_streams_cmp_key(const void *lhs, const struct rb_node *rhs)
 static int arm_smmu_streams_cmp_node(struct rb_node *lhs,
 				     const struct rb_node *rhs)
 {
-	return arm_smmu_streams_cmp_key(
-		&rb_entry(lhs, struct arm_smmu_stream, node)->id, rhs);
+	struct arm_smmu_stream *stream_lhs =
+		rb_entry(lhs, struct arm_smmu_stream, node);
+	struct arm_smmu_stream *stream_rhs =
+		rb_entry(rhs, struct arm_smmu_stream, node);
+
+	if (stream_lhs->id < stream_rhs->id)
+		return -1;
+	if (stream_lhs->id > stream_rhs->id)
+		return 1;
+
+	/*
+	 * The stream table can have multiple nodes with the same ID if there
+	 * are DMA aliases.
+	 */
+	if (stream_lhs < stream_rhs)
+		return -1;
+	if (stream_lhs > stream_rhs)
+		return 1;
+	return 0;
 }
 
 static struct arm_smmu_master *
@@ -1776,6 +1793,16 @@  arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
 	node = rb_find(&sid, &smmu->streams, arm_smmu_streams_cmp_key);
 	if (!node)
 		return NULL;
+	/*
+	 * If there are DMA alises then there are multiple devices with the same
+	 * stream ID and we cannot reliably convert from SID to master.
+	 */
+	if (node->rb_left &&
+	    rb_entry(node->rb_left, struct arm_smmu_stream, node)->id == sid)
+		return NULL;
+	if (node->rb_right &&
+	    rb_entry(node->rb_right, struct arm_smmu_stream, node)->id == sid)
+		return NULL;
 	return rb_entry(node, struct arm_smmu_stream, node)->master;
 }