diff mbox series

[v3,1/1] PCI: Add translated request only flag for pci_enable_pasid()

Message ID 20230114073420.759989-1-baolu.lu@linux.intel.com (mailing list archive)
State Not Applicable
Delegated to: Bjorn Helgaas
Headers show
Series [v3,1/1] PCI: Add translated request only flag for pci_enable_pasid() | expand

Commit Message

Baolu Lu Jan. 14, 2023, 7:34 a.m. UTC
The PCIe fabric routes Memory Requests based on the TLP address, ignoring
the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
Enable PASID only when ACS RR & UF enabled on upstream path") requires
some ACS features being supported on device's upstream path when enabling
PCI/PASID.

One alternative is ATS/PRI which lets the device resolve the PASID + addr
pair before a memory request is made into a routeable TLB address through
the translation agent. Those resolved addresses are then cached on the
device instead of in the IOMMU TLB and the device always sets translated
bit for PASID. One example of those devices are AMD graphic devices that
always have ACS or ATS/PRI enabled together with PASID.

This adds a flag parameter in the pci_enable_pasid() helper, with which
the device driver could opt-in the fact that device always sets the
translated bit for PASID.

It also applies this opt-in for AMD graphic devices. Without this change,
kernel boots to black screen on a system with below AMD graphic device:

00:01.0 VGA compatible controller: Advanced Micro Devices, Inc.
        [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca)
        (prog-if 00 [VGA controller])
	DeviceName: ATI EG BROADWAY
	Subsystem: Hewlett-Packard Company Device 8332

At present, it is a common practice to enable/disable PCI PASID in the
iommu drivers. Considering that the device driver knows more about the
specific device, we will follow up by moving pci_enable_pasid() into
the specific device drivers.

Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path")
Reported-and-tested-by: Matt Fagnani <matt.fagnani@bell.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
Link: https://lore.kernel.org/r/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 include/linux/pci-ats.h                     | 6 ++++--
 drivers/iommu/amd/iommu.c                   | 2 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
 drivers/iommu/intel/iommu.c                 | 3 ++-
 drivers/pci/ats.c                           | 8 ++++++--
 5 files changed, 14 insertions(+), 7 deletions(-)

Change log:
v3:
 - Replace ATS with ATS/PRI in commit message;
 - Refine a code comment;
 - Patch tested by Matt Fagnani.
v2:https://lore.kernel.org/linux-iommu/20230113014409.752405-1-baolu.lu@linux.intel.com/
 - Convert the bool to a named flag;
 - Convert TRANSLED to XLATED.
v1:
 - https://lore.kernel.org/linux-iommu/20230112084629.737653-1-baolu.lu@linux.intel.com/

Comments

Jason Gunthorpe Jan. 16, 2023, 3:42 p.m. UTC | #1
On Sat, Jan 14, 2023 at 03:34:20PM +0800, Lu Baolu wrote:
> The PCIe fabric routes Memory Requests based on the TLP address, ignoring
> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
> Enable PASID only when ACS RR & UF enabled on upstream path") requires
> some ACS features being supported on device's upstream path when enabling
> PCI/PASID.
> 
> One alternative is ATS/PRI which lets the device resolve the PASID + addr
> pair before a memory request is made into a routeable TLB address through
> the translation agent. Those resolved addresses are then cached on the
> device instead of in the IOMMU TLB and the device always sets translated
> bit for PASID. One example of those devices are AMD graphic devices that
> always have ACS or ATS/PRI enabled together with PASID.
> 
> This adds a flag parameter in the pci_enable_pasid() helper, with which
> the device driver could opt-in the fact that device always sets the
> translated bit for PASID.
> 
> It also applies this opt-in for AMD graphic devices. Without this change,
> kernel boots to black screen on a system with below AMD graphic device:
> 
> 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc.
>         [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca)
>         (prog-if 00 [VGA controller])
> 	DeviceName: ATI EG BROADWAY
> 	Subsystem: Hewlett-Packard Company Device 8332
> 
> At present, it is a common practice to enable/disable PCI PASID in the
> iommu drivers. Considering that the device driver knows more about the
> specific device, we will follow up by moving pci_enable_pasid() into
> the specific device drivers.
> 
> Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path")
> Reported-and-tested-by: Matt Fagnani <matt.fagnani@bell.net>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
> Link: https://lore.kernel.org/r/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Suggested-by: Christian König <christian.koenig@amd.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  include/linux/pci-ats.h                     | 6 ++++--
>  drivers/iommu/amd/iommu.c                   | 2 +-
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
>  drivers/iommu/intel/iommu.c                 | 3 ++-
>  drivers/pci/ats.c                           | 8 ++++++--
>  5 files changed, 14 insertions(+), 7 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason
Thorsten Leemhuis Jan. 27, 2023, 11:30 a.m. UTC | #2
Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

What happened to below patch? It looks like there was no progress since
ten days now. Or did I miss something?

Reminder, the patch is fixing a regression, hence it would be good if
this could be fixed rather sooner than later.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

On 16.01.23 16:42, Jason Gunthorpe wrote:
> On Sat, Jan 14, 2023 at 03:34:20PM +0800, Lu Baolu wrote:
>> The PCIe fabric routes Memory Requests based on the TLP address, ignoring
>> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
>> Enable PASID only when ACS RR & UF enabled on upstream path") requires
>> some ACS features being supported on device's upstream path when enabling
>> PCI/PASID.
>>
>> One alternative is ATS/PRI which lets the device resolve the PASID + addr
>> pair before a memory request is made into a routeable TLB address through
>> the translation agent. Those resolved addresses are then cached on the
>> device instead of in the IOMMU TLB and the device always sets translated
>> bit for PASID. One example of those devices are AMD graphic devices that
>> always have ACS or ATS/PRI enabled together with PASID.
>>
>> This adds a flag parameter in the pci_enable_pasid() helper, with which
>> the device driver could opt-in the fact that device always sets the
>> translated bit for PASID.
>>
>> It also applies this opt-in for AMD graphic devices. Without this change,
>> kernel boots to black screen on a system with below AMD graphic device:
>>
>> 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc.
>>         [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca)
>>         (prog-if 00 [VGA controller])
>> 	DeviceName: ATI EG BROADWAY
>> 	Subsystem: Hewlett-Packard Company Device 8332
>>
>> At present, it is a common practice to enable/disable PCI PASID in the
>> iommu drivers. Considering that the device driver knows more about the
>> specific device, we will follow up by moving pci_enable_pasid() into
>> the specific device drivers.
>>
>> Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on upstream path")
>> Reported-and-tested-by: Matt Fagnani <matt.fagnani@bell.net>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
>> Link: https://lore.kernel.org/r/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>> Suggested-by: Christian König <christian.koenig@amd.com>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>> ---
>>  include/linux/pci-ats.h                     | 6 ++++--
>>  drivers/iommu/amd/iommu.c                   | 2 +-
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
>>  drivers/iommu/intel/iommu.c                 | 3 ++-
>>  drivers/pci/ats.c                           | 8 ++++++--
>>  5 files changed, 14 insertions(+), 7 deletions(-)
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Jason
Bjorn Helgaas Jan. 27, 2023, 5:30 p.m. UTC | #3
On Sat, Jan 14, 2023 at 03:34:20PM +0800, Lu Baolu wrote:
> The PCIe fabric routes Memory Requests based on the TLP address, ignoring
> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
> Enable PASID only when ACS RR & UF enabled on upstream path") requires
> some ACS features being supported on device's upstream path when enabling
> PCI/PASID.
> 
> One alternative is ATS/PRI which lets the device resolve the PASID + addr
> pair before a memory request is made into a routeable TLB address through
> the translation agent.

This sounds like "ATS/PRI" is a solution to a problem, but we haven't
stated the problem yet.

> Those resolved addresses are then cached on the
> device instead of in the IOMMU TLB and the device always sets translated
> bit for PASID. One example of those devices are AMD graphic devices that
> always have ACS or ATS/PRI enabled together with PASID.
> 
> This adds a flag parameter in the pci_enable_pasid() helper, with which
> the device driver could opt-in the fact that device always sets the
> translated bit for PASID.

Nit: "Add a flag ..." and "Apply this opt-in ..." (below).

> It also applies this opt-in for AMD graphic devices. Without this change,
> kernel boots to black screen on a system with below AMD graphic device:
> 
> 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc.
>         [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca)
>         (prog-if 00 [VGA controller])
> 	DeviceName: ATI EG BROADWAY
> 	Subsystem: Hewlett-Packard Company Device 8332

What is the underlying failure here?  "Black screen" is useful but we
should say *why* that happens, e.g., transactions went the wrong place
or whatever.

> At present, it is a common practice to enable/disable PCI PASID in the
> iommu drivers. Considering that the device driver knows more about the
> specific device, we will follow up by moving pci_enable_pasid() into
> the specific device drivers.

> @@ -353,12 +353,15 @@ void pci_pasid_init(struct pci_dev *pdev)
>   * pci_enable_pasid - Enable the PASID capability
>   * @pdev: PCI device structure
>   * @features: Features to enable
> + * @flags: device-specific flags
> + *   - PCI_PASID_XLATED_REQ_ONLY: The PCI device always use translated type
> + *                                for all PASID memory requests.

s/use/uses/

I guess PCI_PASID_XLATED_REQ_ONLY is something only the driver knows,
right?  We can't deduce from architected config space that the device
will produce PASID prefixes for every Memory Request, can we?

Bjorn
Tian, Kevin Jan. 28, 2023, 7:52 a.m. UTC | #4
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Saturday, January 28, 2023 1:31 AM
> 
> On Sat, Jan 14, 2023 at 03:34:20PM +0800, Lu Baolu wrote:
> 
> > @@ -353,12 +353,15 @@ void pci_pasid_init(struct pci_dev *pdev)
> >   * pci_enable_pasid - Enable the PASID capability
> >   * @pdev: PCI device structure
> >   * @features: Features to enable
> > + * @flags: device-specific flags
> > + *   - PCI_PASID_XLATED_REQ_ONLY: The PCI device always use translated
> type
> > + *                                for all PASID memory requests.
> 
> s/use/uses/
> 
> I guess PCI_PASID_XLATED_REQ_ONLY is something only the driver knows,
> right?  We can't deduce from architected config space that the device

Yes

> will produce PASID prefixes for every Memory Request, can we?
> 

No, we can't. PASID cap only indicates that the device is capable of
using PASID prefix, not a mandatory requirement on every memory
request.

Similar case is PRI. Having PRI enabled only means the device is
capable of handling I/O page faults, not the indicator that it can
100% tolerate fault in every memory access. That is the main reason
why vfio/iommufd can't simply skip pinning guest memory when
seeing PRI enabled (well, though PRI is not supported yet). It has to
be opted-in via a driver specific way e.g. a vfio variant driver.

Thanks
Kevin
Baolu Lu Jan. 29, 2023, 8:42 a.m. UTC | #5
Hi Bjorn,

Thanks for your review comments.

On 2023/1/28 1:30, Bjorn Helgaas wrote:
> On Sat, Jan 14, 2023 at 03:34:20PM +0800, Lu Baolu wrote:
>> The PCIe fabric routes Memory Requests based on the TLP address, ignoring
>> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
>> Enable PASID only when ACS RR & UF enabled on upstream path") requires
>> some ACS features being supported on device's upstream path when enabling
>> PCI/PASID.
>>
>> One alternative is ATS/PRI which lets the device resolve the PASID + addr
>> pair before a memory request is made into a routeable TLB address through
>> the translation agent.
> 
> This sounds like "ATS/PRI" is a solution to a problem, but we haven't
> stated the problem yet.
> 
>> Those resolved addresses are then cached on the
>> device instead of in the IOMMU TLB and the device always sets translated
>> bit for PASID. One example of those devices are AMD graphic devices that
>> always have ACS or ATS/PRI enabled together with PASID.
>>
>> This adds a flag parameter in the pci_enable_pasid() helper, with which
>> the device driver could opt-in the fact that device always sets the
>> translated bit for PASID.
> 
> Nit: "Add a flag ..." and "Apply this opt-in ..." (below).
> 
>> It also applies this opt-in for AMD graphic devices. Without this change,
>> kernel boots to black screen on a system with below AMD graphic device:
>>
>> 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc.
>>          [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca)
>>          (prog-if 00 [VGA controller])
>> 	DeviceName: ATI EG BROADWAY
>> 	Subsystem: Hewlett-Packard Company Device 8332
> 
> What is the underlying failure here?  "Black screen" is useful but we
> should say *why* that happens, e.g., transactions went the wrong place
> or whatever.

All above make sense to me. I post my new commit message at the end of
this reply.

> 
>> At present, it is a common practice to enable/disable PCI PASID in the
>> iommu drivers. Considering that the device driver knows more about the
>> specific device, we will follow up by moving pci_enable_pasid() into
>> the specific device drivers.
> 
>> @@ -353,12 +353,15 @@ void pci_pasid_init(struct pci_dev *pdev)
>>    * pci_enable_pasid - Enable the PASID capability
>>    * @pdev: PCI device structure
>>    * @features: Features to enable
>> + * @flags: device-specific flags
>> + *   - PCI_PASID_XLATED_REQ_ONLY: The PCI device always use translated type
>> + *                                for all PASID memory requests.
> 
> s/use/uses/

Yes.

> 
> I guess PCI_PASID_XLATED_REQ_ONLY is something only the driver knows,
> right?  We can't deduce from architected config space that the device
> will produce PASID prefixes for every Memory Request, can we?

No, we can't. That's the reason why we need a flag here.

[ Below is an updated commit message. Hope it can describe things
   clearly.]

PCI: Add translated request only flag for pci_enable_pasid()

The PCIe fabric routes Memory Requests based on the TLP address, ignoring
the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
Enable PASID only when ACS RR & UF enabled on upstream path") requires
some ACS features being supported on device's upstream path when enabling
PCI/PASID.

However, above change causes the Linux kernel boots to black screen on a
system with below graphic device:

00:01.0 VGA compatible controller: Advanced Micro Devices, Inc.
         [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca)
         (prog-if 00 [VGA controller])
         DeviceName: ATI EG BROADWAY
         Subsystem: Hewlett-Packard Company Device 8332

The kernel trace looks like below:

  Call Trace:
   <TASK>
   amd_iommu_attach_device+0x2e0/0x300
   __iommu_attach_device+0x1b/0x90
   iommu_attach_group+0x65/0xa0
   amd_iommu_init_device+0x16b/0x250 [iommu_v2]
   kfd_iommu_resume+0x4c/0x1a0 [amdgpu]
   kgd2kfd_resume_iommu+0x12/0x30 [amdgpu]
   kgd2kfd_device_init.cold+0x346/0x49a [amdgpu]
   amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu]
   amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu]
   ? _raw_spin_lock_irqsave+0x23/0x50
   amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
   amdgpu_pci_probe+0x161/0x370 [amdgpu]
   local_pci_probe+0x41/0x80
   pci_device_probe+0xb3/0x220
   really_probe+0xde/0x380
   ? pm_runtime_barrier+0x50/0x90
   __driver_probe_device+0x78/0x170
   driver_probe_device+0x1f/0x90
   __driver_attach+0xce/0x1c0
   ? __pfx___driver_attach+0x10/0x10
   bus_for_each_dev+0x73/0xa0
   bus_add_driver+0x1ae/0x200
   driver_register+0x89/0xe0
   ? __pfx_init_module+0x10/0x10 [amdgpu]
   do_one_initcall+0x59/0x230
   do_init_module+0x4a/0x200
   __do_sys_init_module+0x157/0x180
   do_syscall_64+0x5b/0x80
   ? handle_mm_fault+0xff/0x2f0
   ? do_user_addr_fault+0x1ef/0x690
   ? exc_page_fault+0x70/0x170
   entry_SYSCALL_64_after_hwframe+0x72/0xdc

The AMD iommu driver allocates a new domain (called v2 domain) for the
amdgpu device and enables its PCI PASID/ATS/PRI before attaching the
v2 domain to it. The failure of pci_enable_pasid() due to lack of ACS
causes the domain attaching device to fail. The amdgpu device is unable
to DMA normally, resulting in a black screen of the system.

However, this device is special as it relies on ATS/PRI to resolve the
PASID + addr pair before a memory request is made into a routeable TLB
address through the translation agent. Those resolved addresses are then
cached on the device instead of in the IOMMU TLB and the device always
uses translated memory request for PASID.

ACS is not necessary for the devices that always use translated memory
request for PASID. But this is device specific and only device driver
knows this. We can't deduce this from architected config space.

Add a flag for pci_enable_pasid(), with which the device drivers could
opt-in the fact that device always uses translated memory requests for
PASID hence the ACS is not a necessity. Apply this opt-in for above AMD
graphic device.

At present, it is a common practice to enable/disable PCI PASID in the
iommu drivers. Considering that the device driver knows more about the
specific device, it's better to move pci_enable_pasid() into the specific
device drivers.
[-- end --]

--
Best regards,
baolu
Bjorn Helgaas Jan. 30, 2023, 6:38 p.m. UTC | #6
On Sun, Jan 29, 2023 at 04:42:32PM +0800, Baolu Lu wrote:
> On 2023/1/28 1:30, Bjorn Helgaas wrote:
> > On Sat, Jan 14, 2023 at 03:34:20PM +0800, Lu Baolu wrote:
> > > The PCIe fabric routes Memory Requests based on the TLP address, ignoring
> > > the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
> > > Enable PASID only when ACS RR & UF enabled on upstream path") requires
> > > some ACS features being supported on device's upstream path when enabling
> > > PCI/PASID.
> > > 
> > > One alternative is ATS/PRI which lets the device resolve the PASID + addr
> > > pair before a memory request is made into a routeable TLB address through
> > > the translation agent.
> > 
> > This sounds like "ATS/PRI" is a solution to a problem, but we haven't
> > stated the problem yet.
> > 
> > > Those resolved addresses are then cached on the
> > > device instead of in the IOMMU TLB and the device always sets translated
> > > bit for PASID. One example of those devices are AMD graphic devices that
> > > always have ACS or ATS/PRI enabled together with PASID.
> > > 
> > > This adds a flag parameter in the pci_enable_pasid() helper, with which
> > > the device driver could opt-in the fact that device always sets the
> > > translated bit for PASID.
> > >
> > > It also applies this opt-in for AMD graphic devices. Without this change,
> > > kernel boots to black screen on a system with below AMD graphic device:
> > > 
> > > 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > >          [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca)
> > >          (prog-if 00 [VGA controller])
> > > 	DeviceName: ATI EG BROADWAY
> > > 	Subsystem: Hewlett-Packard Company Device 8332
> > 
> > What is the underlying failure here?  "Black screen" is useful but we
> > should say *why* that happens, e.g., transactions went the wrong place
> > or whatever.

> > I guess PCI_PASID_XLATED_REQ_ONLY is something only the driver knows,
> > right?  We can't deduce from architected config space that the device
> > will produce PASID prefixes for every Memory Request, can we?
> 
> No, we can't. That's the reason why we need a flag here.

Sorry, I'm still confused.  PCI_PASID_XLATED_REQ_ONLY is a
device-specific property, and you want to opt-in AMD graphics devices.
Where's the AMD graphics-specific change?  The current patch does
this:

  pdev_pri_ats_enable
    pci_enable_pasid(pdev, 0, PCI_PASID_XLATED_REQ_ONLY)

which looks like it does it for *all* devices below an AMD IOMMU,
without any device or driver input.

PCIe r6.0, sec 6.20.1:

  A Function is not permitted to generate Requests using Translated
  Addresses and a PASID unless both PASID Enable and Translated
  Requests with PASID Enable are Set.

You want AMD graphics devices to do DMA with translated addresses and
PASID, right?  pci_enable_pasid() sets PASID Enable
(PCI_PASID_CTRL_ENABLE), but I don't see where "Translated Requests
with PASID Enable" is set.  We don't even have a #define for it.

I would think we should check "Translated Requests with PASID
Supported" before setting "Translated Requests with PASID Enable",
too?

> PCI: Add translated request only flag for pci_enable_pasid()
> 
> The PCIe fabric routes Memory Requests based on the TLP address, ignoring
> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
> Enable PASID only when ACS RR & UF enabled on upstream path") requires
> some ACS features being supported on device's upstream path when enabling
> PCI/PASID.
> 
> However, above change causes the Linux kernel boots to black screen on a
> system with below graphic device:

We need a PCIe concept-level description of the issue first, i.e., in
terms of DMA, PASID, ACS, etc.  Then we can mention the AMD GPU issue
as an instance.

> 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc.
>         [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca)
>         (prog-if 00 [VGA controller])
>         DeviceName: ATI EG BROADWAY
>         Subsystem: Hewlett-Packard Company Device 8332
> 
> The kernel trace looks like below:
> 
>  Call Trace:
>   <TASK>
>   amd_iommu_attach_device+0x2e0/0x300
>   __iommu_attach_device+0x1b/0x90
>   iommu_attach_group+0x65/0xa0
>   amd_iommu_init_device+0x16b/0x250 [iommu_v2]
>   kfd_iommu_resume+0x4c/0x1a0 [amdgpu]
>   kgd2kfd_resume_iommu+0x12/0x30 [amdgpu]
>   kgd2kfd_device_init.cold+0x346/0x49a [amdgpu]
>   amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu]
>   amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu]
>   ? _raw_spin_lock_irqsave+0x23/0x50
>   amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
>   amdgpu_pci_probe+0x161/0x370 [amdgpu]
>   local_pci_probe+0x41/0x80
>   pci_device_probe+0xb3/0x220
>   really_probe+0xde/0x380
>   ? pm_runtime_barrier+0x50/0x90
>   __driver_probe_device+0x78/0x170
>   driver_probe_device+0x1f/0x90
>   __driver_attach+0xce/0x1c0
>   ? __pfx___driver_attach+0x10/0x10
>   bus_for_each_dev+0x73/0xa0
>   bus_add_driver+0x1ae/0x200
>   driver_register+0x89/0xe0
>   ? __pfx_init_module+0x10/0x10 [amdgpu]
>   do_one_initcall+0x59/0x230
>   do_init_module+0x4a/0x200
>   __do_sys_init_module+0x157/0x180
>   do_syscall_64+0x5b/0x80
>   ? handle_mm_fault+0xff/0x2f0
>   ? do_user_addr_fault+0x1ef/0x690
>   ? exc_page_fault+0x70/0x170
>   entry_SYSCALL_64_after_hwframe+0x72/0xdc

The stack trace doesn't seem like it shows a failure, so I'm not sure
it's useful this time.  If it is, we can at least strip out the
irrelevant pieces.

> The AMD iommu driver allocates a new domain (called v2 domain) for the

"v2 domain" needs to be something greppable -- an identifier,
filename, etc.

> amdgpu device and enables its PCI PASID/ATS/PRI before attaching the
> v2 domain to it. The failure of pci_enable_pasid() due to lack of ACS
> causes the domain attaching device to fail. The amdgpu device is unable
> to DMA normally, resulting in a black screen of the system.
> 
> However, this device is special as it relies on ATS/PRI to resolve the
> PASID + addr pair before a memory request is made into a routeable TLB
> address through the translation agent. Those resolved addresses are then
> cached on the device instead of in the IOMMU TLB and the device always
> uses translated memory request for PASID.
> 
> ACS is not necessary for the devices that always use translated memory
> request for PASID. But this is device specific and only device driver
> knows this. We can't deduce this from architected config space.
> 
> Add a flag for pci_enable_pasid(), with which the device drivers could
> opt-in the fact that device always uses translated memory requests for
> PASID hence the ACS is not a necessity. Apply this opt-in for above AMD
> graphic device.
> 
> At present, it is a common practice to enable/disable PCI PASID in the
> iommu drivers. Considering that the device driver knows more about the
> specific device, it's better to move pci_enable_pasid() into the specific
> device drivers.
> [-- end --]
> 
> --
> Best regards,
> baolu
Jason Gunthorpe Jan. 30, 2023, 6:47 p.m. UTC | #7
On Mon, Jan 30, 2023 at 12:38:10PM -0600, Bjorn Helgaas wrote:

> Sorry, I'm still confused.  PCI_PASID_XLATED_REQ_ONLY is a
> device-specific property, and you want to opt-in AMD graphics devices.
> Where's the AMD graphics-specific change?  The current patch does
> this:
> 
>   pdev_pri_ats_enable
>     pci_enable_pasid(pdev, 0, PCI_PASID_XLATED_REQ_ONLY)
> 
> which looks like it does it for *all* devices below an AMD IOMMU,
> without any device or driver input.

AMD GPU has a private interface to AMD IOMMU to support PASID support
that only it uses.

When AMD's IOMMU driver is cleaned up to use the common PASID API this
will have to be moved into the GPU driver.

But we are not there yet..

Jason
Baolu Lu Jan. 31, 2023, 12:25 p.m. UTC | #8
On 2023/1/31 2:38, Bjorn Helgaas wrote:
> PCIe r6.0, sec 6.20.1:
> 
>    A Function is not permitted to generate Requests using Translated
>    Addresses and a PASID unless both PASID Enable and Translated
>    Requests with PASID Enable are Set.
> 
> You want AMD graphics devices to do DMA with translated addresses and
> PASID, right?  pci_enable_pasid() sets PASID Enable
> (PCI_PASID_CTRL_ENABLE), but I don't see where "Translated Requests
> with PASID Enable" is set.  We don't even have a #define for it.
> 
> I would think we should check "Translated Requests with PASID
> Supported" before setting "Translated Requests with PASID Enable",
> too?

This seems to be an ECN for PCIe 5.x:

https://members.pcisig.com/wg/PCI-SIG/document/14929

What I read from this ECN is that,

With this ECN, translated memory requests for PASIDs are not allowed to
carry a PASID prefix if "Translated Requests with PASID Enabled" is not
set. It does not mean whether the device can generate translated memory
requests for PASID, but whether the memory request can carry a PASID
prefix.

Best regards,
baolu
Baolu Lu Jan. 31, 2023, 12:56 p.m. UTC | #9
On 2023/1/31 2:38, Bjorn Helgaas wrote:
>> PCI: Add translated request only flag for pci_enable_pasid()
>>
>> The PCIe fabric routes Memory Requests based on the TLP address, ignoring
>> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
>> Enable PASID only when ACS RR & UF enabled on upstream path") requires
>> some ACS features being supported on device's upstream path when enabling
>> PCI/PASID.
>>
>> However, above change causes the Linux kernel boots to black screen on a
>> system with below graphic device:
> We need a PCIe concept-level description of the issue first, i.e., in
> terms of DMA, PASID, ACS, etc.  Then we can mention the AMD GPU issue
> as an instance.

How about below description?

PCIe endpoints can use ATS to request DMA remapping hardware to
translate an IOVA to its mapped physical address. If the translation is
missing or the permissions are insufficient, the PRI is used to trigger
an I/O page fault. The IOMMU driver will fill the mapping with desired
permissions and return the translated address to the device.

The translated address is specified by the IOMMU driver. The IOMMU
driver ensures that the address is a DMA buffer address instead of any
P2P address in the PCI fabric. Therefore, any translated memory request
will eventually be routed to IOMMU regardless of whether there is ACS
control in the up-streaming path.

AMD GPU is one of those devices. Furthermore, it always uses translated
memory requests for PASID.

> 
>> 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc.
>>          [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca)
>>          (prog-if 00 [VGA controller])
>>          DeviceName: ATI EG BROADWAY
>>          Subsystem: Hewlett-Packard Company Device 8332
>>
>> The kernel trace looks like below:
>>
>>   Call Trace:
>>    <TASK>
>>    amd_iommu_attach_device+0x2e0/0x300
>>    __iommu_attach_device+0x1b/0x90
>>    iommu_attach_group+0x65/0xa0
>>    amd_iommu_init_device+0x16b/0x250 [iommu_v2]
>>    kfd_iommu_resume+0x4c/0x1a0 [amdgpu]
>>    kgd2kfd_resume_iommu+0x12/0x30 [amdgpu]
>>    kgd2kfd_device_init.cold+0x346/0x49a [amdgpu]
>>    amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu]
>>    amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu]
>>    ? _raw_spin_lock_irqsave+0x23/0x50
>>    amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
>>    amdgpu_pci_probe+0x161/0x370 [amdgpu]
>>    local_pci_probe+0x41/0x80
>>    pci_device_probe+0xb3/0x220
>>    really_probe+0xde/0x380
>>    ? pm_runtime_barrier+0x50/0x90
>>    __driver_probe_device+0x78/0x170
>>    driver_probe_device+0x1f/0x90
>>    __driver_attach+0xce/0x1c0
>>    ? __pfx___driver_attach+0x10/0x10
>>    bus_for_each_dev+0x73/0xa0
>>    bus_add_driver+0x1ae/0x200
>>    driver_register+0x89/0xe0
>>    ? __pfx_init_module+0x10/0x10 [amdgpu]
>>    do_one_initcall+0x59/0x230
>>    do_init_module+0x4a/0x200
>>    __do_sys_init_module+0x157/0x180
>>    do_syscall_64+0x5b/0x80
>>    ? handle_mm_fault+0xff/0x2f0
>>    ? do_user_addr_fault+0x1ef/0x690
>>    ? exc_page_fault+0x70/0x170
>>    entry_SYSCALL_64_after_hwframe+0x72/0xdc
> The stack trace doesn't seem like it shows a failure, so I'm not sure
> it's useful this time.  If it is, we can at least strip out the
> irrelevant pieces.

I will drop above from the commit message.

> 
>> The AMD iommu driver allocates a new domain (called v2 domain) for the
> "v2 domain" needs to be something greppable -- an identifier,
> filename, etc.
> 

The code reads,

2052         if (iommu_feature(iommu, FEATURE_GT) &&
2053             iommu_feature(iommu, FEATURE_PPR)) {
2054                 iommu->is_iommu_v2   = true;

So, how about

..The AMD GPU has a private interface to its own AMD IOMMU, which could
be detected by the FEATURE_GT && FEATURE_PPR features. The AMD iommu
driver allocates a special domain for the GPU device ..

?

Best regards,
baolu
Bjorn Helgaas Jan. 31, 2023, 11:50 p.m. UTC | #10
On Mon, Jan 30, 2023 at 02:47:32PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 30, 2023 at 12:38:10PM -0600, Bjorn Helgaas wrote:
> 
> > Sorry, I'm still confused.  PCI_PASID_XLATED_REQ_ONLY is a
> > device-specific property, and you want to opt-in AMD graphics devices.
> > Where's the AMD graphics-specific change?  The current patch does
> > this:
> > 
> >   pdev_pri_ats_enable
> >     pci_enable_pasid(pdev, 0, PCI_PASID_XLATED_REQ_ONLY)
> > 
> > which looks like it does it for *all* devices below an AMD IOMMU,
> > without any device or driver input.
> 
> AMD GPU has a private interface to AMD IOMMU to support PASID support
> that only it uses.

What is it that makes this a private interface?  It seems like we will
end up enabling PASID on any device below an AMD v2 IOMMU.

I do see the DRM amdgpu_amdkfd_device_init() path that ends up at
domain_enable_v2().  Maybe that's it?

But amd_iommu_domain_alloc() also leads to domain_enable_v2(), and
that's pretty generic, so it looks like we set PD_IOMMUV2_MASK
whenever the IOMMU supports it.

And it seems like we call pci_enable_pasid() for all endpoints below
a v2 IOMMU.  Of course, if the endpoint doesn't have a PASID cap,
pci_enable_pasid() fails (and pdev_pri_ats_enable() probably splats
useless warnings when we try to disable PRI and PASID which haven't
been enabled).

I guess I'm trying to convince myself that no harm in enabling PASID
for any device below an AMD v2 IOMMU.  But I don't think a device is
*required* to use translated addresses with PASID, and if it uses
untranslated addresses with PASID, don't we need ACS to avoid
misrouting?

Bjorn
Bjorn Helgaas Feb. 1, 2023, 12:14 a.m. UTC | #11
On Tue, Jan 31, 2023 at 08:56:13PM +0800, Baolu Lu wrote:
> On 2023/1/31 2:38, Bjorn Helgaas wrote:
> > > PCI: Add translated request only flag for pci_enable_pasid()
> > > 
> > > The PCIe fabric routes Memory Requests based on the TLP address, ignoring
> > > the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
> > > Enable PASID only when ACS RR & UF enabled on upstream path") requires
> > > some ACS features being supported on device's upstream path when enabling
> > > PCI/PASID.

Looking up 201007ef707a to see what ensuring system integrity means,
it prevents Memory Requests with PASID, which should always be routed
to the RC, from being mistakenly routed as peer-to-peer requests.

> > > However, above change causes the Linux kernel boots to black screen on a
> > > system with below graphic device:
> >
> > We need a PCIe concept-level description of the issue first, i.e., in
> > terms of DMA, PASID, ACS, etc.  Then we can mention the AMD GPU issue
> > as an instance.
> 
> How about below description?

Thanks, this is exactly the sort of thing I'm looking for.  But my
understanding of ATS/PRI/PASID is weak, so I'm still working through
this.  Tell me when I say something wrong below...

> PCIe endpoints can use ATS to request DMA remapping hardware to
> translate an IOVA to its mapped physical address. If the translation is
> missing or the permissions are insufficient, the PRI is used to trigger
> an I/O page fault. The IOMMU driver will fill the mapping with desired
> permissions and return the translated address to the device.

In PCIe spec language, I think you're saying that a PCIe Function may
contain an ATC.  If the ATC Capability Enable bit is set, the Function
can issue Translation Requests.

The TA (aka IOMMU) will respond with a Translation Completion.  If the
Completion is a CplD, it contains the translated address and the
Function can store the entry in its ATC.  I assume the I/O page fault
case corresponds to a Cpl (with no data) meaning that the TA could not
translate the address.

If the TA doesn't have a mapping with the desired permissions, and the
Function's Page Request Capability Enable bit is set, it may issue a
Page Request Message.  It's up to the TA/IOMMU to make this message
visible to the OS, which can make the page resident, create an IOMMU
mapping, and enable a PRG Response Message.  After the Function
receives the PRG Response Message, it would issue another Translation
Request.

> The translated address is specified by the IOMMU driver. The IOMMU
> driver ensures that the address is a DMA buffer address instead of any
> P2P address in the PCI fabric. Therefore, any translated memory request
> will eventually be routed to IOMMU regardless of whether there is ACS
> control in the up-streaming path.

A Memory Request with an address that is not a P2P address, i.e., it
is not contained in any bridge aperture, will *always* be routed
toward the RC, won't it?  Isn't that the case regardless of whether
the address is translated or untranslated, and even regardless of ACS?

IIUC, ACS basically causes peer-to-peer requests to be routed upstream
instead of directly to the peer.

OK, reading this again, I realize that I just restated exactly what
you had already written, sorry about that.

> AMD GPU is one of those devices.

I guess you mean the AMD GPU has ATS, PRI, and PASID Capabilities?
And furthermore, that the GPU *always* uses Translated addresses with
PASID?

So I guess what's going on here is that if:

  - A device only uses PASID with Translated addresses, and 
  - those Translated addresses are never P2P addresses, then
  - those transactions will always be routed to the RC.  

And this applies even if there is no ACS or ACS doesn't support
PCI_ACS_RR and PCI_ACS_UF.

The black screen happens because ... ?

What can we include in the commit log to help people find this fix?  I
see these in the bugzilla:

  WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80
  WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50

(These look like defects in pdev_pri_ats_enable(), so really just
distractions)

  kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874
  kfd kfd: amdgpu: device 1002:9874 NOT added due to errors
  BUG: kernel NULL pointer dereference, address: 0000000000000058
  RIP: 0010:report_iommu_fault+0x11/0x90

I couldn't figure out the NULL pointer dereference.  I expected it to
be from a BUG() or similar in report_iommu_fault(), but I don't see
that.

> Furthermore, it always uses translated memory requests for PASID.
>
> > > 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > >          [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca)
> > >          (prog-if 00 [VGA controller])
> > >          DeviceName: ATI EG BROADWAY
> > >          Subsystem: Hewlett-Packard Company Device 8332

> > > The AMD iommu driver allocates a new domain (called v2 domain) for the
> > "v2 domain" needs to be something greppable -- an identifier,
> > filename, etc.
> 
> The code reads,
> 
> 2052         if (iommu_feature(iommu, FEATURE_GT) &&
> 2053             iommu_feature(iommu, FEATURE_PPR)) {
> 2054                 iommu->is_iommu_v2   = true;
> 
> So, how about
> 
> ..The AMD GPU has a private interface to its own AMD IOMMU, which could
> be detected by the FEATURE_GT && FEATURE_PPR features. The AMD iommu
> driver allocates a special domain for the GPU device ..

Where is this special domain allocated?  I think the above tests for
*IOMMU* features (I assume "GTSup: Guest translations supported" and
"PPRSup: Peripheral page request support" based on the AMD IOMMU
spec).  It doesn't test that this is a GPU.

This change doesn't feel safe for all possible devices that have a
PASID Capability because we don't know whether they *always* use
Translated addresses with PASID TLPs.

Bjorn
Jason Gunthorpe Feb. 1, 2023, 2:28 a.m. UTC | #12
On Tue, Jan 31, 2023 at 05:50:52PM -0600, Bjorn Helgaas wrote:
> On Mon, Jan 30, 2023 at 02:47:32PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 30, 2023 at 12:38:10PM -0600, Bjorn Helgaas wrote:
> > 
> > > Sorry, I'm still confused.  PCI_PASID_XLATED_REQ_ONLY is a
> > > device-specific property, and you want to opt-in AMD graphics devices.
> > > Where's the AMD graphics-specific change?  The current patch does
> > > this:
> > > 
> > >   pdev_pri_ats_enable
> > >     pci_enable_pasid(pdev, 0, PCI_PASID_XLATED_REQ_ONLY)
> > > 
> > > which looks like it does it for *all* devices below an AMD IOMMU,
> > > without any device or driver input.
> > 
> > AMD GPU has a private interface to AMD IOMMU to support PASID support
> > that only it uses.
> 
> What is it that makes this a private interface? 

The symbol names start with "amd"

drivers/iommu/amd/init.c:EXPORT_SYMBOL(amd_iommu_snp_en);
drivers/iommu/amd/init.c:EXPORT_SYMBOL(amd_iommu_v2_supported);
drivers/iommu/amd/init.c:EXPORT_SYMBOL(amd_iommu_pc_get_max_banks);
drivers/iommu/amd/init.c:EXPORT_SYMBOL(amd_iommu_pc_supported);
drivers/iommu/amd/init.c:EXPORT_SYMBOL(amd_iommu_pc_get_max_counters);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_register_ga_log_notifier);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL_GPL(amd_iommu_is_attach_deferred);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_register_ppr_notifier);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_unregister_ppr_notifier);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_domain_direct_map);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_domain_enable_v2);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_flush_page);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_flush_tlb);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_domain_set_gcr3);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_domain_clear_gcr3);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_complete_ppr);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_device_info);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_activate_guest_mode);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_deactivate_guest_mode);
drivers/iommu/amd/iommu.c:EXPORT_SYMBOL(amd_iommu_update_ga);
drivers/iommu/amd/iommu_v2.c:EXPORT_SYMBOL(amd_iommu_bind_pasid);
drivers/iommu/amd/iommu_v2.c:EXPORT_SYMBOL(amd_iommu_unbind_pasid);
drivers/iommu/amd/iommu_v2.c:EXPORT_SYMBOL(amd_iommu_init_device);
drivers/iommu/amd/iommu_v2.c:EXPORT_SYMBOL(amd_iommu_free_device);
drivers/iommu/amd/iommu_v2.c:EXPORT_SYMBOL(amd_iommu_set_invalid_ppr_cb);
drivers/iommu/amd/iommu_v2.c:EXPORT_SYMBOL(amd_iommu_set_invalidate_ctx_cb);

A driver should not be using EXPORT_SYMBOL at all, this is all
superseded by the core code that has now been created, but this 
has not been cleaned up.

So the troublesome PASID bit is here:

drivers/gpu/drm/amd/amdkfd/kfd_iommu.c: err = amd_iommu_bind_pasid(dev->adev->pdev, p->pasid, p->lead_thread);
drivers/gpu/drm/amd/amdkfd/kfd_iommu.c:         err = amd_iommu_bind_pasid(kfd->adev->pdev, p->pasid,

And the logic AMD iommu uses to call pci_enable_pasid() is in the
wrong place, it should be in drm/amd someplace not in the iommu
drivers.

This is all more stuff to fix

> But amd_iommu_domain_alloc() also leads to domain_enable_v2(), and
> that's pretty generic, so it looks like we set PD_IOMMUV2_MASK
> whenever the IOMMU supports it.

Yes, it is all sort of messy still.

AMD and ARM have a requirement that the RID page table format be in a
certain way to be able to enable the PASID decoded in the iommu

So the iommu drivers are trying to guess what page table format to use
based on the PCI caps, and wrongly turning on PASID mode at the same
time.

> I guess I'm trying to convince myself that no harm in enabling PASID
> for any device below an AMD v2 IOMMU.  But I don't think a device is
> *required* to use translated addresses with PASID, and if it uses
> untranslated addresses with PASID, don't we need ACS to avoid
> misrouting?

PASID enabling via config space doesn't actually do much - it is
attaching a PASID at the iommu and attempting to operate the device
with a PASID that is the key item.

So right now, the only thing in the kernel which can do that is amdkfd
because of the private interface. AMD says amdkfd HW always issues ATS
with a PASID and never a MemRd/Wr, which is why it works at all.

Jason
Jason Gunthorpe Feb. 1, 2023, 2:36 a.m. UTC | #13
On Tue, Jan 31, 2023 at 06:14:19PM -0600, Bjorn Helgaas wrote:

> > AMD GPU is one of those devices.
> 
> I guess you mean the AMD GPU has ATS, PRI, and PASID Capabilities?
> And furthermore, that the GPU *always* uses Translated addresses with
> PASID?

I'm not versed in the spec lingo, but the GPU issues MemRd/Wrs with
the translated bit set and no PASID header - which is the correct form
for an address that was translated by ATS.

To get to that it issues ATS requests, and only the ATS related
requests will carry the PASID.

ATS related requests always route to the root port, which is why it is
functionally equivalent to ACS RR/UF in these cases.

Translated requests always route where they are supposed to go, even
with P2P and things.

> And this applies even if there is no ACS or ACS doesn't support
> PCI_ACS_RR and PCI_ACS_UF.
> 
> The black screen happens because ... ?

AMD GPU driver bugs blow up if it cannot setup PASID.

> I couldn't figure out the NULL pointer dereference.  I expected it to
> be from a BUG() or similar in report_iommu_fault(), but I don't see
> that.

IIRC it is a buggy error unwind handling in the AMD GPU driver.
 
Jason
Vasant Hegde Feb. 1, 2023, 5:18 a.m. UTC | #14
Bjorn,


On 2/1/2023 5:44 AM, Bjorn Helgaas wrote:
> On Tue, Jan 31, 2023 at 08:56:13PM +0800, Baolu Lu wrote:
>> On 2023/1/31 2:38, Bjorn Helgaas wrote:
>>>> PCI: Add translated request only flag for pci_enable_pasid()
>>>>
>>>> The PCIe fabric routes Memory Requests based on the TLP address, ignoring
>>>> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
>>>> Enable PASID only when ACS RR & UF enabled on upstream path") requires
>>>> some ACS features being supported on device's upstream path when enabling
>>>> PCI/PASID.
> 
> Looking up 201007ef707a to see what ensuring system integrity means,
> it prevents Memory Requests with PASID, which should always be routed
> to the RC, from being mistakenly routed as peer-to-peer requests.
> 
>>>> However, above change causes the Linux kernel boots to black screen on a
>>>> system with below graphic device:
>>>
>>> We need a PCIe concept-level description of the issue first, i.e., in
>>> terms of DMA, PASID, ACS, etc.  Then we can mention the AMD GPU issue
>>> as an instance.
>>
>> How about below description?
> 
> Thanks, this is exactly the sort of thing I'm looking for.  But my
> understanding of ATS/PRI/PASID is weak, so I'm still working through
> this.  Tell me when I say something wrong below...
> 
>> PCIe endpoints can use ATS to request DMA remapping hardware to
>> translate an IOVA to its mapped physical address. If the translation is
>> missing or the permissions are insufficient, the PRI is used to trigger
>> an I/O page fault. The IOMMU driver will fill the mapping with desired
>> permissions and return the translated address to the device.
> 
> In PCIe spec language, I think you're saying that a PCIe Function may
> contain an ATC.  If the ATC Capability Enable bit is set, the Function
> can issue Translation Requests.
> 
> The TA (aka IOMMU) will respond with a Translation Completion.  If the
> Completion is a CplD, it contains the translated address and the
> Function can store the entry in its ATC.  I assume the I/O page fault
> case corresponds to a Cpl (with no data) meaning that the TA could not
> translate the address.
> 
> If the TA doesn't have a mapping with the desired permissions, and the
> Function's Page Request Capability Enable bit is set, it may issue a
> Page Request Message.  It's up to the TA/IOMMU to make this message
> visible to the OS, which can make the page resident, create an IOMMU
> mapping, and enable a PRG Response Message.  After the Function
> receives the PRG Response Message, it would issue another Translation
> Request.
> 
>> The translated address is specified by the IOMMU driver. The IOMMU
>> driver ensures that the address is a DMA buffer address instead of any
>> P2P address in the PCI fabric. Therefore, any translated memory request
>> will eventually be routed to IOMMU regardless of whether there is ACS
>> control in the up-streaming path.
> 
> A Memory Request with an address that is not a P2P address, i.e., it
> is not contained in any bridge aperture, will *always* be routed
> toward the RC, won't it?  Isn't that the case regardless of whether
> the address is translated or untranslated, and even regardless of ACS?
> 
> IIUC, ACS basically causes peer-to-peer requests to be routed upstream
> instead of directly to the peer.
> 
> OK, reading this again, I realize that I just restated exactly what
> you had already written, sorry about that.
> 
>> AMD GPU is one of those devices.
> 
> I guess you mean the AMD GPU has ATS, PRI, and PASID Capabilities?
> And furthermore, that the GPU *always* uses Translated addresses with
> PASID?
> 
> So I guess what's going on here is that if:
> 
>   - A device only uses PASID with Translated addresses, and 
>   - those Translated addresses are never P2P addresses, then
>   - those transactions will always be routed to the RC.  
> 
> And this applies even if there is no ACS or ACS doesn't support
> PCI_ACS_RR and PCI_ACS_UF.
> 
> The black screen happens because ... ?
> 
> What can we include in the commit log to help people find this fix?  I
> see these in the bugzilla:
> 
>   WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80
>   WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50
> 
> (These look like defects in pdev_pri_ats_enable(), so really just
> distractions)

Right. We have fixed error handling path in this function. Joerg has queued the fix.

> 
>   kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874
>   kfd kfd: amdgpu: device 1002:9874 NOT added due to errors
>   BUG: kernel NULL pointer dereference, address: 0000000000000058
>   RIP: 0010:report_iommu_fault+0x11/0x90
> 
> I couldn't figure out the NULL pointer dereference.  I expected it to
> be from a BUG() or similar in report_iommu_fault(), but I don't see
> that.

Its coming from below path :
  - During system boot IOMMU allocates default domain
  - AMD IOMMU v2 module (iommu_v2) created another domain and tried to attach
devices to new domain.
  - In device attachment path (amd_iommu_attach_device()) it first detaches
device from current domain and tries to attach device to new domain. Here device
attachment failed as PASID enable check failed.
  - We didn't recover from above failure (I have proposed fix for this [1]).
  - So device to domain attachment is not in consistent state.
  - Device tried to do DMA and hit IO fault. Above NULL pointer derefence is
coming from that path as dev to domain setup is not proper.

[1]
https://lore.kernel.org/linux-iommu/20230113135956.5788-1-vasant.hegde@amd.com/T/#t

-Vasant
Baolu Lu Feb. 1, 2023, 5:51 a.m. UTC | #15
On 2023/2/1 8:14, Bjorn Helgaas wrote:
>> The translated address is specified by the IOMMU driver. The IOMMU
>> driver ensures that the address is a DMA buffer address instead of any
>> P2P address in the PCI fabric. Therefore, any translated memory request
>> will eventually be routed to IOMMU regardless of whether there is ACS
>> control in the up-streaming path.
> A Memory Request with an address that is not a P2P address, i.e., it
> is not contained in any bridge aperture, will*always*  be routed
> toward the RC, won't it? 

Yes.

> Isn't that the case regardless of whether
> the address is translated or untranslated, and even regardless of ACS?

They are different. The translated addresses are approved by the Linux
kernel. But untranslated addresses are not. Malicious or buggy userspace
applications could program the device to DMA to addresses locating in
the P2P aperture.

> IIUC, ACS basically causes peer-to-peer requests to be routed upstream
> instead of directly to the peer.

Yes.

Best regards,
baolu
Baolu Lu Feb. 1, 2023, 5:59 a.m. UTC | #16
On 2023/2/1 8:14, Bjorn Helgaas wrote:
> On Tue, Jan 31, 2023 at 08:56:13PM +0800, Baolu Lu wrote:
>> On 2023/1/31 2:38, Bjorn Helgaas wrote:
>>>> PCI: Add translated request only flag for pci_enable_pasid()
>>>>
>>>> The PCIe fabric routes Memory Requests based on the TLP address, ignoring
>>>> the PASID. In order to ensure system integrity, commit 201007ef707a ("PCI:
>>>> Enable PASID only when ACS RR & UF enabled on upstream path") requires
>>>> some ACS features being supported on device's upstream path when enabling
>>>> PCI/PASID.
> 
> Looking up 201007ef707a to see what ensuring system integrity means,
> it prevents Memory Requests with PASID, which should always be routed
> to the RC, from being mistakenly routed as peer-to-peer requests.

Yes, exactly. I will update above paragraph like below

The PCIe fabric routes Memory Requests based on the TLP address, ignoring
the PASID. In order to prevent Memory Requests with PASID, which should
always be routed to the RC, from being mistakenly routed as peer-to-peer
requests, commit 201007ef707a ("PCI: Enable PASID only when ACS RR & UF
enabled on upstream path") requires some ACS features being supported on
device's upstream path when enabling PCI/PASID.


> 
>>>> However, above change causes the Linux kernel boots to black screen on a
>>>> system with below graphic device:
>>>
>>> We need a PCIe concept-level description of the issue first, i.e., in
>>> terms of DMA, PASID, ACS, etc.  Then we can mention the AMD GPU issue
>>> as an instance.
>>
>> How about below description?
> 
> Thanks, this is exactly the sort of thing I'm looking for.  But my
> understanding of ATS/PRI/PASID is weak, so I'm still working through
> this.  Tell me when I say something wrong below...
> 
>> PCIe endpoints can use ATS to request DMA remapping hardware to
>> translate an IOVA to its mapped physical address. If the translation is
>> missing or the permissions are insufficient, the PRI is used to trigger
>> an I/O page fault. The IOMMU driver will fill the mapping with desired
>> permissions and return the translated address to the device.
> 
> In PCIe spec language, I think you're saying that a PCIe Function may
> contain an ATC.  If the ATC Capability Enable bit is set, the Function
> can issue Translation Requests.
> 
> The TA (aka IOMMU) will respond with a Translation Completion.  If the
> Completion is a CplD, it contains the translated address and the
> Function can store the entry in its ATC.  I assume the I/O page fault
> case corresponds to a Cpl (with no data) meaning that the TA could not
> translate the address.
> 
> If the TA doesn't have a mapping with the desired permissions, and the
> Function's Page Request Capability Enable bit is set, it may issue a
> Page Request Message.  It's up to the TA/IOMMU to make this message
> visible to the OS, which can make the page resident, create an IOMMU
> mapping, and enable a PRG Response Message.  After the Function
> receives the PRG Response Message, it would issue another Translation
> Request.
> 
>> The translated address is specified by the IOMMU driver. The IOMMU
>> driver ensures that the address is a DMA buffer address instead of any
>> P2P address in the PCI fabric. Therefore, any translated memory request
>> will eventually be routed to IOMMU regardless of whether there is ACS
>> control in the up-streaming path.
> 
> A Memory Request with an address that is not a P2P address, i.e., it
> is not contained in any bridge aperture, will *always* be routed
> toward the RC, won't it?  Isn't that the case regardless of whether
> the address is translated or untranslated, and even regardless of ACS?
> 
> IIUC, ACS basically causes peer-to-peer requests to be routed upstream
> instead of directly to the peer.
> 
> OK, reading this again, I realize that I just restated exactly what
> you had already written, sorry about that.

Never mind. It's really a good chance for me to learn how to describe
this from the perspective of the PCI subsystem.. :-)

> 
>> AMD GPU is one of those devices.
> 
> I guess you mean the AMD GPU has ATS, PRI, and PASID Capabilities?
> And furthermore, that the GPU *always* uses Translated addresses with
> PASID?
> 
> So I guess what's going on here is that if:
> 
>    - A device only uses PASID with Translated addresses, and
>    - those Translated addresses are never P2P addresses, then
>    - those transactions will always be routed to the RC.
> 
> And this applies even if there is no ACS or ACS doesn't support
> PCI_ACS_RR and PCI_ACS_UF.

Yes.

> 
> The black screen happens because ... ?
> 
> What can we include in the commit log to help people find this fix?  I
> see these in the bugzilla:
> 
>    WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80
>    WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50
> 
> (These look like defects in pdev_pri_ats_enable(), so really just
> distractions)
> 
>    kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874
>    kfd kfd: amdgpu: device 1002:9874 NOT added due to errors
>    BUG: kernel NULL pointer dereference, address: 0000000000000058
>    RIP: 0010:report_iommu_fault+0x11/0x90
> 
> I couldn't figure out the NULL pointer dereference.  I expected it to
> be from a BUG() or similar in report_iommu_fault(), but I don't see
> that.

Above has been answered by Vasant Hegde in a separated reply.

Best regards,
baolu
Baolu Lu Feb. 1, 2023, 6:31 a.m. UTC | #17
On 2023/2/1 8:14, Bjorn Helgaas wrote:
>>>> The AMD iommu driver allocates a new domain (called v2 domain) for the
>>> "v2 domain" needs to be something greppable -- an identifier,
>>> filename, etc.
>> The code reads,
>>
>> 2052         if (iommu_feature(iommu, FEATURE_GT) &&
>> 2053             iommu_feature(iommu, FEATURE_PPR)) {
>> 2054                 iommu->is_iommu_v2   = true;
>>
>> So, how about
>>
>> ..The AMD GPU has a private interface to its own AMD IOMMU, which could
>> be detected by the FEATURE_GT && FEATURE_PPR features. The AMD iommu
>> driver allocates a special domain for the GPU device ..
> Where is this special domain allocated?  I think the above tests for
> *IOMMU*  features (I assume "GTSup: Guest translations supported" and
> "PPRSup: Peripheral page request support" based on the AMD IOMMU
> spec).  It doesn't test that this is a GPU.

 From the discussion, my understanding is that the IOMMU is a GPU
dedicated IOMMU. The translated-request-only feature is enumerated
through the IOMMU feature bits. However, I am not familiar with AMD
architecture. AMD guys may have a better explanation.

> This change doesn't feel safe for all possible devices that have a
> PASID Capability because we don't know whether they*always*  use
> Translated addresses with PASID TLPs.

I tend to think you are right. :-)

Best regards,
baolu
Jonathan Cameron Feb. 1, 2023, 2:09 p.m. UTC | #18
On Tue, 31 Jan 2023 22:36:27 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jan 31, 2023 at 06:14:19PM -0600, Bjorn Helgaas wrote:
> 
> > > AMD GPU is one of those devices.  
> > 
> > I guess you mean the AMD GPU has ATS, PRI, and PASID Capabilities?
> > And furthermore, that the GPU *always* uses Translated addresses with
> > PASID?  
> 
> I'm not versed in the spec lingo, but the GPU issues MemRd/Wrs with
> the translated bit set and no PASID header - which is the correct form
> for an address that was translated by ATS.

FWIW there is a capability bit and enable bit in the PASID cap/control
registers that says whether a device can/should add a PASID to a
translated request or not.  I think the intent is that a host can
sanity check AT requests to make sure the device isn't making them
up. To do that it needs the PASID.  Not sure any hosts do this yet
though ;)

Not worth much, but I thought it always sent the PASID so dug out spec
to check (I was wrong as it is both optional and configurable).

> 
> To get to that it issues ATS requests, and only the ATS related
> requests will carry the PASID.
> 
> ATS related requests always route to the root port, which is why it is
> functionally equivalent to ACS RR/UF in these cases.
> 
> Translated requests always route where they are supposed to go, even
> with P2P and things.
> 
> > And this applies even if there is no ACS or ACS doesn't support
> > PCI_ACS_RR and PCI_ACS_UF.
> > 
> > The black screen happens because ... ?  
> 
> AMD GPU driver bugs blow up if it cannot setup PASID.
> 
> > I couldn't figure out the NULL pointer dereference.  I expected it to
> > be from a BUG() or similar in report_iommu_fault(), but I don't see
> > that.  
> 
> IIRC it is a buggy error unwind handling in the AMD GPU driver.
>  
> Jason
Jason Gunthorpe Feb. 1, 2023, 2:22 p.m. UTC | #19
On Wed, Feb 01, 2023 at 02:31:04PM +0800, Baolu Lu wrote:
> > This change doesn't feel safe for all possible devices that have a
> > PASID Capability because we don't know whether they*always*  use
> > Translated addresses with PASID TLPs.
> 
> I tend to think you are right. :-)

This is why we discussed moving the pci_enable_pasid() into drivers
and out of the core iommu code as a followup

Jason
Bjorn Helgaas Feb. 1, 2023, 4:58 p.m. UTC | #20
On Tue, Jan 31, 2023 at 08:25:05PM +0800, Baolu Lu wrote:
> On 2023/1/31 2:38, Bjorn Helgaas wrote:
> > PCIe r6.0, sec 6.20.1:
> > 
> >    A Function is not permitted to generate Requests using Translated
> >    Addresses and a PASID unless both PASID Enable and Translated
> >    Requests with PASID Enable are Set.
> > 
> > You want AMD graphics devices to do DMA with translated addresses and
> > PASID, right?  pci_enable_pasid() sets PASID Enable
> > (PCI_PASID_CTRL_ENABLE), but I don't see where "Translated Requests
> > with PASID Enable" is set.  We don't even have a #define for it.
> > 
> > I would think we should check "Translated Requests with PASID
> > Supported" before setting "Translated Requests with PASID Enable",
> > too?
> 
> This seems to be an ECN for PCIe 5.x:
> 
> https://members.pcisig.com/wg/PCI-SIG/document/14929
> 
> What I read from this ECN is that,
> 
> With this ECN, translated memory requests for PASIDs are not allowed to
> carry a PASID prefix if "Translated Requests with PASID Enabled" is not
> set. It does not mean whether the device can generate translated memory
> requests for PASID, but whether the memory request can carry a PASID
> prefix.

My assumption that "you want AMD graphics devices to do DMA with
translated addresses and PASID" was wrong.

Per Jason [1], it sounds like the AMD GPU generates Translation
Requests (sec 10.2.2) with a PASID.  The GPU will cache the translated
address from the Translation Completion in its local ATC, and will do
DMA (MemRd/Wr) with that translated address but *without* PASID
prefixes.

That makes sense because (PASID, IOVA) maps to a translated address,
e.g., a a CPU physical address, and the GPU can DMA to that address
directly without needing the PASID.

Bjorn

[1] https://lore.kernel.org/r/Y9nQK9P3HOxEeZ4U@nvidia.com
Baolu Lu Feb. 2, 2023, 3:08 a.m. UTC | #21
On 2023/2/2 0:58, Bjorn Helgaas wrote:
> On Tue, Jan 31, 2023 at 08:25:05PM +0800, Baolu Lu wrote:
>> On 2023/1/31 2:38, Bjorn Helgaas wrote:
>>> PCIe r6.0, sec 6.20.1:
>>>
>>>     A Function is not permitted to generate Requests using Translated
>>>     Addresses and a PASID unless both PASID Enable and Translated
>>>     Requests with PASID Enable are Set.
>>>
>>> You want AMD graphics devices to do DMA with translated addresses and
>>> PASID, right?  pci_enable_pasid() sets PASID Enable
>>> (PCI_PASID_CTRL_ENABLE), but I don't see where "Translated Requests
>>> with PASID Enable" is set.  We don't even have a #define for it.
>>>
>>> I would think we should check "Translated Requests with PASID
>>> Supported" before setting "Translated Requests with PASID Enable",
>>> too?
>> This seems to be an ECN for PCIe 5.x:
>>
>> https://members.pcisig.com/wg/PCI-SIG/document/14929
>>
>> What I read from this ECN is that,
>>
>> With this ECN, translated memory requests for PASIDs are not allowed to
>> carry a PASID prefix if "Translated Requests with PASID Enabled" is not
>> set. It does not mean whether the device can generate translated memory
>> requests for PASID, but whether the memory request can carry a PASID
>> prefix.
> My assumption that "you want AMD graphics devices to do DMA with
> translated addresses and PASID" was wrong.
> 
> Per Jason [1], it sounds like the AMD GPU generates Translation
> Requests (sec 10.2.2) with a PASID.  The GPU will cache the translated
> address from the Translation Completion in its local ATC, and will do
> DMA (MemRd/Wr) with that translated address but*without*  PASID
> prefixes.
> 
> That makes sense because (PASID, IOVA) maps to a translated address,
> e.g., a a CPU physical address, and the GPU can DMA to that address
> directly without needing the PASID.

Hi Bjorn and Jason,

According to the discussion so far, I refined the commit message like
below.  How does it look like?

PCI: Add translated request only flag for pci_enable_pasid()

The PCIe fabric routes Memory Requests based on the TLP address, ignoring
the PASID. In order to prevent Memory Requests for PASID, which should
always be routed to the RC, from being mistakenly routed as peer-to-peer
requests, commit 201007ef707a ("PCI: Enable PASID only when ACS RR & UF
enabled on upstream path") requires some ACS features being supported on
device's upstream path when enabling PCI/PASID.

However, above change causes the Linux kernel boots to black screen on a
system with below graphic device:

00:01.0 VGA compatible controller: Advanced Micro Devices, Inc.
         [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca)
         (prog-if 00 [VGA controller])
         DeviceName: ATI EG BROADWAY
         Subsystem: Hewlett-Packard Company Device 8332

The AMD iommu driver allocates a special domain for above device and
enables its PCI PASID/ATS/PRI before attaching the domain to it. The
failure of pci_enable_pasid() due to lack of ACS causes the domain
to fail to be attached to the device. The GPU device is unable to DMA
normally, resulting in a black screen of the system.

The PCIe spec defines Address Translation Service in its chapter 10.
A PCIe Function may contain an ATC. If the ATC Capability Enable bit
is set, the Function can issue Translation Requests. The TA (aka IOMMU)
will respond with a Translation Completion. If the Completion is a CplD,
it contains the translated address and the Function can store the entry
in its ATC. If the TA doesn't have a mapping with the desired permissions,
and the Function's Page Request Capability Enable bit is set, it may
issue a Page Request Message. It's up to the TA to make this message
visible to the OS, which can make the page resident, create an IOMMU
mapping, and respond with a PRG Response Message.  After the Function
receives the PRG Response Message, it would issue another Translation
Request. The Function can then obtain a translated address and store the
entry in its ATC.

The AMD integrated GPU together with its dedicated IOMMU implements above
functionality. It generates Translation Requests (sec 10.2.2) with a PASID
and caches the translated address from the Translation Completion in its
local ATC, and will do DMA (MemRd/Wr) with that translated address without
PASID prefixes. This capability (using translated address for PASID memory
requests) could be detected by software from AMD IOMMU feature bits (GTSup:
Guest translations supported" and PPRSup: Peripheral page request support).

ACS is unnecessary for the devices that only use translated memory request
for PASID. All translated addresses are granted by the Linux kernel which
ensures that such addresses will never be in a P2P address, i.e., it's not
contained in any bridge aperture, will *always* be routed toward the RC.

Add a flag for pci_enable_pasid(), with which the drivers could opt-in
the fact that device always uses translated memory requests for PASID
hence the ACS is not a necessity. Apply this opt-in for above AMD graphic
device.

At present, it is a common practice to enable/disable PCI PASID in the
iommu drivers. Considering that the device driver knows more about the
specific device, it is recommended to move pci_enable_pasid() into the
specific device drivers.

Fixes: 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on 
upstream path")
Reported-and-tested-by: Matt Fagnani <matt.fagnani@bell.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216865
Link: 
https://lore.kernel.org/r/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>

--
Best regards,
baolu
Bjorn Helgaas Feb. 2, 2023, 8:12 p.m. UTC | #22
[Joerg, you may be able to answer this.  Patch under discussion is:
https://lore.kernel.org/r/20230114073420.759989-1-baolu.lu@linux.intel.com]

On Thu, Feb 02, 2023 at 11:08:25AM +0800, Baolu Lu wrote:
> ...

> ACS is unnecessary for the devices that only use translated memory request
> for PASID. All translated addresses are granted by the Linux kernel which
> ensures that such addresses will never be in a P2P address, i.e., it's not
> contained in any bridge aperture, will *always* be routed toward the RC.

Re 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on
upstream path"), does that commit actually *fix* anything?  I wonder
whether we could revert it completely.

The intent of 201007ef707a is to use ACS to prevent misrouting, which
would happen if a TLP contained an address that *looked* like a PCI
bus address, i.e., it was inside a host bridge aperture, but was
*intended* to reach an IOMMU or main memory directly.

201007ef707a only affects pci_enable_pasid(), so I think we already
avoid this misrouting by restricting DMA address allocation for both
non-IOMMU scenarios and non-PASID IOMMU scenarios.

So what about PASID mappings, e.g., consider a mapping of (Requester
ID, PASID, Untranslated Address) -> Translated Address?  If either the
Untranslated Address or the Translated Address looks like a PCI bus
address, a Memory Request or Translation Request could be misrouted.

Does that actually happen?  I assume it does not happen for Translated
Addresses because that's basically the non-IOMMU case, and we don't
need ACS to prevent misrouting there.

Do IOMMUs allocate (PASID, Untranslated Addresses) that look like PCI
bus addresses?

Bjorn
Jason Gunthorpe Feb. 2, 2023, 8:45 p.m. UTC | #23
On Thu, Feb 02, 2023 at 02:12:49PM -0600, Bjorn Helgaas wrote:
> [Joerg, you may be able to answer this.  Patch under discussion is:
> https://lore.kernel.org/r/20230114073420.759989-1-baolu.lu@linux.intel.com]
> 
> On Thu, Feb 02, 2023 at 11:08:25AM +0800, Baolu Lu wrote:
> > ...
> 
> > ACS is unnecessary for the devices that only use translated memory request
> > for PASID. All translated addresses are granted by the Linux kernel which
> > ensures that such addresses will never be in a P2P address, i.e., it's not
> > contained in any bridge aperture, will *always* be routed toward the RC.
> 
> Re 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled on
> upstream path"), does that commit actually *fix* anything?  I wonder
> whether we could revert it completely.
> 
> The intent of 201007ef707a is to use ACS to prevent misrouting, which
> would happen if a TLP contained an address that *looked* like a PCI
> bus address, i.e., it was inside a host bridge aperture, but was
> *intended* to reach an IOMMU or main memory directly.

Yes.

> 201007ef707a only affects pci_enable_pasid(), so I think we already
> avoid this misrouting by restricting DMA address allocation for both
> non-IOMMU scenarios and non-PASID IOMMU scenarios.

There is no restriction on DMA address allocation with PASID.

The typical PASID use case is to point the PASID at the CPU page table
and then all VA's are fair game by userspace. There is no carve out
like the DMA API has to protect from bus address confusion.
 
> So what about PASID mappings, e.g., consider a mapping of (Requester
> ID, PASID, Untranslated Address) -> Translated Address?  If either the
> Untranslated Address or the Translated Address looks like a PCI bus
> address, a Memory Request or Translation Request could be misrouted.

The PCI rules are a bit complicated:
 - A simple MemRd/Wr with a PASID will be routed according to the
   address. This can be mis-routed
 - A ATS translation request with a PASID is always routed to the host
   bridge
 - A MemRd/Wr with Translated set and no PASID is always routed to the
   correct destination, even if that is not the host bridge

> Do IOMMUs allocate (PASID, Untranslated Addresses) that look like PCI
> bus addresses?

Yes, because it is mapped to a mm_struct userspace can use any mmap to
access any valid address as an IOVA and thus PASID tagged translation
must never become confused with bus addresses.

Further, and worse, the common use model for PASID SVA is for
userspace to directly submit IOVA to the device for operation. If
userspace can submit a hostile IOVA and cause DMA to reach something
that is not the host bridge then system security is completely
wrecked.

So, as an API in Linux we felt it was best to only enable PASID if
PASID is secure and truely isolated, otherwise leave PASID off. The
use cases for insecure PASID seem small.

Jason
Bjorn Helgaas Feb. 3, 2023, 6:20 p.m. UTC | #24
[+cc Alex in case you're interested in the ACS angle]

On Thu, Feb 02, 2023 at 04:45:05PM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 02, 2023 at 02:12:49PM -0600, Bjorn Helgaas wrote:
> > On Thu, Feb 02, 2023 at 11:08:25AM +0800, Baolu Lu wrote:
> > > ...
> > 
> > > ACS is unnecessary for the devices that only use translated
> > > memory request for PASID. All translated addresses are granted
> > > by the Linux kernel which ensures that such addresses will never
> > > be in a P2P address, i.e., it's not contained in any bridge
> > > aperture, will *always* be routed toward the RC.
> > 
> > Re 201007ef707a ("PCI: Enable PASID only when ACS RR & UF enabled
> > on upstream path"), does that commit actually *fix* anything?  I
> > wonder whether we could revert it completely.
> > 
> > The intent of 201007ef707a is to use ACS to prevent misrouting,
> > which would happen if a TLP contained an address that *looked*
> > like a PCI bus address, i.e., it was inside a host bridge
> > aperture, but was *intended* to reach an IOMMU or main memory
> > directly.
> 
> Yes.
> 
> > 201007ef707a only affects pci_enable_pasid(), so I think we
> > already avoid this misrouting by restricting DMA address
> > allocation for both non-IOMMU scenarios and non-PASID IOMMU
> > scenarios.
> 
> There is no restriction on DMA address allocation with PASID.
> 
> The typical PASID use case is to point the PASID at the CPU page
> table and then all VA's are fair game by userspace. There is no
> carve out like the DMA API has to protect from bus address
> confusion.

I think you're saying that for (Requester ID, PASID, Untranslated
Address), the Untranslated Address is not restricted at all, and it
may look like a PCI bus address.

> > So what about PASID mappings, e.g., consider a mapping of
> > (Requester ID, PASID, Untranslated Address) -> Translated Address?
> > If either the Untranslated Address or the Translated Address looks
> > like a PCI bus address, a Memory Request or Translation Request
> > could be misrouted.
> 
> The PCI rules are a bit complicated:
>  - A simple MemRd/Wr with a PASID will be routed according to the
>    address. This can be mis-routed
>  - A ATS translation request with a PASID is always routed to the host
>    bridge

From a PCIe point of view, I think these cases are equivalent because
a PASID prefix doesn't affect routing (sec 2.2.10.4).  A Translation
Request includes an Untranslated Address, and if that happens to look
like a PCI bus address, I think it will be mis-routed just like a
MemRd/Wr would be.

>  - A MemRd/Wr with Translated set and no PASID is always routed to the
>    correct destination, even if that is not the host bridge

I don't think Address Type 10b ("Translated") affects routing.  A
MemRd/Wr should be routed to a PCI peer if the Translated Address is
inside a host bridge aperture, or to the host bridge otherwise.

> > Do IOMMUs allocate (PASID, Untranslated Addresses) that look like
> > PCI bus addresses?
> 
> Yes, because it is mapped to a mm_struct userspace can use any mmap
> to access any valid address as an IOVA and thus PASID tagged
> translation must never become confused with bus addresses.

If PCI bus addresses are carved out of the Translated Address arena,
the MemRd/Wr TLPs should be fine, but I think the Translation Requests
that include Untranslated Addresses are still a problem.

> Further, and worse, the common use model for PASID SVA is for
> userspace to directly submit IOVA to the device for operation. If
> userspace can submit a hostile IOVA and cause DMA to reach something
> that is not the host bridge then system security is completely
> wrecked.
> 
> So, as an API in Linux we felt it was best to only enable PASID if
> PASID is secure and truely isolated, otherwise leave PASID off. The
> use cases for insecure PASID seem small.

The patch under discussion is intended to fix a v6.2-rc1 regression
added by 201007ef707a ("PCI: Enable PASID only when ACS RR & UF
enabled on upstream path").

Are we on track to fix this before v6.2?  I don't have a clear
understanding of how we know this change is safe and it only affects
AMD GPU and not other devices below the same IOMMU.

Bjorn
Jason Gunthorpe Feb. 3, 2023, 6:52 p.m. UTC | #25
On Fri, Feb 03, 2023 at 12:20:45PM -0600, Bjorn Helgaas wrote:
> > The PCI rules are a bit complicated:
> >  - A simple MemRd/Wr with a PASID will be routed according to the
> >    address. This can be mis-routed
> >  - A ATS translation request with a PASID is always routed to the host
> >    bridge
> 
> From a PCIe point of view, I think these cases are equivalent because
> a PASID prefix doesn't affect routing (sec 2.2.10.4).  A Translation
> Request includes an Untranslated Address, and if that happens to look
> like a PCI bus address, I think it will be mis-routed just like a
> MemRd/Wr would be.

So, I trusted AMD when they said this is why their platform worked, I
didn't check carefully the ATS spec language.

But looking now, I agree. I don't see any language that says ATS is
routed differently, if anything it says it is routed the same way as
MemRd.

AMD folks?

If yes then this has all been the wrong direction.

> > > Do IOMMUs allocate (PASID, Untranslated Addresses) that look like
> > > PCI bus addresses?
> > 
> > Yes, because it is mapped to a mm_struct userspace can use any mmap
> > to access any valid address as an IOVA and thus PASID tagged
> > translation must never become confused with bus addresses.
> 
> If PCI bus addresses are carved out of the Translated Address arena,
> the MemRd/Wr TLPs should be fine,

Yes, that is my point that Translated requests are fine because they
already carry a Translated address, and by its nature a Translated
address cannot be mis-routed.

The spec actually talks about this whole issue see the NOTE at the end
of 10.1.1

Per that note, for Linux, the "implementation constraint on the TA" is
that the TA requires ACS so that Untranslated always routes to the TA.

The AMD defect is some BIOS's on some of their SOC's don't report ACS
for the MFD, even though it does presumably implement ACS.

> Are we on track to fix this before v6.2?  

AMD? Did the patches you were talking about to fix the oops in the AMD
driver land? Can we rely on that to fix the black screen for v6.2?

Otherwise I think the PCI core is correct here and I'm loath to change
it to work around a GPU driver bug. The GPU driver should understand
that PASID is not supported because the PCI core says so. It shouldn't
crash and blackscreen.

Keep in mind, from a PCI perspective, this protection is a security
senstive check, described in the spec. If we drop it we expose
platforms using SVA to a potential security issue upon
mis-configuration.

So, I'd much rather leave the PCI alone and see that the AMD driver is
fixed.

Jason
Tian, Kevin Feb. 6, 2023, 4:28 a.m. UTC | #26
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, February 4, 2023 2:53 AM
>
> The AMD defect is some BIOS's on some of their SOC's don't report ACS
> for the MFD, even though it does presumably implement ACS.
> 

If that is the case looks AMD driver should implement a pci quirk so
pci_dev_specific_acs_enabled() can correctly report to meet pci
core requirement.
diff mbox series

Patch

diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h
index df54cd5b15db..750fca736ef4 100644
--- a/include/linux/pci-ats.h
+++ b/include/linux/pci-ats.h
@@ -4,6 +4,8 @@ 
 
 #include <linux/pci.h>
 
+#define PCI_PASID_XLATED_REQ_ONLY	BIT(0)
+
 #ifdef CONFIG_PCI_ATS
 /* Address Translation Service */
 bool pci_ats_supported(struct pci_dev *dev);
@@ -35,12 +37,12 @@  static inline bool pci_pri_supported(struct pci_dev *pdev)
 #endif /* CONFIG_PCI_PRI */
 
 #ifdef CONFIG_PCI_PASID
-int pci_enable_pasid(struct pci_dev *pdev, int features);
+int pci_enable_pasid(struct pci_dev *pdev, int features, int flags);
 void pci_disable_pasid(struct pci_dev *pdev);
 int pci_pasid_features(struct pci_dev *pdev);
 int pci_max_pasids(struct pci_dev *pdev);
 #else /* CONFIG_PCI_PASID */
-static inline int pci_enable_pasid(struct pci_dev *pdev, int features)
+static inline int pci_enable_pasid(struct pci_dev *pdev, int features, int flags)
 { return -EINVAL; }
 static inline void pci_disable_pasid(struct pci_dev *pdev) { }
 static inline int pci_pasid_features(struct pci_dev *pdev)
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index cbeaab55c0db..64a8c03d7dfa 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1700,7 +1700,7 @@  static int pdev_pri_ats_enable(struct pci_dev *pdev)
 	int ret;
 
 	/* Only allow access to user-accessible pages */
-	ret = pci_enable_pasid(pdev, 0);
+	ret = pci_enable_pasid(pdev, 0, PCI_PASID_XLATED_REQ_ONLY);
 	if (ret)
 		goto out_err;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index ab160198edd6..891bf53c45dc 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2350,7 +2350,7 @@  static int arm_smmu_enable_pasid(struct arm_smmu_master *master)
 	if (num_pasids <= 0)
 		return num_pasids;
 
-	ret = pci_enable_pasid(pdev, features);
+	ret = pci_enable_pasid(pdev, features, 0);
 	if (ret) {
 		dev_err(&pdev->dev, "Failed to enable PASID\n");
 		return ret;
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 59df7e42fd53..5cc13f02a5ac 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1425,7 +1425,8 @@  static void iommu_enable_pci_caps(struct device_domain_info *info)
 	   undefined. So always enable PASID support on devices which
 	   have it, even if we can't yet know if we're ever going to
 	   use it. */
-	if (info->pasid_supported && !pci_enable_pasid(pdev, info->pasid_supported & ~1))
+	if (info->pasid_supported &&
+	    !pci_enable_pasid(pdev, info->pasid_supported & ~1, 0))
 		info->pasid_enabled = 1;
 
 	if (info->pri_supported &&
diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index f9cc2e10b676..06168415d6d7 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -353,12 +353,15 @@  void pci_pasid_init(struct pci_dev *pdev)
  * pci_enable_pasid - Enable the PASID capability
  * @pdev: PCI device structure
  * @features: Features to enable
+ * @flags: device-specific flags
+ *   - PCI_PASID_XLATED_REQ_ONLY: The PCI device always use translated type
+ *                                for all PASID memory requests.
  *
  * Returns 0 on success, negative value on error. This function checks
  * whether the features are actually supported by the device and returns
  * an error if not.
  */
-int pci_enable_pasid(struct pci_dev *pdev, int features)
+int pci_enable_pasid(struct pci_dev *pdev, int features, int flags)
 {
 	u16 control, supported;
 	int pasid = pdev->pasid_cap;
@@ -382,7 +385,8 @@  int pci_enable_pasid(struct pci_dev *pdev, int features)
 	if (!pasid)
 		return -EINVAL;
 
-	if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
+	if (!(flags & PCI_PASID_XLATED_REQ_ONLY) &&
+	    !pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
 		return -EINVAL;
 
 	pci_read_config_word(pdev, pasid + PCI_PASID_CAP, &supported);