diff mbox

[v4,04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

Message ID 20180423233046.21476-5-logang@deltatee.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Logan Gunthorpe April 23, 2018, 11:30 p.m. UTC
For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any PCIe switch heirarchy will be in the same IOMMU
group. Which implies that individual devices behind any switch
heirarchy will not be able to be assigned to separate VMs because
there is no isolation between them. Additionally, any malicious PCIe
devices will be able to DMA to memory exposed by other EPs in the same
domain as TLPs will not be checked by the IOMMU.

Given that the intended use case of P2P Memory is for users with
custom hardware designed for purpose, we do not expect distributors
to ever need to enable this option. Users that want to use P2P
must have compiled a custom kernel with this configuration option
and understand the implications regarding ACS. They will either
not require ACS or will have design the system in such a way that
devices that require isolation will be separate from those using P2P
transactions.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig        |  9 +++++++++
 drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
 drivers/pci/pci.c          |  6 ++++++
 include/linux/pci-p2pdma.h |  5 +++++
 4 files changed, 50 insertions(+), 15 deletions(-)

Comments

Randy Dunlap April 24, 2018, 3:33 a.m. UTC | #1
On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:> > Signed-off-by: Logan Gunthorpe <logang@deltatee.com>> ---
>  drivers/pci/Kconfig        |  9 +++++++++>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++--------------->  drivers/pci/pci.c          |  6 ++++++>  include/linux/pci-p2pdma.h |  5 +++++>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>  	  transations must be between devices behind the same root port.
>  	  (Typically behind a network of PCIe switches).
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effectively puts all devices behind any
> +	  switch heirarchy into the same IOMMU group. Which implies that

	         hierarchy                     group, which

and sames fixes in the commit description...

> +	  individual devices behind any switch will not be able to be
> +	  assigned to separate VMs because there is no isolation between
> +	  them. Additionally, any malicious PCIe devices will be able to
> +	  DMA to memory exposed by other EPs in the same domain as TLPs
> +	  will not be checked by the IOMMU.
> +
>  	  If unsure, say N.
>  
>  config PCI_LABEL
Bjorn Helgaas May 7, 2018, 11:13 p.m. UTC | #2
[+to Alex]

Alex,

Are you happy with this strategy of turning off ACS based on
CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and 
I don't know if there are other places we would care?

On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
> 
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
> 
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  9 +++++++++
>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.c          |  6 ++++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>  	  transations must be between devices behind the same root port.
>  	  (Typically behind a network of PCIe switches).
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effectively puts all devices behind any
> +	  switch heirarchy into the same IOMMU group. Which implies that

s/heirarchy/hierarchy/ (also above in changelog)

> +	  individual devices behind any switch will not be able to be
> +	  assigned to separate VMs because there is no isolation between
> +	  them. Additionally, any malicious PCIe devices will be able to
> +	  DMA to memory exposed by other EPs in the same domain as TLPs
> +	  will not be checked by the IOMMU.
> +
>  	  If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ed9dce8552a2..e9f43b43acac 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>  }
>  
>  /*
> - * If a device is behind a switch, we try to find the upstream bridge
> - * port of the switch. This requires two calls to pci_upstream_bridge():
> - * one for the upstream port on the switch, one on the upstream port
> - * for the next level in the hierarchy. Because of this, devices connected
> - * to the root port will be rejected.
> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
> + * up to the RC which is not what we want for P2P.

s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)

> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any bridge to be in the same IOMMU
> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>   */
> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>  {
> -	struct pci_dev *up1, *up2;
> +	int pos;
> +	u16 ctrl;
>  
> -	if (!pdev)
> -		return NULL;
> +	if (!pci_is_bridge(pdev))
> +		return 0;
>  
> -	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> -	if (!up1)
> -		return NULL;
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> +	if (!pos)
> +		return 0;
> +
> +	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>  
> -	up2 = pci_dev_get(pci_upstream_bridge(up1));
> -	pci_dev_put(up1);
> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>  
> -	return up2;
> +	return 1;
>  }
>  
>  /*
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e597655a5643..7e2f5724ba22 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
>  #include <linux/of.h>
>  #include <linux/of_pci.h>
>  #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
>  #include <linux/pm.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   */
>  void pci_enable_acs(struct pci_dev *dev)
>  {
> +#ifdef CONFIG_PCI_P2PDMA
> +	if (pci_p2pdma_disable_acs(dev))
> +		return;
> +#endif
> +
>  	if (!pci_acs_enable)
>  		return;
>  
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 0cde88341eeb..fcb3437a2f3c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -18,6 +18,7 @@ struct block_device;
>  struct scatterlist;
>  
>  #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		u64 offset);
>  int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>  void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>  			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	return 0;
> +}
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
>  {
> -- 
> 2.11.0
>
Christian König May 8, 2018, 7:17 a.m. UTC | #3
Hi Bjorn,

Am 08.05.2018 um 01:13 schrieb Bjorn Helgaas:
> [+to Alex]
>
> Alex,
>
> Are you happy with this strategy of turning off ACS based on
> CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and
> I don't know if there are other places we would care?

thanks for pointing this out, I totally missed this hack.

AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
CPU when IOMMU is enabled or otherwise you will break SVM.

Similar problems arise when you do this for dedicated GPU, but we 
haven't upstreamed the support for this yet.

So that is a clear NAK from my side for the approach.

And what exactly is the problem here? I'm currently testing P2P with 
GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
perfectly fine.

Regards,
Christian.

>
> On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
>> For peer-to-peer transactions to work the downstream ports in each
>> switch must not have the ACS flags set. At this time there is no way
>> to dynamically change the flags and update the corresponding IOMMU
>> groups so this is done at enumeration time before the groups are
>> assigned.
>>
>> This effectively means that if CONFIG_PCI_P2PDMA is selected then
>> all devices behind any PCIe switch heirarchy will be in the same IOMMU
>> group. Which implies that individual devices behind any switch
>> heirarchy will not be able to be assigned to separate VMs because
>> there is no isolation between them. Additionally, any malicious PCIe
>> devices will be able to DMA to memory exposed by other EPs in the same
>> domain as TLPs will not be checked by the IOMMU.
>>
>> Given that the intended use case of P2P Memory is for users with
>> custom hardware designed for purpose, we do not expect distributors
>> to ever need to enable this option. Users that want to use P2P
>> must have compiled a custom kernel with this configuration option
>> and understand the implications regarding ACS. They will either
>> not require ACS or will have design the system in such a way that
>> devices that require isolation will be separate from those using P2P
>> transactions.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>> ---
>>   drivers/pci/Kconfig        |  9 +++++++++
>>   drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>>   drivers/pci/pci.c          |  6 ++++++
>>   include/linux/pci-p2pdma.h |  5 +++++
>>   4 files changed, 50 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
>> index b2396c22b53e..b6db41d4b708 100644
>> --- a/drivers/pci/Kconfig
>> +++ b/drivers/pci/Kconfig
>> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>>   	  transations must be between devices behind the same root port.
>>   	  (Typically behind a network of PCIe switches).
>>   
>> +	  Enabling this option will also disable ACS on all ports behind
>> +	  any PCIe switch. This effectively puts all devices behind any
>> +	  switch heirarchy into the same IOMMU group. Which implies that
> s/heirarchy/hierarchy/ (also above in changelog)
>
>> +	  individual devices behind any switch will not be able to be
>> +	  assigned to separate VMs because there is no isolation between
>> +	  them. Additionally, any malicious PCIe devices will be able to
>> +	  DMA to memory exposed by other EPs in the same domain as TLPs
>> +	  will not be checked by the IOMMU.
>> +
>>   	  If unsure, say N.
>>   
>>   config PCI_LABEL
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index ed9dce8552a2..e9f43b43acac 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>>   }
>>   
>>   /*
>> - * If a device is behind a switch, we try to find the upstream bridge
>> - * port of the switch. This requires two calls to pci_upstream_bridge():
>> - * one for the upstream port on the switch, one on the upstream port
>> - * for the next level in the hierarchy. Because of this, devices connected
>> - * to the root port will be rejected.
>> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
>> + * @pdev: device to disable ACS flags for
>> + *
>> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
>> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
>> + * up to the RC which is not what we want for P2P.
> s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)
>
>> + *
>> + * This function is called when the devices are first enumerated and
>> + * will result in all devices behind any bridge to be in the same IOMMU
>> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
>> + * on this largish hammer. If you need the devices to be in separate groups
>> + * don't enable CONFIG_PCI_P2PDMA.
>> + *
>> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>>    */
>> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>>   {
>> -	struct pci_dev *up1, *up2;
>> +	int pos;
>> +	u16 ctrl;
>>   
>> -	if (!pdev)
>> -		return NULL;
>> +	if (!pci_is_bridge(pdev))
>> +		return 0;
>>   
>> -	up1 = pci_dev_get(pci_upstream_bridge(pdev));
>> -	if (!up1)
>> -		return NULL;
>> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
>> +	if (!pos)
>> +		return 0;
>> +
>> +	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
>> +
>> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
>> +
>> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>>   
>> -	up2 = pci_dev_get(pci_upstream_bridge(up1));
>> -	pci_dev_put(up1);
>> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>>   
>> -	return up2;
>> +	return 1;
>>   }
>>   
>>   /*
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e597655a5643..7e2f5724ba22 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -16,6 +16,7 @@
>>   #include <linux/of.h>
>>   #include <linux/of_pci.h>
>>   #include <linux/pci.h>
>> +#include <linux/pci-p2pdma.h>
>>   #include <linux/pm.h>
>>   #include <linux/slab.h>
>>   #include <linux/module.h>
>> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>>    */
>>   void pci_enable_acs(struct pci_dev *dev)
>>   {
>> +#ifdef CONFIG_PCI_P2PDMA
>> +	if (pci_p2pdma_disable_acs(dev))
>> +		return;
>> +#endif
>> +
>>   	if (!pci_acs_enable)
>>   		return;
>>   
>> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
>> index 0cde88341eeb..fcb3437a2f3c 100644
>> --- a/include/linux/pci-p2pdma.h
>> +++ b/include/linux/pci-p2pdma.h
>> @@ -18,6 +18,7 @@ struct block_device;
>>   struct scatterlist;
>>   
>>   #ifdef CONFIG_PCI_P2PDMA
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>>   int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>>   		u64 offset);
>>   int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
>> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>>   void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>>   			 enum dma_data_direction dir);
>>   #else /* CONFIG_PCI_P2PDMA */
>> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>> +{
>> +	return 0;
>> +}
>>   static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>>   		size_t size, u64 offset)
>>   {
>> -- 
>> 2.11.0
>>
Stephen Bates May 8, 2018, 2:25 p.m. UTC | #4
Hi Christian

> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 

> CPU when IOMMU is enabled or otherwise you will break SVM.


OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?

> Similar problems arise when you do this for dedicated GPU, but we 

> haven't upstreamed the support for this yet.


Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit. 
    
> So that is a clear NAK from my side for the approach.


Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
    
> And what exactly is the problem here?

 
We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).

> I'm currently testing P2P with  GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.


Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change. 

Cheers

Stephen
Dan Williams May 8, 2018, 2:31 p.m. UTC | #5
On Mon, Apr 23, 2018 at 4:30 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
>
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
>
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.

>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  9 +++++++++
>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.c          |  6 ++++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 50 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>           transations must be between devices behind the same root port.
>           (Typically behind a network of PCIe switches).
>
> +         Enabling this option will also disable ACS on all ports behind
> +         any PCIe switch. This effectively puts all devices behind any
> +         switch heirarchy into the same IOMMU group. Which implies that
> +         individual devices behind any switch will not be able to be
> +         assigned to separate VMs because there is no isolation between
> +         them. Additionally, any malicious PCIe devices will be able to
> +         DMA to memory exposed by other EPs in the same domain as TLPs
> +         will not be checked by the IOMMU.
> +
>           If unsure, say N.

It seems unwieldy that this is a compile time option and not a runtime
option. Can't we have a kernel command line option to opt-in to this
behavior rather than require a wholly separate kernel image?

Why is this text added in a follow on patch and not the patch that
introduced the config option?

I'm also wondering if that command line option can take a 'bus device
function' address of a switch to limit the scope of where ACS is
disabled.
Stephen Bates May 8, 2018, 2:44 p.m. UTC | #6
Hi Dan

>    It seems unwieldy that this is a compile time option and not a runtime

>    option. Can't we have a kernel command line option to opt-in to this

>    behavior rather than require a wholly separate kernel image?

  
I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
  
> Why is this text added in a follow on patch and not the patch that

>  introduced the config option?


Because the ACS section was added later in the series and this information is associated with that additional functionality.
    
> I'm also wondering if that command line option can take a 'bus device

> function' address of a switch to limit the scope of where ACS is

> disabled.


By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.

Stephen

[1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
Logan Gunthorpe May 8, 2018, 4:27 p.m. UTC | #7
On 08/05/18 01:17 AM, Christian König wrote:
> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

Well, given that the current set only disables ACS bits on bridges
(previous versions were only on switches) this shouldn't be an issue for
integrated devices. We do not disable ACS flags globally.

> And what exactly is the problem here? I'm currently testing P2P with 
> GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
> perfectly fine.

In addition to Stephen's comments, seeing we've established a general
need to avoid the root complex (until we have a whitelist at least) we
must have ACS disabled along the path between the devices. Otherwise,
all TLPs will go through the root complex and if there is no support it
will fail.

If the consensus is we want a command line option, then so be it. But
we'll have to deny pretty much all P2P transactions unless the user
correctly disables ACS along the path using the command line option and
this is really annoying for users of this functionality to understand
how to do that correctly.

Logan
Christian König May 8, 2018, 4:37 p.m. UTC | #8
Am 08.05.2018 um 16:25 schrieb Stephen Bates:
>      
> Hi Christian
>
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?

Well I'm not an expert on this, but I think that is an incorrect 
assumption you guys use here.

At least in the default configuration even with IOMMU enabled P2P 
transactions does NOT necessary travel up to the root complex for 
translation.

It's already late here, but if nobody beats me I'm going to dig up the 
necessary documentation tomorrow.

Regards,
Christian.

>
>> Similar problems arise when you do this for dedicated GPU, but we
>> haven't upstreamed the support for this yet.
> Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit.
>      
>> So that is a clear NAK from my side for the approach.
> Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
>      
>> And what exactly is the problem here?
>   
> We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).
>
>> I'm currently testing P2P with  GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.
> Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change.
>
> Cheers
>
> Stephen
>      
>
Christian König May 8, 2018, 4:50 p.m. UTC | #9
Am 08.05.2018 um 18:27 schrieb Logan Gunthorpe:
>
> On 08/05/18 01:17 AM, Christian König wrote:
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> Well, given that the current set only disables ACS bits on bridges
> (previous versions were only on switches) this shouldn't be an issue for
> integrated devices. We do not disable ACS flags globally.

Ok, that is at least a step in the right direction. But I think we 
seriously need to test that for side effects.

>
>> And what exactly is the problem here? I'm currently testing P2P with
>> GPUs in different IOMMU domains and at least with AMD IOMMUs that works
>> perfectly fine.
> In addition to Stephen's comments, seeing we've established a general
> need to avoid the root complex (until we have a whitelist at least) we
> must have ACS disabled along the path between the devices. Otherwise,
> all TLPs will go through the root complex and if there is no support it
> will fail.

Well I'm not an expert on this, but if I'm not completely mistaken that 
is not correct.

E.g. transactions are initially send to the root complex for 
translation, that's for sure. But at least for AMD GPUs the root complex 
answers with the translated address which is then cached in the device.

So further transactions for the same address range then go directly to 
the destination.

What you don't want is device isolation, cause in this case the root 
complex handles the transaction themselves. IIRC there where also 
something like "force_isolation" and "nobypass" parameters for the IOMMU 
to control that behavior.

It's already late here, but going to dig up the documentation for that 
tomorrow and/or contact a hardware engineer involved in the ACS spec.

Regards,
Christian.

>
> If the consensus is we want a command line option, then so be it. But
> we'll have to deny pretty much all P2P transactions unless the user
> correctly disables ACS along the path using the command line option and
> this is really annoying for users of this functionality to understand
> how to do that correctly.
>
> Logan
Logan Gunthorpe May 8, 2018, 7:13 p.m. UTC | #10
On 08/05/18 10:50 AM, Christian König wrote:
> E.g. transactions are initially send to the root complex for 
> translation, that's for sure. But at least for AMD GPUs the root complex 
> answers with the translated address which is then cached in the device.
> 
> So further transactions for the same address range then go directly to 
> the destination.

Sounds like you are referring to Address Translation Services (ATS).
This is quite separate from ACS and, to my knowledge, isn't widely
supported by switch hardware.

Logan
Alex Williamson May 8, 2018, 7:34 p.m. UTC | #11
On Tue, 8 May 2018 13:13:40 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 10:50 AM, Christian König wrote:
> > E.g. transactions are initially send to the root complex for 
> > translation, that's for sure. But at least for AMD GPUs the root complex 
> > answers with the translated address which is then cached in the device.
> > 
> > So further transactions for the same address range then go directly to 
> > the destination.  
> 
> Sounds like you are referring to Address Translation Services (ATS).
> This is quite separate from ACS and, to my knowledge, isn't widely
> supported by switch hardware.

They are not so unrelated, see the ACS Direct Translated P2P
capability, which in fact must be implemented by switch downstream
ports implementing ACS and works specifically with ATS.  This appears to
be the way the PCI SIG would intend for P2P to occur within an IOMMU
managed topology, routing pre-translated DMA directly between peer
devices while requiring non-translated requests to bounce through the
IOMMU.  Really, what's the value of having an I/O virtual address space
provided by an IOMMU if we're going to allow physical DMA between
downstream devices, couldn't we just turn off the IOMMU altogether?  Of
course ATS is not without holes itself, basically that we trust the
endpoint's implementation of ATS implicitly.  Thanks,

Alex
Logan Gunthorpe May 8, 2018, 7:45 p.m. UTC | #12
On 08/05/18 01:34 PM, Alex Williamson wrote:
> They are not so unrelated, see the ACS Direct Translated P2P
> capability, which in fact must be implemented by switch downstream
> ports implementing ACS and works specifically with ATS.  This appears to
> be the way the PCI SIG would intend for P2P to occur within an IOMMU
> managed topology, routing pre-translated DMA directly between peer
> devices while requiring non-translated requests to bounce through the
> IOMMU.  Really, what's the value of having an I/O virtual address space
> provided by an IOMMU if we're going to allow physical DMA between
> downstream devices, couldn't we just turn off the IOMMU altogether?  Of
> course ATS is not without holes itself, basically that we trust the
> endpoint's implementation of ATS implicitly.  Thanks,

I agree that this is what the SIG intends, but I don't think hardware
fully supports this methodology yet. The Direct Translated capability
just requires switches to forward packets that have the AT request type
set. It does not require them to do the translation or to support ATS
such that P2P requests can be translated by the IOMMU. I expect this is
so that an downstream device can implement ATS and not get messed up by
an upstream switch that doesn't support it.

Logan
Alex Williamson May 8, 2018, 8:13 p.m. UTC | #13
On Tue, 8 May 2018 13:45:50 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 01:34 PM, Alex Williamson wrote:
> > They are not so unrelated, see the ACS Direct Translated P2P
> > capability, which in fact must be implemented by switch downstream
> > ports implementing ACS and works specifically with ATS.  This appears to
> > be the way the PCI SIG would intend for P2P to occur within an IOMMU
> > managed topology, routing pre-translated DMA directly between peer
> > devices while requiring non-translated requests to bounce through the
> > IOMMU.  Really, what's the value of having an I/O virtual address space
> > provided by an IOMMU if we're going to allow physical DMA between
> > downstream devices, couldn't we just turn off the IOMMU altogether?  Of
> > course ATS is not without holes itself, basically that we trust the
> > endpoint's implementation of ATS implicitly.  Thanks,  
> 
> I agree that this is what the SIG intends, but I don't think hardware
> fully supports this methodology yet. The Direct Translated capability
> just requires switches to forward packets that have the AT request type
> set. It does not require them to do the translation or to support ATS
> such that P2P requests can be translated by the IOMMU. I expect this is
> so that an downstream device can implement ATS and not get messed up by
> an upstream switch that doesn't support it.

Well, I'm a bit confused, this patch series is specifically disabling
ACS on switches, but per the spec downstream switch ports implementing
ACS MUST implement direct translated P2P.  So it seems the only
potential gap here is the endpoint, which must support ATS or else
there's nothing for direct translated P2P to do.  The switch port plays
no part in the actual translation of the request, ATS on the endpoint
has already cached the translation and is now attempting to use it.
For the switch port, this only becomes a routing decision, the request
is already translated, therefore ACS RR and EC can be ignored to
perform "normal" (direct) routing, as if ACS were not present.  It would
be a shame to go to all the trouble of creating this no-ACS mode to find
out the target hardware supports ATS and should have simply used it, or
we should have disabled the IOMMU altogether, which leaves ACS disabled.
Thanks,

Alex
Logan Gunthorpe May 8, 2018, 8:19 p.m. UTC | #14
On 08/05/18 02:13 PM, Alex Williamson wrote:
> Well, I'm a bit confused, this patch series is specifically disabling
> ACS on switches, but per the spec downstream switch ports implementing
> ACS MUST implement direct translated P2P.  So it seems the only
> potential gap here is the endpoint, which must support ATS or else
> there's nothing for direct translated P2P to do.  The switch port plays
> no part in the actual translation of the request, ATS on the endpoint
> has already cached the translation and is now attempting to use it.
> For the switch port, this only becomes a routing decision, the request
> is already translated, therefore ACS RR and EC can be ignored to
> perform "normal" (direct) routing, as if ACS were not present.  It would
> be a shame to go to all the trouble of creating this no-ACS mode to find
> out the target hardware supports ATS and should have simply used it, or
> we should have disabled the IOMMU altogether, which leaves ACS disabled.

Ah, ok, I didn't think it was the endpoint that had to implement ATS.
But in that case, for our application, we need NVMe cards and RDMA NICs
to all have ATS support and I expect that is just as unlikely. At least
none of the endpoints on my system support it. Maybe only certain GPUs
have this support.

Logan
Alex Williamson May 8, 2018, 8:43 p.m. UTC | #15
On Tue, 8 May 2018 14:19:05 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.  
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
dumb question, why not simply turn off the IOMMU and thus ACS?  The
argument of using the IOMMU for security is rather diminished if we're
specifically enabling devices to poke one another directly and clearly
this isn't favorable for device assignment either.  Are there target
systems where this is not a simple kernel commandline option?  Thanks,

Alex
Logan Gunthorpe May 8, 2018, 8:49 p.m. UTC | #16
On 08/05/18 02:43 PM, Alex Williamson wrote:
> Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
> dumb question, why not simply turn off the IOMMU and thus ACS?  The
> argument of using the IOMMU for security is rather diminished if we're
> specifically enabling devices to poke one another directly and clearly
> this isn't favorable for device assignment either.  Are there target
> systems where this is not a simple kernel commandline option?  Thanks,

Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
into some bios's that set the bits on boot (which is annoying).

I also don't expect people will respond well to making the IOMMU and P2P
exclusive. The IOMMU is often used for more than just security and on
many platforms it's enabled by default. I'd much rather allow IOMMU use
but have fewer isolation groups in much the same way as if you had PCI
bridges that didn't support ACS.

Logan
Jerome Glisse May 8, 2018, 8:50 p.m. UTC | #17
On Tue, May 08, 2018 at 02:19:05PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

I think there is confusion here, Alex properly explained the scheme
PCIE-device do a ATS request to the IOMMU which returns a valid
translation for a virtual address. Device can then use that address
directly without going through IOMMU for translation.

ATS is implemented by the IOMMU not by the device (well device implement
the client side of it). Also ATS is meaningless without something like
PASID as far as i know.

Cheers,
Jérôme
Donald Dutile May 8, 2018, 9:04 p.m. UTC | #18
On 05/08/2018 10:44 AM, Stephen  Bates wrote:
> Hi Dan
> 
>>     It seems unwieldy that this is a compile time option and not a runtime
>>     option. Can't we have a kernel command line option to opt-in to this
>>     behavior rather than require a wholly separate kernel image?
>    
> I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
> 
It is clear if it is a kernel command-line option or a CONFIG option.
One does not have access to the kernel command-line w/o a few privs.
A CONFIG option prevents a distribution to have a default, locked-down kernel _and_ the ability to be 'unlocked' if the customer/site is 'secure' via other means.
A run/boot-time option is more flexible and achieves the best of both.
    
>> Why is this text added in a follow on patch and not the patch that
>>   introduced the config option?
> 
> Because the ACS section was added later in the series and this information is associated with that additional functionality.
>      
>> I'm also wondering if that command line option can take a 'bus device
>> function' address of a switch to limit the scope of where ACS is
>> disabled.
> 
Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
I recommend doing so via a sysfs method.

That way, the system can limit the 'unsecure' space btwn two devices, likely configured on a separate switch, from the rest of the still-secured/ACS-enabled PCIe tree.
PCIe is pt-to-pt, effectively; maybe one would have multiple nics/fabrics p2p to/from NVME, but one could look at it as a list of pairs (nic1<->nvme1; nic2<->nvme2; ....).
A pair-listing would be optimal, allowing the kernel to figure out the ACS path, and not making it endpoint-switch-switch...-switch-endpt error-entry prone.
Additionally, systems that can/prefer to do so via a RP's IOMMU, albeit not optimal, but better then all the way to/from memory, and a security/iova-check possible,
can modify the pt-to-pt ACS algorithm to accomodate over time (e.g., cap bits be they hw or device-driver/extension/quirk defined for each bridge/RP in a PCI domain).

Kernels that never want to support P2P could build w/o it enabled.... cmdline option is moot.
Kernels built with it on, *still* need cmdline option, to be blunt that the kernel is enabling a feature that could render the entire (IO sub)system unsecure.

> By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.
> 
as devices are added, they start in ACS-enabled, secured mode.
As sysfs entry modifies p2p ability, IOMMU group is modified as well.


btw -- IOMMU grouping is a host/HV control issue, not a VM control/knowledge issue.
        So I don't understand the comments why VMs should need to know.
        -- configure p2p _before_ assigning devices to VMs. ... iommu groups are checked at assignment time.
           -- so even if hot-add, separate iommu group, then enable p2p, becomes same IOMMU group, then can only assign to same VM.
        -- VMs don't know IOMMU's & ACS are involved now, and won't later, even if device's dynamically added/removed

Is there a thread I need to read up to explain /clear-up the thoughts above?

> Stephen
> 
> [1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
>      
>
Alex Williamson May 8, 2018, 9:26 p.m. UTC | #19
On Tue, 8 May 2018 14:49:23 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 02:43 PM, Alex Williamson wrote:
> > Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
> > dumb question, why not simply turn off the IOMMU and thus ACS?  The
> > argument of using the IOMMU for security is rather diminished if we're
> > specifically enabling devices to poke one another directly and clearly
> > this isn't favorable for device assignment either.  Are there target
> > systems where this is not a simple kernel commandline option?  Thanks,  
> 
> Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
> into some bios's that set the bits on boot (which is annoying).

But it would be a much easier proposal to disable ACS when the IOMMU is
not enabled, ACS has no real purpose in that case.

> I also don't expect people will respond well to making the IOMMU and P2P
> exclusive. The IOMMU is often used for more than just security and on
> many platforms it's enabled by default. I'd much rather allow IOMMU use
> but have fewer isolation groups in much the same way as if you had PCI
> bridges that didn't support ACS.

The IOMMU and P2P are already not exclusive, we can bounce off the
IOMMU or make use of ATS as we've previously discussed.  We were
previously talking about a build time config option that you didn't
expect distros to use, so I don't think intervention for the user to
disable the IOMMU if it's enabled by default is a serious concern
either.

What you're trying to do is enabled direct peer-to-peer for endpoints
which do not support ATS when the IOMMU is enabled, which is not
something that necessarily makes sense to me.  As I mentioned in a
previous reply, the IOMMU provides us with an I/O virtual address space
for devices, ACS is meant to fill the topology based gaps in that
virtual address space, making transactions follow IOMMU compliant
routing rules to avoid aliases between the IOVA and physical address
spaces.  But this series specifically wants to leave those gaps open
for direct P2P access.

So we compromise the P2P aspect of security, still protecting RAM, but
potentially only to the extent that a device cannot hop through or
interfere with other devices to do its bidding.  Device assignment is
mostly tossed out the window because not only are bigger groups more
difficult to deal with, the IOVA space is riddled with gaps, which is
not really a solved problem.  So that leaves avoiding bounce buffers as
the remaining IOMMU feature, but we're dealing with native express
devices and relatively high end devices that are probably installed in
modern systems, so that seems like a non-issue.

Are there other uses I'm forgetting?  We can enable interrupt remapping
separate from DMA translation, so we can exclude that one.  I'm still
not seeing why it's terribly undesirable to require devices to support
ATS if they want to do direct P2P with an IOMMU enabled.  Thanks,

Alex
Stephen Bates May 8, 2018, 9:27 p.m. UTC | #20
Hi Don

>Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.

>    That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.

>    I recommend doing so via a sysfs method.


Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series. 

    
>            So I don't understand the comments why VMs should need to know.


As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
    
> Is there a thread I need to read up to explain /clear-up the thoughts above?


If you search for p2pdma you should find the previous discussions. Thanks for the input!

Stephen
Stephen Bates May 8, 2018, 9:35 p.m. UTC | #21
Hi Jerome

>    I think there is confusion here, Alex properly explained the scheme

>   PCIE-device do a ATS request to the IOMMU which returns a valid

>    translation for a virtual address. Device can then use that address

>    directly without going through IOMMU for translation.


This makes sense and to be honest I now understand ATS and its interaction with ACS a lot better than I did 24 hours ago ;-).

>    ATS is implemented by the IOMMU not by the device (well device implement

>    the client side of it). Also ATS is meaningless without something like

>    PASID as far as i know.

    
I think it's the client side that is what is important to us. Not many EPs support ATS today and it's not clear if many will in the future.  So assuming we want to do p2pdma between devices (some of) which do NOT support ATS how best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me given this impacts all the PCI domains in the system and not just the domain we wish to do P2P on.

Stephen
Stephen Bates May 8, 2018, 9:42 p.m. UTC | #22
Hi Alex

>    But it would be a much easier proposal to disable ACS when the IOMMU is

>    not enabled, ACS has no real purpose in that case.


I guess one issue I have with this is that it disables IOMMU groups for all Root Ports and not just the one(s) we wish to do p2pdma on. 
    
>    The IOMMU and P2P are already not exclusive, we can bounce off the

>    IOMMU or make use of ATS as we've previously discussed.  We were

>    previously talking about a build time config option that you didn't

>    expect distros to use, so I don't think intervention for the user to

>    disable the IOMMU if it's enabled by default is a serious concern

>    either.


ATS definitely makes things more interesting for the cases where the EPs support it. However I don't really have a handle on how common ATS support is going to be in the kinds of devices we have been focused on (NVMe SSDs and RDMA NICs mostly). 
    
> What you're trying to do is enabled direct peer-to-peer for endpoints

>  which do not support ATS when the IOMMU is enabled, which is not

>  something that necessarily makes sense to me. 


As above the advantage of leaving the IOMMU on is that it allows for both p2pdma PCI domains and IOMMU groupings PCI domains in the same system. It is just that these domains will be separate to each other.

>  So that leaves avoiding bounce buffers as the remaining IOMMU feature


I agree with you here that the devices we will want to use for p2p will probably not require a bounce buffer and will support 64 bit DMA addressing.
    
> I'm still not seeing why it's terribly undesirable to require devices to support

> ATS if they want to do direct P2P with an IOMMU enabled.


I think the one reason is for the use-case above. Allowing IOMMU groupings on one domain and p2pdma on another domain....
    
Stephen
Alex Williamson May 8, 2018, 10:03 p.m. UTC | #23
On Tue, 8 May 2018 21:42:27 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> Hi Alex
> 
> >    But it would be a much easier proposal to disable ACS when the
> > IOMMU is not enabled, ACS has no real purpose in that case.  
> 
> I guess one issue I have with this is that it disables IOMMU groups
> for all Root Ports and not just the one(s) we wish to do p2pdma on. 

But as I understand this series, we're not really targeting specific
sets of devices either.  It's more of a shotgun approach that we
disable ACS on downstream switch ports and hope that we get the right
set of devices, but with the indecisiveness that we might later
white-list select root ports to further increase the blast radius.

> >    The IOMMU and P2P are already not exclusive, we can bounce off
> > the IOMMU or make use of ATS as we've previously discussed.  We were
> >    previously talking about a build time config option that you
> > didn't expect distros to use, so I don't think intervention for the
> > user to disable the IOMMU if it's enabled by default is a serious
> > concern either.  
> 
> ATS definitely makes things more interesting for the cases where the
> EPs support it. However I don't really have a handle on how common
> ATS support is going to be in the kinds of devices we have been
> focused on (NVMe SSDs and RDMA NICs mostly). 
>
> > What you're trying to do is enabled direct peer-to-peer for
> > endpoints which do not support ATS when the IOMMU is enabled, which
> > is not something that necessarily makes sense to me.   
> 
> As above the advantage of leaving the IOMMU on is that it allows for
> both p2pdma PCI domains and IOMMU groupings PCI domains in the same
> system. It is just that these domains will be separate to each other.

That argument makes sense if we had the ability to select specific sets
of devices, but that's not the case here, right?  With the shotgun
approach, we're clearly favoring one at the expense of the other and
it's not clear why we don't simple force the needle all the way in that
direction such that the results are at least predictable.

> >  So that leaves avoiding bounce buffers as the remaining IOMMU
> > feature  
> 
> I agree with you here that the devices we will want to use for p2p
> will probably not require a bounce buffer and will support 64 bit DMA
> addressing. 
>
> > I'm still not seeing why it's terribly undesirable to require
> > devices to support ATS if they want to do direct P2P with an IOMMU
> > enabled.  
> 
> I think the one reason is for the use-case above. Allowing IOMMU
> groupings on one domain and p2pdma on another domain.... 

If IOMMU grouping implies device assignment (because nobody else uses
it to the same extent as device assignment) then the build-time option
falls to pieces, we need a single kernel that can do both.  I think we
need to get more clever about allowing the user to specify exactly at
which points in the topology they want to disable isolation.  Thanks,

Alex
Logan Gunthorpe May 8, 2018, 10:10 p.m. UTC | #24
On 08/05/18 04:03 PM, Alex Williamson wrote:
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both.  I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation.  Thanks,


Yeah, so based on the discussion I'm leaning toward just having a
command line option that takes a list of BDFs and disables ACS for them.
(Essentially as Dan has suggested.) This avoids the shotgun.

Then, the pci_p2pdma_distance command needs to check that ACS is
disabled for all bridges between the two devices. If this is not the
case, it returns -1. Future work can check if the EP has ATS support, in
which case it has to check for the ACS direct translated bit.

A user then needs to either disable the IOMMU and/or add the command
line option to disable ACS for the specific downstream ports in the PCI
hierarchy. This means the IOMMU groups will be less granular but
presumably the person adding the command line argument understands this.

We may also want to do some work so that there's informative dmesgs on
which BDFs need to be specified on the command line so it's not so
difficult for the user to figure out.

Logan
Donald Dutile May 8, 2018, 10:21 p.m. UTC | #25
On 05/08/2018 06:03 PM, Alex Williamson wrote:
> On Tue, 8 May 2018 21:42:27 +0000
> "Stephen  Bates" <sbates@raithlin.com> wrote:
> 
>> Hi Alex
>>
>>>     But it would be a much easier proposal to disable ACS when the
>>> IOMMU is not enabled, ACS has no real purpose in that case.
>>
>> I guess one issue I have with this is that it disables IOMMU groups
>> for all Root Ports and not just the one(s) we wish to do p2pdma on.
> 
> But as I understand this series, we're not really targeting specific
> sets of devices either.  It's more of a shotgun approach that we
> disable ACS on downstream switch ports and hope that we get the right
> set of devices, but with the indecisiveness that we might later
> white-list select root ports to further increase the blast radius.
> 
>>>     The IOMMU and P2P are already not exclusive, we can bounce off
>>> the IOMMU or make use of ATS as we've previously discussed.  We were
>>>     previously talking about a build time config option that you
>>> didn't expect distros to use, so I don't think intervention for the
>>> user to disable the IOMMU if it's enabled by default is a serious
>>> concern either.
>>
>> ATS definitely makes things more interesting for the cases where the
>> EPs support it. However I don't really have a handle on how common
>> ATS support is going to be in the kinds of devices we have been
>> focused on (NVMe SSDs and RDMA NICs mostly).
>>
>>> What you're trying to do is enabled direct peer-to-peer for
>>> endpoints which do not support ATS when the IOMMU is enabled, which
>>> is not something that necessarily makes sense to me.
>>
>> As above the advantage of leaving the IOMMU on is that it allows for
>> both p2pdma PCI domains and IOMMU groupings PCI domains in the same
>> system. It is just that these domains will be separate to each other.
> 
> That argument makes sense if we had the ability to select specific sets
> of devices, but that's not the case here, right?  With the shotgun
> approach, we're clearly favoring one at the expense of the other and
> it's not clear why we don't simple force the needle all the way in that
> direction such that the results are at least predictable.
> 
>>>   So that leaves avoiding bounce buffers as the remaining IOMMU
>>> feature
>>
>> I agree with you here that the devices we will want to use for p2p
>> will probably not require a bounce buffer and will support 64 bit DMA
>> addressing.
>>
>>> I'm still not seeing why it's terribly undesirable to require
>>> devices to support ATS if they want to do direct P2P with an IOMMU
>>> enabled.
>>
>> I think the one reason is for the use-case above. Allowing IOMMU
>> groupings on one domain and p2pdma on another domain....
> 
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both.  I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation.  Thanks,
> 
> Alex

+1/ack

RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Stephen Bates May 8, 2018, 10:25 p.m. UTC | #26
>    Yeah, so based on the discussion I'm leaning toward just having a

>    command line option that takes a list of BDFs and disables ACS for them.

>    (Essentially as Dan has suggested.) This avoids the shotgun.


I concur that this seems to be where the conversation is taking us.

@Alex - Before we go do this can you provide input on the approach? I don't want to re-spin only to find we are still not converging on the ACS issue....

Thanks

Stephen
Alex Williamson May 8, 2018, 10:32 p.m. UTC | #27
On Tue, 8 May 2018 16:10:19 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 04:03 PM, Alex Williamson wrote:
> > If IOMMU grouping implies device assignment (because nobody else uses
> > it to the same extent as device assignment) then the build-time option
> > falls to pieces, we need a single kernel that can do both.  I think we
> > need to get more clever about allowing the user to specify exactly at
> > which points in the topology they want to disable isolation.  Thanks,  
> 
> 
> Yeah, so based on the discussion I'm leaning toward just having a
> command line option that takes a list of BDFs and disables ACS for them.
> (Essentially as Dan has suggested.) This avoids the shotgun.
> 
> Then, the pci_p2pdma_distance command needs to check that ACS is
> disabled for all bridges between the two devices. If this is not the
> case, it returns -1. Future work can check if the EP has ATS support, in
> which case it has to check for the ACS direct translated bit.
> 
> A user then needs to either disable the IOMMU and/or add the command
> line option to disable ACS for the specific downstream ports in the PCI
> hierarchy. This means the IOMMU groups will be less granular but
> presumably the person adding the command line argument understands this.
> 
> We may also want to do some work so that there's informative dmesgs on
> which BDFs need to be specified on the command line so it's not so
> difficult for the user to figure out.

I'd advise caution with a user supplied BDF approach, we have no
guaranteed persistence for a device's PCI address.  Adding a device
might renumber the buses, replacing a device with one that consumes
more/less bus numbers can renumber the buses, motherboard firmware
updates could renumber the buses, pci=assign-buses can renumber the
buses, etc.  This is why the VT-d spec makes use of device paths when
describing PCI hierarchies, firmware can't know what bus number will be
assigned to a device, but it does know the base bus number and the path
of devfns needed to get to it.  I don't know how we come up with an
option that's easy enough for a user to understand, but reasonably
robust against hardware changes.  Thanks,

Alex
Dan Williams May 8, 2018, 11 p.m. UTC | #28
On Tue, May 8, 2018 at 3:32 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Tue, 8 May 2018 16:10:19 -0600
> Logan Gunthorpe <logang@deltatee.com> wrote:
>
>> On 08/05/18 04:03 PM, Alex Williamson wrote:
>> > If IOMMU grouping implies device assignment (because nobody else uses
>> > it to the same extent as device assignment) then the build-time option
>> > falls to pieces, we need a single kernel that can do both.  I think we
>> > need to get more clever about allowing the user to specify exactly at
>> > which points in the topology they want to disable isolation.  Thanks,
>>
>>
>> Yeah, so based on the discussion I'm leaning toward just having a
>> command line option that takes a list of BDFs and disables ACS for them.
>> (Essentially as Dan has suggested.) This avoids the shotgun.
>>
>> Then, the pci_p2pdma_distance command needs to check that ACS is
>> disabled for all bridges between the two devices. If this is not the
>> case, it returns -1. Future work can check if the EP has ATS support, in
>> which case it has to check for the ACS direct translated bit.
>>
>> A user then needs to either disable the IOMMU and/or add the command
>> line option to disable ACS for the specific downstream ports in the PCI
>> hierarchy. This means the IOMMU groups will be less granular but
>> presumably the person adding the command line argument understands this.
>>
>> We may also want to do some work so that there's informative dmesgs on
>> which BDFs need to be specified on the command line so it's not so
>> difficult for the user to figure out.
>
> I'd advise caution with a user supplied BDF approach, we have no
> guaranteed persistence for a device's PCI address.  Adding a device
> might renumber the buses, replacing a device with one that consumes
> more/less bus numbers can renumber the buses, motherboard firmware
> updates could renumber the buses, pci=assign-buses can renumber the
> buses, etc.  This is why the VT-d spec makes use of device paths when
> describing PCI hierarchies, firmware can't know what bus number will be
> assigned to a device, but it does know the base bus number and the path
> of devfns needed to get to it.  I don't know how we come up with an
> option that's easy enough for a user to understand, but reasonably
> robust against hardware changes.  Thanks,

True, but at the same time this feature is for "users with custom
hardware designed for purpose", I assume they would be willing to take
on the bus renumbering risk. It's already the case that
/sys/bus/pci/drivers/<x>/bind takes BDF, which is why it seemed to
make a similar interface for the command line. Ideally we could later
get something into ACPI or other platform firmware to arrange for
bridges to disable ACS by default if we see p2p becoming a
common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
given PCI-E sub-domain.
Donald Dutile May 8, 2018, 11:06 p.m. UTC | #29
On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> Hi Don
> 
>> Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>>     That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>>     I recommend doing so via a sysfs method.
> 
> Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
> 
>      
>>             So I don't understand the comments why VMs should need to know.
> 
> As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
Alex:
Really? IOMMU groups are created by the kernel, so don't know how they would be passed into the VMs, unless indirectly via  PCI(e) layout.
At best, twiddling w/ACS enablement (emulation) would cause VMs to see different IOMMU groups, but again, VMs are not the security point/level, the host/HV's are.

>      
>> Is there a thread I need to read up to explain /clear-up the thoughts above?
> 
> If you search for p2pdma you should find the previous discussions. Thanks for the input!
> 
> Stephen
>      
>      
>
Alex Williamson May 8, 2018, 11:11 p.m. UTC | #30
On Tue, 8 May 2018 22:25:06 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> >    Yeah, so based on the discussion I'm leaning toward just having a
> >    command line option that takes a list of BDFs and disables ACS
> > for them. (Essentially as Dan has suggested.) This avoids the
> > shotgun.  
> 
> I concur that this seems to be where the conversation is taking us.
> 
> @Alex - Before we go do this can you provide input on the approach? I
> don't want to re-spin only to find we are still not converging on the
> ACS issue....

I can envision numerous implementation details that makes this less
trivial than it sounds, but it seems like the thing we need to decide
first is if intentionally leaving windows between devices with the
intention of exploiting them for direct P2P DMA in an otherwise IOMMU
managed address space is something we want to do.  From a security
perspective, we already handle this with IOMMU groups because many
devices do not support ACS, the new thing is embracing this rather than
working around it.  It makes me a little twitchy, but so long as the
IOMMU groups match the expected worst case routing between devices,
it's really no different than if we could wipe the ACS capability from
the device.

On to the implementation details... I already mentioned the BDF issue
in my other reply.  If we had a way to persistently identify a device,
would we specify the downstream points at which we want to disable ACS
or the endpoints that we want to connect?  The latter has a problem
that the grouping upstream of an endpoint is already set by the time we
discover the endpoint, so we might need to unwind to get the grouping
correct.  The former might be more difficult for users to find the
necessary nodes, but easier for the kernel to deal with during
discovery.  A runtime, sysfs approach has some benefits here,
especially in identifying the device assuming we're ok with leaving
the persistence problem to userspace tools.  I'm still a little fond of
the idea of exposing an acs_flags attribute for devices in sysfs where
a write would do a soft unplug and re-add of all affected devices to
automatically recreate the proper grouping.  Any dynamic change in
routing and grouping would require all DMA be re-established anyway and
a soft hotplug seems like an elegant way of handling it.  Thanks,

Alex
Logan Gunthorpe May 8, 2018, 11:15 p.m. UTC | #31
On 08/05/18 05:00 PM, Dan Williams wrote:
>> I'd advise caution with a user supplied BDF approach, we have no
>> guaranteed persistence for a device's PCI address.  Adding a device
>> might renumber the buses, replacing a device with one that consumes
>> more/less bus numbers can renumber the buses, motherboard firmware
>> updates could renumber the buses, pci=assign-buses can renumber the
>> buses, etc.  This is why the VT-d spec makes use of device paths when
>> describing PCI hierarchies, firmware can't know what bus number will be
>> assigned to a device, but it does know the base bus number and the path
>> of devfns needed to get to it.  I don't know how we come up with an
>> option that's easy enough for a user to understand, but reasonably
>> robust against hardware changes.  Thanks,
> 
> True, but at the same time this feature is for "users with custom
> hardware designed for purpose", I assume they would be willing to take
> on the bus renumbering risk. It's already the case that
> /sys/bus/pci/drivers/<x>/bind takes BDF, which is why it seemed to
> make a similar interface for the command line. Ideally we could later
> get something into ACPI or other platform firmware to arrange for
> bridges to disable ACS by default if we see p2p becoming a
> common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
> given PCI-E sub-domain.

Yeah, I'm having a hard time coming up with an easy enough solution for
the user. I agree with Dan though, the bus renumbering risk would be
fairly low in the custom hardware seeing the switches are likely going
to be directly soldered to the same board with the CPU.

That being said, I supposed we could allow the command line to take both
a BDF or a BaseBus/DF/DF/DF path. Though, implementing this sounds like
a bit of a challenge.

Logan
Logan Gunthorpe May 8, 2018, 11:31 p.m. UTC | #32
On 08/05/18 05:11 PM, Alex Williamson wrote:
> On to the implementation details... I already mentioned the BDF issue
> in my other reply.  If we had a way to persistently identify a device,
> would we specify the downstream points at which we want to disable ACS
> or the endpoints that we want to connect?  The latter has a problem
> that the grouping upstream of an endpoint is already set by the time we
> discover the endpoint, so we might need to unwind to get the grouping
> correct.  The former might be more difficult for users to find the
> necessary nodes, but easier for the kernel to deal with during
> discovery.  

I was envisioning the former with kernel helping by printing a dmesg in
certain circumstances to help with figuring out which devices need to be
specified. Specifying a list of endpoints on the command line and having
the kernel try to figure out which downstream ports need to be adjusted
while we are in the middle of enumerating the bus is, like you said, a
nightmare.

> A runtime, sysfs approach has some benefits here,
> especially in identifying the device assuming we're ok with leaving
> the persistence problem to userspace tools.  I'm still a little fond of
> the idea of exposing an acs_flags attribute for devices in sysfs where
> a write would do a soft unplug and re-add of all affected devices to
> automatically recreate the proper grouping.  Any dynamic change in
> routing and grouping would require all DMA be re-established anyway and
> a soft hotplug seems like an elegant way of handling it.  Thanks,

This approach sounds like it has a lot more issues to contend with:

For starters, a soft unplug/re-add of all the devices behind a switch is
going to be difficult if a lot of those devices have had drivers
installed and their respective resources are now mounted or otherwise in
use.

Then, do we have to redo a the soft-replace every time we change the ACS
bit for every downstream port? That could mean you have to do dozens
soft-replaces before you have all the ACS bits set which means you have
a storm of drivers being added and removed.

This would require some kind of fancy custom setup software that runs at
just the right time in the boot sequence or a lot of work on the users
part to unbind all the resources, setup the ACS bits and then rebind
everything (assuming the soft re-add doesn't rebind it every time you
adjust one ACS bit). Ugly.

IMO, if we need to do the sysfs approach then we need to be able to
adjust the groups dynamically in a sensible way and not through the
large hammer that is soft-replaces. I think this would be great but I
don't think we will be tackling that with this patch set.

Logan
Alex Williamson May 9, 2018, 12:01 a.m. UTC | #33
On Tue, 8 May 2018 19:06:17 -0400
Don Dutile <ddutile@redhat.com> wrote:
> On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> > As I understand it VMs need to know because VFIO passes IOMMU
> > grouping up into the VMs. So if a IOMMU grouping changes the VM's
> > view of its PCIe topology changes. I think we even have to be
> > cognizant of the fact the OS running on the VM may not even support
> > hot-plug of PCI devices.  
> Alex:
> Really? IOMMU groups are created by the kernel, so don't know how
> they would be passed into the VMs, unless indirectly via  PCI(e)
> layout. At best, twiddling w/ACS enablement (emulation) would cause
> VMs to see different IOMMU groups, but again, VMs are not the
> security point/level, the host/HV's are.

Correct, the VM has no concept of the host's IOMMU groups, only the
hypervisor knows about the groups, but really only to the extent of
which device belongs to which group and whether the group is viable.
Any runtime change to grouping though would require DMA mapping
updates, which I don't see how we can reasonably do with drivers,
vfio-pci or native host drivers, bound to the affected devices.  Thanks,

Alex
Alex Williamson May 9, 2018, 12:17 a.m. UTC | #34
On Tue, 8 May 2018 17:31:48 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:
> On 08/05/18 05:11 PM, Alex Williamson wrote:
> > A runtime, sysfs approach has some benefits here,
> > especially in identifying the device assuming we're ok with leaving
> > the persistence problem to userspace tools.  I'm still a little fond of
> > the idea of exposing an acs_flags attribute for devices in sysfs where
> > a write would do a soft unplug and re-add of all affected devices to
> > automatically recreate the proper grouping.  Any dynamic change in
> > routing and grouping would require all DMA be re-established anyway and
> > a soft hotplug seems like an elegant way of handling it.  Thanks,  
> 
> This approach sounds like it has a lot more issues to contend with:
> 
> For starters, a soft unplug/re-add of all the devices behind a switch is
> going to be difficult if a lot of those devices have had drivers
> installed and their respective resources are now mounted or otherwise in
> use.
> 
> Then, do we have to redo a the soft-replace every time we change the ACS
> bit for every downstream port? That could mean you have to do dozens
> soft-replaces before you have all the ACS bits set which means you have
> a storm of drivers being added and removed.

True, anything requiring tweaking multiple downstream ports would
induce a hot-unplug/replug for each.  A better sysfs interface would
allow multiple downstream ports to be updated in a single shot.

> This would require some kind of fancy custom setup software that runs at
> just the right time in the boot sequence or a lot of work on the users
> part to unbind all the resources, setup the ACS bits and then rebind
> everything (assuming the soft re-add doesn't rebind it every time you
> adjust one ACS bit). Ugly.
> 
> IMO, if we need to do the sysfs approach then we need to be able to
> adjust the groups dynamically in a sensible way and not through the
> large hammer that is soft-replaces. I think this would be great but I
> don't think we will be tackling that with this patch set.

OTOH, I think the only sensible way to dynamically adjust groups is
through hotplug, we cannot have running drivers attached to downstream
endpoints as we're adjusting the routing.  Thanks,

Alex
Stephen Bates May 9, 2018, 12:35 p.m. UTC | #35
Hi Alex and Don

>    Correct, the VM has no concept of the host's IOMMU groups, only the

>   hypervisor knows about the groups, 


But as I understand it these groups are usually passed through to VMs on a pre-group basis by the hypervisor? So IOMMU group 1 might be passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is not aware of IOMMU groupings but it is impacted by them in the sense that if the groupings change the PCI topology presented to the VM needs to change too.

Stephen
Stephen Bates May 9, 2018, 12:38 p.m. UTC | #36
Hi Logan

>    Yeah, I'm having a hard time coming up with an easy enough solution for

>    the user. I agree with Dan though, the bus renumbering risk would be

>    fairly low in the custom hardware seeing the switches are likely going

>    to be directly soldered to the same board with the CPU.

    
I am afraid that soldered down assumption may not be valid. More and more PCIe cards with PCIe switches on them are becoming available and people are using these to connect servers to arrays of NVMe SSDs which may make the topology more dynamic.

Stephen
Stephen Bates May 9, 2018, 12:44 p.m. UTC | #37
Hi Don

>    RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to

>    put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',

>    which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.

    
Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.

NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...

Stephen
Stephen Bates May 9, 2018, 1:12 p.m. UTC | #38
Jerome and Christian
    
> I think there is confusion here, Alex properly explained the scheme

> PCIE-device do a ATS request to the IOMMU which returns a valid

> translation for a virtual address. Device can then use that address

> directly without going through IOMMU for translation.


So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA. 

So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.

> Also ATS is meaningless without something like PASID as far as i know.

    
ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.

Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...

Stephen
Christian König May 9, 2018, 1:40 p.m. UTC | #39
Am 09.05.2018 um 15:12 schrieb Stephen Bates:
> Jerome and Christian
>      
>> I think there is confusion here, Alex properly explained the scheme
>> PCIE-device do a ATS request to the IOMMU which returns a valid
>> translation for a virtual address. Device can then use that address
>> directly without going through IOMMU for translation.
> So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA.
>
> So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.

Still need to double check the specification (had a busy morning today), 
but that sounds about correct.

The key takeaway is that when any device has ATS enabled you can't 
disable ACS without breaking it (even if you unplug and replug it).

>> Also ATS is meaningless without something like PASID as far as i know.
>      
> ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.
>
> Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...

Interesting point, give me a moment to check that. That finally makes 
all the hardware I have standing around here valuable :)

Christian.

>
> Stephen
>
Alex Williamson May 9, 2018, 2:44 p.m. UTC | #40
On Wed, 9 May 2018 12:35:56 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> Hi Alex and Don
> 
> >    Correct, the VM has no concept of the host's IOMMU groups, only
> > the hypervisor knows about the groups,   
> 
> But as I understand it these groups are usually passed through to VMs
> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
> not aware of IOMMU groupings but it is impacted by them in the sense
> that if the groupings change the PCI topology presented to the VM
> needs to change too.

Hypervisors don't currently expose any topology based on the grouping,
the only case where such a concept even makes sense is when a vIOMMU is
present as devices within the same group cannot have separate address
spaces.  Our options for exposing such information is also limited, our
only real option would seem to be placing devices within the same group
together on a conventional PCI bus to denote the address space
granularity.  Currently we strongly recommend singleton groups for this
case and leave any additional configuration constraints to the admin.

The case you note of a group passed to VM A and another passed to VM B
is exactly an example of why any sort of dynamic routing change needs to
have the groups fully released, such as via hot-unplug.  For instance,
a routing change at a shared node above groups 1 & 2 could result in
the merging of these groups and there is absolutely no way to handle
that with portions of the group being owned by two separate VMs after
the merge.  Thanks,

Alex
Stephen Bates May 9, 2018, 3:41 p.m. UTC | #41
Christian

>    Interesting point, give me a moment to check that. That finally makes 

>    all the hardware I have standing around here valuable :)

    
Yes. At the very least it provides an initial standards based path for P2P DMAs across RPs which is something we have discussed on this list in the past as being desirable.

BTW I am trying to understand how an ATS capable EP function determines when to perform an ATS Translation Request (ATS TR). Is there an upstream example of the driver for your APU that uses ATS? If so, can you provide a pointer to it. Do you provide some type of entry in the submission queues for commands going to the APU to indicate if the address associated with a specific command should be translated using ATS or not? Or do you simply enable ATS and then all addresses passed to your APU that miss the local cache result in a ATS TR?

Your feedback would be useful and I initiate discussions within the NVMe community on where we might go with ATS...

Thanks

Stephen
Donald Dutile May 9, 2018, 3:47 p.m. UTC | #42
On 05/08/2018 08:01 PM, Alex Williamson wrote:
> On Tue, 8 May 2018 19:06:17 -0400
> Don Dutile <ddutile@redhat.com> wrote:
>> On 05/08/2018 05:27 PM, Stephen  Bates wrote:
>>> As I understand it VMs need to know because VFIO passes IOMMU
>>> grouping up into the VMs. So if a IOMMU grouping changes the VM's
>>> view of its PCIe topology changes. I think we even have to be
>>> cognizant of the fact the OS running on the VM may not even support
>>> hot-plug of PCI devices.
>> Alex:
>> Really? IOMMU groups are created by the kernel, so don't know how
>> they would be passed into the VMs, unless indirectly via  PCI(e)
>> layout. At best, twiddling w/ACS enablement (emulation) would cause
>> VMs to see different IOMMU groups, but again, VMs are not the
>> security point/level, the host/HV's are.
> 
> Correct, the VM has no concept of the host's IOMMU groups, only the
> hypervisor knows about the groups, but really only to the extent of
> which device belongs to which group and whether the group is viable.
> Any runtime change to grouping though would require DMA mapping
> updates, which I don't see how we can reasonably do with drivers,
> vfio-pci or native host drivers, bound to the affected devices.  Thanks,
> 
> Alex
> 
A change in iommu groups would/could require a device remove/add cycle to get an updated DMA-mapping (yet-another-overused-term: iommu 'domain').
Donald Dutile May 9, 2018, 3:52 p.m. UTC | #43
On 05/09/2018 10:44 AM, Alex Williamson wrote:
> On Wed, 9 May 2018 12:35:56 +0000
> "Stephen  Bates" <sbates@raithlin.com> wrote:
> 
>> Hi Alex and Don
>>
>>>     Correct, the VM has no concept of the host's IOMMU groups, only
>>> the hypervisor knows about the groups,
>>
>> But as I understand it these groups are usually passed through to VMs
>> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
>> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
>> not aware of IOMMU groupings but it is impacted by them in the sense
>> that if the groupings change the PCI topology presented to the VM
>> needs to change too.
> 
> Hypervisors don't currently expose any topology based on the grouping,
> the only case where such a concept even makes sense is when a vIOMMU is
> present as devices within the same group cannot have separate address
> spaces.  Our options for exposing such information is also limited, our
> only real option would seem to be placing devices within the same group
> together on a conventional PCI bus to denote the address space
> granularity.  Currently we strongly recommend singleton groups for this
> case and leave any additional configuration constraints to the admin.
> 
> The case you note of a group passed to VM A and another passed to VM B
> is exactly an example of why any sort of dynamic routing change needs to
> have the groups fully released, such as via hot-unplug.  For instance,
> a routing change at a shared node above groups 1 & 2 could result in
> the merging of these groups and there is absolutely no way to handle
> that with portions of the group being owned by two separate VMs after
> the merge.  Thanks,
> 
> Alex
> 
The above is why I stated the host/HV has to do p2p setup *before* device-assignment
is done.
Now, that could be done at boot time (with a mod.conf-like config in host/HV, before VM startup)
as well.
Dynamically, if such a feature is needed, requires a hot-unplug/plug cycling as Alex states.
Donald Dutile May 9, 2018, 3:53 p.m. UTC | #44
On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> Hi Don
> 
>> Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>>     That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>>     I recommend doing so via a sysfs method.
> 
> Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
> 
>      
>>             So I don't understand the comments why VMs should need to know.
> 
> As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
>      
>> Is there a thread I need to read up to explain /clear-up the thoughts above?
> 
> If you search for p2pdma you should find the previous discussions. Thanks for the input!
> 
under linux-pci I'm assuming...
you cc'd a number of upstream lists; I picked this thread up via rdma-list.

> Stephen
>      
>      
>
Donald Dutile May 9, 2018, 3:58 p.m. UTC | #45
On 05/09/2018 08:44 AM, Stephen  Bates wrote:
> Hi Don
> 
>>     RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
>>     put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
>>     which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
>      
> Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.
> 
> NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...
> 
> Stephen
>      
>      
> 
Sure, we could provide unsecure enablement for development and kick-the-tires deployment ..
device-assignment started that way (no ACS, no intr-remapping, etc.), but for secure setups,
VF's for both p2p EPs is the best security model.
So, we should have a design goal for the secure configuration.
workarounds/unsecure modes to deal with near-term what-we-have-to-work-with can be employed, but they shoudn't be
the only/defacto/final-solution.
Jerome Glisse May 9, 2018, 4:07 p.m. UTC | #46
On Wed, May 09, 2018 at 03:41:44PM +0000, Stephen  Bates wrote:
> Christian
> 
> >    Interesting point, give me a moment to check that. That finally makes 
> >    all the hardware I have standing around here valuable :)
>     
> Yes. At the very least it provides an initial standards based path
> for P2P DMAs across RPs which is something we have discussed on this
> list in the past as being desirable.
> 
> BTW I am trying to understand how an ATS capable EP function determines
> when to perform an ATS Translation Request (ATS TR). Is there an
> upstream example of the driver for your APU that uses ATS? If so, can
> you provide a pointer to it. Do you provide some type of entry in the
> submission queues for commands going to the APU to indicate if the
> address associated with a specific command should be translated using
> ATS or not? Or do you simply enable ATS and then all addresses passed
> to your APU that miss the local cache result in a ATS TR?

On GPU ATS is always tie to a PASID. You do not do the former without
the latter (AFAICT this is not doable, maybe through some JTAG but not
in normal operation).

GPU are like CPU, so you have GPU threads that run against an address
space. This address space use a page table (very much like the CPU page
table). Now inside that page table you can point GPU virtual address
to use GPU memory or use system memory. Those system memory entry can
also be mark as ATS against a given PASID.

On some GPU you define a window of GPU virtual address that goes through
PASID & ATS (so access in that window do not go through the page table
but directly through PASID & ATS).

Jérôme
Stephen Bates May 9, 2018, 4:30 p.m. UTC | #47
Hi Jerome

> Now inside that page table you can point GPU virtual address

> to use GPU memory or use system memory. Those system memory entry can

> also be mark as ATS against a given PASID.

    
Thanks. This all makes sense. 

But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...

Stephen
Logan Gunthorpe May 9, 2018, 4:45 p.m. UTC | #48
On 09/05/18 07:40 AM, Christian König wrote:
> The key takeaway is that when any device has ATS enabled you can't 
> disable ACS without breaking it (even if you unplug and replug it).

I don't follow how you came to this conclusion...
 The ACS bits we'd be turning off are the ones that force TLPs addressed
at a peer to go to the RC. However, ATS translation packets will be
addressed to an untranslated address which a switch will not identify as
a peer address so it should send upstream regardless the state of the
ACS Req/Comp redirect bits.

Once the translation comes back, the ATS endpoint should send the TLP to
the peer address with the AT packet type and it will be directed to the
peer provided the Direct Translated bit is set (or the redirect bits are
unset).

I can't see how turning off the Req/Comp redirect bits could break
anything except for the isolation they provide.

Logan
Jerome Glisse May 9, 2018, 5:49 p.m. UTC | #49
On Wed, May 09, 2018 at 04:30:32PM +0000, Stephen  Bates wrote:
> Hi Jerome
> 
> > Now inside that page table you can point GPU virtual address
> > to use GPU memory or use system memory. Those system memory entry can
> > also be mark as ATS against a given PASID.
>     
> Thanks. This all makes sense. 
> 
> But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...
> 

As it is tie to PASID this is done using IOMMU so looks for caller
of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
user is the AMD GPU driver see:

drivers/gpu/drm/amd/
drivers/gpu/drm/amd/amdkfd/
drivers/gpu/drm/amd/amdgpu/

Lot of codes there. The GPU code details do not really matter for
this discussions thought. You do not need to do much to use PASID.

Cheers,
Jérôme
Christian König May 10, 2018, 12:52 p.m. UTC | #50
Am 09.05.2018 um 18:45 schrieb Logan Gunthorpe:
>
> On 09/05/18 07:40 AM, Christian König wrote:
>> The key takeaway is that when any device has ATS enabled you can't
>> disable ACS without breaking it (even if you unplug and replug it).
> I don't follow how you came to this conclusion...
>   The ACS bits we'd be turning off are the ones that force TLPs addressed
> at a peer to go to the RC. However, ATS translation packets will be
> addressed to an untranslated address which a switch will not identify as
> a peer address so it should send upstream regardless the state of the
> ACS Req/Comp redirect bits.

Why would a switch not identify that as a peer address? We use the PASID 
together with ATS to identify the address space which a transaction 
should use.

If I'm not completely mistaken when you disable ACS it is perfectly 
possible that a bridge identifies a transaction as belonging to a peer 
address, which isn't what we want here.

Christian.

>
> Once the translation comes back, the ATS endpoint should send the TLP to
> the peer address with the AT packet type and it will be directed to the
> peer provided the Direct Translated bit is set (or the redirect bits are
> unset).
>
> I can't see how turning off the Req/Comp redirect bits could break
> anything except for the isolation they provide.
>
> Logan
Stephen Bates May 10, 2018, 2:16 p.m. UTC | #51
Hi Christian

> Why would a switch not identify that as a peer address? We use the PASID 

>    together with ATS to identify the address space which a transaction 

>    should use.


I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
    
 >   If I'm not completely mistaken when you disable ACS it is perfectly 

 >   possible that a bridge identifies a transaction as belonging to a peer 

 >   address, which isn't what we want here.

   
You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:

If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.

So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).

Make sense?

Stephen
Stephen Bates May 10, 2018, 2:20 p.m. UTC | #52
Hi Jerome

> As it is tie to PASID this is done using IOMMU so looks for caller

> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing

>  user is the AMD GPU driver see:

    
Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.

One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).

Cheers

Stephen
Christian König May 10, 2018, 2:29 p.m. UTC | #53
Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> Hi Jerome
>
>> As it is tie to PASID this is done using IOMMU so looks for caller
>> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>>   user is the AMD GPU driver see:
>      
> Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.

Just FYI: There is also another effort ongoing to give both the AMD, 
Intel as well as ARM IOMMUs a common interface so that drivers can use 
whatever the platform offers fro SVM support.

> One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).

Oh, well that is complicated at best.

On very old hardware it wasn't a window, but instead you had to use 
special commands in your shader which indicated that you want to use an 
ATS transaction instead of a normal PCIe transaction for your 
read/write/atomic.

As Jerome explained on most hardware we have a window inside the 
internal GPU address space which when accessed issues a ATS transaction 
with a configurable PASID.

But on very newer hardware that window became a bit in the GPUVM page 
tables, so in theory we now can control it on a 4K granularity basis for 
the internal 48bit GPU address space.

Christian.

>
> Cheers
>
> Stephen
>
Jerome Glisse May 10, 2018, 2:41 p.m. UTC | #54
On Thu, May 10, 2018 at 02:16:25PM +0000, Stephen  Bates wrote:
> Hi Christian
> 
> > Why would a switch not identify that as a peer address? We use the PASID 
> >    together with ATS to identify the address space which a transaction 
> >    should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
>     
>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>    
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
> 
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
> 
> Make sense?
> 

Note on GPU we do would not rely on ATS for peer to peer. Some part
of the GPU (DMA engines) do not necessarily support ATS. Yet those
are the part likely to be use in peer to peer.

However here this is a distinction in objective that i believe is lost.
We (ake GPU people aka the good guys ;)) do no want to do peer to peer
for performance reasons ie we do not care having our transaction going
to the root complex and back down the destination. At least in use case
i am working on this is fine.

Reasons is that GPU are giving up on PCIe (see all specialize link like
NVlink that are popping up in GPU space). So for fast GPU inter-connect
we have this new links. Yet for legacy and inter-operability we would
like to do peer to peer with other devices like RDMA ... going through
the root complex would be fine from performance point of view. Worst
case is that it is slower than existing design where system memory is
use as bounce buffer.

Also the IOMMU isolation do matter a lot to us. Think someone using this
peer to peer to gain control of a server in the cloud.

Cheers,
Jérôme
Jerome Glisse May 10, 2018, 2:59 p.m. UTC | #55
On Thu, May 10, 2018 at 04:29:44PM +0200, Christian König wrote:
> Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> > Hi Jerome
> > 
> > > As it is tie to PASID this is done using IOMMU so looks for caller
> > > of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
> > >   user is the AMD GPU driver see:
> > Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.
> 
> Just FYI: There is also another effort ongoing to give both the AMD, Intel
> as well as ARM IOMMUs a common interface so that drivers can use whatever
> the platform offers fro SVM support.
> 
> > One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).
> 
> Oh, well that is complicated at best.
> 
> On very old hardware it wasn't a window, but instead you had to use special
> commands in your shader which indicated that you want to use an ATS
> transaction instead of a normal PCIe transaction for your read/write/atomic.
> 
> As Jerome explained on most hardware we have a window inside the internal
> GPU address space which when accessed issues a ATS transaction with a
> configurable PASID.
> 
> But on very newer hardware that window became a bit in the GPUVM page
> tables, so in theory we now can control it on a 4K granularity basis for the
> internal 48bit GPU address space.
> 

To complete this a 50 lines primer on GPU:

GPUVA - GPU virtual address
GPUPA - GPU physical address

GPU run programs very much like CPU program expect a program will have
many thousands of threads running concurrently. There is a hierarchy of
groups for a given program ie threads are grouped together, the lowest
hierarchy level have a group size in <= 64 threads on most GPUs.

Those programs (call shader for graphic program think OpenGL, Vulkan
or compute for GPGPU think OpenCL CUDA) are submited by the userspace
against a given address space. In the "old" days (couple years back
when dinausor were still roaming the earth) this address space was
specific to the GPU and each user space program could create multiple
GPU address space. All the memory operation done by the program was
against this address space. Hence all PCIE transactions are spawn from
a program + address space.

GPU use page table + window aperture (the window aperture is going away
so you can focus on page table). To translate GPU virtual address into
a physical address. The physical address can point to GPU local memory
or to system memory or to another PCIE device memory (ie some PCIE BAR).

So all PCIE transaction are spawn through this process of GPUVA to GPUPA
then GPUPA is handled by the GPU mmu unit that either spawn a PCIE
transaction for non local GPUPA or access local memory otherwise.


So per say the kernel driver does not configure which transaction is
using ATS or peer to peer. Userspace program create a GPU virtual address
space and bind object into it. This object can be system memory or some
other PCIE device memory in which case we would to do a peer to peer.


So you won't find any logic in the kernel. What you find is creating
virtual address space and binding object.


Above i talk about the old days, nowadays we want the GPU virtual address
space to be exactly the same as the CPU virtual address space as the
process which initiate the GPU program is using. This is where we use the
PASID and ATS. So here userspace create a special "GPU context" that says
that the GPU virtual address space will be the same as the program that
create the GPU context. A process ID is then allocated and the mm_struct
is bind to this process ID in the IOMMU driver. Then all program executed
on the GPU use the process ID to identify the address space against which
they are running.


All of the above i did not talk about DMA engine which are on the "side"
of the GPU to copy memory around. GPU have multiple DMA engines with
different capabilities, some of those DMA engine use the same GPU address
space as describe above, other use directly GPUPA.


Hopes this helps understanding the big picture. I over simplify thing and
devils is in the details.

Cheers,
Jérôme
Logan Gunthorpe May 10, 2018, 4:32 p.m. UTC | #56
On 10/05/18 08:16 AM, Stephen  Bates wrote:
> Hi Christian
> 
>> Why would a switch not identify that as a peer address? We use the PASID 
>>    together with ATS to identify the address space which a transaction 
>>    should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).

Yes. Remember if we are using the IOMMU the EP is being programmed
(regardless of whether it's a DMA engine, NTB window or GPUVA) with an
IOVA address which is separate from the device's PCI bus address. Any
packet addressed to an IOVA address is going to go back to the root
complex no matter what the ACS bits say. Only once ATS translates the
addres back into the PCI bus address will the EP send packets to the
peer and the switch will attempt to root them to the peer and only then
do the ACS bits apply. And the direct translated ACS bit allows packets
that have purportedly been translated through.

>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>    
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
> 
> Make sense?

Not to me. In the p2pdma code we specifically program DMA engines with
the PCI bus address. So regardless of whether we are using the IOMMU or
not, the packets will be forwarded directly to the peer. If the ACS
Redir bits are on they will be forced back to the RC by the switch and
the transaction will fail. If we clear the ACS bits, the TLPs will go
where we want and everything will work (but we lose the isolation of ACS).

For EPs that support ATS, we should (but don't necessarily have to)
program them with the IOVA address so they can go through the
translation process which will allow P2P without disabling the ACS Redir
bits -- provided the ACS direct translation bit is set. (And btw, if it
is, then we lose the benefit of ACS protecting against malicious EPs).
But, per above, the ATS transaction should involve only the IOVA address
so the ACS bits not being set should not break ATS.

Logan
Stephen Bates May 10, 2018, 5:11 p.m. UTC | #57
> Not to me. In the p2pdma code we specifically program DMA engines with

> the PCI bus address. 


Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...

> So regardless of whether we are using the IOMMU or

> not, the packets will be forwarded directly to the peer. If the ACS

>  Redir bits are on they will be forced back to the RC by the switch and

>  the transaction will fail. If we clear the ACS bits, the TLPs will go

>  where we want and everything will work (but we lose the isolation of ACS).


Agreed.
    
>    For EPs that support ATS, we should (but don't necessarily have to)

>    program them with the IOVA address so they can go through the

>    translation process which will allow P2P without disabling the ACS Redir

>    bits -- provided the ACS direct translation bit is set. (And btw, if it

>    is, then we lose the benefit of ACS protecting against malicious EPs).

>    But, per above, the ATS transaction should involve only the IOVA address

>    so the ACS bits not being set should not break ATS.

    
Well we would still have to clear some ACS bits but now we can clear only for translated addresses.

Stephen
Logan Gunthorpe May 10, 2018, 5:15 p.m. UTC | #58
On 10/05/18 11:11 AM, Stephen  Bates wrote:
>> Not to me. In the p2pdma code we specifically program DMA engines with
>> the PCI bus address. 
> 
> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...
> 
>> So regardless of whether we are using the IOMMU or
>> not, the packets will be forwarded directly to the peer. If the ACS
>>  Redir bits are on they will be forced back to the RC by the switch and
>>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>>  where we want and everything will work (but we lose the isolation of ACS).
> 
> Agreed.
>     
>>    For EPs that support ATS, we should (but don't necessarily have to)
>>    program them with the IOVA address so they can go through the
>>    translation process which will allow P2P without disabling the ACS Redir
>>    bits -- provided the ACS direct translation bit is set. (And btw, if it
>>    is, then we lose the benefit of ACS protecting against malicious EPs).
>>    But, per above, the ATS transaction should involve only the IOVA address
>>    so the ACS bits not being set should not break ATS.
>     
> Well we would still have to clear some ACS bits but now we can clear only for translated addresses.

We don't have to clear the ACS Redir bits as we did in the first case.
We just have to make sure the ACS Direct Translated bit is set.

Logan
Stephen Bates May 10, 2018, 6:41 p.m. UTC | #59
Hi Jerome

>    Note on GPU we do would not rely on ATS for peer to peer. Some part

>    of the GPU (DMA engines) do not necessarily support ATS. Yet those

>    are the part likely to be use in peer to peer.


OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
    
>    We (ake GPU people aka the good guys ;)) do no want to do peer to peer

>    for performance reasons ie we do not care having our transaction going

>    to the root complex and back down the destination. At least in use case

>    i am working on this is fine.


If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.
    
>    Reasons is that GPU are giving up on PCIe (see all specialize link like

>    NVlink that are popping up in GPU space). So for fast GPU inter-connect

>    we have this new links. 


I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-). Or maybe we'll all just switch to OpenGenCCIX when the time comes.
    
>    Also the IOMMU isolation do matter a lot to us. Think someone using this

>    peer to peer to gain control of a server in the cloud.

    
I agree that IOMMU isolation is very desirable. Hence the desire to ensure we can keep the IOMMU on while doing p2pdma if at all possible whilst still delivering the desired performance to the user.

Stephen
Stephen Bates May 10, 2018, 6:44 p.m. UTC | #60
Hi Jerome

>    Hopes this helps understanding the big picture. I over simplify thing and

>    devils is in the details.

    
This was a great primer thanks for putting it together. An LWN.net article perhaps ;-)??

Stephen
Logan Gunthorpe May 10, 2018, 6:59 p.m. UTC | #61
On 10/05/18 12:41 PM, Stephen  Bates wrote:
> Hi Jerome
> 
>>    Note on GPU we do would not rely on ATS for peer to peer. Some part
>>    of the GPU (DMA engines) do not necessarily support ATS. Yet those
>>    are the part likely to be use in peer to peer.
> 
> OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
>     
>>    We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>>    for performance reasons ie we do not care having our transaction going
>>    to the root complex and back down the destination. At least in use case
>>    i am working on this is fine.
> 
> If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.

The NVMe people are the Nice Neighbors, the RDMA people are the
Righteous Romantics and the PCI people are the Pleasant Protagonists...

Obviously.

Logan
Alex Williamson May 10, 2018, 7:10 p.m. UTC | #62
On Thu, 10 May 2018 18:41:09 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:    
> >    Reasons is that GPU are giving up on PCIe (see all specialize link like
> >    NVlink that are popping up in GPU space). So for fast GPU inter-connect
> >    we have this new links.   
> 
> I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).

No doubt, the marketing for it is quick to point out the mesh topology
of NVLink, but I haven't seen any technical documents that describe the
isolation capabilities or IOMMU interaction.  Whether this is included
or an afterthought, I have no idea.

> >    Also the IOMMU isolation do matter a lot to us. Think someone using this
> >    peer to peer to gain control of a server in the cloud.  

From that perspective, do we have any idea what NVLink means for
topology and IOMMU provided isolation and translation?  I've seen a
device assignment user report that seems to suggest it might pretend to
be PCIe compatible, but the assigned GPU ultimately doesn't work
correctly in a VM, so perhaps the software compatibility is only so
deep. Thanks,

Alex
Jerome Glisse May 10, 2018, 7:24 p.m. UTC | #63
On Thu, May 10, 2018 at 01:10:15PM -0600, Alex Williamson wrote:
> On Thu, 10 May 2018 18:41:09 +0000
> "Stephen  Bates" <sbates@raithlin.com> wrote:    
> > >    Reasons is that GPU are giving up on PCIe (see all specialize link like
> > >    NVlink that are popping up in GPU space). So for fast GPU inter-connect
> > >    we have this new links.   
> > 
> > I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).
> 
> No doubt, the marketing for it is quick to point out the mesh topology
> of NVLink, but I haven't seen any technical documents that describe the
> isolation capabilities or IOMMU interaction.  Whether this is included
> or an afterthought, I have no idea.

AFAIK there is no IOMMU on NVLink between devices, walking a page table and
being able to sustain 80GB/s or 160GB/s is hard to achieve :) I think idea
behind those interconnect is that devices in the mesh are inherently secure
ie each single device is suppose to make sure that no one can abuse it.

GPU with their virtual address space and contextualize program executions
unit are suppose to be secure (a specter like bug might be lurking on those
but i doubt it).

So for those interconnect you program physical address directly in the page
table of the devices and those physical address are un-translated from hard-
ware perspective.

Note that the kernel driver that do the actual GPU page table programming
can do sanity check on value it is setting. So checks can also happens at
setup time. But after that assumption is hardware is secure and no one can
abuse it AFAICT.

> 
> > >    Also the IOMMU isolation do matter a lot to us. Think someone using this
> > >    peer to peer to gain control of a server in the cloud.  
> 
> From that perspective, do we have any idea what NVLink means for
> topology and IOMMU provided isolation and translation?  I've seen a
> device assignment user report that seems to suggest it might pretend to
> be PCIe compatible, but the assigned GPU ultimately doesn't work
> correctly in a VM, so perhaps the software compatibility is only so
> deep. Thanks,

Note that each single GPU (in configurations i am aware of) also have a
PCIE link with the CPU/main memory. So from that point of view they very
much behave like a regular PCIE devices. It is just that each GPUs in
the mesh can access each other memory through high bandwidth interconnect.

I am not sure how much is public beyond that, i will ask NVidia to try to
have someone chime in this thread and shed light on this, if possible.

Cheers,
Jérôme
Christian König May 11, 2018, 8:52 a.m. UTC | #64
Am 10.05.2018 um 19:15 schrieb Logan Gunthorpe:
>
> On 10/05/18 11:11 AM, Stephen  Bates wrote:
>>> Not to me. In the p2pdma code we specifically program DMA engines with
>>> the PCI bus address.
>> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...

By disabling the ACS bits on the intermediate bridges you turn their 
address routing from IOVA addresses (which are to be resolved by the 
root complex) back to PCI bus addresses (which are resolved locally in 
the bridge).

This only works when the IOVA and the PCI bus addresses never overlap. 
I'm not sure how the IOVA allocation works but I don't think we 
guarantee that on Linux.

>>
>>> So regardless of whether we are using the IOMMU or
>>> not, the packets will be forwarded directly to the peer. If the ACS
>>>   Redir bits are on they will be forced back to the RC by the switch and
>>>   the transaction will fail. If we clear the ACS bits, the TLPs will go
>>>   where we want and everything will work (but we lose the isolation of ACS).
>> Agreed.

If we really want to enable P2P without ATS and IOMMU enabled I think we 
should probably approach it like this:

a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
BARs in that group.

b) Add configuration options to put a whole PCI branch of devices (e.g. 
a bridge) into a single IOMMU group.

c) Add a configuration option to disable the ACS bit on bridges in the 
same IOMMU group.


I agree that we have a rather special case here, but I still find that 
approach rather brave and would vote for disabling P2P without ATS when 
IOMMU is enabled.


BTW: I can't say anything about other implementations, but at least for 
the AMD-IOMMU the transaction won't fail when it is send to the root 
complex.

Instead the root complex would send it to the correct device. I already 
tested that on an AMD Ryzen with IOMMU enabled and P2P between two GPUs 
(but could be that this only works because of ATS).

Regards,
Christian.

>>>     For EPs that support ATS, we should (but don't necessarily have to)
>>>     program them with the IOVA address so they can go through the
>>>     translation process which will allow P2P without disabling the ACS Redir
>>>     bits -- provided the ACS direct translation bit is set. (And btw, if it
>>>     is, then we lose the benefit of ACS protecting against malicious EPs).
>>>     But, per above, the ATS transaction should involve only the IOVA address
>>>     so the ACS bits not being set should not break ATS.
>>      
>> Well we would still have to clear some ACS bits but now we can clear only for translated addresses.
> We don't have to clear the ACS Redir bits as we did in the first case.
> We just have to make sure the ACS Direct Translated bit is set.
>
> Logan
Logan Gunthorpe May 11, 2018, 3:48 p.m. UTC | #65
On 5/11/2018 2:52 AM, Christian König wrote:
> This only works when the IOVA and the PCI bus addresses never overlap. 
> I'm not sure how the IOVA allocation works but I don't think we 
> guarantee that on Linux.

I find this hard to believe. There's always the possibility that some 
part of the system doesn't support ACS so if the PCI bus addresses and 
IOVA overlap there's a good chance that P2P and ATS won't work at all on 
some hardware.


> If we really want to enable P2P without ATS and IOMMU enabled I think we 
> should probably approach it like this:
> 
> a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
> BARs in that group.
> 
> b) Add configuration options to put a whole PCI branch of devices (e.g. 
> a bridge) into a single IOMMU group.
> 
> c) Add a configuration option to disable the ACS bit on bridges in the 
> same IOMMU group.

I think a configuration option to manage IOMMU groups as you suggest 
would be a very complex interface and difficult to implement. I prefer 
the option to disable the ACS bit on boot and let the existing code put 
the devices into their own IOMMU group (as it should already do to 
support hardware that doesn't have ACS support).

Logan
Stephen Bates May 11, 2018, 9:50 p.m. UTC | #66
>    I find this hard to believe. There's always the possibility that some 

>    part of the system doesn't support ACS so if the PCI bus addresses and 

>    IOVA overlap there's a good chance that P2P and ATS won't work at all on 

>    some hardware.


I tend to agree but this comes down to how IOVA addresses are generated in the kernel. Alex (or anyone else) can you point to where IOVA addresses are generated? As Logan stated earlier, p2pdma bypasses this and programs the PCI bus address directly but other IO going to the same PCI EP may flow through the IOMMU and be programmed with IOVA rather than PCI bus addresses.
    
> I prefer 

>    the option to disable the ACS bit on boot and let the existing code put 

>    the devices into their own IOMMU group (as it should already do to 

>    support hardware that doesn't have ACS support).

    
+1

Stephen
Stephen Bates May 11, 2018, 10:24 p.m. UTC | #67
All

> Alex (or anyone else) can you point to where IOVA addresses are generated?

A case of RTFM perhaps (though a pointer to the code would still be appreciated).

https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Some exceptions to IOVA
-----------------------
Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Cheers

Stephen
Logan Gunthorpe May 11, 2018, 10:55 p.m. UTC | #68
On 5/11/2018 4:24 PM, Stephen  Bates wrote:
> All
> 
>>   Alex (or anyone else) can you point to where IOVA addresses are generated?
> 
> A case of RTFM perhaps (though a pointer to the code would still be appreciated).
> 
> https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt
> 
> Some exceptions to IOVA
> -----------------------
> Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
> The same is true for peer to peer transactions. Hence we reserve the
> address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Hmm, except I'm not sure how to interpret that. It sounds like there 
can't be an IOVA address that overlaps with the PCI MMIO range which is 
good and what I'd expect.

But for peer to peer they say they don't translate the address which 
implies to me that the intention is for a peer to peer address to not be 
mapped in the same way using the dma_map interface (of course though if 
you were using ATS you'd want this for sure). Unless the existing 
dma_map command's notice a PCI MMIO address and handle them differently, 
but I don't see how.

Logan
diff mbox

Patch

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index b2396c22b53e..b6db41d4b708 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -139,6 +139,15 @@  config PCI_P2PDMA
 	  transations must be between devices behind the same root port.
 	  (Typically behind a network of PCIe switches).
 
+	  Enabling this option will also disable ACS on all ports behind
+	  any PCIe switch. This effectively puts all devices behind any
+	  switch heirarchy into the same IOMMU group. Which implies that
+	  individual devices behind any switch will not be able to be
+	  assigned to separate VMs because there is no isolation between
+	  them. Additionally, any malicious PCIe devices will be able to
+	  DMA to memory exposed by other EPs in the same domain as TLPs
+	  will not be checked by the IOMMU.
+
 	  If unsure, say N.
 
 config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ed9dce8552a2..e9f43b43acac 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -240,27 +240,42 @@  static struct pci_dev *find_parent_pci_dev(struct device *dev)
 }
 
 /*
- * If a device is behind a switch, we try to find the upstream bridge
- * port of the switch. This requires two calls to pci_upstream_bridge():
- * one for the upstream port on the switch, one on the upstream port
- * for the next level in the hierarchy. Because of this, devices connected
- * to the root port will be rejected.
+ * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
+ * up to the RC which is not what we want for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any bridge to be in the same IOMMU
+ * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
  */
-static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
 {
-	struct pci_dev *up1, *up2;
+	int pos;
+	u16 ctrl;
 
-	if (!pdev)
-		return NULL;
+	if (!pci_is_bridge(pdev))
+		return 0;
 
-	up1 = pci_dev_get(pci_upstream_bridge(pdev));
-	if (!up1)
-		return NULL;
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return 0;
+
+	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
+
+	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+
+	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
 
-	up2 = pci_dev_get(pci_upstream_bridge(up1));
-	pci_dev_put(up1);
+	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
 
-	return up2;
+	return 1;
 }
 
 /*
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e597655a5643..7e2f5724ba22 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@ 
 #include <linux/of.h>
 #include <linux/of_pci.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/pm.h>
 #include <linux/slab.h>
 #include <linux/module.h>
@@ -2835,6 +2836,11 @@  static void pci_std_enable_acs(struct pci_dev *dev)
  */
 void pci_enable_acs(struct pci_dev *dev)
 {
+#ifdef CONFIG_PCI_P2PDMA
+	if (pci_p2pdma_disable_acs(dev))
+		return;
+#endif
+
 	if (!pci_acs_enable)
 		return;
 
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 0cde88341eeb..fcb3437a2f3c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -18,6 +18,7 @@  struct block_device;
 struct scatterlist;
 
 #ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_disable_acs(struct pci_dev *pdev);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
 int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
@@ -40,6 +41,10 @@  int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	return 0;
+}
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
 {