diff mbox

pci: Enable overrides for missing ACS capabilities

Message ID 20130530183955.14686.89444.stgit@bling.home (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Alex Williamson May 30, 2013, 6:40 p.m. UTC
PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
allows us to control whether transactions are allowed to be redirected
in various subnodes of a PCIe topology.  For instance, if two
endpoints are below a root port or downsteam switch port, the
downstream port may optionally redirect transactions between the
devices, bypassing upstream devices.  The same can happen internally
on multifunction devices.  The transaction may never be visible to the
upstream devices.

One upstream device that we particularly care about is the IOMMU.  If
a redirection occurs in the topology below the IOMMU, then the IOMMU
cannot provide isolation between devices.  This is why the PCIe spec
encourages topologies to include ACS support.  Without it, we have to
assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.

Unfortunately, far too many topologies do not support ACS to make this
a steadfast requirement.  Even the latest chipsets from Intel are only
sporadically supporting ACS.  We have trouble getting interconnect
vendors to include the PCIe spec required PCIe capability, let alone
suggested features.

Therefore, we need to add some flexibility.  The pcie_acs_override=
boot option lets users opt-in specific devices or sets of devices to
assume ACS support.  The "downstream" option assumes full ACS support
on root ports and downstream switch ports.  The "multifunction"
option assumes the subset of ACS features available on multifunction
endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
option enables ACS support on devices matching the provided vendor
and device IDs, allowing more strategic ACS overrides.  These options
may be combined in any order.  A maximum of 16 id specific overrides
are available.  It's suggested to use the most limited set of options
necessary to avoid completely disabling ACS across the topology.

Note to hardware vendors, we have facilities to permanently quirk
specific devices which enforce isolation but not provide an ACS
capability.  Please contact me to have your devices added and save
your customers the hassle of this boot option.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 Documentation/kernel-parameters.txt |   10 +++
 drivers/pci/quirks.c                |  102 +++++++++++++++++++++++++++++++++++
 2 files changed, 112 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Alex Williamson June 14, 2013, 2:21 p.m. UTC | #1
Bump.  Comments?

On Thu, 2013-05-30 at 12:40 -0600, Alex Williamson wrote:
> PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
> allows us to control whether transactions are allowed to be redirected
> in various subnodes of a PCIe topology.  For instance, if two
> endpoints are below a root port or downsteam switch port, the
> downstream port may optionally redirect transactions between the
> devices, bypassing upstream devices.  The same can happen internally
> on multifunction devices.  The transaction may never be visible to the
> upstream devices.
> 
> One upstream device that we particularly care about is the IOMMU.  If
> a redirection occurs in the topology below the IOMMU, then the IOMMU
> cannot provide isolation between devices.  This is why the PCIe spec
> encourages topologies to include ACS support.  Without it, we have to
> assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
> 
> Unfortunately, far too many topologies do not support ACS to make this
> a steadfast requirement.  Even the latest chipsets from Intel are only
> sporadically supporting ACS.  We have trouble getting interconnect
> vendors to include the PCIe spec required PCIe capability, let alone
> suggested features.
> 
> Therefore, we need to add some flexibility.  The pcie_acs_override=
> boot option lets users opt-in specific devices or sets of devices to
> assume ACS support.  The "downstream" option assumes full ACS support
> on root ports and downstream switch ports.  The "multifunction"
> option assumes the subset of ACS features available on multifunction
> endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
> option enables ACS support on devices matching the provided vendor
> and device IDs, allowing more strategic ACS overrides.  These options
> may be combined in any order.  A maximum of 16 id specific overrides
> are available.  It's suggested to use the most limited set of options
> necessary to avoid completely disabling ACS across the topology.
> 
> Note to hardware vendors, we have facilities to permanently quirk
> specific devices which enforce isolation but not provide an ACS
> capability.  Please contact me to have your devices added and save
> your customers the hassle of this boot option.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  Documentation/kernel-parameters.txt |   10 +++
>  drivers/pci/quirks.c                |  102 +++++++++++++++++++++++++++++++++++
>  2 files changed, 112 insertions(+)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 47bb23c..a60e6ad 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2349,6 +2349,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  		nomsi	Do not use MSI for native PCIe PME signaling (this makes
>  			all PCIe root ports use INTx for all services).
>  
> +	pcie_acs_override =
> +			[PCIE] Override missing PCIe ACS support for:
> +		downstream
> +			All downstream ports - full ACS capabilties
> +		multifunction
> +			All multifunction devices - multifunction ACS subset
> +		id:nnnn:nnnn
> +			Specfic device - full ACS capabilities
> +			Specified as vid:did (vendor/device ID) in hex
> +
>  	pcmv=		[HW,PCMCIA] BadgePAD 4
>  
>  	pd.		[PARIDE]
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 0369fb6..c7609f6 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3292,11 +3292,113 @@ struct pci_dev *pci_get_dma_source(struct pci_dev *dev)
>  	return pci_dev_get(dev);
>  }
>  
> +static bool acs_on_downstream;
> +static bool acs_on_multifunction;
> +
> +#define NUM_ACS_IDS 16
> +struct acs_on_id {
> +	unsigned short vendor;
> +	unsigned short device;
> +};
> +static struct acs_on_id acs_on_ids[NUM_ACS_IDS];
> +static u8 max_acs_id;
> +
> +static __init int pcie_acs_override_setup(char *p)
> +{
> +	if (!p)
> +		return -EINVAL;
> +
> +	while (*p) {
> +		if (!strncmp(p, "downstream", 10))
> +			acs_on_downstream = true;
> +		if (!strncmp(p, "multifunction", 13))
> +			acs_on_multifunction = true;
> +		if (!strncmp(p, "id:", 3)) {
> +			char opt[5];
> +			int ret;
> +			long val;
> +
> +			if (max_acs_id >= NUM_ACS_IDS - 1) {
> +				pr_warn("Out of PCIe ACS override slots (%d)\n",
> +					NUM_ACS_IDS);
> +				goto next;
> +			}
> +
> +			p += 3;
> +			snprintf(opt, 5, "%s", p);
> +			ret = kstrtol(opt, 16, &val);
> +			if (ret) {
> +				pr_warn("PCIe ACS ID parse error %d\n", ret);
> +				goto next;
> +			}
> +			acs_on_ids[max_acs_id].vendor = val;
> +
> +			p += strcspn(p, ":");
> +			if (*p != ':') {
> +				pr_warn("PCIe ACS invalid ID\n");
> +				goto next;
> +			}
> +
> +			p++;
> +			snprintf(opt, 5, "%s", p);
> +			ret = kstrtol(opt, 16, &val);
> +			if (ret) {
> +				pr_warn("PCIe ACS ID parse error %d\n", ret);
> +				goto next;
> +			}
> +			acs_on_ids[max_acs_id].device = val;
> +			max_acs_id++;
> +		}
> +next:
> +		p += strcspn(p, ",");
> +		if (*p == ',')
> +			p++;
> +	}
> +
> +	if (acs_on_downstream || acs_on_multifunction || max_acs_id)
> +		pr_warn("Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA\n");
> +
> +	return 0;
> +}
> +early_param("pcie_acs_override", pcie_acs_override_setup);
> +
> +static int pcie_acs_overrides(struct pci_dev *dev, u16 acs_flags)
> +{
> +	int i;
> +
> +	/* Never override ACS for legacy devices or devices with ACS caps */
> +	if (!pci_is_pcie(dev) ||
> +	    pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS))
> +		return -ENOTTY;
> +
> +	for (i = 0; i < max_acs_id; i++)
> +		if (acs_on_ids[i].vendor == dev->vendor &&
> +		    acs_on_ids[i].device == dev->device)
> +			return 1;
> +
> +	switch (pci_pcie_type(dev)) {
> +	case PCI_EXP_TYPE_DOWNSTREAM:
> +	case PCI_EXP_TYPE_ROOT_PORT:
> +		if (acs_on_downstream)
> +			return 1;
> +		break;
> +	case PCI_EXP_TYPE_ENDPOINT:
> +	case PCI_EXP_TYPE_UPSTREAM:
> +	case PCI_EXP_TYPE_LEG_END:
> +	case PCI_EXP_TYPE_RC_END:
> +		if (acs_on_multifunction && dev->multifunction)
> +			return 1;
> +	}
> +
> +	return -ENOTTY;
> +}
> +
>  static const struct pci_dev_acs_enabled {
>  	u16 vendor;
>  	u16 device;
>  	int (*acs_enabled)(struct pci_dev *dev, u16 acs_flags);
>  } pci_dev_acs_enabled[] = {
> +	{ PCI_ANY_ID, PCI_ANY_ID, pcie_acs_overrides },
>  	{ 0 }
>  };
>  
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Donald Dutile June 17, 2013, 3:01 p.m. UTC | #2
On 05/30/2013 02:40 PM, Alex Williamson wrote:
> PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
> allows us to control whether transactions are allowed to be redirected
> in various subnodes of a PCIe topology.  For instance, if two
> endpoints are below a root port or downsteam switch port, the
> downstream port may optionally redirect transactions between the
> devices, bypassing upstream devices.  The same can happen internally
> on multifunction devices.  The transaction may never be visible to the
> upstream devices.
>
> One upstream device that we particularly care about is the IOMMU.  If
> a redirection occurs in the topology below the IOMMU, then the IOMMU
> cannot provide isolation between devices.  This is why the PCIe spec
> encourages topologies to include ACS support.  Without it, we have to
> assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
>
> Unfortunately, far too many topologies do not support ACS to make this
> a steadfast requirement.  Even the latest chipsets from Intel are only
> sporadically supporting ACS.  We have trouble getting interconnect
> vendors to include the PCIe spec required PCIe capability, let alone
> suggested features.
>
> Therefore, we need to add some flexibility.  The pcie_acs_override=
> boot option lets users opt-in specific devices or sets of devices to
> assume ACS support.  The "downstream" option assumes full ACS support
> on root ports and downstream switch ports.  The "multifunction"
> option assumes the subset of ACS features available on multifunction
> endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
> option enables ACS support on devices matching the provided vendor
> and device IDs, allowing more strategic ACS overrides.  These options
> may be combined in any order.  A maximum of 16 id specific overrides
> are available.  It's suggested to use the most limited set of options
> necessary to avoid completely disabling ACS across the topology.
>
> Note to hardware vendors, we have facilities to permanently quirk
> specific devices which enforce isolation but not provide an ACS
> capability.  Please contact me to have your devices added and save
> your customers the hassle of this boot option.
>
> Signed-off-by: Alex Williamson<alex.williamson@redhat.com>
> ---
>   Documentation/kernel-parameters.txt |   10 +++
>   drivers/pci/quirks.c                |  102 +++++++++++++++++++++++++++++++++++
>   2 files changed, 112 insertions(+)
>

Feel free to add my ack.

I like the fact that all of this code is in quirks, and not sprinkled btwn pci core & quirks.
As we have discovered, even common, ACS-compatilbe x86 chipsets are out there w/o ACS caps.
Additionally, an unclear area of the spec has some vendors providing 'null ACS' caps,
which is suppose to be interpreted as no-peer-to-peer DMA; this latter 'feature' is being
brought up to PCI-SIG to get a note added to SRIOV spec in the ACS-related material to clarify.

> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 47bb23c..a60e6ad 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2349,6 +2349,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>   		nomsi	Do not use MSI for native PCIe PME signaling (this makes
>   			all PCIe root ports use INTx for all services).
>
> +	pcie_acs_override =
> +			[PCIE] Override missing PCIe ACS support for:
> +		downstream
> +			All downstream ports - full ACS capabilties
> +		multifunction
> +			All multifunction devices - multifunction ACS subset
> +		id:nnnn:nnnn
> +			Specfic device - full ACS capabilities
> +			Specified as vid:did (vendor/device ID) in hex
> +
>   	pcmv=		[HW,PCMCIA] BadgePAD 4
>
>   	pd.		[PARIDE]
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 0369fb6..c7609f6 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3292,11 +3292,113 @@ struct pci_dev *pci_get_dma_source(struct pci_dev *dev)
>   	return pci_dev_get(dev);
>   }
>
> +static bool acs_on_downstream;
> +static bool acs_on_multifunction;
> +
> +#define NUM_ACS_IDS 16
> +struct acs_on_id {
> +	unsigned short vendor;
> +	unsigned short device;
> +};
At first, I wasn't sure if vid/did would be sufficient, but convinced myself
that a sub-vid/did that differed amongst sub-vid's, even if they added/modified
ACS cap, the ACS-cap-existence test in pcie_acs_overrides() will not exercise
any override for that variant, and both bases -- newer/sub-vid modified devices
w/ACS, and vid/did w/o ACS will be handled properly.
So, another reason for the ack.

> +static struct acs_on_id acs_on_ids[NUM_ACS_IDS];
> +static u8 max_acs_id;
> +
> +static __init int pcie_acs_override_setup(char *p)
> +{
> +	if (!p)
> +		return -EINVAL;
> +
> +	while (*p) {
> +		if (!strncmp(p, "downstream", 10))
> +			acs_on_downstream = true;
> +		if (!strncmp(p, "multifunction", 13))
> +			acs_on_multifunction = true;
> +		if (!strncmp(p, "id:", 3)) {
> +			char opt[5];
> +			int ret;
> +			long val;
> +
> +			if (max_acs_id>= NUM_ACS_IDS - 1) {
> +				pr_warn("Out of PCIe ACS override slots (%d)\n",
> +					NUM_ACS_IDS);
> +				goto next;
> +			}
> +
> +			p += 3;
> +			snprintf(opt, 5, "%s", p);
> +			ret = kstrtol(opt, 16,&val);
> +			if (ret) {
> +				pr_warn("PCIe ACS ID parse error %d\n", ret);
> +				goto next;
> +			}
> +			acs_on_ids[max_acs_id].vendor = val;
> +
> +			p += strcspn(p, ":");
> +			if (*p != ':') {
> +				pr_warn("PCIe ACS invalid ID\n");
> +				goto next;
> +			}
> +
> +			p++;
> +			snprintf(opt, 5, "%s", p);
> +			ret = kstrtol(opt, 16,&val);
> +			if (ret) {
> +				pr_warn("PCIe ACS ID parse error %d\n", ret);
> +				goto next;
> +			}
> +			acs_on_ids[max_acs_id].device = val;
> +			max_acs_id++;
> +		}
> +next:
> +		p += strcspn(p, ",");
> +		if (*p == ',')
> +			p++;
> +	}
> +
> +	if (acs_on_downstream || acs_on_multifunction || max_acs_id)
> +		pr_warn("Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA\n");
> +
> +	return 0;
> +}
> +early_param("pcie_acs_override", pcie_acs_override_setup);
> +
> +static int pcie_acs_overrides(struct pci_dev *dev, u16 acs_flags)
> +{
> +	int i;
> +
> +	/* Never override ACS for legacy devices or devices with ACS caps */
> +	if (!pci_is_pcie(dev) ||
> +	    pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS))
> +		return -ENOTTY;
> +
> +	for (i = 0; i<  max_acs_id; i++)
> +		if (acs_on_ids[i].vendor == dev->vendor&&
> +		    acs_on_ids[i].device == dev->device)
> +			return 1;
> +
> +	switch (pci_pcie_type(dev)) {
> +	case PCI_EXP_TYPE_DOWNSTREAM:
> +	case PCI_EXP_TYPE_ROOT_PORT:
> +		if (acs_on_downstream)
> +			return 1;
> +		break;
> +	case PCI_EXP_TYPE_ENDPOINT:
> +	case PCI_EXP_TYPE_UPSTREAM:
> +	case PCI_EXP_TYPE_LEG_END:
> +	case PCI_EXP_TYPE_RC_END:
> +		if (acs_on_multifunction&&  dev->multifunction)
> +			return 1;
> +	}
> +
> +	return -ENOTTY;
> +}
> +
>   static const struct pci_dev_acs_enabled {
>   	u16 vendor;
>   	u16 device;
>   	int (*acs_enabled)(struct pci_dev *dev, u16 acs_flags);
>   } pci_dev_acs_enabled[] = {
> +	{ PCI_ANY_ID, PCI_ANY_ID, pcie_acs_overrides },
>   	{ 0 }
>   };
>
Would like to see this list filled out asap to avoid cmdline workarounds! ;-)
wait!  -- did I say that?  I want quirks?!?  ... ;-)
Honestly, I'd rather see the vid/did here then numerous kernel cmdline args being added.


>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas June 18, 2013, 5:28 p.m. UTC | #3
On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
> PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
> allows us to control whether transactions are allowed to be redirected
> in various subnodes of a PCIe topology.  For instance, if two
> endpoints are below a root port or downsteam switch port, the
> downstream port may optionally redirect transactions between the
> devices, bypassing upstream devices.  The same can happen internally
> on multifunction devices.  The transaction may never be visible to the
> upstream devices.
> 
> One upstream device that we particularly care about is the IOMMU.  If
> a redirection occurs in the topology below the IOMMU, then the IOMMU
> cannot provide isolation between devices.  This is why the PCIe spec
> encourages topologies to include ACS support.  Without it, we have to
> assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
> 
> Unfortunately, far too many topologies do not support ACS to make this
> a steadfast requirement.  Even the latest chipsets from Intel are only
> sporadically supporting ACS.  We have trouble getting interconnect
> vendors to include the PCIe spec required PCIe capability, let alone
> suggested features.
> 
> Therefore, we need to add some flexibility.  The pcie_acs_override=
> boot option lets users opt-in specific devices or sets of devices to
> assume ACS support.  

"ACS support" means the device provides an ACS capability and we
can change bits in the ACS Control Register to control how things
work.

You are really adding a way to "assume this device always routes
peer-to-peer DMA upstream," which ultimately translates into "assume
this device can be isolated from others via the IOMMU."  I think.

> The "downstream" option assumes full ACS support
> on root ports and downstream switch ports.  The "multifunction"
> option assumes the subset of ACS features available on multifunction
> endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
> option enables ACS support on devices matching the provided vendor
> and device IDs, allowing more strategic ACS overrides.  These options
> may be combined in any order.  A maximum of 16 id specific overrides
> are available.  It's suggested to use the most limited set of options
> necessary to avoid completely disabling ACS across the topology.

Probably I'm confused by your use of "assume full ACS support," but I
don't understand the bit about "completely disabling ACS."  If we use
more options than necessary, it seems like we'd assume more isolation
than really exists.  That sounds like pretending ACS is *enabled* in
more places than it really is.

> Note to hardware vendors, we have facilities to permanently quirk
> specific devices which enforce isolation but not provide an ACS
> capability.  Please contact me to have your devices added and save
> your customers the hassle of this boot option.

Who do you expect to decide whether to use this option?  I think it
requires intimate knowledge of how the device works.

I think the benefit of using the option is that it makes assignment of
devices to guests more flexible, which will make it attractive to users.
But most users have no way of knowing whether it's actually *safe* to
use this.  So I worry that you're adding an easy way to pretend isolation
exists when there's no good way of being confident that it actually does.

I see the warning you added for this case; I just don't feel good about
it.  Maybe the idea is that, e.g., Red Hat will research certain devices
and recommend the option for those cases, and sign up to support that
config.  I assume you would not be willing to support its use unless
Red Hat specifically recommended it.

Bjorn

> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  Documentation/kernel-parameters.txt |   10 +++
>  drivers/pci/quirks.c                |  102 +++++++++++++++++++++++++++++++++++
>  2 files changed, 112 insertions(+)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 47bb23c..a60e6ad 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2349,6 +2349,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  		nomsi	Do not use MSI for native PCIe PME signaling (this makes
>  			all PCIe root ports use INTx for all services).
>  
> +	pcie_acs_override =
> +			[PCIE] Override missing PCIe ACS support for:
> +		downstream
> +			All downstream ports - full ACS capabilties
> +		multifunction
> +			All multifunction devices - multifunction ACS subset
> +		id:nnnn:nnnn
> +			Specfic device - full ACS capabilities
> +			Specified as vid:did (vendor/device ID) in hex
> +
>  	pcmv=		[HW,PCMCIA] BadgePAD 4
>  
>  	pd.		[PARIDE]
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 0369fb6..c7609f6 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3292,11 +3292,113 @@ struct pci_dev *pci_get_dma_source(struct pci_dev *dev)
>  	return pci_dev_get(dev);
>  }
>  
> +static bool acs_on_downstream;
> +static bool acs_on_multifunction;
> +
> +#define NUM_ACS_IDS 16
> +struct acs_on_id {
> +	unsigned short vendor;
> +	unsigned short device;
> +};
> +static struct acs_on_id acs_on_ids[NUM_ACS_IDS];
> +static u8 max_acs_id;
> +
> +static __init int pcie_acs_override_setup(char *p)
> +{
> +	if (!p)
> +		return -EINVAL;
> +
> +	while (*p) {
> +		if (!strncmp(p, "downstream", 10))
> +			acs_on_downstream = true;
> +		if (!strncmp(p, "multifunction", 13))
> +			acs_on_multifunction = true;
> +		if (!strncmp(p, "id:", 3)) {
> +			char opt[5];
> +			int ret;
> +			long val;
> +
> +			if (max_acs_id >= NUM_ACS_IDS - 1) {
> +				pr_warn("Out of PCIe ACS override slots (%d)\n",
> +					NUM_ACS_IDS);
> +				goto next;
> +			}
> +
> +			p += 3;
> +			snprintf(opt, 5, "%s", p);
> +			ret = kstrtol(opt, 16, &val);
> +			if (ret) {
> +				pr_warn("PCIe ACS ID parse error %d\n", ret);
> +				goto next;
> +			}
> +			acs_on_ids[max_acs_id].vendor = val;
> +
> +			p += strcspn(p, ":");
> +			if (*p != ':') {
> +				pr_warn("PCIe ACS invalid ID\n");
> +				goto next;
> +			}
> +
> +			p++;
> +			snprintf(opt, 5, "%s", p);
> +			ret = kstrtol(opt, 16, &val);
> +			if (ret) {
> +				pr_warn("PCIe ACS ID parse error %d\n", ret);
> +				goto next;
> +			}
> +			acs_on_ids[max_acs_id].device = val;
> +			max_acs_id++;
> +		}
> +next:
> +		p += strcspn(p, ",");
> +		if (*p == ',')
> +			p++;
> +	}
> +
> +	if (acs_on_downstream || acs_on_multifunction || max_acs_id)
> +		pr_warn("Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA\n");
> +
> +	return 0;
> +}
> +early_param("pcie_acs_override", pcie_acs_override_setup);
> +
> +static int pcie_acs_overrides(struct pci_dev *dev, u16 acs_flags)
> +{
> +	int i;
> +
> +	/* Never override ACS for legacy devices or devices with ACS caps */
> +	if (!pci_is_pcie(dev) ||
> +	    pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS))
> +		return -ENOTTY;
> +
> +	for (i = 0; i < max_acs_id; i++)
> +		if (acs_on_ids[i].vendor == dev->vendor &&
> +		    acs_on_ids[i].device == dev->device)
> +			return 1;
> +
> +	switch (pci_pcie_type(dev)) {
> +	case PCI_EXP_TYPE_DOWNSTREAM:
> +	case PCI_EXP_TYPE_ROOT_PORT:
> +		if (acs_on_downstream)
> +			return 1;
> +		break;
> +	case PCI_EXP_TYPE_ENDPOINT:
> +	case PCI_EXP_TYPE_UPSTREAM:
> +	case PCI_EXP_TYPE_LEG_END:
> +	case PCI_EXP_TYPE_RC_END:
> +		if (acs_on_multifunction && dev->multifunction)
> +			return 1;
> +	}
> +
> +	return -ENOTTY;
> +}
> +
>  static const struct pci_dev_acs_enabled {
>  	u16 vendor;
>  	u16 device;
>  	int (*acs_enabled)(struct pci_dev *dev, u16 acs_flags);
>  } pci_dev_acs_enabled[] = {
> +	{ PCI_ANY_ID, PCI_ANY_ID, pcie_acs_overrides },
>  	{ 0 }
>  };
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson June 18, 2013, 6:20 p.m. UTC | #4
On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
> > PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
> > allows us to control whether transactions are allowed to be redirected
> > in various subnodes of a PCIe topology.  For instance, if two
> > endpoints are below a root port or downsteam switch port, the
> > downstream port may optionally redirect transactions between the
> > devices, bypassing upstream devices.  The same can happen internally
> > on multifunction devices.  The transaction may never be visible to the
> > upstream devices.
> > 
> > One upstream device that we particularly care about is the IOMMU.  If
> > a redirection occurs in the topology below the IOMMU, then the IOMMU
> > cannot provide isolation between devices.  This is why the PCIe spec
> > encourages topologies to include ACS support.  Without it, we have to
> > assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
> > 
> > Unfortunately, far too many topologies do not support ACS to make this
> > a steadfast requirement.  Even the latest chipsets from Intel are only
> > sporadically supporting ACS.  We have trouble getting interconnect
> > vendors to include the PCIe spec required PCIe capability, let alone
> > suggested features.
> > 
> > Therefore, we need to add some flexibility.  The pcie_acs_override=
> > boot option lets users opt-in specific devices or sets of devices to
> > assume ACS support.  
> 
> "ACS support" means the device provides an ACS capability and we
> can change bits in the ACS Control Register to control how things
> work.
> 
> You are really adding a way to "assume this device always routes
> peer-to-peer DMA upstream," which ultimately translates into "assume
> this device can be isolated from others via the IOMMU."  I think.

We've heard from one vendor that they support ACS with a NULL capability
structure, ie. the ACS PCIe capability exists, but reports no ACS
capabilities and allows no control.  This takes advantage of the "if
supported" style wording of the spec to imply that peer-to-peer is not
supported because the capability is not available.  This supported but
not controllable version is what we're trying to enable here.

> > The "downstream" option assumes full ACS support
> > on root ports and downstream switch ports.  The "multifunction"
> > option assumes the subset of ACS features available on multifunction
> > endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
> > option enables ACS support on devices matching the provided vendor
> > and device IDs, allowing more strategic ACS overrides.  These options
> > may be combined in any order.  A maximum of 16 id specific overrides
> > are available.  It's suggested to use the most limited set of options
> > necessary to avoid completely disabling ACS across the topology.
> 
> Probably I'm confused by your use of "assume full ACS support,"

[on root ports and downstream ports]

>  but I
> don't understand the bit about "completely disabling ACS."

[across the topology]

>   If we use
> more options than necessary, it seems like we'd assume more isolation
> than really exists.  That sounds like pretending ACS is *enabled* in
> more places than it really is.

Exactly.  I'm just trying to suggest that booting with
pcie_acs_override=downstream,multifunction is not generally recommended
because it effectively disables ACS checking across the topology and
assumes isolation where there may be none.  In other words, if
everything is overriding ACS checks, then we've disabled any benefit of
doing the checking in the first place (so I really mean disable
checking, not disable ACS).  Instead, the recommendation is to be more
selective, possibly opting for just "downstream" or even better, using
the specific IDs for devices which are known or suspected to not allow
peer-to-peer.

> > Note to hardware vendors, we have facilities to permanently quirk
> > specific devices which enforce isolation but not provide an ACS
> > capability.  Please contact me to have your devices added and save
> > your customers the hassle of this boot option.
> 
> Who do you expect to decide whether to use this option?  I think it
> requires intimate knowledge of how the device works.
> 
> I think the benefit of using the option is that it makes assignment of
> devices to guests more flexible, which will make it attractive to users.
> But most users have no way of knowing whether it's actually *safe* to
> use this.  So I worry that you're adding an easy way to pretend isolation
> exists when there's no good way of being confident that it actually does.
> 
> I see the warning you added for this case; I just don't feel good about
> it.  Maybe the idea is that, e.g., Red Hat will research certain devices
> and recommend the option for those cases, and sign up to support that
> config.  I assume you would not be willing to support its use unless
> Red Hat specifically recommended it.

That pretty much sums it up.  We're working with vendors to try to
figure out which devices do not support ACS but don't allow
peer-to-peer, but it's difficult to get decisive statements to that
affect.  Legacy KVM device assignment relied on libvirt to do this ACS
checking and it does a poor job of it, allowing far more devices to be
assigned with ACS enforced than it really should.  It's also trivial to
disable libvirt's enforcement of ACS and people do it every day w/o
really thinking of the implications.  With VFIO-based device assignment
we move to a model where the kernel is enforcing device isolation rather
than it being at the whim of a userspace service.  Now we have not only
a more complete ACS test, but no way to make it more lenient.  Devices
that were previously allowed, no longer are and there's no way around it
without this patch.  However, this patch gives us more control than the
global disable switch in libvirt.  We can still be selective and fine
tune the shape of the groups, while hopefully adhering to the isolation
exposed by the hardware elsewhere.

I agree that it's difficult to determine whether it's safe to override,
but I don't see a way around it.  If it was obvious how to maintain
isolation, we'd do it automatically.  We need better ACS support from
vendors.  In the mean time, this allows people to complain to their
hardware vendors and opt-in to using the device anyway.  The warning is
there because if I'm debugging odd problems, I want to know that
isolation may be compromised, as the warning indicates.  Thanks,

Alex

> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  Documentation/kernel-parameters.txt |   10 +++
> >  drivers/pci/quirks.c                |  102 +++++++++++++++++++++++++++++++++++
> >  2 files changed, 112 insertions(+)
> > 
> > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> > index 47bb23c..a60e6ad 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -2349,6 +2349,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> >  		nomsi	Do not use MSI for native PCIe PME signaling (this makes
> >  			all PCIe root ports use INTx for all services).
> >  
> > +	pcie_acs_override =
> > +			[PCIE] Override missing PCIe ACS support for:
> > +		downstream
> > +			All downstream ports - full ACS capabilties
> > +		multifunction
> > +			All multifunction devices - multifunction ACS subset
> > +		id:nnnn:nnnn
> > +			Specfic device - full ACS capabilities
> > +			Specified as vid:did (vendor/device ID) in hex
> > +
> >  	pcmv=		[HW,PCMCIA] BadgePAD 4
> >  
> >  	pd.		[PARIDE]
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 0369fb6..c7609f6 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3292,11 +3292,113 @@ struct pci_dev *pci_get_dma_source(struct pci_dev *dev)
> >  	return pci_dev_get(dev);
> >  }
> >  
> > +static bool acs_on_downstream;
> > +static bool acs_on_multifunction;
> > +
> > +#define NUM_ACS_IDS 16
> > +struct acs_on_id {
> > +	unsigned short vendor;
> > +	unsigned short device;
> > +};
> > +static struct acs_on_id acs_on_ids[NUM_ACS_IDS];
> > +static u8 max_acs_id;
> > +
> > +static __init int pcie_acs_override_setup(char *p)
> > +{
> > +	if (!p)
> > +		return -EINVAL;
> > +
> > +	while (*p) {
> > +		if (!strncmp(p, "downstream", 10))
> > +			acs_on_downstream = true;
> > +		if (!strncmp(p, "multifunction", 13))
> > +			acs_on_multifunction = true;
> > +		if (!strncmp(p, "id:", 3)) {
> > +			char opt[5];
> > +			int ret;
> > +			long val;
> > +
> > +			if (max_acs_id >= NUM_ACS_IDS - 1) {
> > +				pr_warn("Out of PCIe ACS override slots (%d)\n",
> > +					NUM_ACS_IDS);
> > +				goto next;
> > +			}
> > +
> > +			p += 3;
> > +			snprintf(opt, 5, "%s", p);
> > +			ret = kstrtol(opt, 16, &val);
> > +			if (ret) {
> > +				pr_warn("PCIe ACS ID parse error %d\n", ret);
> > +				goto next;
> > +			}
> > +			acs_on_ids[max_acs_id].vendor = val;
> > +
> > +			p += strcspn(p, ":");
> > +			if (*p != ':') {
> > +				pr_warn("PCIe ACS invalid ID\n");
> > +				goto next;
> > +			}
> > +
> > +			p++;
> > +			snprintf(opt, 5, "%s", p);
> > +			ret = kstrtol(opt, 16, &val);
> > +			if (ret) {
> > +				pr_warn("PCIe ACS ID parse error %d\n", ret);
> > +				goto next;
> > +			}
> > +			acs_on_ids[max_acs_id].device = val;
> > +			max_acs_id++;
> > +		}
> > +next:
> > +		p += strcspn(p, ",");
> > +		if (*p == ',')
> > +			p++;
> > +	}
> > +
> > +	if (acs_on_downstream || acs_on_multifunction || max_acs_id)
> > +		pr_warn("Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA\n");
> > +
> > +	return 0;
> > +}
> > +early_param("pcie_acs_override", pcie_acs_override_setup);
> > +
> > +static int pcie_acs_overrides(struct pci_dev *dev, u16 acs_flags)
> > +{
> > +	int i;
> > +
> > +	/* Never override ACS for legacy devices or devices with ACS caps */
> > +	if (!pci_is_pcie(dev) ||
> > +	    pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS))
> > +		return -ENOTTY;
> > +
> > +	for (i = 0; i < max_acs_id; i++)
> > +		if (acs_on_ids[i].vendor == dev->vendor &&
> > +		    acs_on_ids[i].device == dev->device)
> > +			return 1;
> > +
> > +	switch (pci_pcie_type(dev)) {
> > +	case PCI_EXP_TYPE_DOWNSTREAM:
> > +	case PCI_EXP_TYPE_ROOT_PORT:
> > +		if (acs_on_downstream)
> > +			return 1;
> > +		break;
> > +	case PCI_EXP_TYPE_ENDPOINT:
> > +	case PCI_EXP_TYPE_UPSTREAM:
> > +	case PCI_EXP_TYPE_LEG_END:
> > +	case PCI_EXP_TYPE_RC_END:
> > +		if (acs_on_multifunction && dev->multifunction)
> > +			return 1;
> > +	}
> > +
> > +	return -ENOTTY;
> > +}
> > +
> >  static const struct pci_dev_acs_enabled {
> >  	u16 vendor;
> >  	u16 device;
> >  	int (*acs_enabled)(struct pci_dev *dev, u16 acs_flags);
> >  } pci_dev_acs_enabled[] = {
> > +	{ PCI_ANY_ID, PCI_ANY_ID, pcie_acs_overrides },
> >  	{ 0 }
> >  };
> >  
> > 



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas June 18, 2013, 9:31 p.m. UTC | #5
On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
>> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
>> > PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
>> > allows us to control whether transactions are allowed to be redirected
>> > in various subnodes of a PCIe topology.  For instance, if two
>> > endpoints are below a root port or downsteam switch port, the
>> > downstream port may optionally redirect transactions between the
>> > devices, bypassing upstream devices.  The same can happen internally
>> > on multifunction devices.  The transaction may never be visible to the
>> > upstream devices.
>> >
>> > One upstream device that we particularly care about is the IOMMU.  If
>> > a redirection occurs in the topology below the IOMMU, then the IOMMU
>> > cannot provide isolation between devices.  This is why the PCIe spec
>> > encourages topologies to include ACS support.  Without it, we have to
>> > assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
>> >
>> > Unfortunately, far too many topologies do not support ACS to make this
>> > a steadfast requirement.  Even the latest chipsets from Intel are only
>> > sporadically supporting ACS.  We have trouble getting interconnect
>> > vendors to include the PCIe spec required PCIe capability, let alone
>> > suggested features.
>> >
>> > Therefore, we need to add some flexibility.  The pcie_acs_override=
>> > boot option lets users opt-in specific devices or sets of devices to
>> > assume ACS support.
>>
>> "ACS support" means the device provides an ACS capability and we
>> can change bits in the ACS Control Register to control how things
>> work.
>>
>> You are really adding a way to "assume this device always routes
>> peer-to-peer DMA upstream," which ultimately translates into "assume
>> this device can be isolated from others via the IOMMU."  I think.
>
> We've heard from one vendor that they support ACS with a NULL capability
> structure, ie. the ACS PCIe capability exists, but reports no ACS
> capabilities and allows no control.  This takes advantage of the "if
> supported" style wording of the spec to imply that peer-to-peer is not
> supported because the capability is not available.  This supported but
> not controllable version is what we're trying to enable here.

I was wrong to say "we can change bits in the control register."  All
the control register bits are actually required to be hardwired to
zero when the relevant functionality is not implemented.

The example you mention (a device with an ACS capability structure
that reports no supported capabilities and allows no control) sounds
perfectly legal as a description of a device that doesn't support
peer-to-peer, and I hope it doesn't require any user intervention
(e.g., this patch) or quirks to make it work.

The ACS P2P Request Redirect "must be implemented by Functions that
support peer-to-peer traffic with other Functions."  This example
device doesn't support peer-to-peer traffic, so why would it implement
the ACS R bit?  In fact, it looks like the R bit (and all the other
bits) *must* be hardwired to zero in both capability and control
registers.

>> > The "downstream" option assumes full ACS support
>> > on root ports and downstream switch ports.  The "multifunction"
>> > option assumes the subset of ACS features available on multifunction
>> > endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
>> > option enables ACS support on devices matching the provided vendor
>> > and device IDs, allowing more strategic ACS overrides.  These options
>> > may be combined in any order.  A maximum of 16 id specific overrides
>> > are available.  It's suggested to use the most limited set of options
>> > necessary to avoid completely disabling ACS across the topology.
>>
>> Probably I'm confused by your use of "assume full ACS support,"
>
> [on root ports and downstream ports]
>
>>  but I
>> don't understand the bit about "completely disabling ACS."
>
> [across the topology]
>
>>   If we use
>> more options than necessary, it seems like we'd assume more isolation
>> than really exists.  That sounds like pretending ACS is *enabled* in
>> more places than it really is.
>
> Exactly.  I'm just trying to suggest that booting with
> pcie_acs_override=downstream,multifunction is not generally recommended
> because it effectively disables ACS checking across the topology and
> assumes isolation where there may be none.  In other words, if
> everything is overriding ACS checks, then we've disabled any benefit of
> doing the checking in the first place (so I really mean disable
> checking, not disable ACS).

Yep, the missing "checking" is what is confusing.  Also, I think it
would be good to make the implication more explicit -- using this
option makes the kernel assume isolation where it may not actually
exist -- because the users of this option don't know about checking
anyway; that's an internal implementation detail.

> Instead, the recommendation is to be more
> selective, possibly opting for just "downstream" or even better, using
> the specific IDs for devices which are known or suspected to not allow
> peer-to-peer.
>
>> > Note to hardware vendors, we have facilities to permanently quirk
>> > specific devices which enforce isolation but not provide an ACS
>> > capability.  Please contact me to have your devices added and save
>> > your customers the hassle of this boot option.
>>
>> Who do you expect to decide whether to use this option?  I think it
>> requires intimate knowledge of how the device works.
>>
>> I think the benefit of using the option is that it makes assignment of
>> devices to guests more flexible, which will make it attractive to users.
>> But most users have no way of knowing whether it's actually *safe* to
>> use this.  So I worry that you're adding an easy way to pretend isolation
>> exists when there's no good way of being confident that it actually does.
>>
>> I see the warning you added for this case; I just don't feel good about
>> it.  Maybe the idea is that, e.g., Red Hat will research certain devices
>> and recommend the option for those cases, and sign up to support that
>> config.  I assume you would not be willing to support its use unless
>> Red Hat specifically recommended it.
>
> That pretty much sums it up.  We're working with vendors to try to
> figure out which devices do not support ACS but don't allow
> peer-to-peer, but it's difficult to get decisive statements to that
> affect.  Legacy KVM device assignment relied on libvirt to do this ACS
> checking and it does a poor job of it, allowing far more devices to be
> assigned with ACS enforced than it really should.  It's also trivial to
> disable libvirt's enforcement of ACS and people do it every day w/o
> really thinking of the implications.  With VFIO-based device assignment
> we move to a model where the kernel is enforcing device isolation rather
> than it being at the whim of a userspace service.  Now we have not only
> a more complete ACS test, but no way to make it more lenient.  Devices
> that were previously allowed, no longer are and there's no way around it
> without this patch.  However, this patch gives us more control than the
> global disable switch in libvirt.  We can still be selective and fine
> tune the shape of the groups, while hopefully adhering to the isolation
> exposed by the hardware elsewhere.
>
> I agree that it's difficult to determine whether it's safe to override,
> but I don't see a way around it.  If it was obvious how to maintain
> isolation, we'd do it automatically.  We need better ACS support from
> vendors.  In the mean time, this allows people to complain to their
> hardware vendors and opt-in to using the device anyway.  The warning is
> there because if I'm debugging odd problems, I want to know that
> isolation may be compromised, as the warning indicates.  Thanks,

I wonder if we should taint the kernel if this option is used (but not
for specific devices added to pci_dev_acs_enabled[]).  It would also
be nice if pci_dev_specific_acs_enabled() gave some indication in
dmesg for the specific devices you're hoping to add to
pci_dev_acs_enabled[].  It's not an enumeration-time quirk right now,
so I'm not sure how we'd limit it to one message per device.

I assume that if RH knows about certain devices that are safe in this
respect, you'd just add them to pci_dev_acs_enabled[] up front, and
the command-line option is really just a workaround until you can spin
a new kernel that has the table updated?

It might even make sense to simplify this option so it just assumes
*all* devices can be isolated, and get rid of the
"downstream/multifunction/vendor & device ID" stuff.  That would be
much easier to explain, and it would make it more obvious that we're
really doing something pretty scary here.

Bjorn (there is one doc comment below)

> Alex
>
>> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> > ---
>> >  Documentation/kernel-parameters.txt |   10 +++
>> >  drivers/pci/quirks.c                |  102 +++++++++++++++++++++++++++++++++++
>> >  2 files changed, 112 insertions(+)
>> >
>> > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>> > index 47bb23c..a60e6ad 100644
>> > --- a/Documentation/kernel-parameters.txt
>> > +++ b/Documentation/kernel-parameters.txt
>> > @@ -2349,6 +2349,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>> >             nomsi   Do not use MSI for native PCIe PME signaling (this makes
>> >                     all PCIe root ports use INTx for all services).
>> >
>> > +   pcie_acs_override =
>> > +                   [PCIE] Override missing PCIe ACS support for:
>> > +           downstream
>> > +                   All downstream ports - full ACS capabilties
>> > +           multifunction
>> > +                   All multifunction devices - multifunction ACS subset
>> > +           id:nnnn:nnnn
>> > +                   Specfic device - full ACS capabilities
>> > +                   Specified as vid:did (vendor/device ID) in hex

This should say something about "device isolation support," I think.
It's too big a leap from "missing ACS support" to expect users to make
it.

>> > +
>> >     pcmv=           [HW,PCMCIA] BadgePAD 4
>> >
>> >     pd.             [PARIDE]
>> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>> > index 0369fb6..c7609f6 100644
>> > --- a/drivers/pci/quirks.c
>> > +++ b/drivers/pci/quirks.c
>> > @@ -3292,11 +3292,113 @@ struct pci_dev *pci_get_dma_source(struct pci_dev *dev)
>> >     return pci_dev_get(dev);
>> >  }
>> >
>> > +static bool acs_on_downstream;
>> > +static bool acs_on_multifunction;
>> > +
>> > +#define NUM_ACS_IDS 16
>> > +struct acs_on_id {
>> > +   unsigned short vendor;
>> > +   unsigned short device;
>> > +};
>> > +static struct acs_on_id acs_on_ids[NUM_ACS_IDS];
>> > +static u8 max_acs_id;
>> > +
>> > +static __init int pcie_acs_override_setup(char *p)
>> > +{
>> > +   if (!p)
>> > +           return -EINVAL;
>> > +
>> > +   while (*p) {
>> > +           if (!strncmp(p, "downstream", 10))
>> > +                   acs_on_downstream = true;
>> > +           if (!strncmp(p, "multifunction", 13))
>> > +                   acs_on_multifunction = true;
>> > +           if (!strncmp(p, "id:", 3)) {
>> > +                   char opt[5];
>> > +                   int ret;
>> > +                   long val;
>> > +
>> > +                   if (max_acs_id >= NUM_ACS_IDS - 1) {
>> > +                           pr_warn("Out of PCIe ACS override slots (%d)\n",
>> > +                                   NUM_ACS_IDS);
>> > +                           goto next;
>> > +                   }
>> > +
>> > +                   p += 3;
>> > +                   snprintf(opt, 5, "%s", p);
>> > +                   ret = kstrtol(opt, 16, &val);
>> > +                   if (ret) {
>> > +                           pr_warn("PCIe ACS ID parse error %d\n", ret);
>> > +                           goto next;
>> > +                   }
>> > +                   acs_on_ids[max_acs_id].vendor = val;
>> > +
>> > +                   p += strcspn(p, ":");
>> > +                   if (*p != ':') {
>> > +                           pr_warn("PCIe ACS invalid ID\n");
>> > +                           goto next;
>> > +                   }
>> > +
>> > +                   p++;
>> > +                   snprintf(opt, 5, "%s", p);
>> > +                   ret = kstrtol(opt, 16, &val);
>> > +                   if (ret) {
>> > +                           pr_warn("PCIe ACS ID parse error %d\n", ret);
>> > +                           goto next;
>> > +                   }
>> > +                   acs_on_ids[max_acs_id].device = val;
>> > +                   max_acs_id++;
>> > +           }
>> > +next:
>> > +           p += strcspn(p, ",");
>> > +           if (*p == ',')
>> > +                   p++;
>> > +   }
>> > +
>> > +   if (acs_on_downstream || acs_on_multifunction || max_acs_id)
>> > +           pr_warn("Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA\n");
>> > +
>> > +   return 0;
>> > +}
>> > +early_param("pcie_acs_override", pcie_acs_override_setup);
>> > +
>> > +static int pcie_acs_overrides(struct pci_dev *dev, u16 acs_flags)
>> > +{
>> > +   int i;
>> > +
>> > +   /* Never override ACS for legacy devices or devices with ACS caps */
>> > +   if (!pci_is_pcie(dev) ||
>> > +       pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS))
>> > +           return -ENOTTY;
>> > +
>> > +   for (i = 0; i < max_acs_id; i++)
>> > +           if (acs_on_ids[i].vendor == dev->vendor &&
>> > +               acs_on_ids[i].device == dev->device)
>> > +                   return 1;
>> > +
>> > +   switch (pci_pcie_type(dev)) {
>> > +   case PCI_EXP_TYPE_DOWNSTREAM:
>> > +   case PCI_EXP_TYPE_ROOT_PORT:
>> > +           if (acs_on_downstream)
>> > +                   return 1;
>> > +           break;
>> > +   case PCI_EXP_TYPE_ENDPOINT:
>> > +   case PCI_EXP_TYPE_UPSTREAM:
>> > +   case PCI_EXP_TYPE_LEG_END:
>> > +   case PCI_EXP_TYPE_RC_END:
>> > +           if (acs_on_multifunction && dev->multifunction)
>> > +                   return 1;
>> > +   }
>> > +
>> > +   return -ENOTTY;
>> > +}
>> > +
>> >  static const struct pci_dev_acs_enabled {
>> >     u16 vendor;
>> >     u16 device;
>> >     int (*acs_enabled)(struct pci_dev *dev, u16 acs_flags);
>> >  } pci_dev_acs_enabled[] = {
>> > +   { PCI_ANY_ID, PCI_ANY_ID, pcie_acs_overrides },
>> >     { 0 }
>> >  };
>> >
>> >
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson June 18, 2013, 10:22 p.m. UTC | #6
On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:
> On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
> >> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
> >> > PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
> >> > allows us to control whether transactions are allowed to be redirected
> >> > in various subnodes of a PCIe topology.  For instance, if two
> >> > endpoints are below a root port or downsteam switch port, the
> >> > downstream port may optionally redirect transactions between the
> >> > devices, bypassing upstream devices.  The same can happen internally
> >> > on multifunction devices.  The transaction may never be visible to the
> >> > upstream devices.
> >> >
> >> > One upstream device that we particularly care about is the IOMMU.  If
> >> > a redirection occurs in the topology below the IOMMU, then the IOMMU
> >> > cannot provide isolation between devices.  This is why the PCIe spec
> >> > encourages topologies to include ACS support.  Without it, we have to
> >> > assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
> >> >
> >> > Unfortunately, far too many topologies do not support ACS to make this
> >> > a steadfast requirement.  Even the latest chipsets from Intel are only
> >> > sporadically supporting ACS.  We have trouble getting interconnect
> >> > vendors to include the PCIe spec required PCIe capability, let alone
> >> > suggested features.
> >> >
> >> > Therefore, we need to add some flexibility.  The pcie_acs_override=
> >> > boot option lets users opt-in specific devices or sets of devices to
> >> > assume ACS support.
> >>
> >> "ACS support" means the device provides an ACS capability and we
> >> can change bits in the ACS Control Register to control how things
> >> work.
> >>
> >> You are really adding a way to "assume this device always routes
> >> peer-to-peer DMA upstream," which ultimately translates into "assume
> >> this device can be isolated from others via the IOMMU."  I think.
> >
> > We've heard from one vendor that they support ACS with a NULL capability
> > structure, ie. the ACS PCIe capability exists, but reports no ACS
> > capabilities and allows no control.  This takes advantage of the "if
> > supported" style wording of the spec to imply that peer-to-peer is not
> > supported because the capability is not available.  This supported but
> > not controllable version is what we're trying to enable here.
> 
> I was wrong to say "we can change bits in the control register."  All
> the control register bits are actually required to be hardwired to
> zero when the relevant functionality is not implemented.
> 
> The example you mention (a device with an ACS capability structure
> that reports no supported capabilities and allows no control) sounds
> perfectly legal as a description of a device that doesn't support
> peer-to-peer, and I hope it doesn't require any user intervention
> (e.g., this patch) or quirks to make it work.

It requires: 

Subject: [PATCH v2 2/2] pci: Differentiate ACS controllable from enabled

> The ACS P2P Request Redirect "must be implemented by Functions that
> support peer-to-peer traffic with other Functions."  This example
> device doesn't support peer-to-peer traffic, so why would it implement
> the ACS R bit?  In fact, it looks like the R bit (and all the other
> bits) *must* be hardwired to zero in both capability and control
> registers.

Right, if they don't support peer-to-peer then hardwiring capability and
control to zero should indicate that and is fixed by the patch
referenced above.

> >> > The "downstream" option assumes full ACS support
> >> > on root ports and downstream switch ports.  The "multifunction"
> >> > option assumes the subset of ACS features available on multifunction
> >> > endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
> >> > option enables ACS support on devices matching the provided vendor
> >> > and device IDs, allowing more strategic ACS overrides.  These options
> >> > may be combined in any order.  A maximum of 16 id specific overrides
> >> > are available.  It's suggested to use the most limited set of options
> >> > necessary to avoid completely disabling ACS across the topology.
> >>
> >> Probably I'm confused by your use of "assume full ACS support,"
> >
> > [on root ports and downstream ports]
> >
> >>  but I
> >> don't understand the bit about "completely disabling ACS."
> >
> > [across the topology]
> >
> >>   If we use
> >> more options than necessary, it seems like we'd assume more isolation
> >> than really exists.  That sounds like pretending ACS is *enabled* in
> >> more places than it really is.
> >
> > Exactly.  I'm just trying to suggest that booting with
> > pcie_acs_override=downstream,multifunction is not generally recommended
> > because it effectively disables ACS checking across the topology and
> > assumes isolation where there may be none.  In other words, if
> > everything is overriding ACS checks, then we've disabled any benefit of
> > doing the checking in the first place (so I really mean disable
> > checking, not disable ACS).
> 
> Yep, the missing "checking" is what is confusing.  Also, I think it
> would be good to make the implication more explicit -- using this
> option makes the kernel assume isolation where it may not actually
> exist -- because the users of this option don't know about checking
> anyway; that's an internal implementation detail.

More explicit in Documentation/kernel-parameters.txt?

> > Instead, the recommendation is to be more
> > selective, possibly opting for just "downstream" or even better, using
> > the specific IDs for devices which are known or suspected to not allow
> > peer-to-peer.
> >
> >> > Note to hardware vendors, we have facilities to permanently quirk
> >> > specific devices which enforce isolation but not provide an ACS
> >> > capability.  Please contact me to have your devices added and save
> >> > your customers the hassle of this boot option.
> >>
> >> Who do you expect to decide whether to use this option?  I think it
> >> requires intimate knowledge of how the device works.
> >>
> >> I think the benefit of using the option is that it makes assignment of
> >> devices to guests more flexible, which will make it attractive to users.
> >> But most users have no way of knowing whether it's actually *safe* to
> >> use this.  So I worry that you're adding an easy way to pretend isolation
> >> exists when there's no good way of being confident that it actually does.
> >>
> >> I see the warning you added for this case; I just don't feel good about
> >> it.  Maybe the idea is that, e.g., Red Hat will research certain devices
> >> and recommend the option for those cases, and sign up to support that
> >> config.  I assume you would not be willing to support its use unless
> >> Red Hat specifically recommended it.
> >
> > That pretty much sums it up.  We're working with vendors to try to
> > figure out which devices do not support ACS but don't allow
> > peer-to-peer, but it's difficult to get decisive statements to that
> > affect.  Legacy KVM device assignment relied on libvirt to do this ACS
> > checking and it does a poor job of it, allowing far more devices to be
> > assigned with ACS enforced than it really should.  It's also trivial to
> > disable libvirt's enforcement of ACS and people do it every day w/o
> > really thinking of the implications.  With VFIO-based device assignment
> > we move to a model where the kernel is enforcing device isolation rather
> > than it being at the whim of a userspace service.  Now we have not only
> > a more complete ACS test, but no way to make it more lenient.  Devices
> > that were previously allowed, no longer are and there's no way around it
> > without this patch.  However, this patch gives us more control than the
> > global disable switch in libvirt.  We can still be selective and fine
> > tune the shape of the groups, while hopefully adhering to the isolation
> > exposed by the hardware elsewhere.
> >
> > I agree that it's difficult to determine whether it's safe to override,
> > but I don't see a way around it.  If it was obvious how to maintain
> > isolation, we'd do it automatically.  We need better ACS support from
> > vendors.  In the mean time, this allows people to complain to their
> > hardware vendors and opt-in to using the device anyway.  The warning is
> > there because if I'm debugging odd problems, I want to know that
> > isolation may be compromised, as the warning indicates.  Thanks,
> 
> I wonder if we should taint the kernel if this option is used (but not
> for specific devices added to pci_dev_acs_enabled[]).  It would also
> be nice if pci_dev_specific_acs_enabled() gave some indication in
> dmesg for the specific devices you're hoping to add to
> pci_dev_acs_enabled[].  It's not an enumeration-time quirk right now,
> so I'm not sure how we'd limit it to one message per device.

Right, setup vs use and getting single prints is a lot of extra code.
Tainting is troublesome for support, Don had some objections when I
suggested the same to him.

> I assume that if RH knows about certain devices that are safe in this
> respect, you'd just add them to pci_dev_acs_enabled[] up front, and
> the command-line option is really just a workaround until you can spin
> a new kernel that has the table updated?

Yep, needing to know about this option is not user friendly.  We want
things to "just work" whenever possible.  There are going to be cases
where there are obscure devices, even mainstream devices, that need
this.  Whether it's a temporary or long lived solution depends on what
kind of answers we can get from vendors.

> It might even make sense to simplify this option so it just assumes
> *all* devices can be isolated, and get rid of the
> "downstream/multifunction/vendor & device ID" stuff.  That would be
> much easier to explain, and it would make it more obvious that we're
> really doing something pretty scary here.

Seems like that makes it harder to isolate specific devices for
promotion to pci_dev_acs_enabled[] and more likely that the user will
turn the whole thing off for a single device and forget about isolation
of other devices.  It's a time bomb that legacy KVM assignment and
libvirt make this so easy to work around today, I'd like to be more
selective for vfio in the future.

> Bjorn (there is one doc comment below)
> 
> > Alex
> >
> >> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >> > ---
> >> >  Documentation/kernel-parameters.txt |   10 +++
> >> >  drivers/pci/quirks.c                |  102 +++++++++++++++++++++++++++++++++++
> >> >  2 files changed, 112 insertions(+)
> >> >
> >> > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> >> > index 47bb23c..a60e6ad 100644
> >> > --- a/Documentation/kernel-parameters.txt
> >> > +++ b/Documentation/kernel-parameters.txt
> >> > @@ -2349,6 +2349,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> >> >             nomsi   Do not use MSI for native PCIe PME signaling (this makes
> >> >                     all PCIe root ports use INTx for all services).
> >> >
> >> > +   pcie_acs_override =
> >> > +                   [PCIE] Override missing PCIe ACS support for:
> >> > +           downstream
> >> > +                   All downstream ports - full ACS capabilties
> >> > +           multifunction
> >> > +                   All multifunction devices - multifunction ACS subset
> >> > +           id:nnnn:nnnn
> >> > +                   Specfic device - full ACS capabilities
> >> > +                   Specified as vid:did (vendor/device ID) in hex
> 
> This should say something about "device isolation support," I think.
> It's too big a leap from "missing ACS support" to expect users to make
> it.

Ok.  Thanks,

Alex

> >> > +
> >> >     pcmv=           [HW,PCMCIA] BadgePAD 4
> >> >
> >> >     pd.             [PARIDE]
> >> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> >> > index 0369fb6..c7609f6 100644
> >> > --- a/drivers/pci/quirks.c
> >> > +++ b/drivers/pci/quirks.c
> >> > @@ -3292,11 +3292,113 @@ struct pci_dev *pci_get_dma_source(struct pci_dev *dev)
> >> >     return pci_dev_get(dev);
> >> >  }
> >> >
> >> > +static bool acs_on_downstream;
> >> > +static bool acs_on_multifunction;
> >> > +
> >> > +#define NUM_ACS_IDS 16
> >> > +struct acs_on_id {
> >> > +   unsigned short vendor;
> >> > +   unsigned short device;
> >> > +};
> >> > +static struct acs_on_id acs_on_ids[NUM_ACS_IDS];
> >> > +static u8 max_acs_id;
> >> > +
> >> > +static __init int pcie_acs_override_setup(char *p)
> >> > +{
> >> > +   if (!p)
> >> > +           return -EINVAL;
> >> > +
> >> > +   while (*p) {
> >> > +           if (!strncmp(p, "downstream", 10))
> >> > +                   acs_on_downstream = true;
> >> > +           if (!strncmp(p, "multifunction", 13))
> >> > +                   acs_on_multifunction = true;
> >> > +           if (!strncmp(p, "id:", 3)) {
> >> > +                   char opt[5];
> >> > +                   int ret;
> >> > +                   long val;
> >> > +
> >> > +                   if (max_acs_id >= NUM_ACS_IDS - 1) {
> >> > +                           pr_warn("Out of PCIe ACS override slots (%d)\n",
> >> > +                                   NUM_ACS_IDS);
> >> > +                           goto next;
> >> > +                   }
> >> > +
> >> > +                   p += 3;
> >> > +                   snprintf(opt, 5, "%s", p);
> >> > +                   ret = kstrtol(opt, 16, &val);
> >> > +                   if (ret) {
> >> > +                           pr_warn("PCIe ACS ID parse error %d\n", ret);
> >> > +                           goto next;
> >> > +                   }
> >> > +                   acs_on_ids[max_acs_id].vendor = val;
> >> > +
> >> > +                   p += strcspn(p, ":");
> >> > +                   if (*p != ':') {
> >> > +                           pr_warn("PCIe ACS invalid ID\n");
> >> > +                           goto next;
> >> > +                   }
> >> > +
> >> > +                   p++;
> >> > +                   snprintf(opt, 5, "%s", p);
> >> > +                   ret = kstrtol(opt, 16, &val);
> >> > +                   if (ret) {
> >> > +                           pr_warn("PCIe ACS ID parse error %d\n", ret);
> >> > +                           goto next;
> >> > +                   }
> >> > +                   acs_on_ids[max_acs_id].device = val;
> >> > +                   max_acs_id++;
> >> > +           }
> >> > +next:
> >> > +           p += strcspn(p, ",");
> >> > +           if (*p == ',')
> >> > +                   p++;
> >> > +   }
> >> > +
> >> > +   if (acs_on_downstream || acs_on_multifunction || max_acs_id)
> >> > +           pr_warn("Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA\n");
> >> > +
> >> > +   return 0;
> >> > +}
> >> > +early_param("pcie_acs_override", pcie_acs_override_setup);
> >> > +
> >> > +static int pcie_acs_overrides(struct pci_dev *dev, u16 acs_flags)
> >> > +{
> >> > +   int i;
> >> > +
> >> > +   /* Never override ACS for legacy devices or devices with ACS caps */
> >> > +   if (!pci_is_pcie(dev) ||
> >> > +       pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS))
> >> > +           return -ENOTTY;
> >> > +
> >> > +   for (i = 0; i < max_acs_id; i++)
> >> > +           if (acs_on_ids[i].vendor == dev->vendor &&
> >> > +               acs_on_ids[i].device == dev->device)
> >> > +                   return 1;
> >> > +
> >> > +   switch (pci_pcie_type(dev)) {
> >> > +   case PCI_EXP_TYPE_DOWNSTREAM:
> >> > +   case PCI_EXP_TYPE_ROOT_PORT:
> >> > +           if (acs_on_downstream)
> >> > +                   return 1;
> >> > +           break;
> >> > +   case PCI_EXP_TYPE_ENDPOINT:
> >> > +   case PCI_EXP_TYPE_UPSTREAM:
> >> > +   case PCI_EXP_TYPE_LEG_END:
> >> > +   case PCI_EXP_TYPE_RC_END:
> >> > +           if (acs_on_multifunction && dev->multifunction)
> >> > +                   return 1;
> >> > +   }
> >> > +
> >> > +   return -ENOTTY;
> >> > +}
> >> > +
> >> >  static const struct pci_dev_acs_enabled {
> >> >     u16 vendor;
> >> >     u16 device;
> >> >     int (*acs_enabled)(struct pci_dev *dev, u16 acs_flags);
> >> >  } pci_dev_acs_enabled[] = {
> >> > +   { PCI_ANY_ID, PCI_ANY_ID, pcie_acs_overrides },
> >> >     { 0 }
> >> >  };
> >> >
> >> >
> >
> >
> >



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Donald Dutile June 18, 2013, 11:03 p.m. UTC | #7
On 06/18/2013 06:22 PM, Alex Williamson wrote:
> On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:
>> On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
>> <alex.williamson@redhat.com>  wrote:
>>> On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
>>>> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
>>>>> PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
>>>>> allows us to control whether transactions are allowed to be redirected
>>>>> in various subnodes of a PCIe topology.  For instance, if two
>>>>> endpoints are below a root port or downsteam switch port, the
>>>>> downstream port may optionally redirect transactions between the
>>>>> devices, bypassing upstream devices.  The same can happen internally
>>>>> on multifunction devices.  The transaction may never be visible to the
>>>>> upstream devices.
>>>>>
>>>>> One upstream device that we particularly care about is the IOMMU.  If
>>>>> a redirection occurs in the topology below the IOMMU, then the IOMMU
>>>>> cannot provide isolation between devices.  This is why the PCIe spec
>>>>> encourages topologies to include ACS support.  Without it, we have to
>>>>> assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
>>>>>
>>>>> Unfortunately, far too many topologies do not support ACS to make this
>>>>> a steadfast requirement.  Even the latest chipsets from Intel are only
>>>>> sporadically supporting ACS.  We have trouble getting interconnect
>>>>> vendors to include the PCIe spec required PCIe capability, let alone
>>>>> suggested features.
>>>>>
>>>>> Therefore, we need to add some flexibility.  The pcie_acs_override=
>>>>> boot option lets users opt-in specific devices or sets of devices to
>>>>> assume ACS support.
>>>>
>>>> "ACS support" means the device provides an ACS capability and we
>>>> can change bits in the ACS Control Register to control how things
>>>> work.
>>>>
>>>> You are really adding a way to "assume this device always routes
>>>> peer-to-peer DMA upstream," which ultimately translates into "assume
>>>> this device can be isolated from others via the IOMMU."  I think.
>>>
>>> We've heard from one vendor that they support ACS with a NULL capability
>>> structure, ie. the ACS PCIe capability exists, but reports no ACS
>>> capabilities and allows no control.  This takes advantage of the "if
>>> supported" style wording of the spec to imply that peer-to-peer is not
>>> supported because the capability is not available.  This supported but
>>> not controllable version is what we're trying to enable here.
>>
>> I was wrong to say "we can change bits in the control register."  All
>> the control register bits are actually required to be hardwired to
>> zero when the relevant functionality is not implemented.
>>
>> The example you mention (a device with an ACS capability structure
>> that reports no supported capabilities and allows no control) sounds
>> perfectly legal as a description of a device that doesn't support
>> peer-to-peer, and I hope it doesn't require any user intervention
>> (e.g., this patch) or quirks to make it work.
>
> It requires:
>
> Subject: [PATCH v2 2/2] pci: Differentiate ACS controllable from enabled
>
>> The ACS P2P Request Redirect "must be implemented by Functions that
>> support peer-to-peer traffic with other Functions."  This example
>> device doesn't support peer-to-peer traffic, so why would it implement
>> the ACS R bit?  In fact, it looks like the R bit (and all the other
>> bits) *must* be hardwired to zero in both capability and control
>> registers.
>
> Right, if they don't support peer-to-peer then hardwiring capability and
> control to zero should indicate that and is fixed by the patch
> referenced above.
>
>>>>> The "downstream" option assumes full ACS support
>>>>> on root ports and downstream switch ports.  The "multifunction"
>>>>> option assumes the subset of ACS features available on multifunction
>>>>> endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
>>>>> option enables ACS support on devices matching the provided vendor
>>>>> and device IDs, allowing more strategic ACS overrides.  These options
>>>>> may be combined in any order.  A maximum of 16 id specific overrides
>>>>> are available.  It's suggested to use the most limited set of options
>>>>> necessary to avoid completely disabling ACS across the topology.
>>>>
>>>> Probably I'm confused by your use of "assume full ACS support,"
>>>
>>> [on root ports and downstream ports]
>>>
>>>>   but I
>>>> don't understand the bit about "completely disabling ACS."
>>>
>>> [across the topology]
>>>
>>>>    If we use
>>>> more options than necessary, it seems like we'd assume more isolation
>>>> than really exists.  That sounds like pretending ACS is *enabled* in
>>>> more places than it really is.
>>>
>>> Exactly.  I'm just trying to suggest that booting with
>>> pcie_acs_override=downstream,multifunction is not generally recommended
>>> because it effectively disables ACS checking across the topology and
>>> assumes isolation where there may be none.  In other words, if
>>> everything is overriding ACS checks, then we've disabled any benefit of
>>> doing the checking in the first place (so I really mean disable
>>> checking, not disable ACS).
>>
>> Yep, the missing "checking" is what is confusing.  Also, I think it
>> would be good to make the implication more explicit -- using this
>> option makes the kernel assume isolation where it may not actually
>> exist -- because the users of this option don't know about checking
>> anyway; that's an internal implementation detail.
>
> More explicit in Documentation/kernel-parameters.txt?
>
>>> Instead, the recommendation is to be more
>>> selective, possibly opting for just "downstream" or even better, using
>>> the specific IDs for devices which are known or suspected to not allow
>>> peer-to-peer.
>>>
>>>>> Note to hardware vendors, we have facilities to permanently quirk
>>>>> specific devices which enforce isolation but not provide an ACS
>>>>> capability.  Please contact me to have your devices added and save
>>>>> your customers the hassle of this boot option.
>>>>
>>>> Who do you expect to decide whether to use this option?  I think it
>>>> requires intimate knowledge of how the device works.
>>>>
>>>> I think the benefit of using the option is that it makes assignment of
>>>> devices to guests more flexible, which will make it attractive to users.
>>>> But most users have no way of knowing whether it's actually *safe* to
>>>> use this.  So I worry that you're adding an easy way to pretend isolation
>>>> exists when there's no good way of being confident that it actually does.
>>>>
>>>> I see the warning you added for this case; I just don't feel good about
>>>> it.  Maybe the idea is that, e.g., Red Hat will research certain devices
>>>> and recommend the option for those cases, and sign up to support that
>>>> config.  I assume you would not be willing to support its use unless
>>>> Red Hat specifically recommended it.
>>>
>>> That pretty much sums it up.  We're working with vendors to try to
>>> figure out which devices do not support ACS but don't allow
>>> peer-to-peer, but it's difficult to get decisive statements to that
>>> affect.  Legacy KVM device assignment relied on libvirt to do this ACS
>>> checking and it does a poor job of it, allowing far more devices to be
>>> assigned with ACS enforced than it really should.  It's also trivial to
>>> disable libvirt's enforcement of ACS and people do it every day w/o
>>> really thinking of the implications.  With VFIO-based device assignment
>>> we move to a model where the kernel is enforcing device isolation rather
>>> than it being at the whim of a userspace service.  Now we have not only
>>> a more complete ACS test, but no way to make it more lenient.  Devices
>>> that were previously allowed, no longer are and there's no way around it
>>> without this patch.  However, this patch gives us more control than the
>>> global disable switch in libvirt.  We can still be selective and fine
>>> tune the shape of the groups, while hopefully adhering to the isolation
>>> exposed by the hardware elsewhere.
>>>
>>> I agree that it's difficult to determine whether it's safe to override,
>>> but I don't see a way around it.  If it was obvious how to maintain
>>> isolation, we'd do it automatically.  We need better ACS support from
>>> vendors.  In the mean time, this allows people to complain to their
>>> hardware vendors and opt-in to using the device anyway.  The warning is
>>> there because if I'm debugging odd problems, I want to know that
>>> isolation may be compromised, as the warning indicates.  Thanks,
>>
>> I wonder if we should taint the kernel if this option is used (but not
>> for specific devices added to pci_dev_acs_enabled[]).  It would also
>> be nice if pci_dev_specific_acs_enabled() gave some indication in
>> dmesg for the specific devices you're hoping to add to
>> pci_dev_acs_enabled[].  It's not an enumeration-time quirk right now,
>> so I'm not sure how we'd limit it to one message per device.
>
> Right, setup vs use and getting single prints is a lot of extra code.
> Tainting is troublesome for support, Don had some objections when I
> suggested the same to him.
>
For RH GSS (Global Support Services), a 'taint' in the kernel printk means
RH doesn't support that system.  The 'non-support' due to 'taint' being printed
out in this case may be incorrect -- RH may support that use, at least until
a more sufficient patched kernel is provided.
Thus my dissension that 'taint' be output.  WARN is ok. 'driver beware',
'unleashed dog afoot'.... sure...

>> I assume that if RH knows about certain devices that are safe in this
>> respect, you'd just add them to pci_dev_acs_enabled[] up front, and
>> the command-line option is really just a workaround until you can spin
>> a new kernel that has the table updated?
>
> Yep, needing to know about this option is not user friendly.  We want
> things to "just work" whenever possible.  There are going to be cases
> where there are obscure devices, even mainstream devices, that need
> this.  Whether it's a temporary or long lived solution depends on what
> kind of answers we can get from vendors.
>
>> It might even make sense to simplify this option so it just assumes
>> *all* devices can be isolated, and get rid of the
>> "downstream/multifunction/vendor&  device ID" stuff.  That would be
>> much easier to explain, and it would make it more obvious that we're
>> really doing something pretty scary here.
>
> Seems like that makes it harder to isolate specific devices for
> promotion to pci_dev_acs_enabled[] and more likely that the user will
> turn the whole thing off for a single device and forget about isolation
> of other devices.  It's a time bomb that legacy KVM assignment and
> libvirt make this so easy to work around today, I'd like to be more
> selective for vfio in the future.
>
>> Bjorn (there is one doc comment below)
>>
>>> Alex
>>>
>>>>> Signed-off-by: Alex Williamson<alex.williamson@redhat.com>
>>>>> ---
>>>>>   Documentation/kernel-parameters.txt |   10 +++
>>>>>   drivers/pci/quirks.c                |  102 +++++++++++++++++++++++++++++++++++
>>>>>   2 files changed, 112 insertions(+)
>>>>>
>>>>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>>>>> index 47bb23c..a60e6ad 100644
>>>>> --- a/Documentation/kernel-parameters.txt
>>>>> +++ b/Documentation/kernel-parameters.txt
>>>>> @@ -2349,6 +2349,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>>>>>              nomsi   Do not use MSI for native PCIe PME signaling (this makes
>>>>>                      all PCIe root ports use INTx for all services).
>>>>>
>>>>> +   pcie_acs_override =
>>>>> +                   [PCIE] Override missing PCIe ACS support for:
>>>>> +           downstream
>>>>> +                   All downstream ports - full ACS capabilties
>>>>> +           multifunction
>>>>> +                   All multifunction devices - multifunction ACS subset
>>>>> +           id:nnnn:nnnn
>>>>> +                   Specfic device - full ACS capabilities
>>>>> +                   Specified as vid:did (vendor/device ID) in hex
>>
>> This should say something about "device isolation support," I think.
>> It's too big a leap from "missing ACS support" to expect users to make
>> it.
>
> Ok.  Thanks,
>
> Alex
>
>>>>> +
>>>>>      pcmv=           [HW,PCMCIA] BadgePAD 4
>>>>>
>>>>>      pd.             [PARIDE]
>>>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>>>>> index 0369fb6..c7609f6 100644
>>>>> --- a/drivers/pci/quirks.c
>>>>> +++ b/drivers/pci/quirks.c
>>>>> @@ -3292,11 +3292,113 @@ struct pci_dev *pci_get_dma_source(struct pci_dev *dev)
>>>>>      return pci_dev_get(dev);
>>>>>   }
>>>>>
>>>>> +static bool acs_on_downstream;
>>>>> +static bool acs_on_multifunction;
>>>>> +
>>>>> +#define NUM_ACS_IDS 16
>>>>> +struct acs_on_id {
>>>>> +   unsigned short vendor;
>>>>> +   unsigned short device;
>>>>> +};
>>>>> +static struct acs_on_id acs_on_ids[NUM_ACS_IDS];
>>>>> +static u8 max_acs_id;
>>>>> +
>>>>> +static __init int pcie_acs_override_setup(char *p)
>>>>> +{
>>>>> +   if (!p)
>>>>> +           return -EINVAL;
>>>>> +
>>>>> +   while (*p) {
>>>>> +           if (!strncmp(p, "downstream", 10))
>>>>> +                   acs_on_downstream = true;
>>>>> +           if (!strncmp(p, "multifunction", 13))
>>>>> +                   acs_on_multifunction = true;
>>>>> +           if (!strncmp(p, "id:", 3)) {
>>>>> +                   char opt[5];
>>>>> +                   int ret;
>>>>> +                   long val;
>>>>> +
>>>>> +                   if (max_acs_id>= NUM_ACS_IDS - 1) {
>>>>> +                           pr_warn("Out of PCIe ACS override slots (%d)\n",
>>>>> +                                   NUM_ACS_IDS);
>>>>> +                           goto next;
>>>>> +                   }
>>>>> +
>>>>> +                   p += 3;
>>>>> +                   snprintf(opt, 5, "%s", p);
>>>>> +                   ret = kstrtol(opt, 16,&val);
>>>>> +                   if (ret) {
>>>>> +                           pr_warn("PCIe ACS ID parse error %d\n", ret);
>>>>> +                           goto next;
>>>>> +                   }
>>>>> +                   acs_on_ids[max_acs_id].vendor = val;
>>>>> +
>>>>> +                   p += strcspn(p, ":");
>>>>> +                   if (*p != ':') {
>>>>> +                           pr_warn("PCIe ACS invalid ID\n");
>>>>> +                           goto next;
>>>>> +                   }
>>>>> +
>>>>> +                   p++;
>>>>> +                   snprintf(opt, 5, "%s", p);
>>>>> +                   ret = kstrtol(opt, 16,&val);
>>>>> +                   if (ret) {
>>>>> +                           pr_warn("PCIe ACS ID parse error %d\n", ret);
>>>>> +                           goto next;
>>>>> +                   }
>>>>> +                   acs_on_ids[max_acs_id].device = val;
>>>>> +                   max_acs_id++;
>>>>> +           }
>>>>> +next:
>>>>> +           p += strcspn(p, ",");
>>>>> +           if (*p == ',')
>>>>> +                   p++;
>>>>> +   }
>>>>> +
>>>>> +   if (acs_on_downstream || acs_on_multifunction || max_acs_id)
>>>>> +           pr_warn("Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA\n");
>>>>> +
>>>>> +   return 0;
>>>>> +}
>>>>> +early_param("pcie_acs_override", pcie_acs_override_setup);
>>>>> +
>>>>> +static int pcie_acs_overrides(struct pci_dev *dev, u16 acs_flags)
>>>>> +{
>>>>> +   int i;
>>>>> +
>>>>> +   /* Never override ACS for legacy devices or devices with ACS caps */
>>>>> +   if (!pci_is_pcie(dev) ||
>>>>> +       pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS))
>>>>> +           return -ENOTTY;
>>>>> +
>>>>> +   for (i = 0; i<  max_acs_id; i++)
>>>>> +           if (acs_on_ids[i].vendor == dev->vendor&&
>>>>> +               acs_on_ids[i].device == dev->device)
>>>>> +                   return 1;
>>>>> +
>>>>> +   switch (pci_pcie_type(dev)) {
>>>>> +   case PCI_EXP_TYPE_DOWNSTREAM:
>>>>> +   case PCI_EXP_TYPE_ROOT_PORT:
>>>>> +           if (acs_on_downstream)
>>>>> +                   return 1;
>>>>> +           break;
>>>>> +   case PCI_EXP_TYPE_ENDPOINT:
>>>>> +   case PCI_EXP_TYPE_UPSTREAM:
>>>>> +   case PCI_EXP_TYPE_LEG_END:
>>>>> +   case PCI_EXP_TYPE_RC_END:
>>>>> +           if (acs_on_multifunction&&  dev->multifunction)
>>>>> +                   return 1;
>>>>> +   }
>>>>> +
>>>>> +   return -ENOTTY;
>>>>> +}
>>>>> +
>>>>>   static const struct pci_dev_acs_enabled {
>>>>>      u16 vendor;
>>>>>      u16 device;
>>>>>      int (*acs_enabled)(struct pci_dev *dev, u16 acs_flags);
>>>>>   } pci_dev_acs_enabled[] = {
>>>>> +   { PCI_ANY_ID, PCI_ANY_ID, pcie_acs_overrides },
>>>>>      { 0 }
>>>>>   };
>>>>>
>>>>>
>>>
>>>
>>>
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas June 19, 2013, 2:52 a.m. UTC | #8
On Tue, Jun 18, 2013 at 5:03 PM, Don Dutile <ddutile@redhat.com> wrote:
> On 06/18/2013 06:22 PM, Alex Williamson wrote:
>>
>> On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:
>>>
>>> On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
>>> <alex.williamson@redhat.com>  wrote:
>>>>
>>>> On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
>>>>>
>>>>> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
> ...
>>>>> Who do you expect to decide whether to use this option?  I think it
>>>>> requires intimate knowledge of how the device works.
>>>>>
>>>>> I think the benefit of using the option is that it makes assignment of
>>>>> devices to guests more flexible, which will make it attractive to
>>>>> users.
>>>>> But most users have no way of knowing whether it's actually *safe* to
>>>>> use this.  So I worry that you're adding an easy way to pretend
>>>>> isolation
>>>>> exists when there's no good way of being confident that it actually
>>>>> does.

> ...

>>> I wonder if we should taint the kernel if this option is used (but not
>>> for specific devices added to pci_dev_acs_enabled[]).  It would also
>>> be nice if pci_dev_specific_acs_enabled() gave some indication in
>>> dmesg for the specific devices you're hoping to add to
>>> pci_dev_acs_enabled[].  It's not an enumeration-time quirk right now,
>>> so I'm not sure how we'd limit it to one message per device.
>>
>> Right, setup vs use and getting single prints is a lot of extra code.
>> Tainting is troublesome for support, Don had some objections when I
>> suggested the same to him.
>>
> For RH GSS (Global Support Services), a 'taint' in the kernel printk means
> RH doesn't support that system.  The 'non-support' due to 'taint' being
> printed
> out in this case may be incorrect -- RH may support that use, at least until
> a more sufficient patched kernel is provided.
> Thus my dissension that 'taint' be output.  WARN is ok. 'driver beware',
> 'unleashed dog afoot'.... sure...

So ...  that's really a RH-specific support issue, and easily worked
around by RH adding a patch that turns off tainting.

It still sounds like a good idea to me for upstream, where use of this
option can very possibly lead to corruption or information leakage
between devices the user claimed were isolated, but in fact were not.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Donald Dutile June 19, 2013, 12:43 p.m. UTC | #9
On 06/18/2013 10:52 PM, Bjorn Helgaas wrote:
> On Tue, Jun 18, 2013 at 5:03 PM, Don Dutile<ddutile@redhat.com>  wrote:
>> On 06/18/2013 06:22 PM, Alex Williamson wrote:
>>>
>>> On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:
>>>>
>>>> On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
>>>> <alex.williamson@redhat.com>   wrote:
>>>>>
>>>>> On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
>>>>>>
>>>>>> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
>> ...
>>>>>> Who do you expect to decide whether to use this option?  I think it
>>>>>> requires intimate knowledge of how the device works.
>>>>>>
>>>>>> I think the benefit of using the option is that it makes assignment of
>>>>>> devices to guests more flexible, which will make it attractive to
>>>>>> users.
>>>>>> But most users have no way of knowing whether it's actually *safe* to
>>>>>> use this.  So I worry that you're adding an easy way to pretend
>>>>>> isolation
>>>>>> exists when there's no good way of being confident that it actually
>>>>>> does.
>
>> ...
>
>>>> I wonder if we should taint the kernel if this option is used (but not
>>>> for specific devices added to pci_dev_acs_enabled[]).  It would also
>>>> be nice if pci_dev_specific_acs_enabled() gave some indication in
>>>> dmesg for the specific devices you're hoping to add to
>>>> pci_dev_acs_enabled[].  It's not an enumeration-time quirk right now,
>>>> so I'm not sure how we'd limit it to one message per device.
>>>
>>> Right, setup vs use and getting single prints is a lot of extra code.
>>> Tainting is troublesome for support, Don had some objections when I
>>> suggested the same to him.
>>>
>> For RH GSS (Global Support Services), a 'taint' in the kernel printk means
>> RH doesn't support that system.  The 'non-support' due to 'taint' being
>> printed
>> out in this case may be incorrect -- RH may support that use, at least until
>> a more sufficient patched kernel is provided.
>> Thus my dissension that 'taint' be output.  WARN is ok. 'driver beware',
>> 'unleashed dog afoot'.... sure...
>
> So ...  that's really a RH-specific support issue, and easily worked
> around by RH adding a patch that turns off tainting.
>
sure. what's another patch to the thousands... :-/

> It still sounds like a good idea to me for upstream, where use of this
> option can very possibly lead to corruption or information leakage
> between devices the user claimed were isolated, but in fact were not.
>
> Bjorn
Did I miss something?  This patch provides a user-level/chosen override;
like all other overrides, (pci=realloc, etc.), it can lead to a failing system.
IMO, this patch is no different.  If you want to tag this patch with taint,
then let's audit all the (PCI) overrides and taint them appropriately.
Taint should be reserved to changes to the kernel that were done outside
the development of the kernel, or with the explicit intent to circumvent
the normal operation of the kernel.  This patch provides a way to enable
ACS checking to succeed when the devices have not provided sufficiently complete
ACS information.  i.e., it's a growth path for PCIe-ACS and its need for
proper support.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas June 24, 2013, 5:43 p.m. UTC | #10
On Wed, Jun 19, 2013 at 6:43 AM, Don Dutile <ddutile@redhat.com> wrote:
> On 06/18/2013 10:52 PM, Bjorn Helgaas wrote:
>>
>> On Tue, Jun 18, 2013 at 5:03 PM, Don Dutile<ddutile@redhat.com>  wrote:
>>>
>>> On 06/18/2013 06:22 PM, Alex Williamson wrote:
>>>>
>>>>
>>>> On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:
>>>>>
>>>>>
>>>>> On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
>>>>> <alex.williamson@redhat.com>   wrote:
>>>>>>
>>>>>>
>>>>>> On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
>>>
>>> ...
>>>>>>>
>>>>>>> Who do you expect to decide whether to use this option?  I think it
>>>>>>> requires intimate knowledge of how the device works.
>>>>>>>
>>>>>>> I think the benefit of using the option is that it makes assignment
>>>>>>> of
>>>>>>> devices to guests more flexible, which will make it attractive to
>>>>>>> users.
>>>>>>> But most users have no way of knowing whether it's actually *safe* to
>>>>>>> use this.  So I worry that you're adding an easy way to pretend
>>>>>>> isolation
>>>>>>> exists when there's no good way of being confident that it actually
>>>>>>> does.
>>
>>
>>> ...
>>
>>
>>>>> I wonder if we should taint the kernel if this option is used (but not
>>>>> for specific devices added to pci_dev_acs_enabled[]).  It would also
>>>>> be nice if pci_dev_specific_acs_enabled() gave some indication in
>>>>> dmesg for the specific devices you're hoping to add to
>>>>> pci_dev_acs_enabled[].  It's not an enumeration-time quirk right now,
>>>>> so I'm not sure how we'd limit it to one message per device.
>>>>
>>>>
>>>> Right, setup vs use and getting single prints is a lot of extra code.
>>>> Tainting is troublesome for support, Don had some objections when I
>>>> suggested the same to him.
>>>>
>>> For RH GSS (Global Support Services), a 'taint' in the kernel printk
>>> means
>>> RH doesn't support that system.  The 'non-support' due to 'taint' being
>>> printed
>>> out in this case may be incorrect -- RH may support that use, at least
>>> until
>>> a more sufficient patched kernel is provided.
>>> Thus my dissension that 'taint' be output.  WARN is ok. 'driver beware',
>>> 'unleashed dog afoot'.... sure...
>>
>>
>> So ...  that's really a RH-specific support issue, and easily worked
>> around by RH adding a patch that turns off tainting.
>>
> sure. what's another patch to the thousands... :-/
>
>> It still sounds like a good idea to me for upstream, where use of this
>> option can very possibly lead to corruption or information leakage
>> between devices the user claimed were isolated, but in fact were not.
>
> Did I miss something?  This patch provides a user-level/chosen override;
> like all other overrides, (pci=realloc, etc.), it can lead to a failing
> system.
> IMO, this patch is no different.  If you want to tag this patch with taint,
> then let's audit all the (PCI) overrides and taint them appropriately.
> Taint should be reserved to changes to the kernel that were done outside
> the development of the kernel, or with the explicit intent to circumvent
> the normal operation of the kernel.  This patch provides a way to enable
> ACS checking to succeed when the devices have not provided sufficiently
> complete
> ACS information.  i.e., it's a growth path for PCIe-ACS and its need for
> proper support.

We're telling the kernel to assume something (the hardware provides
protection) that may not be true.  If that assumption turns out to be
false, the result is that a VM can be crashed or comprised by another
VM.

One difference I see is that this override can lead to a crash that
looks like random memory corruption and has no apparent connection to
the actual cause.  Most other overrides won't cause run-time crashes
(I think they're more likely to cause boot or device configuration
failures), and the dmesg log will probably have good clues as to the
reason.

But the possibility of compromise is probably even more serious,
because there would be no crash at all, and we'd have no indication
that VM A read or corrupted data in VM B.  I'm very concerned about
that, enough so that it's not clear to me that an override belongs in
the upstream kernel at all.

Yes, that would mean some hardware is not suitable for device
assignment.  That just sounds like "if hardware manufacturers do their
homework and support ACS properly, their hardware is more useful for
virtualization than other hardware."  I don't see the problem with
that.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson June 26, 2013, 7:03 p.m. UTC | #11
On Mon, 2013-06-24 at 11:43 -0600, Bjorn Helgaas wrote:
> On Wed, Jun 19, 2013 at 6:43 AM, Don Dutile <ddutile@redhat.com> wrote:
> > On 06/18/2013 10:52 PM, Bjorn Helgaas wrote:
> >>
> >> On Tue, Jun 18, 2013 at 5:03 PM, Don Dutile<ddutile@redhat.com>  wrote:
> >>>
> >>> On 06/18/2013 06:22 PM, Alex Williamson wrote:
> >>>>
> >>>>
> >>>> On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:
> >>>>>
> >>>>>
> >>>>> On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
> >>>>> <alex.williamson@redhat.com>   wrote:
> >>>>>>
> >>>>>>
> >>>>>> On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
> >>>
> >>> ...
> >>>>>>>
> >>>>>>> Who do you expect to decide whether to use this option?  I think it
> >>>>>>> requires intimate knowledge of how the device works.
> >>>>>>>
> >>>>>>> I think the benefit of using the option is that it makes assignment
> >>>>>>> of
> >>>>>>> devices to guests more flexible, which will make it attractive to
> >>>>>>> users.
> >>>>>>> But most users have no way of knowing whether it's actually *safe* to
> >>>>>>> use this.  So I worry that you're adding an easy way to pretend
> >>>>>>> isolation
> >>>>>>> exists when there's no good way of being confident that it actually
> >>>>>>> does.
> >>
> >>
> >>> ...
> >>
> >>
> >>>>> I wonder if we should taint the kernel if this option is used (but not
> >>>>> for specific devices added to pci_dev_acs_enabled[]).  It would also
> >>>>> be nice if pci_dev_specific_acs_enabled() gave some indication in
> >>>>> dmesg for the specific devices you're hoping to add to
> >>>>> pci_dev_acs_enabled[].  It's not an enumeration-time quirk right now,
> >>>>> so I'm not sure how we'd limit it to one message per device.
> >>>>
> >>>>
> >>>> Right, setup vs use and getting single prints is a lot of extra code.
> >>>> Tainting is troublesome for support, Don had some objections when I
> >>>> suggested the same to him.
> >>>>
> >>> For RH GSS (Global Support Services), a 'taint' in the kernel printk
> >>> means
> >>> RH doesn't support that system.  The 'non-support' due to 'taint' being
> >>> printed
> >>> out in this case may be incorrect -- RH may support that use, at least
> >>> until
> >>> a more sufficient patched kernel is provided.
> >>> Thus my dissension that 'taint' be output.  WARN is ok. 'driver beware',
> >>> 'unleashed dog afoot'.... sure...
> >>
> >>
> >> So ...  that's really a RH-specific support issue, and easily worked
> >> around by RH adding a patch that turns off tainting.
> >>
> > sure. what's another patch to the thousands... :-/
> >
> >> It still sounds like a good idea to me for upstream, where use of this
> >> option can very possibly lead to corruption or information leakage
> >> between devices the user claimed were isolated, but in fact were not.
> >
> > Did I miss something?  This patch provides a user-level/chosen override;
> > like all other overrides, (pci=realloc, etc.), it can lead to a failing
> > system.
> > IMO, this patch is no different.  If you want to tag this patch with taint,
> > then let's audit all the (PCI) overrides and taint them appropriately.
> > Taint should be reserved to changes to the kernel that were done outside
> > the development of the kernel, or with the explicit intent to circumvent
> > the normal operation of the kernel.  This patch provides a way to enable
> > ACS checking to succeed when the devices have not provided sufficiently
> > complete
> > ACS information.  i.e., it's a growth path for PCIe-ACS and its need for
> > proper support.
> 
> We're telling the kernel to assume something (the hardware provides
> protection) that may not be true.  If that assumption turns out to be
> false, the result is that a VM can be crashed or comprised by another
> VM.
> 
> One difference I see is that this override can lead to a crash that
> looks like random memory corruption and has no apparent connection to
> the actual cause.  Most other overrides won't cause run-time crashes
> (I think they're more likely to cause boot or device configuration
> failures), and the dmesg log will probably have good clues as to the
> reason.
> 
> But the possibility of compromise is probably even more serious,
> because there would be no crash at all, and we'd have no indication
> that VM A read or corrupted data in VM B.  I'm very concerned about
> that, enough so that it's not clear to me that an override belongs in
> the upstream kernel at all.
> 
> Yes, that would mean some hardware is not suitable for device
> assignment.  That just sounds like "if hardware manufacturers do their
> homework and support ACS properly, their hardware is more useful for
> virtualization than other hardware."  I don't see the problem with
> that.

That's easy to say for someone that doesn't get caught trying to explain
this to users over and over.  In many cases devices don't do
peer-to-peer and missing ACS is an oversight.  I imagine that quite a
few vendors also see the ACS capability as a means to allow control of
ACS and therefore see it as a much larger investment that just providing
an empty ACS structure in config space to indicate the lack of
peer-to-peer.

Even if we taint the kernel when this is enabled and add extremely
verbose warnings in kernel-parameters.txt, I think there's value to
providing an on-the-spot workaround to users.  In many cases we're not
going to get a response from vendors.  Removing the
downstream/multifunction catch-alls might be another way to raise the
bar for this kind of override.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Donald Dutile July 23, 2013, 6:38 p.m. UTC | #12
On 06/26/2013 03:03 PM, Alex Williamson wrote:
> On Mon, 2013-06-24 at 11:43 -0600, Bjorn Helgaas wrote:
>> On Wed, Jun 19, 2013 at 6:43 AM, Don Dutile<ddutile@redhat.com>  wrote:
>>> On 06/18/2013 10:52 PM, Bjorn Helgaas wrote:
>>>>
>>>> On Tue, Jun 18, 2013 at 5:03 PM, Don Dutile<ddutile@redhat.com>   wrote:
>>>>>
>>>>> On 06/18/2013 06:22 PM, Alex Williamson wrote:
>>>>>>
>>>>>>
>>>>>> On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
>>>>>>> <alex.williamson@redhat.com>    wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
>>>>>
>>>>> ...
>>>>>>>>>
>>>>>>>>> Who do you expect to decide whether to use this option?  I think it
>>>>>>>>> requires intimate knowledge of how the device works.
>>>>>>>>>
>>>>>>>>> I think the benefit of using the option is that it makes assignment
>>>>>>>>> of
>>>>>>>>> devices to guests more flexible, which will make it attractive to
>>>>>>>>> users.
>>>>>>>>> But most users have no way of knowing whether it's actually *safe* to
>>>>>>>>> use this.  So I worry that you're adding an easy way to pretend
>>>>>>>>> isolation
>>>>>>>>> exists when there's no good way of being confident that it actually
>>>>>>>>> does.
>>>>
>>>>
>>>>> ...
>>>>
>>>>
>>>>>>> I wonder if we should taint the kernel if this option is used (but not
>>>>>>> for specific devices added to pci_dev_acs_enabled[]).  It would also
>>>>>>> be nice if pci_dev_specific_acs_enabled() gave some indication in
>>>>>>> dmesg for the specific devices you're hoping to add to
>>>>>>> pci_dev_acs_enabled[].  It's not an enumeration-time quirk right now,
>>>>>>> so I'm not sure how we'd limit it to one message per device.
>>>>>>
>>>>>>
>>>>>> Right, setup vs use and getting single prints is a lot of extra code.
>>>>>> Tainting is troublesome for support, Don had some objections when I
>>>>>> suggested the same to him.
>>>>>>
>>>>> For RH GSS (Global Support Services), a 'taint' in the kernel printk
>>>>> means
>>>>> RH doesn't support that system.  The 'non-support' due to 'taint' being
>>>>> printed
>>>>> out in this case may be incorrect -- RH may support that use, at least
>>>>> until
>>>>> a more sufficient patched kernel is provided.
>>>>> Thus my dissension that 'taint' be output.  WARN is ok. 'driver beware',
>>>>> 'unleashed dog afoot'.... sure...
>>>>
>>>>
>>>> So ...  that's really a RH-specific support issue, and easily worked
>>>> around by RH adding a patch that turns off tainting.
>>>>
>>> sure. what's another patch to the thousands... :-/
>>>
>>>> It still sounds like a good idea to me for upstream, where use of this
>>>> option can very possibly lead to corruption or information leakage
>>>> between devices the user claimed were isolated, but in fact were not.
>>>
>>> Did I miss something?  This patch provides a user-level/chosen override;
>>> like all other overrides, (pci=realloc, etc.), it can lead to a failing
>>> system.
>>> IMO, this patch is no different.  If you want to tag this patch with taint,
>>> then let's audit all the (PCI) overrides and taint them appropriately.
>>> Taint should be reserved to changes to the kernel that were done outside
>>> the development of the kernel, or with the explicit intent to circumvent
>>> the normal operation of the kernel.  This patch provides a way to enable
>>> ACS checking to succeed when the devices have not provided sufficiently
>>> complete
>>> ACS information.  i.e., it's a growth path for PCIe-ACS and its need for
>>> proper support.
>>
>> We're telling the kernel to assume something (the hardware provides
>> protection) that may not be true.  If that assumption turns out to be
>> false, the result is that a VM can be crashed or comprised by another
>> VM.
>>
>> One difference I see is that this override can lead to a crash that
>> looks like random memory corruption and has no apparent connection to
>> the actual cause.  Most other overrides won't cause run-time crashes
>> (I think they're more likely to cause boot or device configuration
>> failures), and the dmesg log will probably have good clues as to the
>> reason.
>>
>> But the possibility of compromise is probably even more serious,
>> because there would be no crash at all, and we'd have no indication
>> that VM A read or corrupted data in VM B.  I'm very concerned about
>> that, enough so that it's not clear to me that an override belongs in
>> the upstream kernel at all.
>>
>> Yes, that would mean some hardware is not suitable for device
>> assignment.  That just sounds like "if hardware manufacturers do their
>> homework and support ACS properly, their hardware is more useful for
>> virtualization than other hardware."  I don't see the problem with
>> that.
>
> That's easy to say for someone that doesn't get caught trying to explain
> this to users over and over.  In many cases devices don't do
> peer-to-peer and missing ACS is an oversight.  I imagine that quite a
> few vendors also see the ACS capability as a means to allow control of
> ACS and therefore see it as a much larger investment that just providing
> an empty ACS structure in config space to indicate the lack of
> peer-to-peer.
>
+1 -- A good explanation why the workaround is needed.
In fact, until recently, we didn't understand that an empty ACS meant
'no p2p' and not just an empty ACS with no control to disable p2p. :-/

> Even if we taint the kernel when this is enabled and add extremely
> verbose warnings in kernel-parameters.txt, I think there's value to
> providing an on-the-spot workaround to users.  In many cases we're not
> going to get a response from vendors.  Removing the
> downstream/multifunction catch-alls might be another way to raise the
> bar for this kind of override.  Thanks,
I'll re-state what I said previously (maybe privately): there are numerous
ways to crash a system with kernel params.  Agreed, that in most instances,
the params cause the system not to boot at all if they don't work/help, but
arguing to taint the kernel in this case is like asking rm to detect 'rm -rf /'
and flag the user you're trashing your system .... and not temporarily, or in
a recoverable way.

>
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 47bb23c..a60e6ad 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2349,6 +2349,16 @@  bytes respectively. Such letter suffixes can also be entirely omitted.
 		nomsi	Do not use MSI for native PCIe PME signaling (this makes
 			all PCIe root ports use INTx for all services).
 
+	pcie_acs_override =
+			[PCIE] Override missing PCIe ACS support for:
+		downstream
+			All downstream ports - full ACS capabilties
+		multifunction
+			All multifunction devices - multifunction ACS subset
+		id:nnnn:nnnn
+			Specfic device - full ACS capabilities
+			Specified as vid:did (vendor/device ID) in hex
+
 	pcmv=		[HW,PCMCIA] BadgePAD 4
 
 	pd.		[PARIDE]
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 0369fb6..c7609f6 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3292,11 +3292,113 @@  struct pci_dev *pci_get_dma_source(struct pci_dev *dev)
 	return pci_dev_get(dev);
 }
 
+static bool acs_on_downstream;
+static bool acs_on_multifunction;
+
+#define NUM_ACS_IDS 16
+struct acs_on_id {
+	unsigned short vendor;
+	unsigned short device;
+};
+static struct acs_on_id acs_on_ids[NUM_ACS_IDS];
+static u8 max_acs_id;
+
+static __init int pcie_acs_override_setup(char *p)
+{
+	if (!p)
+		return -EINVAL;
+
+	while (*p) {
+		if (!strncmp(p, "downstream", 10))
+			acs_on_downstream = true;
+		if (!strncmp(p, "multifunction", 13))
+			acs_on_multifunction = true;
+		if (!strncmp(p, "id:", 3)) {
+			char opt[5];
+			int ret;
+			long val;
+
+			if (max_acs_id >= NUM_ACS_IDS - 1) {
+				pr_warn("Out of PCIe ACS override slots (%d)\n",
+					NUM_ACS_IDS);
+				goto next;
+			}
+
+			p += 3;
+			snprintf(opt, 5, "%s", p);
+			ret = kstrtol(opt, 16, &val);
+			if (ret) {
+				pr_warn("PCIe ACS ID parse error %d\n", ret);
+				goto next;
+			}
+			acs_on_ids[max_acs_id].vendor = val;
+
+			p += strcspn(p, ":");
+			if (*p != ':') {
+				pr_warn("PCIe ACS invalid ID\n");
+				goto next;
+			}
+
+			p++;
+			snprintf(opt, 5, "%s", p);
+			ret = kstrtol(opt, 16, &val);
+			if (ret) {
+				pr_warn("PCIe ACS ID parse error %d\n", ret);
+				goto next;
+			}
+			acs_on_ids[max_acs_id].device = val;
+			max_acs_id++;
+		}
+next:
+		p += strcspn(p, ",");
+		if (*p == ',')
+			p++;
+	}
+
+	if (acs_on_downstream || acs_on_multifunction || max_acs_id)
+		pr_warn("Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA\n");
+
+	return 0;
+}
+early_param("pcie_acs_override", pcie_acs_override_setup);
+
+static int pcie_acs_overrides(struct pci_dev *dev, u16 acs_flags)
+{
+	int i;
+
+	/* Never override ACS for legacy devices or devices with ACS caps */
+	if (!pci_is_pcie(dev) ||
+	    pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS))
+		return -ENOTTY;
+
+	for (i = 0; i < max_acs_id; i++)
+		if (acs_on_ids[i].vendor == dev->vendor &&
+		    acs_on_ids[i].device == dev->device)
+			return 1;
+
+	switch (pci_pcie_type(dev)) {
+	case PCI_EXP_TYPE_DOWNSTREAM:
+	case PCI_EXP_TYPE_ROOT_PORT:
+		if (acs_on_downstream)
+			return 1;
+		break;
+	case PCI_EXP_TYPE_ENDPOINT:
+	case PCI_EXP_TYPE_UPSTREAM:
+	case PCI_EXP_TYPE_LEG_END:
+	case PCI_EXP_TYPE_RC_END:
+		if (acs_on_multifunction && dev->multifunction)
+			return 1;
+	}
+
+	return -ENOTTY;
+}
+
 static const struct pci_dev_acs_enabled {
 	u16 vendor;
 	u16 device;
 	int (*acs_enabled)(struct pci_dev *dev, u16 acs_flags);
 } pci_dev_acs_enabled[] = {
+	{ PCI_ANY_ID, PCI_ANY_ID, pcie_acs_overrides },
 	{ 0 }
 };