diff mbox series

[v3,3/7] PCI: OF: Allow endpoints to bypass the iommu

Message ID 20181012145917.6840-4-jean-philippe.brucker@arm.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show
Series Add virtio-iommu driver | expand

Commit Message

Jean-Philippe Brucker Oct. 12, 2018, 2:59 p.m. UTC
Using the iommu-map binding, endpoints in a given PCI domain can be
managed by different IOMMUs. Some virtual machines may allow a subset of
endpoints to bypass the IOMMU. In some case the IOMMU itself is presented
as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a
PCI root complex has an iommu-map property, the driver requires all
endpoints to be described by the property. Allow the iommu-map property to
have gaps.

Relaxing of_pci_map_rid also allows the msi-map property to have gaps,
which is invalid since MSIs always reach an MSI controller. Thankfully
Linux will error out later, when attempting to find an MSI domain for the
device.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 drivers/pci/of.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Bjorn Helgaas Oct. 12, 2018, 7:41 p.m. UTC | #1
s/iommu/IOMMU/ in subject

On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote:
> Using the iommu-map binding, endpoints in a given PCI domain can be
> managed by different IOMMUs. Some virtual machines may allow a subset of
> endpoints to bypass the IOMMU. In some case the IOMMU itself is presented

s/case/cases/

> as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a
> PCI root complex has an iommu-map property, the driver requires all
> endpoints to be described by the property. Allow the iommu-map property to
> have gaps.

I'm not an IOMMU or virtio expert, so it's not obvious to me why it is
safe to allow devices to bypass the IOMMU.  Does this mean a typo in
iommu-map could inadvertently allow devices to bypass it?  Should we
indicate something in dmesg (and/or sysfs) about devices that bypass
it?

> Relaxing of_pci_map_rid also allows the msi-map property to have gaps,

s/of_pci_map_rid/of_pci_map_rid()/

> which is invalid since MSIs always reach an MSI controller. Thankfully
> Linux will error out later, when attempting to find an MSI domain for the
> device.

Not clear to me what "error out" means here.  In a userspace program,
I would infer that the program exits with an error message, but I
doubt you mean that Linux exits.

> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/pci/of.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> index 1836b8ddf292..2f5015bdb256 100644
> --- a/drivers/pci/of.c
> +++ b/drivers/pci/of.c
> @@ -451,9 +451,10 @@ int of_pci_map_rid(struct device_node *np, u32 rid,
>  		return 0;
>  	}
>  
> -	pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n",
> -		np, map_name, rid, target && *target ? *target : NULL);
> -	return -EFAULT;
> +	/* Bypasses translation */
> +	if (id_out)
> +		*id_out = rid;
> +	return 0;
>  }
>  
>  #if IS_ENABLED(CONFIG_OF_IRQ)
> -- 
> 2.19.1
>
Michael S. Tsirkin Oct. 15, 2018, 10:52 a.m. UTC | #2
On Fri, Oct 12, 2018 at 02:41:59PM -0500, Bjorn Helgaas wrote:
> s/iommu/IOMMU/ in subject
> 
> On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote:
> > Using the iommu-map binding, endpoints in a given PCI domain can be
> > managed by different IOMMUs. Some virtual machines may allow a subset of
> > endpoints to bypass the IOMMU. In some case the IOMMU itself is presented
> 
> s/case/cases/
> 
> > as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a
> > PCI root complex has an iommu-map property, the driver requires all
> > endpoints to be described by the property. Allow the iommu-map property to
> > have gaps.
> 
> I'm not an IOMMU or virtio expert, so it's not obvious to me why it is
> safe to allow devices to bypass the IOMMU.  Does this mean a typo in
> iommu-map could inadvertently allow devices to bypass it?


Thinking about this comment, I would like to ask: can't the
virtio device indicate the ranges in a portable way?
This would minimize the dependency on dt bindings and ACPI,
enabling support for systems that have neither but do
have virtio e.g. through pci.

>  Should we
> indicate something in dmesg (and/or sysfs) about devices that bypass
> it?
> 
> > Relaxing of_pci_map_rid also allows the msi-map property to have gaps,
> 
> s/of_pci_map_rid/of_pci_map_rid()/
> 
> > which is invalid since MSIs always reach an MSI controller. Thankfully
> > Linux will error out later, when attempting to find an MSI domain for the
> > device.
> 
> Not clear to me what "error out" means here.  In a userspace program,
> I would infer that the program exits with an error message, but I
> doubt you mean that Linux exits.
> 
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > ---
> >  drivers/pci/of.c | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > index 1836b8ddf292..2f5015bdb256 100644
> > --- a/drivers/pci/of.c
> > +++ b/drivers/pci/of.c
> > @@ -451,9 +451,10 @@ int of_pci_map_rid(struct device_node *np, u32 rid,
> >  		return 0;
> >  	}
> >  
> > -	pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n",
> > -		np, map_name, rid, target && *target ? *target : NULL);
> > -	return -EFAULT;
> > +	/* Bypasses translation */
> > +	if (id_out)
> > +		*id_out = rid;
> > +	return 0;
> >  }
> >  
> >  #if IS_ENABLED(CONFIG_OF_IRQ)
> > -- 
> > 2.19.1
> >
Robin Murphy Oct. 15, 2018, 11:32 a.m. UTC | #3
On 12/10/18 20:41, Bjorn Helgaas wrote:
> s/iommu/IOMMU/ in subject
> 
> On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote:
>> Using the iommu-map binding, endpoints in a given PCI domain can be
>> managed by different IOMMUs. Some virtual machines may allow a subset of
>> endpoints to bypass the IOMMU. In some case the IOMMU itself is presented
> 
> s/case/cases/
> 
>> as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a
>> PCI root complex has an iommu-map property, the driver requires all
>> endpoints to be described by the property. Allow the iommu-map property to
>> have gaps.
> 
> I'm not an IOMMU or virtio expert, so it's not obvious to me why it is
> safe to allow devices to bypass the IOMMU.  Does this mean a typo in
> iommu-map could inadvertently allow devices to bypass it?  Should we
> indicate something in dmesg (and/or sysfs) about devices that bypass
> it?

It's not really "allow devices to bypass the IOMMU" so much as "allow DT 
to describe devices which the IOMMU doesn't translate". It's a bit of an 
edge case for not-really-PCI devices, but FWIW I can certainly think of 
several ways to build real hardware like that. As for inadvertent errors 
leaving out IDs which *should* be in the map, that really depends on the 
IOMMU/driver implementation - e.g. SMMUv2 with arm-smmu.disable_bypass=0 
would treat the device as untranslated, whereas SMMUv3 would always 
generate a fault upon any transaction due to no valid stream table entry 
being programmed (not even a bypass one).

I reckon it's a sufficiently unusual case that keeping some sort of 
message probably is worthwhile (at pr_info rather than pr_err) in case 
someone does hit it by mistake.

>> Relaxing of_pci_map_rid also allows the msi-map property to have gaps,

At worst, I suppose we could always add yet another parameter for each 
caller to choose whether a missing entry is considered an error or not.

Robin.

> s/of_pci_map_rid/of_pci_map_rid()/
> 
>> which is invalid since MSIs always reach an MSI controller. Thankfully
>> Linux will error out later, when attempting to find an MSI domain for the
>> device.
> 
> Not clear to me what "error out" means here.  In a userspace program,
> I would infer that the program exits with an error message, but I
> doubt you mean that Linux exits.
> 
>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>> ---
>>   drivers/pci/of.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
>> index 1836b8ddf292..2f5015bdb256 100644
>> --- a/drivers/pci/of.c
>> +++ b/drivers/pci/of.c
>> @@ -451,9 +451,10 @@ int of_pci_map_rid(struct device_node *np, u32 rid,
>>   		return 0;
>>   	}
>>   
>> -	pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n",
>> -		np, map_name, rid, target && *target ? *target : NULL);
>> -	return -EFAULT;
>> +	/* Bypasses translation */
>> +	if (id_out)
>> +		*id_out = rid;
>> +	return 0;
>>   }
>>   
>>   #if IS_ENABLED(CONFIG_OF_IRQ)
>> -- 
>> 2.19.1
>>
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
Jean-philippe Brucker Oct. 15, 2018, 7:45 p.m. UTC | #4
[Replying with my personal address because we're having SMTP issues]

On 12/10/2018 20:41, Bjorn Helgaas wrote:
> s/iommu/IOMMU/ in subject
> 
> On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote:
>> Using the iommu-map binding, endpoints in a given PCI domain can be
>> managed by different IOMMUs. Some virtual machines may allow a subset of
>> endpoints to bypass the IOMMU. In some case the IOMMU itself is presented
> 
> s/case/cases/
> 
>> as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a
>> PCI root complex has an iommu-map property, the driver requires all
>> endpoints to be described by the property. Allow the iommu-map property to
>> have gaps.
> 
> I'm not an IOMMU or virtio expert, so it's not obvious to me why it is
> safe to allow devices to bypass the IOMMU.  Does this mean a typo in
> iommu-map could inadvertently allow devices to bypass it?

As Robin said, a device that is absent from iommu-map will be ignored by
the IOMMU layer, so it depends on the specific IOMMU implementation and
driver. By default the SMMU and virtio-iommu drivers disable bypass, but
I'm not sure about the others. I'll try to find a more accurate title for
this patch

>  Should we
> indicate something in dmesg (and/or sysfs) about devices that bypass
> it?

Good idea, I'll replace the pr_err() below with a pr_info(), instead of
simply removing it.

>> Relaxing of_pci_map_rid also allows the msi-map property to have gaps,
> 
> s/of_pci_map_rid/of_pci_map_rid()/
> 
>> which is invalid since MSIs always reach an MSI controller. Thankfully
>> Linux will error out later, when attempting to find an MSI domain for the
>> device.
> 
> Not clear to me what "error out" means here.  In a userspace program,
> I would infer that the program exits with an error message, but I
> doubt you mean that Linux exits.

Right, I'll clarify this. It's pci_msi_setup_msi_irqs() that returns an
error if the device is missing from msi-map, so the device driver won't be
able to request MSIs.

Thanks,
Jean

>> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>> ---
>>  drivers/pci/of.c | 7 ++++---
>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
>> index 1836b8ddf292..2f5015bdb256 100644
>> --- a/drivers/pci/of.c
>> +++ b/drivers/pci/of.c
>> @@ -451,9 +451,10 @@ int of_pci_map_rid(struct device_node *np, u32 rid,
>>  		return 0;
>>  	}
>>  
>> -	pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n",
>> -		np, map_name, rid, target && *target ? *target : NULL);
>> -	return -EFAULT;
>> +	/* Bypasses translation */
>> +	if (id_out)
>> +		*id_out = rid;
>> +	return 0;
>>  }
>>  
>>  #if IS_ENABLED(CONFIG_OF_IRQ)
>> -- 
>> 2.19.1
>>
>
Jean-philippe Brucker Oct. 15, 2018, 7:46 p.m. UTC | #5
[Replying with my personal address because we're having SMTP issues]

On 15/10/2018 11:52, Michael S. Tsirkin wrote:
> On Fri, Oct 12, 2018 at 02:41:59PM -0500, Bjorn Helgaas wrote:
>> s/iommu/IOMMU/ in subject
>>
>> On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote:
>>> Using the iommu-map binding, endpoints in a given PCI domain can be
>>> managed by different IOMMUs. Some virtual machines may allow a subset of
>>> endpoints to bypass the IOMMU. In some case the IOMMU itself is presented
>>
>> s/case/cases/
>>
>>> as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a
>>> PCI root complex has an iommu-map property, the driver requires all
>>> endpoints to be described by the property. Allow the iommu-map property to
>>> have gaps.
>>
>> I'm not an IOMMU or virtio expert, so it's not obvious to me why it is
>> safe to allow devices to bypass the IOMMU.  Does this mean a typo in
>> iommu-map could inadvertently allow devices to bypass it?
> 
> 
> Thinking about this comment, I would like to ask: can't the
> virtio device indicate the ranges in a portable way?
> This would minimize the dependency on dt bindings and ACPI,
> enabling support for systems that have neither but do
> have virtio e.g. through pci.

I thought about adding a PROBE request for this in virtio-iommu, but it
wouldn't be usable by a Linux guest because of a bootstrapping problem.

Early on, Linux needs a description of device dependencies, to determine
in which order to probe them. If the device dependency was described by
virtio-iommu itself, the guest could for example initialize a NIC,
allocate buffers and start DMA on the physical address space (which aborts
if the IOMMU implementation disallows DMA by default), only to find out
once the virtio-iommu module is loaded that it needs to cancel all DMA and
reconfigure the NIC. With a static description such as iommu-map in DT or
ACPI remapping tables, the guest can defer probing of the NIC until the
IOMMU is initialized.

Thanks,
Jean
Michael S. Tsirkin Oct. 17, 2018, 3:14 p.m. UTC | #6
On Mon, Oct 15, 2018 at 08:46:41PM +0100, Jean-philippe Brucker wrote:
> [Replying with my personal address because we're having SMTP issues]
> 
> On 15/10/2018 11:52, Michael S. Tsirkin wrote:
> > On Fri, Oct 12, 2018 at 02:41:59PM -0500, Bjorn Helgaas wrote:
> >> s/iommu/IOMMU/ in subject
> >>
> >> On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote:
> >>> Using the iommu-map binding, endpoints in a given PCI domain can be
> >>> managed by different IOMMUs. Some virtual machines may allow a subset of
> >>> endpoints to bypass the IOMMU. In some case the IOMMU itself is presented
> >>
> >> s/case/cases/
> >>
> >>> as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a
> >>> PCI root complex has an iommu-map property, the driver requires all
> >>> endpoints to be described by the property. Allow the iommu-map property to
> >>> have gaps.
> >>
> >> I'm not an IOMMU or virtio expert, so it's not obvious to me why it is
> >> safe to allow devices to bypass the IOMMU.  Does this mean a typo in
> >> iommu-map could inadvertently allow devices to bypass it?
> > 
> > 
> > Thinking about this comment, I would like to ask: can't the
> > virtio device indicate the ranges in a portable way?
> > This would minimize the dependency on dt bindings and ACPI,
> > enabling support for systems that have neither but do
> > have virtio e.g. through pci.
> 
> I thought about adding a PROBE request for this in virtio-iommu, but it
> wouldn't be usable by a Linux guest because of a bootstrapping problem.

Hmm. At some level it seems wrong to design hardware interfaces
around how Linux happens to probe things. That can change at any time
...

> Early on, Linux needs a description of device dependencies, to determine
> in which order to probe them. If the device dependency was described by
> virtio-iommu itself, the guest could for example initialize a NIC,
> allocate buffers and start DMA on the physical address space (which aborts
> if the IOMMU implementation disallows DMA by default), only to find out
> once the virtio-iommu module is loaded that it needs to cancel all DMA and
> reconfigure the NIC. With a static description such as iommu-map in DT or
> ACPI remapping tables, the guest can defer probing of the NIC until the
> IOMMU is initialized.
> 
> Thanks,
> Jean

Could you point me at the code you refer to here?
Robin Murphy Oct. 18, 2018, 10:47 a.m. UTC | #7
On 17/10/18 16:14, Michael S. Tsirkin wrote:
> On Mon, Oct 15, 2018 at 08:46:41PM +0100, Jean-philippe Brucker wrote:
>> [Replying with my personal address because we're having SMTP issues]
>>
>> On 15/10/2018 11:52, Michael S. Tsirkin wrote:
>>> On Fri, Oct 12, 2018 at 02:41:59PM -0500, Bjorn Helgaas wrote:
>>>> s/iommu/IOMMU/ in subject
>>>>
>>>> On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote:
>>>>> Using the iommu-map binding, endpoints in a given PCI domain can be
>>>>> managed by different IOMMUs. Some virtual machines may allow a subset of
>>>>> endpoints to bypass the IOMMU. In some case the IOMMU itself is presented
>>>>
>>>> s/case/cases/
>>>>
>>>>> as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a
>>>>> PCI root complex has an iommu-map property, the driver requires all
>>>>> endpoints to be described by the property. Allow the iommu-map property to
>>>>> have gaps.
>>>>
>>>> I'm not an IOMMU or virtio expert, so it's not obvious to me why it is
>>>> safe to allow devices to bypass the IOMMU.  Does this mean a typo in
>>>> iommu-map could inadvertently allow devices to bypass it?
>>>
>>>
>>> Thinking about this comment, I would like to ask: can't the
>>> virtio device indicate the ranges in a portable way?
>>> This would minimize the dependency on dt bindings and ACPI,
>>> enabling support for systems that have neither but do
>>> have virtio e.g. through pci.
>>
>> I thought about adding a PROBE request for this in virtio-iommu, but it
>> wouldn't be usable by a Linux guest because of a bootstrapping problem.
> 
> Hmm. At some level it seems wrong to design hardware interfaces
> around how Linux happens to probe things. That can change at any time
> ...

This isn't Linux-specific though. In general it's somewhere between 
difficult and impossible to pull in an IOMMU underneath a device after 
at device is active, so if any OS wants to use an IOMMU, it's going to 
want to know up-front that it's there and which devices it translates so 
that it can program said IOMMU appropriately *before* potentially 
starting DMA and/or interrupts from the relevant devices. Linux happens 
to do things in that order (either by firmware-driven probe-deferral or 
just perilous initcall ordering) because it is the only reasonable order 
in which to do them. AFAIK the platforms which don't rely on any 
firmware description of their IOMMU tend to have a fairly static system 
architecture (such that the OS simply makes hard-coded assumptions), so 
it's not necessarily entirely clear how they would cope with 
virtio-iommu either way.

Robin.

>> Early on, Linux needs a description of device dependencies, to determine
>> in which order to probe them. If the device dependency was described by
>> virtio-iommu itself, the guest could for example initialize a NIC,
>> allocate buffers and start DMA on the physical address space (which aborts
>> if the IOMMU implementation disallows DMA by default), only to find out
>> once the virtio-iommu module is loaded that it needs to cancel all DMA and
>> reconfigure the NIC. With a static description such as iommu-map in DT or
>> ACPI remapping tables, the guest can defer probing of the NIC until the
>> IOMMU is initialized.
>>
>> Thanks,
>> Jean
> 
> Could you point me at the code you refer to here?
>
Jean-Philippe Brucker Oct. 22, 2018, 11:27 a.m. UTC | #8
On 17/10/2018 16:14, Michael S. Tsirkin wrote:
>>> Thinking about this comment, I would like to ask: can't the
>>> virtio device indicate the ranges in a portable way?
>>> This would minimize the dependency on dt bindings and ACPI,
>>> enabling support for systems that have neither but do
>>> have virtio e.g. through pci.
>>
>> I thought about adding a PROBE request for this in virtio-iommu, but it
>> wouldn't be usable by a Linux guest because of a bootstrapping problem.
> 
> Hmm. At some level it seems wrong to design hardware interfaces
> around how Linux happens to probe things. That can change at any time
> ...

I suspect that most other OS will also solve this class of problem using
a standard such as DT or ACPI, because they also provide dependency for
clock, interrupts, power management, etc. We can add a self-contained
PROBE method if someone makes a case for it, but it's unlikely to get
used at all, and nearly impossible to implement in Linux. The host would
still need a method to tell the guest which device to probe first, for
example with kernel parameters.

>> Early on, Linux needs a description of device dependencies, to determine
>> in which order to probe them. If the device dependency was described by
>> virtio-iommu itself, the guest could for example initialize a NIC,
>> allocate buffers and start DMA on the physical address space (which aborts
>> if the IOMMU implementation disallows DMA by default), only to find out
>> once the virtio-iommu module is loaded that it needs to cancel all DMA and
>> reconfigure the NIC. With a static description such as iommu-map in DT or
>> ACPI remapping tables, the guest can defer probing of the NIC until the
>> IOMMU is initialized.
>>
>> Thanks,
>> Jean
> 
> Could you point me at the code you refer to here?

In drivers/base/dd.c, really_probe() calls dma_configure() before the
device driver's probe(). dma_configure() ends up calling either
of_dma_configure() or acpi_dma_configure(), which return -EPROBE_DEFER
if the device's IOMMU isn't yet available. In that case the device is
added to the deferred pending list.

After another device is successfully bound to a driver, all devices on
the pending list are retried (driver_deferred_probe_trigger()), and if
the dependency has been resolved, then dma_configure() succeeds.

Another method (used by Intel and AMD IOMMU drivers) is to initialize
the IOMMU as early as possible, after discovering it in the ACPI tables
and before probing other devices. This can't work for virtio-iommu
because the driver might be a module, in which case early init isn't
possible. We have to defer probe of all dependent devices until the
virtio and virtio-iommu modules are loaded.

Thanks,
Jean
diff mbox series

Patch

diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index 1836b8ddf292..2f5015bdb256 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -451,9 +451,10 @@  int of_pci_map_rid(struct device_node *np, u32 rid,
 		return 0;
 	}
 
-	pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n",
-		np, map_name, rid, target && *target ? *target : NULL);
-	return -EFAULT;
+	/* Bypasses translation */
+	if (id_out)
+		*id_out = rid;
+	return 0;
 }
 
 #if IS_ENABLED(CONFIG_OF_IRQ)