Message ID | 20181012145917.6840-4-jean-philippe.brucker@arm.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | Add virtio-iommu driver | expand |
s/iommu/IOMMU/ in subject On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote: > Using the iommu-map binding, endpoints in a given PCI domain can be > managed by different IOMMUs. Some virtual machines may allow a subset of > endpoints to bypass the IOMMU. In some case the IOMMU itself is presented s/case/cases/ > as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a > PCI root complex has an iommu-map property, the driver requires all > endpoints to be described by the property. Allow the iommu-map property to > have gaps. I'm not an IOMMU or virtio expert, so it's not obvious to me why it is safe to allow devices to bypass the IOMMU. Does this mean a typo in iommu-map could inadvertently allow devices to bypass it? Should we indicate something in dmesg (and/or sysfs) about devices that bypass it? > Relaxing of_pci_map_rid also allows the msi-map property to have gaps, s/of_pci_map_rid/of_pci_map_rid()/ > which is invalid since MSIs always reach an MSI controller. Thankfully > Linux will error out later, when attempting to find an MSI domain for the > device. Not clear to me what "error out" means here. In a userspace program, I would infer that the program exits with an error message, but I doubt you mean that Linux exits. > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > --- > drivers/pci/of.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c > index 1836b8ddf292..2f5015bdb256 100644 > --- a/drivers/pci/of.c > +++ b/drivers/pci/of.c > @@ -451,9 +451,10 @@ int of_pci_map_rid(struct device_node *np, u32 rid, > return 0; > } > > - pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n", > - np, map_name, rid, target && *target ? *target : NULL); > - return -EFAULT; > + /* Bypasses translation */ > + if (id_out) > + *id_out = rid; > + return 0; > } > > #if IS_ENABLED(CONFIG_OF_IRQ) > -- > 2.19.1 >
On Fri, Oct 12, 2018 at 02:41:59PM -0500, Bjorn Helgaas wrote: > s/iommu/IOMMU/ in subject > > On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote: > > Using the iommu-map binding, endpoints in a given PCI domain can be > > managed by different IOMMUs. Some virtual machines may allow a subset of > > endpoints to bypass the IOMMU. In some case the IOMMU itself is presented > > s/case/cases/ > > > as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a > > PCI root complex has an iommu-map property, the driver requires all > > endpoints to be described by the property. Allow the iommu-map property to > > have gaps. > > I'm not an IOMMU or virtio expert, so it's not obvious to me why it is > safe to allow devices to bypass the IOMMU. Does this mean a typo in > iommu-map could inadvertently allow devices to bypass it? Thinking about this comment, I would like to ask: can't the virtio device indicate the ranges in a portable way? This would minimize the dependency on dt bindings and ACPI, enabling support for systems that have neither but do have virtio e.g. through pci. > Should we > indicate something in dmesg (and/or sysfs) about devices that bypass > it? > > > Relaxing of_pci_map_rid also allows the msi-map property to have gaps, > > s/of_pci_map_rid/of_pci_map_rid()/ > > > which is invalid since MSIs always reach an MSI controller. Thankfully > > Linux will error out later, when attempting to find an MSI domain for the > > device. > > Not clear to me what "error out" means here. In a userspace program, > I would infer that the program exits with an error message, but I > doubt you mean that Linux exits. > > > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> > > --- > > drivers/pci/of.c | 7 ++++--- > > 1 file changed, 4 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c > > index 1836b8ddf292..2f5015bdb256 100644 > > --- a/drivers/pci/of.c > > +++ b/drivers/pci/of.c > > @@ -451,9 +451,10 @@ int of_pci_map_rid(struct device_node *np, u32 rid, > > return 0; > > } > > > > - pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n", > > - np, map_name, rid, target && *target ? *target : NULL); > > - return -EFAULT; > > + /* Bypasses translation */ > > + if (id_out) > > + *id_out = rid; > > + return 0; > > } > > > > #if IS_ENABLED(CONFIG_OF_IRQ) > > -- > > 2.19.1 > >
On 12/10/18 20:41, Bjorn Helgaas wrote: > s/iommu/IOMMU/ in subject > > On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote: >> Using the iommu-map binding, endpoints in a given PCI domain can be >> managed by different IOMMUs. Some virtual machines may allow a subset of >> endpoints to bypass the IOMMU. In some case the IOMMU itself is presented > > s/case/cases/ > >> as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a >> PCI root complex has an iommu-map property, the driver requires all >> endpoints to be described by the property. Allow the iommu-map property to >> have gaps. > > I'm not an IOMMU or virtio expert, so it's not obvious to me why it is > safe to allow devices to bypass the IOMMU. Does this mean a typo in > iommu-map could inadvertently allow devices to bypass it? Should we > indicate something in dmesg (and/or sysfs) about devices that bypass > it? It's not really "allow devices to bypass the IOMMU" so much as "allow DT to describe devices which the IOMMU doesn't translate". It's a bit of an edge case for not-really-PCI devices, but FWIW I can certainly think of several ways to build real hardware like that. As for inadvertent errors leaving out IDs which *should* be in the map, that really depends on the IOMMU/driver implementation - e.g. SMMUv2 with arm-smmu.disable_bypass=0 would treat the device as untranslated, whereas SMMUv3 would always generate a fault upon any transaction due to no valid stream table entry being programmed (not even a bypass one). I reckon it's a sufficiently unusual case that keeping some sort of message probably is worthwhile (at pr_info rather than pr_err) in case someone does hit it by mistake. >> Relaxing of_pci_map_rid also allows the msi-map property to have gaps, At worst, I suppose we could always add yet another parameter for each caller to choose whether a missing entry is considered an error or not. Robin. > s/of_pci_map_rid/of_pci_map_rid()/ > >> which is invalid since MSIs always reach an MSI controller. Thankfully >> Linux will error out later, when attempting to find an MSI domain for the >> device. > > Not clear to me what "error out" means here. In a userspace program, > I would infer that the program exits with an error message, but I > doubt you mean that Linux exits. > >> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> >> --- >> drivers/pci/of.c | 7 ++++--- >> 1 file changed, 4 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/pci/of.c b/drivers/pci/of.c >> index 1836b8ddf292..2f5015bdb256 100644 >> --- a/drivers/pci/of.c >> +++ b/drivers/pci/of.c >> @@ -451,9 +451,10 @@ int of_pci_map_rid(struct device_node *np, u32 rid, >> return 0; >> } >> >> - pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n", >> - np, map_name, rid, target && *target ? *target : NULL); >> - return -EFAULT; >> + /* Bypasses translation */ >> + if (id_out) >> + *id_out = rid; >> + return 0; >> } >> >> #if IS_ENABLED(CONFIG_OF_IRQ) >> -- >> 2.19.1 >> > _______________________________________________ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu >
[Replying with my personal address because we're having SMTP issues] On 12/10/2018 20:41, Bjorn Helgaas wrote: > s/iommu/IOMMU/ in subject > > On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote: >> Using the iommu-map binding, endpoints in a given PCI domain can be >> managed by different IOMMUs. Some virtual machines may allow a subset of >> endpoints to bypass the IOMMU. In some case the IOMMU itself is presented > > s/case/cases/ > >> as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a >> PCI root complex has an iommu-map property, the driver requires all >> endpoints to be described by the property. Allow the iommu-map property to >> have gaps. > > I'm not an IOMMU or virtio expert, so it's not obvious to me why it is > safe to allow devices to bypass the IOMMU. Does this mean a typo in > iommu-map could inadvertently allow devices to bypass it? As Robin said, a device that is absent from iommu-map will be ignored by the IOMMU layer, so it depends on the specific IOMMU implementation and driver. By default the SMMU and virtio-iommu drivers disable bypass, but I'm not sure about the others. I'll try to find a more accurate title for this patch > Should we > indicate something in dmesg (and/or sysfs) about devices that bypass > it? Good idea, I'll replace the pr_err() below with a pr_info(), instead of simply removing it. >> Relaxing of_pci_map_rid also allows the msi-map property to have gaps, > > s/of_pci_map_rid/of_pci_map_rid()/ > >> which is invalid since MSIs always reach an MSI controller. Thankfully >> Linux will error out later, when attempting to find an MSI domain for the >> device. > > Not clear to me what "error out" means here. In a userspace program, > I would infer that the program exits with an error message, but I > doubt you mean that Linux exits. Right, I'll clarify this. It's pci_msi_setup_msi_irqs() that returns an error if the device is missing from msi-map, so the device driver won't be able to request MSIs. Thanks, Jean >> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> >> --- >> drivers/pci/of.c | 7 ++++--- >> 1 file changed, 4 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/pci/of.c b/drivers/pci/of.c >> index 1836b8ddf292..2f5015bdb256 100644 >> --- a/drivers/pci/of.c >> +++ b/drivers/pci/of.c >> @@ -451,9 +451,10 @@ int of_pci_map_rid(struct device_node *np, u32 rid, >> return 0; >> } >> >> - pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n", >> - np, map_name, rid, target && *target ? *target : NULL); >> - return -EFAULT; >> + /* Bypasses translation */ >> + if (id_out) >> + *id_out = rid; >> + return 0; >> } >> >> #if IS_ENABLED(CONFIG_OF_IRQ) >> -- >> 2.19.1 >> >
[Replying with my personal address because we're having SMTP issues] On 15/10/2018 11:52, Michael S. Tsirkin wrote: > On Fri, Oct 12, 2018 at 02:41:59PM -0500, Bjorn Helgaas wrote: >> s/iommu/IOMMU/ in subject >> >> On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote: >>> Using the iommu-map binding, endpoints in a given PCI domain can be >>> managed by different IOMMUs. Some virtual machines may allow a subset of >>> endpoints to bypass the IOMMU. In some case the IOMMU itself is presented >> >> s/case/cases/ >> >>> as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a >>> PCI root complex has an iommu-map property, the driver requires all >>> endpoints to be described by the property. Allow the iommu-map property to >>> have gaps. >> >> I'm not an IOMMU or virtio expert, so it's not obvious to me why it is >> safe to allow devices to bypass the IOMMU. Does this mean a typo in >> iommu-map could inadvertently allow devices to bypass it? > > > Thinking about this comment, I would like to ask: can't the > virtio device indicate the ranges in a portable way? > This would minimize the dependency on dt bindings and ACPI, > enabling support for systems that have neither but do > have virtio e.g. through pci. I thought about adding a PROBE request for this in virtio-iommu, but it wouldn't be usable by a Linux guest because of a bootstrapping problem. Early on, Linux needs a description of device dependencies, to determine in which order to probe them. If the device dependency was described by virtio-iommu itself, the guest could for example initialize a NIC, allocate buffers and start DMA on the physical address space (which aborts if the IOMMU implementation disallows DMA by default), only to find out once the virtio-iommu module is loaded that it needs to cancel all DMA and reconfigure the NIC. With a static description such as iommu-map in DT or ACPI remapping tables, the guest can defer probing of the NIC until the IOMMU is initialized. Thanks, Jean
On Mon, Oct 15, 2018 at 08:46:41PM +0100, Jean-philippe Brucker wrote: > [Replying with my personal address because we're having SMTP issues] > > On 15/10/2018 11:52, Michael S. Tsirkin wrote: > > On Fri, Oct 12, 2018 at 02:41:59PM -0500, Bjorn Helgaas wrote: > >> s/iommu/IOMMU/ in subject > >> > >> On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote: > >>> Using the iommu-map binding, endpoints in a given PCI domain can be > >>> managed by different IOMMUs. Some virtual machines may allow a subset of > >>> endpoints to bypass the IOMMU. In some case the IOMMU itself is presented > >> > >> s/case/cases/ > >> > >>> as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a > >>> PCI root complex has an iommu-map property, the driver requires all > >>> endpoints to be described by the property. Allow the iommu-map property to > >>> have gaps. > >> > >> I'm not an IOMMU or virtio expert, so it's not obvious to me why it is > >> safe to allow devices to bypass the IOMMU. Does this mean a typo in > >> iommu-map could inadvertently allow devices to bypass it? > > > > > > Thinking about this comment, I would like to ask: can't the > > virtio device indicate the ranges in a portable way? > > This would minimize the dependency on dt bindings and ACPI, > > enabling support for systems that have neither but do > > have virtio e.g. through pci. > > I thought about adding a PROBE request for this in virtio-iommu, but it > wouldn't be usable by a Linux guest because of a bootstrapping problem. Hmm. At some level it seems wrong to design hardware interfaces around how Linux happens to probe things. That can change at any time ... > Early on, Linux needs a description of device dependencies, to determine > in which order to probe them. If the device dependency was described by > virtio-iommu itself, the guest could for example initialize a NIC, > allocate buffers and start DMA on the physical address space (which aborts > if the IOMMU implementation disallows DMA by default), only to find out > once the virtio-iommu module is loaded that it needs to cancel all DMA and > reconfigure the NIC. With a static description such as iommu-map in DT or > ACPI remapping tables, the guest can defer probing of the NIC until the > IOMMU is initialized. > > Thanks, > Jean Could you point me at the code you refer to here?
On 17/10/18 16:14, Michael S. Tsirkin wrote: > On Mon, Oct 15, 2018 at 08:46:41PM +0100, Jean-philippe Brucker wrote: >> [Replying with my personal address because we're having SMTP issues] >> >> On 15/10/2018 11:52, Michael S. Tsirkin wrote: >>> On Fri, Oct 12, 2018 at 02:41:59PM -0500, Bjorn Helgaas wrote: >>>> s/iommu/IOMMU/ in subject >>>> >>>> On Fri, Oct 12, 2018 at 03:59:13PM +0100, Jean-Philippe Brucker wrote: >>>>> Using the iommu-map binding, endpoints in a given PCI domain can be >>>>> managed by different IOMMUs. Some virtual machines may allow a subset of >>>>> endpoints to bypass the IOMMU. In some case the IOMMU itself is presented >>>> >>>> s/case/cases/ >>>> >>>>> as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a >>>>> PCI root complex has an iommu-map property, the driver requires all >>>>> endpoints to be described by the property. Allow the iommu-map property to >>>>> have gaps. >>>> >>>> I'm not an IOMMU or virtio expert, so it's not obvious to me why it is >>>> safe to allow devices to bypass the IOMMU. Does this mean a typo in >>>> iommu-map could inadvertently allow devices to bypass it? >>> >>> >>> Thinking about this comment, I would like to ask: can't the >>> virtio device indicate the ranges in a portable way? >>> This would minimize the dependency on dt bindings and ACPI, >>> enabling support for systems that have neither but do >>> have virtio e.g. through pci. >> >> I thought about adding a PROBE request for this in virtio-iommu, but it >> wouldn't be usable by a Linux guest because of a bootstrapping problem. > > Hmm. At some level it seems wrong to design hardware interfaces > around how Linux happens to probe things. That can change at any time > ... This isn't Linux-specific though. In general it's somewhere between difficult and impossible to pull in an IOMMU underneath a device after at device is active, so if any OS wants to use an IOMMU, it's going to want to know up-front that it's there and which devices it translates so that it can program said IOMMU appropriately *before* potentially starting DMA and/or interrupts from the relevant devices. Linux happens to do things in that order (either by firmware-driven probe-deferral or just perilous initcall ordering) because it is the only reasonable order in which to do them. AFAIK the platforms which don't rely on any firmware description of their IOMMU tend to have a fairly static system architecture (such that the OS simply makes hard-coded assumptions), so it's not necessarily entirely clear how they would cope with virtio-iommu either way. Robin. >> Early on, Linux needs a description of device dependencies, to determine >> in which order to probe them. If the device dependency was described by >> virtio-iommu itself, the guest could for example initialize a NIC, >> allocate buffers and start DMA on the physical address space (which aborts >> if the IOMMU implementation disallows DMA by default), only to find out >> once the virtio-iommu module is loaded that it needs to cancel all DMA and >> reconfigure the NIC. With a static description such as iommu-map in DT or >> ACPI remapping tables, the guest can defer probing of the NIC until the >> IOMMU is initialized. >> >> Thanks, >> Jean > > Could you point me at the code you refer to here? >
On 17/10/2018 16:14, Michael S. Tsirkin wrote: >>> Thinking about this comment, I would like to ask: can't the >>> virtio device indicate the ranges in a portable way? >>> This would minimize the dependency on dt bindings and ACPI, >>> enabling support for systems that have neither but do >>> have virtio e.g. through pci. >> >> I thought about adding a PROBE request for this in virtio-iommu, but it >> wouldn't be usable by a Linux guest because of a bootstrapping problem. > > Hmm. At some level it seems wrong to design hardware interfaces > around how Linux happens to probe things. That can change at any time > ... I suspect that most other OS will also solve this class of problem using a standard such as DT or ACPI, because they also provide dependency for clock, interrupts, power management, etc. We can add a self-contained PROBE method if someone makes a case for it, but it's unlikely to get used at all, and nearly impossible to implement in Linux. The host would still need a method to tell the guest which device to probe first, for example with kernel parameters. >> Early on, Linux needs a description of device dependencies, to determine >> in which order to probe them. If the device dependency was described by >> virtio-iommu itself, the guest could for example initialize a NIC, >> allocate buffers and start DMA on the physical address space (which aborts >> if the IOMMU implementation disallows DMA by default), only to find out >> once the virtio-iommu module is loaded that it needs to cancel all DMA and >> reconfigure the NIC. With a static description such as iommu-map in DT or >> ACPI remapping tables, the guest can defer probing of the NIC until the >> IOMMU is initialized. >> >> Thanks, >> Jean > > Could you point me at the code you refer to here? In drivers/base/dd.c, really_probe() calls dma_configure() before the device driver's probe(). dma_configure() ends up calling either of_dma_configure() or acpi_dma_configure(), which return -EPROBE_DEFER if the device's IOMMU isn't yet available. In that case the device is added to the deferred pending list. After another device is successfully bound to a driver, all devices on the pending list are retried (driver_deferred_probe_trigger()), and if the dependency has been resolved, then dma_configure() succeeds. Another method (used by Intel and AMD IOMMU drivers) is to initialize the IOMMU as early as possible, after discovering it in the ACPI tables and before probing other devices. This can't work for virtio-iommu because the driver might be a module, in which case early init isn't possible. We have to defer probe of all dependent devices until the virtio and virtio-iommu modules are loaded. Thanks, Jean
diff --git a/drivers/pci/of.c b/drivers/pci/of.c index 1836b8ddf292..2f5015bdb256 100644 --- a/drivers/pci/of.c +++ b/drivers/pci/of.c @@ -451,9 +451,10 @@ int of_pci_map_rid(struct device_node *np, u32 rid, return 0; } - pr_err("%pOF: Invalid %s translation - no match for rid 0x%x on %pOF\n", - np, map_name, rid, target && *target ? *target : NULL); - return -EFAULT; + /* Bypasses translation */ + if (id_out) + *id_out = rid; + return 0; } #if IS_ENABLED(CONFIG_OF_IRQ)
Using the iommu-map binding, endpoints in a given PCI domain can be managed by different IOMMUs. Some virtual machines may allow a subset of endpoints to bypass the IOMMU. In some case the IOMMU itself is presented as a PCI endpoint (e.g. AMD IOMMU and virtio-iommu). Currently, when a PCI root complex has an iommu-map property, the driver requires all endpoints to be described by the property. Allow the iommu-map property to have gaps. Relaxing of_pci_map_rid also allows the msi-map property to have gaps, which is invalid since MSIs always reach an MSI controller. Thankfully Linux will error out later, when attempting to find an MSI domain for the device. Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> --- drivers/pci/of.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)