diff mbox series

[RFC,XEN,v3,1/5] docs/designs: Add a design document for PV-IOMMU

Message ID cf749c46f9d3d91bc116c96ee2fd1869164fbe5b.1720703078.git.teddy.astie@vates.tech (mailing list archive)
State New
Headers show
Series IOMMU subsystem redesign and PV-IOMMU interface | expand

Commit Message

Teddy Astie July 11, 2024, 2:04 p.m. UTC
Some operating systems want to use IOMMU to implement various features (e.g
VFIO) or DMA protection.
This patch introduce a proposal for IOMMU paravirtualization for Dom0.

Signed-off-by Teddy Astie <teddy.astie@vates.tech>
---
 docs/designs/pv-iommu.md | 105 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 105 insertions(+)
 create mode 100644 docs/designs/pv-iommu.md

Comments

Alejandro Vallejo July 11, 2024, 6:26 p.m. UTC | #1
Disclaimer: I haven't looked at the code yet.

On Thu Jul 11, 2024 at 3:04 PM BST, Teddy Astie wrote:
> Some operating systems want to use IOMMU to implement various features (e.g
> VFIO) or DMA protection.
> This patch introduce a proposal for IOMMU paravirtualization for Dom0.
>
> Signed-off-by Teddy Astie <teddy.astie@vates.tech>
> ---
>  docs/designs/pv-iommu.md | 105 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 105 insertions(+)
>  create mode 100644 docs/designs/pv-iommu.md
>
> diff --git a/docs/designs/pv-iommu.md b/docs/designs/pv-iommu.md
> new file mode 100644
> index 0000000000..c01062a3ad
> --- /dev/null
> +++ b/docs/designs/pv-iommu.md
> @@ -0,0 +1,105 @@
> +# IOMMU paravirtualization for Dom0
> +
> +Status: Experimental
> +
> +# Background
> +
> +By default, Xen only uses the IOMMU for itself, either to make device adress
> +space coherent with guest adress space (x86 HVM/PVH) or to prevent devices
> +from doing DMA outside it's expected memory regions including the hypervisor
> +(x86 PV).

"By default...": Do you mean "currently"?

> +
> +A limitation is that guests (especially privildged ones) may want to use
> +IOMMU hardware in order to implement features such as DMA protection and
> +VFIO [1] as IOMMU functionality is not available outside of the hypervisor
> +currently.

s/privildged/privileged/

> +
> +[1] VFIO - "Virtual Function I/O" - https://www.kernel.org/doc/html/latest/driver-api/vfio.html
> +
> +# Design
> +
> +The operating system may want to have access to various IOMMU features such as
> +context management and DMA remapping. We can create a new hypercall that allows
> +the guest to have access to a new paravirtualized IOMMU interface.
> +
> +This feature is only meant to be available for the Dom0, as DomU have some
> +emulated devices that can't be managed on Xen side and are not hardware, we
> +can't rely on the hardware IOMMU to enforce DMA remapping.

Is that the reason though? While it's true we can't mix emulated and real
devices under the same emulated PCI bus covered by an IOMMU, nothing prevents us
from stating "the IOMMU(s) configured via PV-IOMMU cover from busN to busM".

AFAIK, that already happens on systems with several IOMMUs, where they might
affect partially disjoint devices. But I admit I'm no expert on this.

I can definitely see a lot of interesting use cases for a PV-IOMMU interface
exposed to domUs (it'd be a subset of that of dom0, obviously); that'd
allow them to use the IOMMU without resorting to 2-stage translation, which has
terrible IOTLB miss costs.

> +
> +This interface is exposed under the `iommu_op` hypercall.
> +
> +In addition, Xen domains are modified in order to allow existence of several
> +IOMMU context including a default one that implement default behavior (e.g
> +hardware assisted paging) and can't be modified by guest. DomU cannot have
> +contexts, and therefore act as if they only have the default domain.
> +
> +Each IOMMU context within a Xen domain is identified using a domain-specific
> +context number that is used in the Xen IOMMU subsystem and the hypercall
> +interface.
> +
> +The number of IOMMU context a domain can use is predetermined at domain creation
> +and is configurable through `dom0-iommu=nb-ctx=N` xen cmdline.

nit: I think it's more typical within Xen to see "nr" rather than "nb"

> +
> +# IOMMU operations
> +
> +## Alloc context
> +
> +Create a new IOMMU context for the guest and return the context number to the
> +guest.
> +Fail if the IOMMU context limit of the guest is reached.

or -ENOMEM, I guess.

I'm guessing from this dom0 takes care of the contexts for guests? Or are these
contexts for use within dom0 exclusively?

> +
> +A flag can be specified to create a identity mapping.
> +
> +## Free context
> +
> +Destroy a IOMMU context created previously.
> +It is not possible to free the default context.
> +
> +Reattach context devices to default context if specified by the guest.
> +
> +Fail if there is a device in the context and reattach-to-default flag is not
> +specified.
> +
> +## Reattach device
> +
> +Reattach a device to another IOMMU context (including the default one).
> +The target IOMMU context number must be valid and the context allocated.
> +
> +The guest needs to specify a PCI SBDF of a device he has access to.
> +
> +## Map/unmap page
> +
> +Map/unmap a page on a context.
> +The guest needs to specify a gfn and target dfn to map.

And an "order", I hope; to enable superpages and hugepages without having to
find out after the fact that the mappings are in fact mergeable and the leaf PTs
can go away.

> +
> +Refuse to create the mapping if one already exist for the same dfn.
> +
> +## Lookup page
> +
> +Get the gfn mapped by a specific dfn.
> +
> +# Implementation considerations
> +
> +## Hypercall batching
> +
> +In order to prevent unneeded hypercalls and IOMMU flushing, it is advisable to
> +be able to batch some critical IOMMU operations (e.g map/unmap multiple pages).

See above for an additional way of reducing the load.

> +
> +## Hardware without IOMMU support
> +
> +Operating system needs to be aware on PV-IOMMU capability, and whether it is
> +able to make contexts. However, some operating system may critically fail in
> +case they are able to make a new IOMMU context. Which is supposed to happen
> +if no IOMMU hardware is available.
> +
> +The hypercall interface needs a interface to advertise the ability to create
> +and manage IOMMU contexts including the amount of context the guest is able
> +to use. Using these informations, the Dom0 may decide whether to use or not
> +the PV-IOMMU interface.

We could just return -ENOTSUPP when there's no IOMMU, then encapsulate a random
lookup with pv_iommu_is_present() and return true or false depending on rc.

> +
> +## Page pool for contexts
> +
> +In order to prevent unexpected starving on the hypervisor memory with a
> +buggy Dom0. We can preallocate the pages the contexts will use and make
> +map/unmap use these pages instead of allocating them dynamically.
> +

That seems dangerous should we need to shatter a superpage asynchronously (i.e:
due to HW misbehaving and requiring it) and have no more pages in the pool.

Cheers,
Alejandro
Teddy Astie July 12, 2024, 8:54 a.m. UTC | #2
Hello Alejandro, thanks for reply !

Le 11/07/2024 à 20:26, Alejandro Vallejo a écrit :
> Disclaimer: I haven't looked at the code yet.
> 
> On Thu Jul 11, 2024 at 3:04 PM BST, Teddy Astie wrote:
>> Some operating systems want to use IOMMU to implement various features (e.g
>> VFIO) or DMA protection.
>> This patch introduce a proposal for IOMMU paravirtualization for Dom0.
>>
>> Signed-off-by Teddy Astie <teddy.astie@vates.tech>
>> ---
>>   docs/designs/pv-iommu.md | 105 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 105 insertions(+)
>>   create mode 100644 docs/designs/pv-iommu.md
>>
>> diff --git a/docs/designs/pv-iommu.md b/docs/designs/pv-iommu.md
>> new file mode 100644
>> index 0000000000..c01062a3ad
>> --- /dev/null
>> +++ b/docs/designs/pv-iommu.md
>> @@ -0,0 +1,105 @@
>> +# IOMMU paravirtualization for Dom0
>> +
>> +Status: Experimental
>> +
>> +# Background
>> +
>> +By default, Xen only uses the IOMMU for itself, either to make device adress
>> +space coherent with guest adress space (x86 HVM/PVH) or to prevent devices
>> +from doing DMA outside it's expected memory regions including the hypervisor
>> +(x86 PV).
> 
> "By default...": Do you mean "currently"?
> 

Yes, that's what I mean with default here.

>> +
>> +[1] VFIO - "Virtual Function I/O" - https://www.kernel.org/doc/html/latest/driver-api/vfio.html
>> +
>> +# Design
>> +
>> +The operating system may want to have access to various IOMMU features such as
>> +context management and DMA remapping. We can create a new hypercall that allows
>> +the guest to have access to a new paravirtualized IOMMU interface.
>> +
>> +This feature is only meant to be available for the Dom0, as DomU have some
>> +emulated devices that can't be managed on Xen side and are not hardware, we
>> +can't rely on the hardware IOMMU to enforce DMA remapping.
> 
> Is that the reason though? While it's true we can't mix emulated and real
> devices under the same emulated PCI bus covered by an IOMMU, nothing prevents us
> from stating "the IOMMU(s) configured via PV-IOMMU cover from busN to busM".
> 
> AFAIK, that already happens on systems with several IOMMUs, where they might
> affect partially disjoint devices. But I admit I'm no expert on this.
>
I am not a expert on how emulated devices are exposed, but the guest 
will definitely need a way to know if a device is hardware or not.

But I think the situation will be different whether we do PV or HVM. In 
PV, there is no emulated device AFAIK, so no need for identifying. In 
case of HVM, there is though, which we should consider.

There is still the question of interactions between eventual future 
IOMMU emulation (VT-d with QEMU) that can be allowed to act on real 
devices (e.g by relying on the new IOMMU infrastructure) and PV-IOMMU.

> I can definitely see a lot of interesting use cases for a PV-IOMMU interface
> exposed to domUs (it'd be a subset of that of dom0, obviously); that'd
> allow them to use the IOMMU without resorting to 2-stage translation, which has
> terrible IOTLB miss costs.
> 

Makes sense, could be useful for e.g storage domains with support for 
SPDK. Do note that 2-stage IOMMU translation is only supported by very 
modern hardware (e.g Xeon Scalable 4th gen).

>> +
>> +This interface is exposed under the `iommu_op` hypercall.
>> +
>> +In addition, Xen domains are modified in order to allow existence of several
>> +IOMMU context including a default one that implement default behavior (e.g
>> +hardware assisted paging) and can't be modified by guest. DomU cannot have
>> +contexts, and therefore act as if they only have the default domain.
>> +
>> +Each IOMMU context within a Xen domain is identified using a domain-specific
>> +context number that is used in the Xen IOMMU subsystem and the hypercall
>> +interface.
>> +
>> +The number of IOMMU context a domain can use is predetermined at domain creation
>> +and is configurable through `dom0-iommu=nb-ctx=N` xen cmdline.
> 
> nit: I think it's more typical within Xen to see "nr" rather than "nb"
> 

yes

>> +
>> +# IOMMU operations
>> +
>> +## Alloc context
>> +
>> +Create a new IOMMU context for the guest and return the context number to the
>> +guest.
>> +Fail if the IOMMU context limit of the guest is reached.
> 
> or -ENOMEM, I guess.
> 
> I'm guessing from this dom0 takes care of the contexts for guests? Or are these
> contexts for use within dom0 exclusively?
>

Each domain has a set of "IOMMU context" that can be allocated and freed 
(up to a fixed limit at creation of domain).
If there is no available context (if the context number limit is hit), I 
choosed -ENOSPC as error return (-ENOMEM is reserved for lack of memory 
which can also happens).

>> +
>> +A flag can be specified to create a identity mapping.
>> +
>> +## Free context
>> +
>> +Destroy a IOMMU context created previously.
>> +It is not possible to free the default context.
>> +
>> +Reattach context devices to default context if specified by the guest.
>> +
>> +Fail if there is a device in the context and reattach-to-default flag is not
>> +specified.
>> +
>> +## Reattach device
>> +
>> +Reattach a device to another IOMMU context (including the default one).
>> +The target IOMMU context number must be valid and the context allocated.
>> +
>> +The guest needs to specify a PCI SBDF of a device he has access to.
>> +
>> +## Map/unmap page
>> +
>> +Map/unmap a page on a context.
>> +The guest needs to specify a gfn and target dfn to map.
> 
> And an "order", I hope; to enable superpages and hugepages without having to
> find out after the fact that the mappings are in fact mergeable and the leaf PTs
> can go away.
> 

In my implementation, I added a "nr_page" parameter to specify how much 
page can be mapped at once (and you can derive the superpages using 
this), I think as you suppose, it can be useful to try optimizing the 
map operation by mapping superpages directly.
The biggest problem is the superpage mapping we would like is going to 
only be valid if the target page of the domain is also a superpage 
(because the actual mapped region will also need to be contiguous in 
actual physical memory, not just from guest point of view)

>> +
>> +## Hardware without IOMMU support
>> +
>> +Operating system needs to be aware on PV-IOMMU capability, and whether it is
>> +able to make contexts. However, some operating system may critically fail in
>> +case they are able to make a new IOMMU context. Which is supposed to happen
>> +if no IOMMU hardware is available.
>> +
>> +The hypercall interface needs a interface to advertise the ability to create
>> +and manage IOMMU contexts including the amount of context the guest is able
>> +to use. Using these informations, the Dom0 may decide whether to use or not
>> +the PV-IOMMU interface.
> 
> We could just return -ENOTSUPP when there's no IOMMU, then encapsulate a random
> lookup with pv_iommu_is_present() and return true or false depending on rc.
> 

-ENOTSUPP makes sense, another way that I use to report no support for 
PV-IOMMU is to report limits that means "no operation is actually 
possible" (e.g max_ctx_no = 0).

>> +
>> +## Page pool for contexts
>> +
>> +In order to prevent unexpected starving on the hypervisor memory with a
>> +buggy Dom0. We can preallocate the pages the contexts will use and make
>> +map/unmap use these pages instead of allocating them dynamically.
>> +
> 
> That seems dangerous should we need to shatter a superpage asynchronously (i.e:
> due to HW misbehaving and requiring it) and have no more pages in the pool.
> 

Superpage shattering is actually recoverable (if you fail to allocate 
the new leafs, you just keep the superpage entry and do as if nothing 
happened), and report -ENOMEM. Nothing happened from hardware point of view.

The modification of the superpage entry into a regular one is only done 
once the leafs are actually valid. A similar story happens when 
collapsing leafs into superpages (you can free the leafs only once the 
hardware doesn't use it anymore e.g after a relevant iotlb_flush).

> Cheers,
> Alejandro

Teddy


Teddy Astie | Vates XCP-ng Intern

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech
diff mbox series

Patch

diff --git a/docs/designs/pv-iommu.md b/docs/designs/pv-iommu.md
new file mode 100644
index 0000000000..c01062a3ad
--- /dev/null
+++ b/docs/designs/pv-iommu.md
@@ -0,0 +1,105 @@ 
+# IOMMU paravirtualization for Dom0
+
+Status: Experimental
+
+# Background
+
+By default, Xen only uses the IOMMU for itself, either to make device adress
+space coherent with guest adress space (x86 HVM/PVH) or to prevent devices
+from doing DMA outside it's expected memory regions including the hypervisor
+(x86 PV).
+
+A limitation is that guests (especially privildged ones) may want to use
+IOMMU hardware in order to implement features such as DMA protection and
+VFIO [1] as IOMMU functionality is not available outside of the hypervisor
+currently.
+
+[1] VFIO - "Virtual Function I/O" - https://www.kernel.org/doc/html/latest/driver-api/vfio.html
+
+# Design
+
+The operating system may want to have access to various IOMMU features such as
+context management and DMA remapping. We can create a new hypercall that allows
+the guest to have access to a new paravirtualized IOMMU interface.
+
+This feature is only meant to be available for the Dom0, as DomU have some
+emulated devices that can't be managed on Xen side and are not hardware, we
+can't rely on the hardware IOMMU to enforce DMA remapping.
+
+This interface is exposed under the `iommu_op` hypercall.
+
+In addition, Xen domains are modified in order to allow existence of several
+IOMMU context including a default one that implement default behavior (e.g
+hardware assisted paging) and can't be modified by guest. DomU cannot have
+contexts, and therefore act as if they only have the default domain.
+
+Each IOMMU context within a Xen domain is identified using a domain-specific
+context number that is used in the Xen IOMMU subsystem and the hypercall
+interface.
+
+The number of IOMMU context a domain can use is predetermined at domain creation
+and is configurable through `dom0-iommu=nb-ctx=N` xen cmdline.
+
+# IOMMU operations
+
+## Alloc context
+
+Create a new IOMMU context for the guest and return the context number to the
+guest.
+Fail if the IOMMU context limit of the guest is reached.
+
+A flag can be specified to create a identity mapping.
+
+## Free context
+
+Destroy a IOMMU context created previously.
+It is not possible to free the default context.
+
+Reattach context devices to default context if specified by the guest.
+
+Fail if there is a device in the context and reattach-to-default flag is not
+specified.
+
+## Reattach device
+
+Reattach a device to another IOMMU context (including the default one).
+The target IOMMU context number must be valid and the context allocated.
+
+The guest needs to specify a PCI SBDF of a device he has access to.
+
+## Map/unmap page
+
+Map/unmap a page on a context.
+The guest needs to specify a gfn and target dfn to map.
+
+Refuse to create the mapping if one already exist for the same dfn.
+
+## Lookup page
+
+Get the gfn mapped by a specific dfn.
+
+# Implementation considerations
+
+## Hypercall batching
+
+In order to prevent unneeded hypercalls and IOMMU flushing, it is advisable to
+be able to batch some critical IOMMU operations (e.g map/unmap multiple pages).
+
+## Hardware without IOMMU support
+
+Operating system needs to be aware on PV-IOMMU capability, and whether it is
+able to make contexts. However, some operating system may critically fail in
+case they are able to make a new IOMMU context. Which is supposed to happen
+if no IOMMU hardware is available.
+
+The hypercall interface needs a interface to advertise the ability to create
+and manage IOMMU contexts including the amount of context the guest is able
+to use. Using these informations, the Dom0 may decide whether to use or not
+the PV-IOMMU interface.
+
+## Page pool for contexts
+
+In order to prevent unexpected starving on the hypervisor memory with a
+buggy Dom0. We can preallocate the pages the contexts will use and make
+map/unmap use these pages instead of allocating them dynamically.
+