diff mbox

devicetree: Add generic IOMMU device tree bindings

Message ID 1400242998-437-1-git-send-email-thierry.reding@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Thierry Reding May 16, 2014, 12:23 p.m. UTC
From: Thierry Reding <treding@nvidia.com>

This commit introduces a generic device tree binding for IOMMU devices.
Only a very minimal subset is described here, but it is enough to cover
the requirements of both the Exynos System MMU and Tegra SMMU as
discussed here:

    https://lkml.org/lkml/2014/4/27/346

More advanced functionality such as the dma-ranges property can easily
be added in a backwards-compatible way. In the absence of a dma-ranges
property it should be safe to default to the whole address space.

Signed-off-by: Thierry Reding <treding@nvidia.com>
---
 Documentation/devicetree/bindings/iommu/iommu.txt | 109 ++++++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/iommu/iommu.txt

Comments

Cho KyongHo May 17, 2014, 8:04 a.m. UTC | #1
On Fri, 16 May 2014 14:23:18 +0200, Thierry Reding wrote:
> From: Thierry Reding <treding@nvidia.com>
> 
> This commit introduces a generic device tree binding for IOMMU devices.
> Only a very minimal subset is described here, but it is enough to cover
> the requirements of both the Exynos System MMU and Tegra SMMU as
> discussed here:
> 
>     https://lkml.org/lkml/2014/4/27/346
> 
> More advanced functionality such as the dma-ranges property can easily
> be added in a backwards-compatible way. In the absence of a dma-ranges
> property it should be safe to default to the whole address space.
> 
> Signed-off-by: Thierry Reding <treding@nvidia.com>
> ---
>  Documentation/devicetree/bindings/iommu/iommu.txt | 109 ++++++++++++++++++++++
>  1 file changed, 109 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/iommu/iommu.txt
> 
> diff --git a/Documentation/devicetree/bindings/iommu/iommu.txt b/Documentation/devicetree/bindings/iommu/iommu.txt
> new file mode 100644
> index 000000000000..2d67b52b656e
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/iommu/iommu.txt
> @@ -0,0 +1,109 @@
> +This document describes the generic device tree binding for IOMMUs and their
> +master(s).
> +
> +
> +IOMMU device node:
> +==================
> +
> +An IOMMU can provide the following services:
> +
> +* Remap address space to allow devices to access physical memory ranges that
> +  they otherwise wouldn't be capable of accessing.
> +
> +  Example: 32-bit DMA to 64-bit physical addresses
> +
> +* Implement scatter-gather at page level granularity so that the device does
> +  not have to.
> +
> +* Provide system protection against "rogue" DMA by forcing all accesses to go
> +  through the IOMMU and faulting when encountering accesses to unmapped
> +  address regions.
> +
> +* Provide address space isolation between multiple contexts.
> +
> +  Example: Virtualization
> +
> +Device nodes compatible with this binding represent hardware with some of the
> +above capabilities.
> +
> +IOMMUs can be single-master or multiple-master. Single-master IOMMU devices
> +typically have a fixed association to the master device, whereas multiple-
> +master IOMMU devices can translate accesses from more than one master.
> +
> +Required properties:
> +--------------------
> +- #iommu-cells: The number of cells in an IOMMU specifier. The meaning of the
> +  cells is defined by the binding for the IOMMU device.
> +
> +  Typical values include:
> +  * 0: Single-master IOMMU devices are often not configurable, therefore the
> +    specifying doesn't need to encode any information and can be empty.
> +
> +  * 1: Multiple-master IOMMU devices need to know for which master they should
> +    enable translation. Typically the single cell in the specifier corresponds
> +    to the master device's ID.
> +
> +
> +IOMMU master node:
> +==================
> +
> +Devices that access memory through an IOMMU are called masters. A device can
> +have multiple master interfaces (to one or more IOMMU devices).
> +
> +Required properties:
> +--------------------
> +- iommus: A list of phandle and IOMMU specifier pairs that describe the IOMMU
> +  master interfaces of the device. One entry in the list describes one master
> +  interface of the device.
> +
> +Optional properties:
> +--------------------
> +- iommu-names: A list of names identifying each entry in the iommus property.
> +
> +
> +Examples:
> +=========
> +
> +Single-master IOMMU:
> +--------------------
> +
> +	iommu {
> +		#iommu-cells = <0>;
> +	};
> +
> +	master {
> +		iommu = <&/iommu>;
> +	};
> +

Great work, Thierry.

One simple comment.

This should be also applicable to multi-master IOMMUs that the masters
of an IOMMU is not configurable with ID or something.
I think the title needs to be changed to cover such IOMMUs which always
translate master's transactions and unable to change the configuration
of the relationship between the masters and IOMMUs by S/W.

Regards,

KyongHo

> +Multi-master IOMMU:
> +-------------------
> +
> +	iommu {
> +		/* the specifier represents the ID of the master */
> +		#iommu-cells = <1>;
> +	};
> +
> +	master {
> +		/* device has master ID 42 in the IOMMU */
> +		iommu = <&/iommu 42>;
> +	};
> +
> +Multi-master device:
> +--------------------
> +
> +	/* single-master IOMMU */
> +	iommu@1 {
> +		#iommu-cells = <0>;
> +	};
> +
> +	/* multi-master IOMMU */
> +	iommu@2 {
> +		/* the specifier represents the ID of the master */
> +		#iommu-cells = <1>;
> +	};
> +
> +	/* device with two master interfaces */
> +	master {
> +		iommus = <&/iommu@1>,    /* master of the single-master IOMMU */
> +			 <&/iommu@2 42>; /* ID 42 in multi-master IOMMU */
> +	};
> -- 
> 1.9.2
>
Thierry Reding May 17, 2014, 8:48 p.m. UTC | #2
On Sat, May 17, 2014 at 05:04:55PM +0900, Cho KyongHo wrote:
> On Fri, 16 May 2014 14:23:18 +0200, Thierry Reding wrote:
[...]
> > +Examples:
> > +=========
> > +
> > +Single-master IOMMU:
> > +--------------------
> > +
> > +	iommu {
> > +		#iommu-cells = <0>;
> > +	};
> > +
> > +	master {
> > +		iommu = <&/iommu>;
> > +	};
> > +
> 
> Great work, Thierry.
> 
> One simple comment.
> 
> This should be also applicable to multi-master IOMMUs that the masters
> of an IOMMU is not configurable with ID or something.
> I think the title needs to be changed to cover such IOMMUs which always
> translate master's transactions and unable to change the configuration
> of the relationship between the masters and IOMMUs by S/W.

Agreed, how about we add a separate example for that case:

Multiple-master IOMMU with fixed associations:
----------------------------------------------

	/* multiple-master IOMMU */
	iommu {
		/*
		 * Masters are statically associated with this IOMMU and
		 * address translation is always enabled.
		 */
		#iommu-cells = <0>;
	};

	/* static association with IOMMU */
	master@1 {
		reg = <1>;
		iommus = <&/iommu>;
	};

	/* static association with IOMMU */
	master@2 {
		reg = <2>;
		iommus = <&/iommu>;
	};

?

Thierry
Arnd Bergmann May 19, 2014, 10:26 a.m. UTC | #3
On Friday 16 May 2014 14:23:18 Thierry Reding wrote:
> From: Thierry Reding <treding@nvidia.com>
> 
> This commit introduces a generic device tree binding for IOMMU devices.
> Only a very minimal subset is described here, but it is enough to cover
> the requirements of both the Exynos System MMU and Tegra SMMU as
> discussed here:
> 
>     https://lkml.org/lkml/2014/4/27/346
> 
> More advanced functionality such as the dma-ranges property can easily
> be added in a backwards-compatible way. In the absence of a dma-ranges
> property it should be safe to default to the whole address space.
> 

The basic binding looks fine, but I'd like it to be more explicit
about dma-ranges. Most importantly, what does "the whole address space"
mean? A lot of IOMMUs have only 32-bit bus addresses when targetted
by a bus master, it would also be normal for some to be smaller and
some might even support 64-bit.

For the upstream side, I'd hope we always have access to the full
physical memory, but since this is a brand-new binding, it should
be straightforward to just ask for upstream dma-ranges properties
to be set all the way up to the root to confirm that.

For downstream, we don't actually have a good place to put the
dma-ranges property. We can't put it into the iommu node, because
that would imply translating to the iommu's parent bus, not the
iommu's own bus space. We also can't put it into the master, because
dma-ranges is supposed to be in the parent bus. Finally, it makes
no sense to use the dma-ranges property of the master's parent bus,
because that bus isn't actually involved in the translation.

My preferred option would be to always put the address range into
the iommu descriptor, using the iommu's #address-cells.

	Arnd
Thierry Reding May 19, 2014, 12:53 p.m. UTC | #4
On Mon, May 19, 2014 at 12:26:35PM +0200, Arnd Bergmann wrote:
> On Friday 16 May 2014 14:23:18 Thierry Reding wrote:
> > From: Thierry Reding <treding@nvidia.com>
> > 
> > This commit introduces a generic device tree binding for IOMMU devices.
> > Only a very minimal subset is described here, but it is enough to cover
> > the requirements of both the Exynos System MMU and Tegra SMMU as
> > discussed here:
> > 
> >     https://lkml.org/lkml/2014/4/27/346
> > 
> > More advanced functionality such as the dma-ranges property can easily
> > be added in a backwards-compatible way. In the absence of a dma-ranges
> > property it should be safe to default to the whole address space.
> > 
> 
> The basic binding looks fine, but I'd like it to be more explicit
> about dma-ranges. Most importantly, what does "the whole address space"
> mean?

The whole point was to leave out any mention of dma-ranges from the
binding until we've figured out more of the puzzle.

So what I was trying to avoid was another lengthy discussion on the
topic of dma-ranges. Oh well... =)

> A lot of IOMMUs have only 32-bit bus addresses when targetted
> by a bus master, it would also be normal for some to be smaller and
> some might even support 64-bit.
> 
> For the upstream side, I'd hope we always have access to the full
> physical memory, but since this is a brand-new binding, it should
> be straightforward to just ask for upstream dma-ranges properties
> to be set all the way up to the root to confirm that.
> 
> For downstream, we don't actually have a good place to put the
> dma-ranges property.

I'm not sure I understand what you mean by upstream and downstream in
this context.

> We can't put it into the iommu node, because  that would imply translating
> to the iommu's parent bus, not the iommu's own bus space.

My understanding was that the purpose of dma-ranges was to define a
mapping from one bus to another. So the general form of

	child-address  parent-address  child-size

Would be used to translate a region of size <child-size> from the
<child-address> (the I/O address space created by the IOMMU) to the
<parent-address> (physical address space).

> We also can't put it into the master, because dma-ranges is supposed to be
> in the parent bus.

I don't understand. From the above I would think that the master's node
is precisely where it belongs.

> Finally, it makes no sense to use the dma-ranges property of the master's
> parent bus, because that bus isn't actually involved in the translation.

My understanding here is mostly based on the OpenFirmware working group
proposal for the dma-ranges property[0]. I'll give another example to
try and clarify how I had imagined this to work:

	/ {
		#address-cells = <2>;
		#size-cells = <2>;

		iommu {
			/*
			 * This is somewhat unusual (or maybe not) in that we
			 * need 2 cells to represent the size of an address
			 * space that is 32 bits long.
			 */
			#address-cells = <1>;
			#size-cells = <2>;

			#iommu-cells = <1>;
		};

		master {
			iommus = <&/iommu 42>;
			/*
			 * Map I/O addresses 0 - 4 GiB to physical addresses
			 * 2 GiB - 6 GiB.
			 */
			dma-ranges = <0x00000000 0 0x80000000 1 0>;
		};
	};

This is somewhat incompatible with [0] in that #address-cells used to
parse the child address must be taken from the iommu node rather than
the child node. But that seems to me to be the only reasonable thing
to do, because after all the IOMMU creates a completely new address
space for the master.

[0]: http://www.openfirmware.org/ofwg/proposals/Closed/Accepted/410-it.txt

> My preferred option would be to always put the address range into
> the iommu descriptor, using the iommu's #address-cells.

That could become impossible to parse. I'm not sure if such hardware
actually exists, but if for some reason we have to split the address
range into two, then there's no longer any way to determine the size
needed for the specifier.

On the other hand what you propose makes it easy to represent multiple
master interfaces on a device. With a separate dma-ranges property how
can you define which ranges apply to which of the master interfaces?

Then again if address ranges can't be broken up in the first place, then
dma-ranges could be considered to be one entry per IOMMU in the iommus
property.

Thierry
Dave Martin May 19, 2014, 5:22 p.m. UTC | #5
On Mon, May 19, 2014 at 01:53:37PM +0100, Thierry Reding wrote:
> On Mon, May 19, 2014 at 12:26:35PM +0200, Arnd Bergmann wrote:
> > On Friday 16 May 2014 14:23:18 Thierry Reding wrote:
> > > From: Thierry Reding <treding@nvidia.com>
> > > 
> > > This commit introduces a generic device tree binding for IOMMU devices.
> > > Only a very minimal subset is described here, but it is enough to cover
> > > the requirements of both the Exynos System MMU and Tegra SMMU as
> > > discussed here:
> > > 
> > >     https://lkml.org/lkml/2014/4/27/346
> > > 
> > > More advanced functionality such as the dma-ranges property can easily
> > > be added in a backwards-compatible way. In the absence of a dma-ranges
> > > property it should be safe to default to the whole address space.
> > > 
> > 
> > The basic binding looks fine, but I'd like it to be more explicit
> > about dma-ranges. Most importantly, what does "the whole address space"
> > mean?
> 
> The whole point was to leave out any mention of dma-ranges from the
> binding until we've figured out more of the puzzle.
> 
> So what I was trying to avoid was another lengthy discussion on the
> topic of dma-ranges. Oh well... =)

Apologies for the silence on this topic...  I'm trying to do too many
things at the sime time as usual :/

> 
> > A lot of IOMMUs have only 32-bit bus addresses when targetted
> > by a bus master, it would also be normal for some to be smaller and
> > some might even support 64-bit.
> > 
> > For the upstream side, I'd hope we always have access to the full
> > physical memory, but since this is a brand-new binding, it should
> > be straightforward to just ask for upstream dma-ranges properties
> > to be set all the way up to the root to confirm that.
> > 
> > For downstream, we don't actually have a good place to put the
> > dma-ranges property.
> 
> I'm not sure I understand what you mean by upstream and downstream in
> this context.
> 
> > We can't put it into the iommu node, because  that would imply translating
> > to the iommu's parent bus, not the iommu's own bus space.
> 
> My understanding was that the purpose of dma-ranges was to define a
> mapping from one bus to another. So the general form of
> 
> 	child-address  parent-address  child-size
> 
> Would be used to translate a region of size <child-size> from the
> <child-address> (the I/O address space created by the IOMMU) to the
> <parent-address> (physical address space).
> 
> > We also can't put it into the master, because dma-ranges is supposed to be
> > in the parent bus.
> 
> I don't understand. From the above I would think that the master's node
> is precisely where it belongs.
> 
> > Finally, it makes no sense to use the dma-ranges property of the master's
> > parent bus, because that bus isn't actually involved in the translation.
> 
> My understanding here is mostly based on the OpenFirmware working group
> proposal for the dma-ranges property[0]. I'll give another example to
> try and clarify how I had imagined this to work:
> 
> 	/ {
> 		#address-cells = <2>;
> 		#size-cells = <2>;
> 
> 		iommu {
> 			/*
> 			 * This is somewhat unusual (or maybe not) in that we
> 			 * need 2 cells to represent the size of an address
> 			 * space that is 32 bits long.
> 			 */
> 			#address-cells = <1>;
> 			#size-cells = <2>;
> 
> 			#iommu-cells = <1>;
> 		};
> 
> 		master {
> 			iommus = <&/iommu 42>;

Comparing this with the other discussion thread, we have a similar
concept here, in that the iommu is made a logical parent, however...

Firstly, there's an implicit assumption here that the only kind of
thing the master could possibly be connected to is an IOMMU, with
no non-trivial interconnect in between.  I'm not sure this is going
to scale to complex SoCs.

If a range of Stream IDs may be issued (especially from something like
a PCIe RC where the stream ID may be a many-bit value), describing
the IDs individually may be impractical.

> 			/*
> 			 * Map I/O addresses 0 - 4 GiB to physical addresses
> 			 * 2 GiB - 6 GiB.
> 			 */
> 			dma-ranges = <0x00000000 0 0x80000000 1 0>;

... I concur with Arnd dma-ranges is now in the wrong place with
respect to the (overridden) child-parent relationship, and that if
this device masters to multiple destinations there's no possibility
of describing different ranges mappings for each.

> 		};
> 	};
> 
> This is somewhat incompatible with [0] in that #address-cells used to
> parse the child address must be taken from the iommu node rather than
> the child node. But that seems to me to be the only reasonable thing
> to do, because after all the IOMMU creates a completely new address
> space for the master.
> 
> [0]: http://www.openfirmware.org/ofwg/proposals/Closed/Accepted/410-it.txt
> 
> > My preferred option would be to always put the address range into
> > the iommu descriptor, using the iommu's #address-cells.
> 
> That could become impossible to parse. I'm not sure if such hardware
> actually exists, but if for some reason we have to split the address
> range into two, then there's no longer any way to determine the size
> needed for the specifier.
> 
> On the other hand what you propose makes it easy to represent multiple
> master interfaces on a device. With a separate dma-ranges property how
> can you define which ranges apply to which of the master interfaces?

In theory, you could describe split ranges simply by multiplying out:

	<&/iommu 42 sub-range-1>,
	<&/iommu 42 sub-range-2>,
	...

This is only manageable if the number of subranges for each
(iommu,streamID) mapping is small (and usually 1).

Of course, I hope the number of subranges normally _is_ small...


Cheers
---Dave
	
> Then again if address ranges can't be broken up in the first place, then
> dma-ranges could be considered to be one entry per IOMMU in the iommus
> property.
> 
> Thierry
Arnd Bergmann May 19, 2014, 6:34 p.m. UTC | #6
On Monday 19 May 2014 14:53:37 Thierry Reding wrote:
> On Mon, May 19, 2014 at 12:26:35PM +0200, Arnd Bergmann wrote:
> > On Friday 16 May 2014 14:23:18 Thierry Reding wrote:
> > > From: Thierry Reding <treding@nvidia.com>
> > > 
> > > This commit introduces a generic device tree binding for IOMMU devices.
> > > Only a very minimal subset is described here, but it is enough to cover
> > > the requirements of both the Exynos System MMU and Tegra SMMU as
> > > discussed here:
> > > 
> > >     https://lkml.org/lkml/2014/4/27/346
> > > 
> > > More advanced functionality such as the dma-ranges property can easily
> > > be added in a backwards-compatible way. In the absence of a dma-ranges
> > > property it should be safe to default to the whole address space.
> > > 
> > 
> > The basic binding looks fine, but I'd like it to be more explicit
> > about dma-ranges. Most importantly, what does "the whole address space"
> > mean?
> 
> The whole point was to leave out any mention of dma-ranges from the
> binding until we've figured out more of the puzzle.
> 
> So what I was trying to avoid was another lengthy discussion on the
> topic of dma-ranges. Oh well... =)

I think that can't work, because we need a way to specify the
ranges for some iommu drivers. We have to make sure we at least
don't prevent it from working.

> > A lot of IOMMUs have only 32-bit bus addresses when targetted
> > by a bus master, it would also be normal for some to be smaller and
> > some might even support 64-bit.
> > 
> > For the upstream side, I'd hope we always have access to the full
> > physical memory, but since this is a brand-new binding, it should
> > be straightforward to just ask for upstream dma-ranges properties
> > to be set all the way up to the root to confirm that.
> > 
> > For downstream, we don't actually have a good place to put the
> > dma-ranges property.
> 
> I'm not sure I understand what you mean by upstream and downstream in
> this context.

Upstream -> close to memory
Downstream -> close to DMA master

> > We can't put it into the iommu node, because  that would imply translating
> > to the iommu's parent bus, not the iommu's own bus space.
> 
> My understanding was that the purpose of dma-ranges was to define a
> mapping from one bus to another. So the general form of
> 
> 	child-address  parent-address  child-size
> 
> Would be used to translate a region of size <child-size> from the
> <child-address> (the I/O address space created by the IOMMU) to the
> <parent-address> (physical address space).

That's how it's defined for normal buses, we haven't defined it for
IOMMUs yet.

> > We also can't put it into the master, because dma-ranges is supposed to be
> > in the parent bus.
> 
> I don't understand. From the above I would think that the master's node
> is precisely where it belongs.

The way that I read the binding, the way that dma-ranges belongs into the parent
of the master, according. On normal buses (e.g. classic PCI), all DMA masters
see the same address space. The bus defines how a transaction started on one
of its children gets translated to the address space of its parent. This
is similar to 'ranges', which defines how an MMIO address initiated on the
parent of a bus gets translated to an address understood by its children.

My interpretation at least matches the defintion of of_dma_get_range(),
but then again Santosh added that according to my comments.

> > Finally, it makes no sense to use the dma-ranges property of the master's
> > parent bus, because that bus isn't actually involved in the translation.
> 
> My understanding here is mostly based on the OpenFirmware working group
> proposal for the dma-ranges property[0]. I'll give another example to
> try and clarify how I had imagined this to work:
> 
> 	/ {
> 		#address-cells = <2>;
> 		#size-cells = <2>;
> 
> 		iommu {
> 			/*
> 			 * This is somewhat unusual (or maybe not) in that we
> 			 * need 2 cells to represent the size of an address
> 			 * space that is 32 bits long.
> 			 */
> 			#address-cells = <1>;
> 			#size-cells = <2>;

You should never need #size-cells > #address-cells

> 			#iommu-cells = <1>;
> 		};
> 
> 		master {
> 			iommus = <&/iommu 42>;
> 			/*
> 			 * Map I/O addresses 0 - 4 GiB to physical addresses
> 			 * 2 GiB - 6 GiB.
> 			 */
> 			dma-ranges = <0x00000000 0 0x80000000 1 0>;
> 		};
> 	};
> 
> This is somewhat incompatible with [0] in that #address-cells used to
> parse the child address must be taken from the iommu node rather than
> the child node. But that seems to me to be the only reasonable thing
> to do, because after all the IOMMU creates a completely new address
> space for the master.
> 
> [0]: http://www.openfirmware.org/ofwg/proposals/Closed/Accepted/410-it.txt

I don't think you can have a dma-ranges without a #address-cells and
#size-cells property in the same node. In your example, I'd also expect
a child node below 'master' that then interprets the address space
made up by dma-ranges.

As a comment on the numbers in your example, I don't expect to ever
see a 4GB IOMMU address space that doesn't start at an offset. Instead
I'd expect either addresses that encode a device ID, or those that
are just a subset of the parent address space, with non-overlapping
bus addresses for each master.

> > My preferred option would be to always put the address range into
> > the iommu descriptor, using the iommu's #address-cells.
> 
> That could become impossible to parse. I'm not sure if such hardware
> actually exists, but if for some reason we have to split the address
> range into two, then there's no longer any way to determine the size
> needed for the specifier.
> 
> On the other hand what you propose makes it easy to represent multiple
> master interfaces on a device. With a separate dma-ranges property how
> can you define which ranges apply to which of the master interfaces?

Well, you could have multiple links to the same IOMMU if you want to 
do that, and define that there must be at least one dma-ranges entry
for each IOMMU entry (although not necessarily the other way round,
you could have direct ranges in addition to translated ones.

> Then again if address ranges can't be broken up in the first place, then
> dma-ranges could be considered to be one entry per IOMMU in the iommus
> property.

Let me do another example, with the address merged into the iommu
references:

/ {
	#address-cells = <2>; // 64-bit address
	#size-cells = <2>;

	iommu@a {
		#address-cells = <2>; // 1 cell ID, 1 cell address
		#size-cells = <1>;

		// no need for #iommu-cells
	};


	master@b {
		iommus = <&/iommu@a // iommu
			0x23  	     // ID
			0x40000000   // window start
			0x10000000>; //window size
	};
};

A disadvantage of this model would be that for all ARM SMMU users, we'd
have to always list a 4GB address space, which is kind of redundant.

	Arnd
Thierry Reding May 19, 2014, 8:32 p.m. UTC | #7
On Mon, May 19, 2014 at 06:22:31PM +0100, Dave Martin wrote:
> On Mon, May 19, 2014 at 01:53:37PM +0100, Thierry Reding wrote:
[...]
> > My understanding here is mostly based on the OpenFirmware working group
> > proposal for the dma-ranges property[0]. I'll give another example to
> > try and clarify how I had imagined this to work:
> > 
> > 	/ {
> > 		#address-cells = <2>;
> > 		#size-cells = <2>;
> > 
> > 		iommu {
> > 			/*
> > 			 * This is somewhat unusual (or maybe not) in that we
> > 			 * need 2 cells to represent the size of an address
> > 			 * space that is 32 bits long.
> > 			 */
> > 			#address-cells = <1>;
> > 			#size-cells = <2>;
> > 
> > 			#iommu-cells = <1>;
> > 		};
> > 
> > 		master {
> > 			iommus = <&/iommu 42>;
> 
> Comparing this with the other discussion thread, we have a similar
> concept here, in that the iommu is made a logical parent, however...
> 
> Firstly, there's an implicit assumption here that the only kind of
> thing the master could possibly be connected to is an IOMMU, with
> no non-trivial interconnect in between.  I'm not sure this is going
> to scale to complex SoCs.

Here we go again. We're now in the very same bad spot that we've been in
so many times before. There are at least two SoCs that we know do not
require anything fancy, yet we're blocking adding support for those use
cases because we think that at some point some IOMMU may require more
than that. But at the same time it seems like we don't have enough data
about what exactly that is, so we keep speculating. At this rate we're
making it impossible to get a reasonable feature set supported upstream.

That's very frustrating.

> If a range of Stream IDs may be issued (especially from something like
> a PCIe RC where the stream ID may be a many-bit value), describing
> the IDs individually may be impractical.

The IOMMU specifier is completely specific to the IOMMU. If an IOMMU has
a need to represent the stream IDs as a range there's nothing keeping it
from defining the specifier accordingly.

Thierry
Thierry Reding May 19, 2014, 8:59 p.m. UTC | #8
On Mon, May 19, 2014 at 08:34:07PM +0200, Arnd Bergmann wrote:
> On Monday 19 May 2014 14:53:37 Thierry Reding wrote:
> > On Mon, May 19, 2014 at 12:26:35PM +0200, Arnd Bergmann wrote:
> > > On Friday 16 May 2014 14:23:18 Thierry Reding wrote:
> > > > From: Thierry Reding <treding@nvidia.com>
> > > > 
> > > > This commit introduces a generic device tree binding for IOMMU devices.
> > > > Only a very minimal subset is described here, but it is enough to cover
> > > > the requirements of both the Exynos System MMU and Tegra SMMU as
> > > > discussed here:
> > > > 
> > > >     https://lkml.org/lkml/2014/4/27/346
> > > > 
> > > > More advanced functionality such as the dma-ranges property can easily
> > > > be added in a backwards-compatible way. In the absence of a dma-ranges
> > > > property it should be safe to default to the whole address space.
> > > > 
> > > 
> > > The basic binding looks fine, but I'd like it to be more explicit
> > > about dma-ranges. Most importantly, what does "the whole address space"
> > > mean?
> > 
> > The whole point was to leave out any mention of dma-ranges from the
> > binding until we've figured out more of the puzzle.
> > 
> > So what I was trying to avoid was another lengthy discussion on the
> > topic of dma-ranges. Oh well... =)
> 
> I think that can't work, because we need a way to specify the
> ranges for some iommu drivers. We have to make sure we at least
> don't prevent it from working.

That was precisely why I wanted to let this out of the binding for now
so that we can actually focus on making existing hardware work rather
than speculate about what we may or may not need.

If you think the current minimal binding will cause future use-cases to
break, then we should change it to avoid that. What you're proposing is
to make the binding more complex on the assumption that we'll get it
right.

Wouldn't it be safer to keep things to the bare minimum necessary to
represent hardware that we have access to and understand now, rather
than speculating about what may be necessary at some point. I'd much
prefer to add complexity on an as-needed basis and when we have a better
understand of where we're headed.

> > > Finally, it makes no sense to use the dma-ranges property of the master's
> > > parent bus, because that bus isn't actually involved in the translation.
> > 
> > My understanding here is mostly based on the OpenFirmware working group
> > proposal for the dma-ranges property[0]. I'll give another example to
> > try and clarify how I had imagined this to work:
> > 
> > 	/ {
> > 		#address-cells = <2>;
> > 		#size-cells = <2>;
> > 
> > 		iommu {
> > 			/*
> > 			 * This is somewhat unusual (or maybe not) in that we
> > 			 * need 2 cells to represent the size of an address
> > 			 * space that is 32 bits long.
> > 			 */
> > 			#address-cells = <1>;
> > 			#size-cells = <2>;
> 
> You should never need #size-cells > #address-cells

That was always my impression as well. But how then do you represent the
full 4 GiB address space in a 32-bit system? It starts at 0 and ends at
4 GiB - 1, which makes it 4 GiB large. That's:

	<0 1 0>

With #address-cells = <1> and #size-cells = <1> the best you can do is:

	<0 0xffffffff>

but that's not accurate.

> > 			#iommu-cells = <1>;
> > 		};
> > 
> > 		master {
> > 			iommus = <&/iommu 42>;
> > 			/*
> > 			 * Map I/O addresses 0 - 4 GiB to physical addresses
> > 			 * 2 GiB - 6 GiB.
> > 			 */
> > 			dma-ranges = <0x00000000 0 0x80000000 1 0>;
> > 		};
> > 	};
> > 
> > This is somewhat incompatible with [0] in that #address-cells used to
> > parse the child address must be taken from the iommu node rather than
> > the child node. But that seems to me to be the only reasonable thing
> > to do, because after all the IOMMU creates a completely new address
> > space for the master.
> > 
> > [0]: http://www.openfirmware.org/ofwg/proposals/Closed/Accepted/410-it.txt
> 
> I don't think you can have a dma-ranges without a #address-cells and
> #size-cells property in the same node. In your example, I'd also expect
> a child node below 'master' that then interprets the address space
> made up by dma-ranges.

Okay, so what Dave and you have been saying strongly indicates that
dma-ranges isn't the right thing to use here.

> As a comment on the numbers in your example, I don't expect to ever
> see a 4GB IOMMU address space that doesn't start at an offset. Instead
> I'd expect either addresses that encode a device ID, or those that
> are just a subset of the parent address space, with non-overlapping
> bus addresses for each master.

As I understand the Tegra SMMU allows each of the clients to be assigned
a separate address space (4 GiB on earlier generations and 16 GiB on new
generations) and all addresses in each address space can be mapped to
arbitrary physical addresses.

> > > My preferred option would be to always put the address range into
> > > the iommu descriptor, using the iommu's #address-cells.
> > 
> > That could become impossible to parse. I'm not sure if such hardware
> > actually exists, but if for some reason we have to split the address
> > range into two, then there's no longer any way to determine the size
> > needed for the specifier.
> > 
> > On the other hand what you propose makes it easy to represent multiple
> > master interfaces on a device. With a separate dma-ranges property how
> > can you define which ranges apply to which of the master interfaces?
> 
> Well, you could have multiple links to the same IOMMU if you want to 
> do that, and define that there must be at least one dma-ranges entry
> for each IOMMU entry (although not necessarily the other way round,
> you could have direct ranges in addition to translated ones.
> 
> > Then again if address ranges can't be broken up in the first place, then
> > dma-ranges could be considered to be one entry per IOMMU in the iommus
> > property.
> 
> Let me do another example, with the address merged into the iommu
> references:
> 
> / {
> 	#address-cells = <2>; // 64-bit address
> 	#size-cells = <2>;
> 
> 	iommu@a {
> 		#address-cells = <2>; // 1 cell ID, 1 cell address
> 		#size-cells = <1>;
> 
> 		// no need for #iommu-cells
> 	};
> 
> 
> 	master@b {
> 		iommus = <&/iommu@a // iommu
> 			0x23  	     // ID
> 			0x40000000   // window start
> 			0x10000000>; //window size
> 	};
> };
> 
> A disadvantage of this model would be that for all ARM SMMU users, we'd
> have to always list a 4GB address space, which is kind of redundant.

Isn't that the equivalent of the "whole address space" default that I
mentioned in the commit message? Could this be handled with
#address-cells = <1> and #size-cells = <0> in the iommu node? That way
the only cell that needs to be specified in iommus would be the ID and
the redundant address space could be simply omitted from DT since it
would be implied by the compatible string.

Thierry
Arnd Bergmann May 20, 2014, 10:04 a.m. UTC | #9
On Monday 19 May 2014 22:59:46 Thierry Reding wrote:
> On Mon, May 19, 2014 at 08:34:07PM +0200, Arnd Bergmann wrote:
> > On Monday 19 May 2014 14:53:37 Thierry Reding wrote:
> > > On Mon, May 19, 2014 at 12:26:35PM +0200, Arnd Bergmann wrote:
> > > > On Friday 16 May 2014 14:23:18 Thierry Reding wrote:
> > > > > From: Thierry Reding <treding@nvidia.com>
> > > > > 
> > > > > This commit introduces a generic device tree binding for IOMMU devices.
> > > > > Only a very minimal subset is described here, but it is enough to cover
> > > > > the requirements of both the Exynos System MMU and Tegra SMMU as
> > > > > discussed here:
> > > > > 
> > > > >     https://lkml.org/lkml/2014/4/27/346
> > > > > 
> > > > > More advanced functionality such as the dma-ranges property can easily
> > > > > be added in a backwards-compatible way. In the absence of a dma-ranges
> > > > > property it should be safe to default to the whole address space.
> > > > > 
> > > > 
> > > > The basic binding looks fine, but I'd like it to be more explicit
> > > > about dma-ranges. Most importantly, what does "the whole address space"
> > > > mean?
> > > 
> > > The whole point was to leave out any mention of dma-ranges from the
> > > binding until we've figured out more of the puzzle.
> > > 
> > > So what I was trying to avoid was another lengthy discussion on the
> > > topic of dma-ranges. Oh well... =)
> > 
> > I think that can't work, because we need a way to specify the
> > ranges for some iommu drivers. We have to make sure we at least
> > don't prevent it from working.
> 
> That was precisely why I wanted to let this out of the binding for now
> so that we can actually focus on making existing hardware work rather
> than speculate about what we may or may not need.
> 
> If you think the current minimal binding will cause future use-cases to
> break, then we should change it to avoid that. What you're proposing is
> to make the binding more complex on the assumption that we'll get it
> right.
> 
> Wouldn't it be safer to keep things to the bare minimum necessary to
> represent hardware that we have access to and understand now, rather
> than speculating about what may be necessary at some point. I'd much
> prefer to add complexity on an as-needed basis and when we have a better
> understand of where we're headed.

I just want to think it through for the cases we know about. We don't
have to implement it all at once, but I think there is a danger in making
an important binding too simple or too complicated, both of which are
equally bad.

After giving the ranges stuff some more thought, I have come to the
conclusion that using #iommu-cells should work fine for almost
all cases, including windowed iommus, because the window is not
actually needed in the device, but only in the iommu, wihch is of course
free to interpret the arguments as addresses.
 
> > > > Finally, it makes no sense to use the dma-ranges property of the master's
> > > > parent bus, because that bus isn't actually involved in the translation.
> > > 
> > > My understanding here is mostly based on the OpenFirmware working group
> > > proposal for the dma-ranges property[0]. I'll give another example to
> > > try and clarify how I had imagined this to work:
> > > 
> > > 	/ {
> > > 		#address-cells = <2>;
> > > 		#size-cells = <2>;
> > > 
> > > 		iommu {
> > > 			/*
> > > 			 * This is somewhat unusual (or maybe not) in that we
> > > 			 * need 2 cells to represent the size of an address
> > > 			 * space that is 32 bits long.
> > > 			 */
> > > 			#address-cells = <1>;
> > > 			#size-cells = <2>;
> > 
> > You should never need #size-cells > #address-cells
> 
> That was always my impression as well. But how then do you represent the
> full 4 GiB address space in a 32-bit system? It starts at 0 and ends at
> 4 GiB - 1, which makes it 4 GiB large. That's:
> 
> 	<0 1 0>
> 
> With #address-cells = <1> and #size-cells = <1> the best you can do is:
> 
> 	<0 0xffffffff>
> 
> but that's not accurate.

I think we've done both in the past, either extended #size-cells or
taken 0xffffffff as a special token. Note that in your example,
the iommu actually needs #address-cells = <2> anyway.

> > > 			#iommu-cells = <1>;
> > > 		};
> > > 
> > > 		master {
> > > 			iommus = <&/iommu 42>;
> > > 			/*
> > > 			 * Map I/O addresses 0 - 4 GiB to physical addresses
> > > 			 * 2 GiB - 6 GiB.
> > > 			 */
> > > 			dma-ranges = <0x00000000 0 0x80000000 1 0>;
> > > 		};
> > > 	};
> > > 
> > > This is somewhat incompatible with [0] in that #address-cells used to
> > > parse the child address must be taken from the iommu node rather than
> > > the child node. But that seems to me to be the only reasonable thing
> > > to do, because after all the IOMMU creates a completely new address
> > > space for the master.
> > > 
> > > [0]: http://www.openfirmware.org/ofwg/proposals/Closed/Accepted/410-it.txt
> > 
> > I don't think you can have a dma-ranges without a #address-cells and
> > #size-cells property in the same node. In your example, I'd also expect
> > a child node below 'master' that then interprets the address space
> > made up by dma-ranges.
> 
> Okay, so what Dave and you have been saying strongly indicates that
> dma-ranges isn't the right thing to use here.

There are cases where I think we could use dma-ranges, but not in the
way you have it above. I'll get to that below.

> > As a comment on the numbers in your example, I don't expect to ever
> > see a 4GB IOMMU address space that doesn't start at an offset. Instead
> > I'd expect either addresses that encode a device ID, or those that
> > are just a subset of the parent address space, with non-overlapping
> > bus addresses for each master.
> 
> As I understand the Tegra SMMU allows each of the clients to be assigned
> a separate address space (4 GiB on earlier generations and 16 GiB on new
> generations) and all addresses in each address space can be mapped to
> arbitrary physical addresses.

Right, so this is not a windowed IOMMU, and you wouldn't need to encode
the address at all.

> > > > My preferred option would be to always put the address range into
> > > > the iommu descriptor, using the iommu's #address-cells.
> > > 
> > > That could become impossible to parse. I'm not sure if such hardware
> > > actually exists, but if for some reason we have to split the address
> > > range into two, then there's no longer any way to determine the size
> > > needed for the specifier.
> > > 
> > > On the other hand what you propose makes it easy to represent multiple
> > > master interfaces on a device. With a separate dma-ranges property how
> > > can you define which ranges apply to which of the master interfaces?
> > 
> > Well, you could have multiple links to the same IOMMU if you want to 
> > do that, and define that there must be at least one dma-ranges entry
> > for each IOMMU entry (although not necessarily the other way round,
> > you could have direct ranges in addition to translated ones.
> > 
> > > Then again if address ranges can't be broken up in the first place, then
> > > dma-ranges could be considered to be one entry per IOMMU in the iommus
> > > property.
> > 
> > Let me do another example, with the address merged into the iommu
> > references:
> > 
> > / {
> > 	#address-cells = <2>; // 64-bit address
> > 	#size-cells = <2>;
> > 
> > 	iommu@a {
> > 		#address-cells = <2>; // 1 cell ID, 1 cell address
> > 		#size-cells = <1>;
> > 
> > 		// no need for #iommu-cells
> > 	};
> > 
> > 
> > 	master@b {
> > 		iommus = <&/iommu@a // iommu
> > 			0x23  	     // ID
> > 			0x40000000   // window start
> > 			0x10000000>; //window size
> > 	};
> > };
> > 
> > A disadvantage of this model would be that for all ARM SMMU users, we'd
> > have to always list a 4GB address space, which is kind of redundant.
> 
> Isn't that the equivalent of the "whole address space" default that I
> mentioned in the commit message? Could this be handled with
> #address-cells = <1> and #size-cells = <0> in the iommu node? That way
> the only cell that needs to be specified in iommus would be the ID and
> the redundant address space could be simply omitted from DT since it
> would be implied by the compatible string.

Yes, that's right. After giving it some more thought, I agree that the
#size-cells case makes sense as an indicator for cases where the master
doesn't have to care about the address at all and no further translation
with dma-ranges is possible.

Back to my example with dma-ranges, the simplest case would an IOMMU
that has an ID per bus, with multiple masters sharing that ID:

/ {
	#address-cells = <1>;
	#size-cells = <1>;

	iommu {
		#address-cells = <2>; // ID, address
		#size-cells = <2>;
	};

	master@a {
		iommus = <& {/iommu} 0xa 0x0  0x1 0x0>; // 4GB ID '0xa'
	}

	bus1 {
		#address-cells = <1>;
		#size-cells = <1>;
		ranges;
		iommus = <& {/iommu} 0 0   0x100 0>; // all IDs
		dma-ranges = <0  0xb 0  1 0>; // child devices use ID '0xb'

		anothermaster {
			// no iommus link, implied by dma-ranges above
		};
	};
};

If you set #size-cells=<0>, you can't really do that but instead would
require an iommus property in each master, which is not a big concern
either.

The main advantage I think would be for IOMMUs that use the PCI b/d/f
numbers as IDs. These can have #address-cells=<3>, #size-cells=<2>
and have an empty dma-ranges property in the PCI host bridge node,
and interpret this as using the same encoding as the PCI BARs in
the ranges property.

	Arnd
Arnd Bergmann May 20, 2014, 10:08 a.m. UTC | #10
On Monday 19 May 2014 22:32:33 Thierry Reding wrote:
> On Mon, May 19, 2014 at 06:22:31PM +0100, Dave Martin wrote:
> > On Mon, May 19, 2014 at 01:53:37PM +0100, Thierry Reding wrote:
> [...]
> > > My understanding here is mostly based on the OpenFirmware working group
> > > proposal for the dma-ranges property[0]. I'll give another example to
> > > try and clarify how I had imagined this to work:
> > > 
> > >     / {
> > >             #address-cells = <2>;
> > >             #size-cells = <2>;
> > > 
> > >             iommu {
> > >                     /*
> > >                      * This is somewhat unusual (or maybe not) in that we
> > >                      * need 2 cells to represent the size of an address
> > >                      * space that is 32 bits long.
> > >                      */
> > >                     #address-cells = <1>;
> > >                     #size-cells = <2>;
> > > 
> > >                     #iommu-cells = <1>;
> > >             };
> > > 
> > >             master {
> > >                     iommus = <&/iommu 42>;
> > 
> > Comparing this with the other discussion thread, we have a similar
> > concept here, in that the iommu is made a logical parent, however...
> > 
> > Firstly, there's an implicit assumption here that the only kind of
> > thing the master could possibly be connected to is an IOMMU, with
> > no non-trivial interconnect in between.  I'm not sure this is going
> > to scale to complex SoCs.
> 
> Here we go again. We're now in the very same bad spot that we've been in
> so many times before. There are at least two SoCs that we know do not
> require anything fancy, yet we're blocking adding support for those use
> cases because we think that at some point some IOMMU may require more
> than that. But at the same time it seems like we don't have enough data
> about what exactly that is, so we keep speculating. At this rate we're
> making it impossible to get a reasonable feature set supported upstream.
> 
> That's very frustrating.

I agree. While I just commented that I want to think through how this
would look for other IOMMUs, the generic case of all masters that Dave
wants is going overboard with the complexity and we'd be better off
deferring this to whenever someone needs it, which I assume is never.

> > If a range of Stream IDs may be issued (especially from something like
> > a PCIe RC where the stream ID may be a many-bit value), describing
> > the IDs individually may be impractical.
> 
> The IOMMU specifier is completely specific to the IOMMU. If an IOMMU has
> a need to represent the stream IDs as a range there's nothing keeping it
> from defining the specifier accordingly.

If we make the IOMMU address space look like the PCI address space, we
can have a simple representation for this in DT. For the code, I assume
that we will always have to treat PCI and platform devices differently.

	Arnd
Thierry Reding May 20, 2014, 11:05 a.m. UTC | #11
On Tue, May 20, 2014 at 12:04:54PM +0200, Arnd Bergmann wrote:
> On Monday 19 May 2014 22:59:46 Thierry Reding wrote:
> > On Mon, May 19, 2014 at 08:34:07PM +0200, Arnd Bergmann wrote:
> > > On Monday 19 May 2014 14:53:37 Thierry Reding wrote:
> > > > On Mon, May 19, 2014 at 12:26:35PM +0200, Arnd Bergmann wrote:
> > > > > On Friday 16 May 2014 14:23:18 Thierry Reding wrote:
> > > > > > From: Thierry Reding <treding@nvidia.com>
[...]
> > > > > Finally, it makes no sense to use the dma-ranges property of the master's
> > > > > parent bus, because that bus isn't actually involved in the translation.
> > > > 
> > > > My understanding here is mostly based on the OpenFirmware working group
> > > > proposal for the dma-ranges property[0]. I'll give another example to
> > > > try and clarify how I had imagined this to work:
> > > > 
> > > > 	/ {
> > > > 		#address-cells = <2>;
> > > > 		#size-cells = <2>;
> > > > 
> > > > 		iommu {
> > > > 			/*
> > > > 			 * This is somewhat unusual (or maybe not) in that we
> > > > 			 * need 2 cells to represent the size of an address
> > > > 			 * space that is 32 bits long.
> > > > 			 */
> > > > 			#address-cells = <1>;
> > > > 			#size-cells = <2>;
> > > 
> > > You should never need #size-cells > #address-cells
> > 
> > That was always my impression as well. But how then do you represent the
> > full 4 GiB address space in a 32-bit system? It starts at 0 and ends at
> > 4 GiB - 1, which makes it 4 GiB large. That's:
> > 
> > 	<0 1 0>
> > 
> > With #address-cells = <1> and #size-cells = <1> the best you can do is:
> > 
> > 	<0 0xffffffff>
> > 
> > but that's not accurate.
> 
> I think we've done both in the past, either extended #size-cells or
> taken 0xffffffff as a special token. Note that in your example,
> the iommu actually needs #address-cells = <2> anyway.

But it needs #address-cells = <2> only to encode an ID in addition to
the address. If this was a single-master IOMMU then there'd be no need
for the ID.

This really isn't specific to IOMMUs though. It really applies to all
cases where #address-cells and #size-cells are parsed. While it's way
too late to change the semantics of standard properties, perhaps for
properties that are introduced it would make more sense to encode this
as a <start limit> pair, both of length #address-cells, to avoid this
type of corner case.

On the other hand doing so would make it inconsistent with existing
bindings which may not be desirable either.

But since it seems like we're headed for something completely different
for IOMMUs, perhaps it would be worth considering to describe the IOMMU
range as <start limit>. Since it will likely use #iommu-cells rather
than #address-cells we have an opportunity to change the semantics.

> > > Let me do another example, with the address merged into the iommu
> > > references:
> > > 
> > > / {
> > > 	#address-cells = <2>; // 64-bit address
> > > 	#size-cells = <2>;
> > > 
> > > 	iommu@a {
> > > 		#address-cells = <2>; // 1 cell ID, 1 cell address
> > > 		#size-cells = <1>;
> > > 
> > > 		// no need for #iommu-cells
> > > 	};
> > > 
> > > 
> > > 	master@b {
> > > 		iommus = <&/iommu@a // iommu
> > > 			0x23  	     // ID
> > > 			0x40000000   // window start
> > > 			0x10000000>; //window size
> > > 	};
> > > };
> > > 
> > > A disadvantage of this model would be that for all ARM SMMU users, we'd
> > > have to always list a 4GB address space, which is kind of redundant.
> > 
> > Isn't that the equivalent of the "whole address space" default that I
> > mentioned in the commit message? Could this be handled with
> > #address-cells = <1> and #size-cells = <0> in the iommu node? That way
> > the only cell that needs to be specified in iommus would be the ID and
> > the redundant address space could be simply omitted from DT since it
> > would be implied by the compatible string.
> 
> Yes, that's right. After giving it some more thought, I agree that the
> #size-cells case makes sense as an indicator for cases where the master
> doesn't have to care about the address at all and no further translation
> with dma-ranges is possible.
> 
> Back to my example with dma-ranges, the simplest case would an IOMMU
> that has an ID per bus, with multiple masters sharing that ID:
> 
> / {
> 	#address-cells = <1>;
> 	#size-cells = <1>;
> 
> 	iommu {
> 		#address-cells = <2>; // ID, address
> 		#size-cells = <2>;
> 	};
> 
> 	master@a {
> 		iommus = <& {/iommu} 0xa 0x0  0x1 0x0>; // 4GB ID '0xa'
> 	}
> 
> 	bus1 {
> 		#address-cells = <1>;
> 		#size-cells = <1>;
> 		ranges;
> 		iommus = <& {/iommu} 0 0   0x100 0>; // all IDs
> 		dma-ranges = <0  0xb 0  1 0>; // child devices use ID '0xb'
> 
> 		anothermaster {
> 			// no iommus link, implied by dma-ranges above
> 		};
> 	};
> };
> 
> If you set #size-cells=<0>, you can't really do that but instead would
> require an iommus property in each master, which is not a big concern
> either.

I'm not sure I understand the need for 0x100 (all IDs) entry above. If
bus1's iommus property applies to all devices on the bus, why can't the
ID 0xb be listed in the iommus property?

	bus1 {
		#address-cells = <1>;
		#size-cells = <1>;
		ranges;
		iommus = <&{/iommu} 0xb 0  0x1 0x0>; // 4GB ID '0xb'
		dma-ranges = <0  0xb 0  0x1 0x0>;

		anothermaster {
			...
		};
	};

In which case I guess dma-ranges would be redundant.

> The main advantage I think would be for IOMMUs that use the PCI b/d/f
> numbers as IDs. These can have #address-cells=<3>, #size-cells=<2>
> and have an empty dma-ranges property in the PCI host bridge node,
> and interpret this as using the same encoding as the PCI BARs in
> the ranges property.

I'm somewhat confused here, since you said earlier:

> After giving the ranges stuff some more thought, I have come to the
> conclusion that using #iommu-cells should work fine for almost
> all cases, including windowed iommus, because the window is not
> actually needed in the device, but only in the iommu, wihch is of course
> free to interpret the arguments as addresses.

But now you seem to be saying that we should still be using the
#address-cells and #size-cells properties in the IOMMU node to determine
the length of the specifier.

Thierry
Arnd Bergmann May 20, 2014, 11:15 a.m. UTC | #12
On Tuesday 20 May 2014 13:05:37 Thierry Reding wrote:
> On Tue, May 20, 2014 at 12:04:54PM +0200, Arnd Bergmann wrote:
> > On Monday 19 May 2014 22:59:46 Thierry Reding wrote:
> > > On Mon, May 19, 2014 at 08:34:07PM +0200, Arnd Bergmann wrote:
> > > > On Monday 19 May 2014 14:53:37 Thierry Reding wrote:
> > > > > On Mon, May 19, 2014 at 12:26:35PM +0200, Arnd Bergmann wrote:
> > > > > > On Friday 16 May 2014 14:23:18 Thierry Reding wrote:
> > > > > > > From: Thierry Reding <treding@nvidia.com>
> [...]
> > > > > > Finally, it makes no sense to use the dma-ranges property of the master's
> > > > > > parent bus, because that bus isn't actually involved in the translation.
> > > > > 
> > > > > My understanding here is mostly based on the OpenFirmware working group
> > > > > proposal for the dma-ranges property[0]. I'll give another example to
> > > > > try and clarify how I had imagined this to work:
> > > > > 
> > > > > 	/ {
> > > > > 		#address-cells = <2>;
> > > > > 		#size-cells = <2>;
> > > > > 
> > > > > 		iommu {
> > > > > 			/*
> > > > > 			 * This is somewhat unusual (or maybe not) in that we
> > > > > 			 * need 2 cells to represent the size of an address
> > > > > 			 * space that is 32 bits long.
> > > > > 			 */
> > > > > 			#address-cells = <1>;
> > > > > 			#size-cells = <2>;
> > > > 
> > > > You should never need #size-cells > #address-cells
> > > 
> > > That was always my impression as well. But how then do you represent the
> > > full 4 GiB address space in a 32-bit system? It starts at 0 and ends at
> > > 4 GiB - 1, which makes it 4 GiB large. That's:
> > > 
> > > 	<0 1 0>
> > > 
> > > With #address-cells = <1> and #size-cells = <1> the best you can do is:
> > > 
> > > 	<0 0xffffffff>
> > > 
> > > but that's not accurate.
> > 
> > I think we've done both in the past, either extended #size-cells or
> > taken 0xffffffff as a special token. Note that in your example,
> > the iommu actually needs #address-cells = <2> anyway.
> 
> But it needs #address-cells = <2> only to encode an ID in addition to
> the address. If this was a single-master IOMMU then there'd be no need
> for the ID.

Right. But for a single-master IOMMU, there is no need to specify
any additional data, it could have #address-cells=<0> if we take the
optimization you suggested.

> This really isn't specific to IOMMUs though. It really applies to all
> cases where #address-cells and #size-cells are parsed. While it's way
> too late to change the semantics of standard properties, perhaps for
> properties that are introduced it would make more sense to encode this
> as a <start limit> pair, both of length #address-cells, to avoid this
> type of corner case.
> 
> On the other hand doing so would make it inconsistent with existing
> bindings which may not be desirable either.
> 
> But since it seems like we're headed for something completely different
> for IOMMUs, perhaps it would be worth considering to describe the IOMMU
> range as <start limit>. Since it will likely use #iommu-cells rather
> than #address-cells we have an opportunity to change the semantics.

I'd still prefer #address-cells/#size-cells over #iommu-cells.

> > / {
> > 	#address-cells = <1>;
> > 	#size-cells = <1>;
> > 
> > 	iommu {
> > 		#address-cells = <2>; // ID, address
> > 		#size-cells = <2>;
> > 	};
> > 
> > 	master@a {
> > 		iommus = <& {/iommu} 0xa 0x0  0x1 0x0>; // 4GB ID '0xa'
> > 	}
> > 
> > 	bus1 {
> > 		#address-cells = <1>;
> > 		#size-cells = <1>;
> > 		ranges;
> > 		iommus = <& {/iommu} 0 0   0x100 0>; // all IDs
> > 		dma-ranges = <0  0xb 0  1 0>; // child devices use ID '0xb'
> > 
> > 		anothermaster {
> > 			// no iommus link, implied by dma-ranges above
> > 		};
> > 	};
> > };
> > 
> > If you set #size-cells=<0>, you can't really do that but instead would
> > require an iommus property in each master, which is not a big concern
> > either.
> 
> I'm not sure I understand the need for 0x100 (all IDs) entry above. If
> bus1's iommus property applies to all devices on the bus, why can't the
> ID 0xb be listed in the iommus property?
> 
> 	bus1 {
> 		#address-cells = <1>;
> 		#size-cells = <1>;
> 		ranges;
> 		iommus = <&{/iommu} 0xb 0  0x1 0x0>; // 4GB ID '0xb'
> 		dma-ranges = <0  0xb 0  0x1 0x0>;
> 
> 		anothermaster {
> 			...
> 		};
> 	};

It depends on how the address is interpreted, but we could make this
a valid case too.

> In which case I guess dma-ranges would be redundant.

No, because the iommus property doesn't translate the address range, it
just creates a new address space. bus1 and iommu in the example have
different #address-cells, so you definitely need a non-empty ranges
property.

> > The main advantage I think would be for IOMMUs that use the PCI b/d/f
> > numbers as IDs. These can have #address-cells=<3>, #size-cells=<2>
> > and have an empty dma-ranges property in the PCI host bridge node,
> > and interpret this as using the same encoding as the PCI BARs in
> > the ranges property.
> 
> I'm somewhat confused here, since you said earlier:
> 
> > After giving the ranges stuff some more thought, I have come to the
> > conclusion that using #iommu-cells should work fine for almost
> > all cases, including windowed iommus, because the window is not
> > actually needed in the device, but only in the iommu, wihch is of course
> > free to interpret the arguments as addresses.
> 
> But now you seem to be saying that we should still be using the
> #address-cells and #size-cells properties in the IOMMU node to determine
> the length of the specifier.

I probably wasn't clear. I think we can make it work either way, but
my feeling is that using #address-cells/#size-cells gives us a nicer
syntax for the more complex cases.

	Arnd
Thierry Reding May 20, 2014, 12:02 p.m. UTC | #13
On Tue, May 20, 2014 at 01:15:48PM +0200, Arnd Bergmann wrote:
> On Tuesday 20 May 2014 13:05:37 Thierry Reding wrote:
> > On Tue, May 20, 2014 at 12:04:54PM +0200, Arnd Bergmann wrote:
> > > On Monday 19 May 2014 22:59:46 Thierry Reding wrote:
> > > > On Mon, May 19, 2014 at 08:34:07PM +0200, Arnd Bergmann wrote:
[...]
> > > > > You should never need #size-cells > #address-cells
> > > > 
> > > > That was always my impression as well. But how then do you represent the
> > > > full 4 GiB address space in a 32-bit system? It starts at 0 and ends at
> > > > 4 GiB - 1, which makes it 4 GiB large. That's:
> > > > 
> > > > 	<0 1 0>
> > > > 
> > > > With #address-cells = <1> and #size-cells = <1> the best you can do is:
> > > > 
> > > > 	<0 0xffffffff>
> > > > 
> > > > but that's not accurate.
> > > 
> > > I think we've done both in the past, either extended #size-cells or
> > > taken 0xffffffff as a special token. Note that in your example,
> > > the iommu actually needs #address-cells = <2> anyway.
> > 
> > But it needs #address-cells = <2> only to encode an ID in addition to
> > the address. If this was a single-master IOMMU then there'd be no need
> > for the ID.
> 
> Right. But for a single-master IOMMU, there is no need to specify
> any additional data, it could have #address-cells=<0> if we take the
> optimization you suggested.

Couldn't a single-master IOMMU be windowed?

> > I'm not sure I understand the need for 0x100 (all IDs) entry above. If
> > bus1's iommus property applies to all devices on the bus, why can't the
> > ID 0xb be listed in the iommus property?
> > 
> > 	bus1 {
> > 		#address-cells = <1>;
> > 		#size-cells = <1>;
> > 		ranges;
> > 		iommus = <&{/iommu} 0xb 0  0x1 0x0>; // 4GB ID '0xb'
> > 		dma-ranges = <0  0xb 0  0x1 0x0>;
> > 
> > 		anothermaster {
> > 			...
> > 		};
> > 	};
> 
> It depends on how the address is interpreted, but we could make this
> a valid case too.
> 
> > In which case I guess dma-ranges would be redundant.
> 
> No, because the iommus property doesn't translate the address range, it
> just creates a new address space. bus1 and iommu in the example have
> different #address-cells, so you definitely need a non-empty ranges
> property.

Ah, now I get it.

> > > The main advantage I think would be for IOMMUs that use the PCI b/d/f
> > > numbers as IDs. These can have #address-cells=<3>, #size-cells=<2>
> > > and have an empty dma-ranges property in the PCI host bridge node,
> > > and interpret this as using the same encoding as the PCI BARs in
> > > the ranges property.
> > 
> > I'm somewhat confused here, since you said earlier:
> > 
> > > After giving the ranges stuff some more thought, I have come to the
> > > conclusion that using #iommu-cells should work fine for almost
> > > all cases, including windowed iommus, because the window is not
> > > actually needed in the device, but only in the iommu, wihch is of course
> > > free to interpret the arguments as addresses.
> > 
> > But now you seem to be saying that we should still be using the
> > #address-cells and #size-cells properties in the IOMMU node to determine
> > the length of the specifier.
> 
> I probably wasn't clear. I think we can make it work either way, but
> my feeling is that using #address-cells/#size-cells gives us a nicer
> syntax for the more complex cases.

Okay, so in summary we'd have something like this for simple cases:

Required properties:
--------------------
- #address-cells: The number of cells in an IOMMU specifier needed to encode
  an address.
- #size-cells: The number of cells in an IOMMU specifier needed to represent
  the length of an address range.

Typical values for the above include:
- #address-cells = <0>, size-cells = <0>: Single master IOMMU devices are not
  configurable and therefore no additional information needs to be encoded in
  the specifier. This may also apply to multiple master IOMMU devices that do
  not allow the association of masters to be configured.
- #address-cells = <1>, size-cells = <0>: Multiple master IOMMU devices may
  need to be configured in order to enable translation for a given master. In
  such cases the single address cell corresponds to the master device's ID.
- #address-cells = <2>, size-cells = <2>: Some IOMMU devices allow the DMA
  window for masters to be configured. The first cell of the address in this
  may contain the master device's ID for example, while the second cell could
  contain the start of the DMA window for the given device. The length of the
  DMA window is specified by two additional cells.

Examples:
=========

Single-master IOMMU:
--------------------

	iommu {
		#address-cells = <0>;
		#size-cells = <0>;
	};

	master {
		iommus = <&/iommu>;
	};

Multiple-master IOMMU with fixed associations:
----------------------------------------------

	/* multiple-master IOMMU */
	iommu {
		/*
		 * Masters are statically associated with this IOMMU and
		 * address translation is always enabled.
		 */
		#iommu-cells = <0>;
	};

	/* static association with IOMMU */
	master@1 {
		reg = <1>;
		iommus = <&/iommu>;
	};

	/* static association with IOMMU */
	master@2 {
		reg = <2>;
		iommus = <&/iommu>;
	};

Multiple-master IOMMU:
----------------------

	iommu {
		/* the specifier represents the ID of the master */
		#address-cells = <1>;
		#size-cells = <0>;
	};

	master {
		/* device has master ID 42 in the IOMMU */
		iommus = <&/iommu 42>;
	};

Multiple-master device:
-----------------------

	/* single-master IOMMU */
	iommu@1 {
		reg = <1>;
		#address-cells = <0>;
		#size-cells = <0>;
	};

	/* multiple-master IOMMU */
	iommu@2 {
		reg = <2>;
		#address-cells = <1>;
		#size-cells = <0>;
	};

	/* device with two master interfaces */
	master {
		iommus = <&/iommu@1>,    /* master of the single-master IOMMU */
			 <&/iommu@2 42>; /* ID 42 in multiple-master IOMMU */
	};

Multiple-master IOMMU with configurable DMA window:
---------------------------------------------------

	/ {
		#address-cells = <1>;
		#size-cells = <1>;

		iommu {
			/* master ID, address of DMA window */
			#address-cells = <2>;
			#size-cells = <2>;
		};

		master {
			/* master ID 42, 4 GiB DMA window starting at 0 */
			iommus = <&/iommu  42 0  0x1 0x0>;
		};
	};

Does that sound about right?

Thierry
Arnd Bergmann May 20, 2014, 12:41 p.m. UTC | #14
On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
> On Tue, May 20, 2014 at 01:15:48PM +0200, Arnd Bergmann wrote:
> > On Tuesday 20 May 2014 13:05:37 Thierry Reding wrote:
> > > On Tue, May 20, 2014 at 12:04:54PM +0200, Arnd Bergmann wrote:
> > > > On Monday 19 May 2014 22:59:46 Thierry Reding wrote:
> > > > > On Mon, May 19, 2014 at 08:34:07PM +0200, Arnd Bergmann wrote:
> [...]
> > > > > > You should never need #size-cells > #address-cells
> > > > > 
> > > > > That was always my impression as well. But how then do you represent the
> > > > > full 4 GiB address space in a 32-bit system? It starts at 0 and ends at
> > > > > 4 GiB - 1, which makes it 4 GiB large. That's:
> > > > > 
> > > > > 	<0 1 0>
> > > > > 
> > > > > With #address-cells = <1> and #size-cells = <1> the best you can do is:
> > > > > 
> > > > > 	<0 0xffffffff>
> > > > > 
> > > > > but that's not accurate.
> > > > 
> > > > I think we've done both in the past, either extended #size-cells or
> > > > taken 0xffffffff as a special token. Note that in your example,
> > > > the iommu actually needs #address-cells = <2> anyway.
> > > 
> > > But it needs #address-cells = <2> only to encode an ID in addition to
> > > the address. If this was a single-master IOMMU then there'd be no need
> > > for the ID.
> > 
> > Right. But for a single-master IOMMU, there is no need to specify
> > any additional data, it could have #address-cells=<0> if we take the
> > optimization you suggested.
> 
> Couldn't a single-master IOMMU be windowed?

Ah, yes. That would actually be like an IBM pSeries, which has a windowed
IOMMU but uses one window per virtual machine. In that case, the window could
be a property of the iommu node though, rather than part of the address
in the link.

> > > > The main advantage I think would be for IOMMUs that use the PCI b/d/f
> > > > numbers as IDs. These can have #address-cells=<3>, #size-cells=<2>
> > > > and have an empty dma-ranges property in the PCI host bridge node,
> > > > and interpret this as using the same encoding as the PCI BARs in
> > > > the ranges property.
> > > 
> > > I'm somewhat confused here, since you said earlier:
> > > 
> > > > After giving the ranges stuff some more thought, I have come to the
> > > > conclusion that using #iommu-cells should work fine for almost
> > > > all cases, including windowed iommus, because the window is not
> > > > actually needed in the device, but only in the iommu, wihch is of course
> > > > free to interpret the arguments as addresses.
> > > 
> > > But now you seem to be saying that we should still be using the
> > > #address-cells and #size-cells properties in the IOMMU node to determine
> > > the length of the specifier.
> > 
> > I probably wasn't clear. I think we can make it work either way, but
> > my feeling is that using #address-cells/#size-cells gives us a nicer
> > syntax for the more complex cases.
> 
> Okay, so in summary we'd have something like this for simple cases:
> 
> Required properties:
> --------------------
> - #address-cells: The number of cells in an IOMMU specifier needed to encode
>   an address.
> - #size-cells: The number of cells in an IOMMU specifier needed to represent
>   the length of an address range.
> 
> Typical values for the above include:
> - #address-cells = <0>, size-cells = <0>: Single master IOMMU devices are not
>   configurable and therefore no additional information needs to be encoded in
>   the specifier. This may also apply to multiple master IOMMU devices that do
>   not allow the association of masters to be configured.
> - #address-cells = <1>, size-cells = <0>: Multiple master IOMMU devices may
>   need to be configured in order to enable translation for a given master. In
>   such cases the single address cell corresponds to the master device's ID.
> - #address-cells = <2>, size-cells = <2>: Some IOMMU devices allow the DMA
>   window for masters to be configured. The first cell of the address in this
>   may contain the master device's ID for example, while the second cell could
>   contain the start of the DMA window for the given device. The length of the
>   DMA window is specified by two additional cells.
> 
> Examples:
> =========
> 
> Single-master IOMMU:
> --------------------
> 
> 	iommu {
> 		#address-cells = <0>;
> 		#size-cells = <0>;
> 	};
> 
> 	master {
> 		iommus = <&/iommu>;
> 	};
> 
> Multiple-master IOMMU with fixed associations:
> ----------------------------------------------
> 
> 	/* multiple-master IOMMU */
> 	iommu {
> 		/*
> 		 * Masters are statically associated with this IOMMU and
> 		 * address translation is always enabled.
> 		 */
> 		#iommu-cells = <0>;
> 	};

copied wrong? I guess you mean #address-cells=<0>/#size-cells=<0> here.

> 	/* static association with IOMMU */
> 	master@1 {
> 		reg = <1>;
> 		iommus = <&/iommu>;
> 	};
> 
> 	/* static association with IOMMU */
> 	master@2 {
> 		reg = <2>;
> 		iommus = <&/iommu>;
> 	};
> 
> Multiple-master IOMMU:
> ----------------------
> 
> 	iommu {
> 		/* the specifier represents the ID of the master */
> 		#address-cells = <1>;
> 		#size-cells = <0>;
> 	};
> 
> 	master {
> 		/* device has master ID 42 in the IOMMU */
> 		iommus = <&/iommu 42>;
> 	};
> 
> Multiple-master device:
> -----------------------
> 
> 	/* single-master IOMMU */
> 	iommu@1 {
> 		reg = <1>;
> 		#address-cells = <0>;
> 		#size-cells = <0>;
> 	};
> 
> 	/* multiple-master IOMMU */
> 	iommu@2 {
> 		reg = <2>;
> 		#address-cells = <1>;
> 		#size-cells = <0>;
> 	};
> 
> 	/* device with two master interfaces */
> 	master {
> 		iommus = <&/iommu@1>,    /* master of the single-master IOMMU */
> 			 <&/iommu@2 42>; /* ID 42 in multiple-master IOMMU */
> 	};
> 
> Multiple-master IOMMU with configurable DMA window:
> ---------------------------------------------------
> 
> 	/ {
> 		#address-cells = <1>;
> 		#size-cells = <1>;
> 
> 		iommu {
> 			/* master ID, address of DMA window */
> 			#address-cells = <2>;
> 			#size-cells = <2>;
> 		};
> 
> 		master {
> 			/* master ID 42, 4 GiB DMA window starting at 0 */
> 			iommus = <&/iommu  42 0  0x1 0x0>;
> 		};
> 	};
> 
> Does that sound about right?

Yes, sounds great. I would probably leave out the Multiple-master device
from the examples, since that seems to be a rather obscure case.

I would like to add an explanation about dma-ranges to the binding:

8<--------
The parent bus of the iommu must have a valid "dma-ranges" property
describing how the physical address space of the IOMMU maps into
memory.
A device with an "iommus" property will ignore the "dma-ranges" property
of the parent node and rely on the IOMMU for translation instead.
Using an "iommus" property in bus device nodes with "dma-ranges"
specifying how child devices relate to the IOMMU is a possible extension
but is not recommended until this binding gets extended.
----------->8

Does that make sense to you? We can change what we say about
dma-ranges, I mainly want to be clear with what is or is not
allowed at this point.

	Arnd
Dave Martin May 20, 2014, 1:07 p.m. UTC | #15
On Tue, May 20, 2014 at 12:08:44PM +0200, Arnd Bergmann wrote:
> On Monday 19 May 2014 22:32:33 Thierry Reding wrote:
> > On Mon, May 19, 2014 at 06:22:31PM +0100, Dave Martin wrote:
> > > On Mon, May 19, 2014 at 01:53:37PM +0100, Thierry Reding wrote:
> > [...]
> > > > My understanding here is mostly based on the OpenFirmware working group
> > > > proposal for the dma-ranges property[0]. I'll give another example to
> > > > try and clarify how I had imagined this to work:
> > > > 
> > > >     / {
> > > >             #address-cells = <2>;
> > > >             #size-cells = <2>;
> > > > 
> > > >             iommu {
> > > >                     /*
> > > >                      * This is somewhat unusual (or maybe not) in that we
> > > >                      * need 2 cells to represent the size of an address
> > > >                      * space that is 32 bits long.
> > > >                      */
> > > >                     #address-cells = <1>;
> > > >                     #size-cells = <2>;
> > > > 
> > > >                     #iommu-cells = <1>;
> > > >             };
> > > > 
> > > >             master {
> > > >                     iommus = <&/iommu 42>;
> > > 
> > > Comparing this with the other discussion thread, we have a similar
> > > concept here, in that the iommu is made a logical parent, however...
> > > 
> > > Firstly, there's an implicit assumption here that the only kind of
> > > thing the master could possibly be connected to is an IOMMU, with
> > > no non-trivial interconnect in between.  I'm not sure this is going
> > > to scale to complex SoCs.
> > 
> > Here we go again. We're now in the very same bad spot that we've been in
> > so many times before. There are at least two SoCs that we know do not
> > require anything fancy, yet we're blocking adding support for those use
> > cases because we think that at some point some IOMMU may require more
> > than that. But at the same time it seems like we don't have enough data
> > about what exactly that is, so we keep speculating. At this rate we're
> > making it impossible to get a reasonable feature set supported upstream.
> > 
> > That's very frustrating.
> 
> I agree. While I just commented that I want to think through how this
> would look for other IOMMUs, the generic case of all masters that Dave
> wants is going overboard with the complexity and we'd be better off

I don't want that, and actually I agree with the conclusion "overboard".

It was really about exploring the problem space, including things that
we can reasonably foresee.  Any bindings today should necessarily be
much simpler, and that's certainly the correct approach.

I've been neglecting this thread (apologies) -- but although I have to
think a bit about Thierry's most recent suggestions, I think there's
actually reasonable alignment now.

> deferring this to whenever someone needs it, which I assume is never.

Well, some of it might be.  But I'm not saying we should give in to
wild speculation (except to get a feel for what we're ruling out, and
the consequences of that).

> 
> > > If a range of Stream IDs may be issued (especially from something like
> > > a PCIe RC where the stream ID may be a many-bit value), describing
> > > the IDs individually may be impractical.
> > 
> > The IOMMU specifier is completely specific to the IOMMU. If an IOMMU has
> > a need to represent the stream IDs as a range there's nothing keeping it
> > from defining the specifier accordingly.
> 
> If we make the IOMMU address space look like the PCI address space, we
> can have a simple representation for this in DT. For the code, I assume
> that we will always have to treat PCI and platform devices differently.

Can you elaborate on that?

Master ID signals in ARM systems and how they are handled are rather
under-specified today.  Treating the master ID bits as some extra bits
in some kind of extended bus address could work for ARM IOMMUs etc.,
but I didn't have a complete answer yet.

Cheers
---Dave
Thierry Reding May 20, 2014, 1:17 p.m. UTC | #16
On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
[...]
> > Couldn't a single-master IOMMU be windowed?
> 
> Ah, yes. That would actually be like an IBM pSeries, which has a windowed
> IOMMU but uses one window per virtual machine. In that case, the window could
> be a property of the iommu node though, rather than part of the address
> in the link.

Does that mean that the IOMMU has one statically configured window which
is the same for each virtual machine? That would require some other
mechanism to assign separate address spaces to each virtual machine,
wouldn't it? But I suspect that if the IOMMU allows that it could be
allocated dynamically at runtime.

> > Multiple-master IOMMU with fixed associations:
> > ----------------------------------------------
> > 
> > 	/* multiple-master IOMMU */
> > 	iommu {
> > 		/*
> > 		 * Masters are statically associated with this IOMMU and
> > 		 * address translation is always enabled.
> > 		 */
> > 		#iommu-cells = <0>;
> > 	};
> 
> copied wrong? I guess you mean #address-cells=<0>/#size-cells=<0> here.

Yes, I obviously wasn't careful when I copied this over.

> > Multiple-master device:
> > -----------------------
> > 
> > 	/* single-master IOMMU */
> > 	iommu@1 {
> > 		reg = <1>;
> > 		#address-cells = <0>;
> > 		#size-cells = <0>;
> > 	};
> > 
> > 	/* multiple-master IOMMU */
> > 	iommu@2 {
> > 		reg = <2>;
> > 		#address-cells = <1>;
> > 		#size-cells = <0>;
> > 	};
> > 
> > 	/* device with two master interfaces */
> > 	master {
> > 		iommus = <&/iommu@1>,    /* master of the single-master IOMMU */
> > 			 <&/iommu@2 42>; /* ID 42 in multiple-master IOMMU */
> > 	};
[...]
> > Does that sound about right?
> 
> Yes, sounds great. I would probably leave out the Multiple-master device
> from the examples, since that seems to be a rather obscure case.

Agreed. We can easily add such examples if/when such device start to
appear.

> I would like to add an explanation about dma-ranges to the binding:
> 
> 8<--------
> The parent bus of the iommu must have a valid "dma-ranges" property
> describing how the physical address space of the IOMMU maps into
> memory.

With physical address space you mean the addresses after translation,
not the I/O virtual addresses, right? But even so, how will this work
when there are multiple IOMMU devices? What determines which IOMMU is
mapped via which entry?

Perhaps having multiple IOMMUs implies that there will have to be some
partitioning of the parent address space to make sure two IOMMUs don't
translate to the same ranges?

> A device with an "iommus" property will ignore the "dma-ranges" property
> of the parent node and rely on the IOMMU for translation instead.

Do we need to consider the case where an IOMMU listed in iommus isn't
enabled (status = "disabled")? In that case presumably the device would
either not function or may optionally continue to master onto the parent
untranslated.

> Using an "iommus" property in bus device nodes with "dma-ranges"
> specifying how child devices relate to the IOMMU is a possible extension
> but is not recommended until this binding gets extended.

Just for my understanding, bus device nodes with iommus and dma-ranges
properties could be equivalently written by explicitly moving the iommus
properties into the child device nodes, right? In which case they should
be the same as the other examples. So that concept is a convenience
notation to reduce duplication, but doesn't fundamentally introduce any
new concept.

Thierry
Arnd Bergmann May 20, 2014, 1:23 p.m. UTC | #17
On Tuesday 20 May 2014 14:07:09 Dave Martin wrote:
> On Tue, May 20, 2014 at 12:08:44PM +0200, Arnd Bergmann wrote:
> > On Monday 19 May 2014 22:32:33 Thierry Reding wrote:
> > > On Mon, May 19, 2014 at 06:22:31PM +0100, Dave Martin wrote:
> > > > On Mon, May 19, 2014 at 01:53:37PM +0100, Thierry Reding wrote:
> > > [...]
> > > > > My understanding here is mostly based on the OpenFirmware working group
> > > > > proposal for the dma-ranges property[0]. I'll give another example to
> > > > > try and clarify how I had imagined this to work:
> > > > > 
> > > > >     / {
> > > > >             #address-cells = <2>;
> > > > >             #size-cells = <2>;
> > > > > 
> > > > >             iommu {
> > > > >                     /*
> > > > >                      * This is somewhat unusual (or maybe not) in that we
> > > > >                      * need 2 cells to represent the size of an address
> > > > >                      * space that is 32 bits long.
> > > > >                      */
> > > > >                     #address-cells = <1>;
> > > > >                     #size-cells = <2>;
> > > > > 
> > > > >                     #iommu-cells = <1>;
> > > > >             };
> > > > > 
> > > > >             master {
> > > > >                     iommus = <&/iommu 42>;
> > > > 
> > > > Comparing this with the other discussion thread, we have a similar
> > > > concept here, in that the iommu is made a logical parent, however...
> > > > 
> > > > Firstly, there's an implicit assumption here that the only kind of
> > > > thing the master could possibly be connected to is an IOMMU, with
> > > > no non-trivial interconnect in between.  I'm not sure this is going
> > > > to scale to complex SoCs.
> > > 
> > > Here we go again. We're now in the very same bad spot that we've been in
> > > so many times before. There are at least two SoCs that we know do not
> > > require anything fancy, yet we're blocking adding support for those use
> > > cases because we think that at some point some IOMMU may require more
> > > than that. But at the same time it seems like we don't have enough data
> > > about what exactly that is, so we keep speculating. At this rate we're
> > > making it impossible to get a reasonable feature set supported upstream.
> > > 
> > > That's very frustrating.
> > 
> > I agree. While I just commented that I want to think through how this
> > would look for other IOMMUs, the generic case of all masters that Dave
> > wants is going overboard with the complexity and we'd be better off
> 
> I don't want that, and actually I agree with the conclusion "overboard".
> 
> It was really about exploring the problem space, including things that
> we can reasonably foresee.  Any bindings today should necessarily be
> much simpler, and that's certainly the correct approach.
> 
> I've been neglecting this thread (apologies) -- but although I have to
> think a bit about Thierry's most recent suggestions, I think there's
> actually reasonable alignment now.
> 
> > deferring this to whenever someone needs it, which I assume is never.
> 
> Well, some of it might be.  But I'm not saying we should give in to
> wild speculation (except to get a feel for what we're ruling out, and
> the consequences of that).


Ok, fair enough.

> > > > If a range of Stream IDs may be issued (especially from something like
> > > > a PCIe RC where the stream ID may be a many-bit value), describing
> > > > the IDs individually may be impractical.
> > > 
> > > The IOMMU specifier is completely specific to the IOMMU. If an IOMMU has
> > > a need to represent the stream IDs as a range there's nothing keeping it
> > > from defining the specifier accordingly.
> > 
> > If we make the IOMMU address space look like the PCI address space, we
> > can have a simple representation for this in DT. For the code, I assume
> > that we will always have to treat PCI and platform devices differently.
> 
> Can you elaborate on that?
> 
> Master ID signals in ARM systems and how they are handled are rather
> under-specified today.  Treating the master ID bits as some extra bits
> in some kind of extended bus address could work for ARM IOMMUs etc.,
> but I didn't have a complete answer yet.

The PCI binding at http://www.openfirmware.org/1275/bindings/pci/pci2_1.pdf
defines a 96-bit address space for MMIO registers and other things, so
it can uniquely identify not just an address on the bus, but also 
any standard resource as seen by the device, quoting from there:

2.2.1.1. Numerical Representation
(The Numerical Representation of an address is the format that Open
 Firmware uses for storing an address within a property value and on
 the stack, as an argument to a package method.) The numerical
 representation of a PCI address consists of three cells, encoded as
 follows. For this purpose, the least-significant 32 bits of a cell is used;
 if the cell size is larger than 32 bits, any additional high-order bits
 are zero. Bit# 0 refers to the least-significant bit.

	Bit#   33222222 22221111 11111100 00000000
	       10987654 32109876 54321098 76543210
phys.hi cell:  npt000ss bbbbbbbb dddddfff rrrrrrrr
phys.mid cell: hhhhhhhh hhhhhhhh hhhhhhhh hhhhhhhh
phys.lo cell:  llllllll llllllll llllllll llllllll

where:
n is 0 if the address is relocatable, 1 otherwise
p is 1 if the addressable region is "prefetchable", 0 otherwise
t is 1 if the address is aliased (for non-relocatable I/O),
 below 1 MB (for Memory), or below 64 KB (for relocatable I/O).
ss is the space code, denoting the address space
bbbbbbbb is the 8-bit Bus Number
ddddd is the 5-bit Device Number
fff is the 3-bit Function Number
rrrrrrrr is the 8-bit Register Number
hh...hh is a 32-bit unsigned number
ll...ll is a 32-bit unsigned number

We can ignore n, p, t and r here, and use the same format for a DMA
address, then define an empty "dma-ranges" property. That would
imply that using b/d/f is sufficient to identify each master at the
iommu. Any device outside of the PCI host but connected to the same
iommu can use the same notation to list the logical b/d/f that gets
sent to the IOMMU in bus master transactions.

Do you think this is sufficient for the ARM SMMU, or do we need
something beyond that?

	Arnd
Arnd Bergmann May 20, 2014, 1:34 p.m. UTC | #18
On Tuesday 20 May 2014 15:17:43 Thierry Reding wrote:
> On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> > On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
> [...]
> > > Couldn't a single-master IOMMU be windowed?
> > 
> > Ah, yes. That would actually be like an IBM pSeries, which has a windowed
> > IOMMU but uses one window per virtual machine. In that case, the window could
> > be a property of the iommu node though, rather than part of the address
> > in the link.
> 
> Does that mean that the IOMMU has one statically configured window which
> is the same for each virtual machine? That would require some other
> mechanism to assign separate address spaces to each virtual machine,
> wouldn't it? But I suspect that if the IOMMU allows that it could be
> allocated dynamically at runtime.

The way it works on pSeries is that upon VM creation, the guest is assigned
one 256MB window for use by assigned DMA capable devices. When the guest
creates a mapping, it uses a hypercall to associate a bus address in that
range with a guest physical address. The hypervisor checks that the bus
address is within the allowed range, and translates the guest physical
address into a host physical address, then enters both into the I/O page
table or I/O TLB.

> > I would like to add an explanation about dma-ranges to the binding:
> > 
> > 8<--------
> > The parent bus of the iommu must have a valid "dma-ranges" property
> > describing how the physical address space of the IOMMU maps into
> > memory.
> 
> With physical address space you mean the addresses after translation,
> not the I/O virtual addresses, right? But even so, how will this work
> when there are multiple IOMMU devices? What determines which IOMMU is
> mapped via which entry?
> 
> Perhaps having multiple IOMMUs implies that there will have to be some
> partitioning of the parent address space to make sure two IOMMUs don't
> translate to the same ranges?

These dma-ranges properties would almost always be for the entire RAM,
and we can treat anything else as a bug.

The mapping between what goes into the IOMMU and what comes out of it
is not reflected in DT at all, since it only happens at runtime.
The dma-ranges property I mean above describes how what comes out of
the IOMMU maps into physical memory.

> > A device with an "iommus" property will ignore the "dma-ranges" property
> > of the parent node and rely on the IOMMU for translation instead.
> 
> Do we need to consider the case where an IOMMU listed in iommus isn't
> enabled (status = "disabled")? In that case presumably the device would
> either not function or may optionally continue to master onto the parent
> untranslated.

My reasoning was that the DT should specify whether we use the IOMMU
or not. Being able to just switch on or off the IOMMU sounds nice as
well, so we could change the text above to do that.

Another option would be to do this in the IOMMU code, basically
falling back to the IOMMU parent's dma-ranges property and using
linear dma_map_ops when that is disabled.

> > Using an "iommus" property in bus device nodes with "dma-ranges"
> > specifying how child devices relate to the IOMMU is a possible extension
> > but is not recommended until this binding gets extended.
> 
> Just for my understanding, bus device nodes with iommus and dma-ranges
> properties could be equivalently written by explicitly moving the iommus
> properties into the child device nodes, right? In which case they should
> be the same as the other examples. So that concept is a convenience
> notation to reduce duplication, but doesn't fundamentally introduce any
> new concept.

The one case  where that doesn't work is PCI, because we don't list the
PCI devices in DT normally, and the iommus property would only exist
at the PCI host controller node.

	Arnd
Thierry Reding May 20, 2014, 2 p.m. UTC | #19
On Tue, May 20, 2014 at 03:34:46PM +0200, Arnd Bergmann wrote:
> On Tuesday 20 May 2014 15:17:43 Thierry Reding wrote:
> > On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> > > On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
> > [...]
> > > > Couldn't a single-master IOMMU be windowed?
> > > 
> > > Ah, yes. That would actually be like an IBM pSeries, which has a windowed
> > > IOMMU but uses one window per virtual machine. In that case, the window could
> > > be a property of the iommu node though, rather than part of the address
> > > in the link.
> > 
> > Does that mean that the IOMMU has one statically configured window which
> > is the same for each virtual machine? That would require some other
> > mechanism to assign separate address spaces to each virtual machine,
> > wouldn't it? But I suspect that if the IOMMU allows that it could be
> > allocated dynamically at runtime.
> 
> The way it works on pSeries is that upon VM creation, the guest is assigned
> one 256MB window for use by assigned DMA capable devices. When the guest
> creates a mapping, it uses a hypercall to associate a bus address in that
> range with a guest physical address. The hypervisor checks that the bus
> address is within the allowed range, and translates the guest physical
> address into a host physical address, then enters both into the I/O page
> table or I/O TLB.

So when a VM is booted it is passed a device tree with that DMA window?
Given what you describe above this seems to be more of a configuration
option to restrict the IOMMU to a subset of the physical memory for
purposes of virtualization. So I agree that this wouldn't be a good fit
for what we're trying to achieve with iommus or dma-ranges in this
binding.

> > > I would like to add an explanation about dma-ranges to the binding:
> > > 
> > > 8<--------
> > > The parent bus of the iommu must have a valid "dma-ranges" property
> > > describing how the physical address space of the IOMMU maps into
> > > memory.
> > 
> > With physical address space you mean the addresses after translation,
> > not the I/O virtual addresses, right? But even so, how will this work
> > when there are multiple IOMMU devices? What determines which IOMMU is
> > mapped via which entry?
> > 
> > Perhaps having multiple IOMMUs implies that there will have to be some
> > partitioning of the parent address space to make sure two IOMMUs don't
> > translate to the same ranges?
> 
> These dma-ranges properties would almost always be for the entire RAM,
> and we can treat anything else as a bug.

Would it typically be a 1:1 mapping? In that case could we define an
empty dma-ranges property to mean exactly that? That would make it
consistent with the ranges property.

> The mapping between what goes into the IOMMU and what comes out of it
> is not reflected in DT at all, since it only happens at runtime.
> The dma-ranges property I mean above describes how what comes out of
> the IOMMU maps into physical memory.

Understood. That makes sense.

> > > A device with an "iommus" property will ignore the "dma-ranges" property
> > > of the parent node and rely on the IOMMU for translation instead.
> > 
> > Do we need to consider the case where an IOMMU listed in iommus isn't
> > enabled (status = "disabled")? In that case presumably the device would
> > either not function or may optionally continue to master onto the parent
> > untranslated.
> 
> My reasoning was that the DT should specify whether we use the IOMMU
> or not. Being able to just switch on or off the IOMMU sounds nice as
> well, so we could change the text above to do that.
> 
> Another option would be to do this in the IOMMU code, basically
> falling back to the IOMMU parent's dma-ranges property and using
> linear dma_map_ops when that is disabled.

Yes, it should be trivial for the IOMMU core code to take care of this
special case. Still I think it's worth mentioning it in the binding so
that it's clearly specified.

> > > Using an "iommus" property in bus device nodes with "dma-ranges"
> > > specifying how child devices relate to the IOMMU is a possible extension
> > > but is not recommended until this binding gets extended.
> > 
> > Just for my understanding, bus device nodes with iommus and dma-ranges
> > properties could be equivalently written by explicitly moving the iommus
> > properties into the child device nodes, right? In which case they should
> > be the same as the other examples. So that concept is a convenience
> > notation to reduce duplication, but doesn't fundamentally introduce any
> > new concept.
> 
> The one case  where that doesn't work is PCI, because we don't list the
> PCI devices in DT normally, and the iommus property would only exist
> at the PCI host controller node.

But it could work in classic OpenFirmware where the device tree can be
populated with the tree of PCI devices enumerated by OF. There are also
devices that have a fixed configuration and where technically the PCI
devices can be listed in the device tree. This is somewhat important if
for example one PCI device is a GPIO controller and needs to be
referenced by phandle from some other device.

I'll make a note in the binding document about this possible future
extension.

Thierry
Dave Martin May 20, 2014, 3:24 p.m. UTC | #20
On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
> > On Tue, May 20, 2014 at 01:15:48PM +0200, Arnd Bergmann wrote:
> > > On Tuesday 20 May 2014 13:05:37 Thierry Reding wrote:
> > > > On Tue, May 20, 2014 at 12:04:54PM +0200, Arnd Bergmann wrote:
> > > > > On Monday 19 May 2014 22:59:46 Thierry Reding wrote:
> > > > > > On Mon, May 19, 2014 at 08:34:07PM +0200, Arnd Bergmann wrote:
> > [...]
> > > > > > > You should never need #size-cells > #address-cells
> > > > > > 
> > > > > > That was always my impression as well. But how then do you represent the
> > > > > > full 4 GiB address space in a 32-bit system? It starts at 0 and ends at
> > > > > > 4 GiB - 1, which makes it 4 GiB large. That's:
> > > > > > 
> > > > > > 	<0 1 0>
> > > > > > 
> > > > > > With #address-cells = <1> and #size-cells = <1> the best you can do is:
> > > > > > 
> > > > > > 	<0 0xffffffff>
> > > > > > 
> > > > > > but that's not accurate.
> > > > > 
> > > > > I think we've done both in the past, either extended #size-cells or
> > > > > taken 0xffffffff as a special token. Note that in your example,
> > > > > the iommu actually needs #address-cells = <2> anyway.
> > > > 
> > > > But it needs #address-cells = <2> only to encode an ID in addition to
> > > > the address. If this was a single-master IOMMU then there'd be no need
> > > > for the ID.
> > > 
> > > Right. But for a single-master IOMMU, there is no need to specify
> > > any additional data, it could have #address-cells=<0> if we take the
> > > optimization you suggested.
> > 
> > Couldn't a single-master IOMMU be windowed?
> 
> Ah, yes. That would actually be like an IBM pSeries, which has a windowed
> IOMMU but uses one window per virtual machine. In that case, the window could
> be a property of the iommu node though, rather than part of the address
> in the link.
> 
> > > > > The main advantage I think would be for IOMMUs that use the PCI b/d/f
> > > > > numbers as IDs. These can have #address-cells=<3>, #size-cells=<2>
> > > > > and have an empty dma-ranges property in the PCI host bridge node,
> > > > > and interpret this as using the same encoding as the PCI BARs in
> > > > > the ranges property.
> > > > 
> > > > I'm somewhat confused here, since you said earlier:
> > > > 
> > > > > After giving the ranges stuff some more thought, I have come to the
> > > > > conclusion that using #iommu-cells should work fine for almost
> > > > > all cases, including windowed iommus, because the window is not
> > > > > actually needed in the device, but only in the iommu, wihch is of course
> > > > > free to interpret the arguments as addresses.
> > > > 
> > > > But now you seem to be saying that we should still be using the
> > > > #address-cells and #size-cells properties in the IOMMU node to determine
> > > > the length of the specifier.
> > > 
> > > I probably wasn't clear. I think we can make it work either way, but
> > > my feeling is that using #address-cells/#size-cells gives us a nicer
> > > syntax for the more complex cases.
> > 
> > Okay, so in summary we'd have something like this for simple cases:
> > 
> > Required properties:
> > --------------------
> > - #address-cells: The number of cells in an IOMMU specifier needed to encode
> >   an address.
> > - #size-cells: The number of cells in an IOMMU specifier needed to represent
> >   the length of an address range.
> > 
> > Typical values for the above include:
> > - #address-cells = <0>, size-cells = <0>: Single master IOMMU devices are not
> >   configurable and therefore no additional information needs to be encoded in
> >   the specifier. This may also apply to multiple master IOMMU devices that do
> >   not allow the association of masters to be configured.
> > - #address-cells = <1>, size-cells = <0>: Multiple master IOMMU devices may
> >   need to be configured in order to enable translation for a given master. In
> >   such cases the single address cell corresponds to the master device's ID.
> > - #address-cells = <2>, size-cells = <2>: Some IOMMU devices allow the DMA
> >   window for masters to be configured. The first cell of the address in this
> >   may contain the master device's ID for example, while the second cell could
> >   contain the start of the DMA window for the given device. The length of the
> >   DMA window is specified by two additional cells.

I was trying to figure out how to describe the different kinds of
transformation we could have on the address/ID input to the IOMMU.
Treating the whole thing as opaque gets us off the hook there.

IDs are probably not propagated, not remapped, or we simply don't care
about them; whereas the address transformation is software-controlled,
so we don't describe that anyway.

Delegating grokking the mapping to the iommu driver makes sense --
it's what it's there for, after all.


I'm not sure whether the windowed IOMMU case is special actually.

Since the address to program into the master is found by calling the
IOMMU driver to create some mappings, does anything except the IOMMU
driver need to understand that there is windowing?

> > 
> > Examples:
> > =========
> > 
> > Single-master IOMMU:
> > --------------------
> > 
> > 	iommu {
> > 		#address-cells = <0>;
> > 		#size-cells = <0>;
> > 	};
> > 
> > 	master {
> > 		iommus = <&/iommu>;
> > 	};
> > 
> > Multiple-master IOMMU with fixed associations:
> > ----------------------------------------------
> > 
> > 	/* multiple-master IOMMU */
> > 	iommu {
> > 		/*
> > 		 * Masters are statically associated with this IOMMU and
> > 		 * address translation is always enabled.
> > 		 */
> > 		#iommu-cells = <0>;
> > 	};
> 
> copied wrong? I guess you mean #address-cells=<0>/#size-cells=<0> here.
> 
> > 	/* static association with IOMMU */
> > 	master@1 {
> > 		reg = <1>;

Just for clarification, "reg" just has its standard meaning here, and
is nothing to do with the IOMMU?

> > 		iommus = <&/iommu>;

In effect, "iommus" is doing the same thing as my "slaves" property.

The way #address-cells and #size-cells determine the address and range
size for mastering into the IOMMU is also similar.  The main difference
is that I didn't build the ID into the address.

> > 	};
> > 
> > 	/* static association with IOMMU */
> > 	master@2 {
> > 		reg = <2>;
> > 		iommus = <&/iommu>;
> > 	};
> > 
> > Multiple-master IOMMU:
> > ----------------------
> > 
> > 	iommu {
> > 		/* the specifier represents the ID of the master */
> > 		#address-cells = <1>;
> > 		#size-cells = <0>;

How do we know the size of the input address to the IOMMU?  Do we
get cases for example where the IOMMU only accepts a 32-bit input
address, but some 64-bit capable masters are connected through it?

The size of the output address from the IOMMU will be determined
by its own mastering destination, which by default in ePAPR is the
IOMMU node's parent.  I think that's what you intended, and what we
expect in this case.

For determining dma masks, it is the output address that it
important.  Santosh's code can probably be taught to handle this,
if given an additional traversal rule for following "iommus"
properties.  However, deploying an IOMMU whose output address size
is smaller than the 

> > 	};
> > 
> > 	master {
> > 		/* device has master ID 42 in the IOMMU */
> > 		iommus = <&/iommu 42>;
> > 	};
> > 
> > Multiple-master device:
> > -----------------------
> > 
> > 	/* single-master IOMMU */
> > 	iommu@1 {
> > 		reg = <1>;
> > 		#address-cells = <0>;
> > 		#size-cells = <0>;
> > 	};
> > 
> > 	/* multiple-master IOMMU */
> > 	iommu@2 {
> > 		reg = <2>;
> > 		#address-cells = <1>;
> > 		#size-cells = <0>;
> > 	};
> > 
> > 	/* device with two master interfaces */
> > 	master {
> > 		iommus = <&/iommu@1>,    /* master of the single-master IOMMU */
> > 			 <&/iommu@2 42>; /* ID 42 in multiple-master IOMMU */
> > 	};
> > 
> > Multiple-master IOMMU with configurable DMA window:
> > ---------------------------------------------------
> > 
> > 	/ {
> > 		#address-cells = <1>;
> > 		#size-cells = <1>;
> > 
> > 		iommu {
> > 			/* master ID, address of DMA window */
> > 			#address-cells = <2>;
> > 			#size-cells = <2>;
> > 		};
> > 
> > 		master {
> > 			/* master ID 42, 4 GiB DMA window starting at 0 */
> > 			iommus = <&/iommu  42 0  0x1 0x0>;
> > 		};
> > 	};
> > 
> > Does that sound about right?
> 
> Yes, sounds great. I would probably leave out the Multiple-master device
> from the examples, since that seems to be a rather obscure case.

I think multi-master is the common case.

> 
> I would like to add an explanation about dma-ranges to the binding:
> 
> 8<--------
> The parent bus of the iommu must have a valid "dma-ranges" property
> describing how the physical address space of the IOMMU maps into
> memory.
> A device with an "iommus" property will ignore the "dma-ranges" property
> of the parent node and rely on the IOMMU for translation instead.
> Using an "iommus" property in bus device nodes with "dma-ranges"
> specifying how child devices relate to the IOMMU is a possible extension
> but is not recommended until this binding gets extended.

This sounds just right.  The required semantics is that the presence of
"iommus" on some bus mastering device overrides the ePAPR default
destination so that transactions are delivered to the IOMMU for
translation instead of the master's DT parent node.

Where transactions flow out of the IOMMU, the iommu takes on the role
of the master, so the default destination would be the iommu node's
parent.

> ----------->8
> 
> Does that make sense to you? We can change what we say about
> dma-ranges, I mainly want to be clear with what is or is not
> allowed at this point.

I think it would be inconsistent and unnecessary to disallow it in the
binding.  The meaning you've proposed seems completely consistent with
ePAPR, so I suggest to keep it.  The IOMMU is just another bus master
from the ePAPR point of view -- no need to make special rules for it
unless they are useful.

The binding does not need to be (and generally shouldn't be) a
description of precisely what the kernel does and does not support.

However, if we don't need to support non-identity dma-ranges in Linux
yet, we have the option to barf if we see such a dma-ranges memorywards
of an IOMMU, if it simplifies the Linux implementation.  We could always
relax that later -- and it'll be obvious how to describe that situation
in DT.



What I would like to see is a recommandation, based on Thierry's binding
here, for describing how cross-mastering in general is described.  It's
not really a binding, but more of a template for bindings.

I'm happy to have a go at writing it, then we can decide whether it's
useful or not.


There are a few things from the discussion that are *not* solved by this
iommu binding, but they seem reasonable.  The binding also doesn't block
solving those things later if/when needed:

 1) Cross-mastering to things that are not IOMMUs

    We might need to solve this later if we encounter SoCs with
    problematic topologies, we shouldn't worry about it for the time
    being.

    We'll to revisit it for GICv3 but that's a separate topic.

 2) Describing address and ID remappings for cross-mastering.

    We can describe this in a way that is consistent with this IOMMU
    binding.  We will need to describe something for GICv3, but the
    common case will be that IDs are just passed through without
    remapping.

    We don't need to clarify how IDs are propagated until we have
    something in DT for IDs to propagate to.


Cheers
---Dave
Will Deacon May 20, 2014, 3:26 p.m. UTC | #21
On Tue, May 20, 2014 at 02:23:47PM +0100, Arnd Bergmann wrote:
> 	Bit#   33222222 22221111 11111100 00000000
> 	       10987654 32109876 54321098 76543210
> phys.hi cell:  npt000ss bbbbbbbb dddddfff rrrrrrrr
> phys.mid cell: hhhhhhhh hhhhhhhh hhhhhhhh hhhhhhhh
> phys.lo cell:  llllllll llllllll llllllll llllllll
> 
> where:
> n is 0 if the address is relocatable, 1 otherwise
> p is 1 if the addressable region is "prefetchable", 0 otherwise
> t is 1 if the address is aliased (for non-relocatable I/O),
>  below 1 MB (for Memory), or below 64 KB (for relocatable I/O).
> ss is the space code, denoting the address space
> bbbbbbbb is the 8-bit Bus Number
> ddddd is the 5-bit Device Number
> fff is the 3-bit Function Number
> rrrrrrrr is the 8-bit Register Number
> hh...hh is a 32-bit unsigned number
> ll...ll is a 32-bit unsigned number
> 
> We can ignore n, p, t and r here, and use the same format for a DMA
> address, then define an empty "dma-ranges" property. That would
> imply that using b/d/f is sufficient to identify each master at the
> iommu. Any device outside of the PCI host but connected to the same
> iommu can use the same notation to list the logical b/d/f that gets
> sent to the IOMMU in bus master transactions.
> 
> Do you think this is sufficient for the ARM SMMU, or do we need
> something beyond that?

I think it can define the common-cases for the existing implementations,
yes. I anticipate Stream-IDs becoming > 16-bit in the near future though,
so we'd need extra bits if we're describing other devices coming into the
SMMU.

Note that we already have a binding for the current SMMU driver, so I'm not
really in a position to shift over to a new binding until the next version of
the SMMU architecture comes along...

Will
Dave Martin May 20, 2014, 4:39 p.m. UTC | #22
On Tue, May 20, 2014 at 04:26:59PM +0100, Will Deacon wrote:
> On Tue, May 20, 2014 at 02:23:47PM +0100, Arnd Bergmann wrote:
> > 	Bit#   33222222 22221111 11111100 00000000
> > 	       10987654 32109876 54321098 76543210
> > phys.hi cell:  npt000ss bbbbbbbb dddddfff rrrrrrrr
> > phys.mid cell: hhhhhhhh hhhhhhhh hhhhhhhh hhhhhhhh
> > phys.lo cell:  llllllll llllllll llllllll llllllll
> > 
> > where:
> > n is 0 if the address is relocatable, 1 otherwise
> > p is 1 if the addressable region is "prefetchable", 0 otherwise
> > t is 1 if the address is aliased (for non-relocatable I/O),
> >  below 1 MB (for Memory), or below 64 KB (for relocatable I/O).
> > ss is the space code, denoting the address space
> > bbbbbbbb is the 8-bit Bus Number
> > ddddd is the 5-bit Device Number
> > fff is the 3-bit Function Number
> > rrrrrrrr is the 8-bit Register Number
> > hh...hh is a 32-bit unsigned number
> > ll...ll is a 32-bit unsigned number
> > 
> > We can ignore n, p, t and r here, and use the same format for a DMA
> > address, then define an empty "dma-ranges" property. That would
> > imply that using b/d/f is sufficient to identify each master at the
> > iommu. Any device outside of the PCI host but connected to the same
> > iommu can use the same notation to list the logical b/d/f that gets
> > sent to the IOMMU in bus master transactions.
> > 
> > Do you think this is sufficient for the ARM SMMU, or do we need
> > something beyond that?
> 
> I think it can define the common-cases for the existing implementations,
> yes. I anticipate Stream-IDs becoming > 16-bit in the near future though,
> so we'd need extra bits if we're describing other devices coming into the
> SMMU.
> 
> Note that we already have a binding for the current SMMU driver, so I'm not
> really in a position to shift over to a new binding until the next version of
> the SMMU architecture comes along...

How much code relies on the meaning of the nptsbdf bits?

Following the general model of concatenating the master ID with the
address bits and trasting that as a single input to the IOMMU seems
reasonable -- maybe we either don't care about the attribute bits and
can replace them with master ID bits, or instead we could extend with
more master ID bits at bit 96 and up.

npts looks like memory type information.  What memory type information
we need to describe for coherent masters on ARM is a bit of an open
question.  Perhaps we can come up with a good ARM-specific interpretation
of these bits, but I wonder whether they are enough.

If there is a real PCI bus in the system of course, then we are fine
once we reach it.  The only problem is for the stuff on AMBA buses etc.

In theory, we could have to describe memory types, cacheability
attributes and coherency domains, but we may be able to squeeze the
possibilities down to a smaller set of sane memory types.

Cheers
---Dave
Arnd Bergmann May 20, 2014, 8:26 p.m. UTC | #23
On Tuesday 20 May 2014 16:24:59 Dave Martin wrote:
> On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> > On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
> > > On Tue, May 20, 2014 at 01:15:48PM +0200, Arnd Bergmann wrote:
> > > Typical values for the above include:
> > > - #address-cells = <0>, size-cells = <0>: Single master IOMMU devices are not
> > >   configurable and therefore no additional information needs to be encoded in
> > >   the specifier. This may also apply to multiple master IOMMU devices that do
> > >   not allow the association of masters to be configured.
> > > - #address-cells = <1>, size-cells = <0>: Multiple master IOMMU devices may
> > >   need to be configured in order to enable translation for a given master. In
> > >   such cases the single address cell corresponds to the master device's ID.
> > > - #address-cells = <2>, size-cells = <2>: Some IOMMU devices allow the DMA
> > >   window for masters to be configured. The first cell of the address in this
> > >   may contain the master device's ID for example, while the second cell could
> > >   contain the start of the DMA window for the given device. The length of the
> > >   DMA window is specified by two additional cells.
> 
> I was trying to figure out how to describe the different kinds of
> transformation we could have on the address/ID input to the IOMMU.
> Treating the whole thing as opaque gets us off the hook there.
> 
> IDs are probably not propagated, not remapped, or we simply don't care
> about them; whereas the address transformation is software-controlled,
> so we don't describe that anyway.
> 
> Delegating grokking the mapping to the iommu driver makes sense --
> it's what it's there for, after all.
> 
> 
> I'm not sure whether the windowed IOMMU case is special actually.
> 
> Since the address to program into the master is found by calling the
> IOMMU driver to create some mappings, does anything except the IOMMU
> driver need to understand that there is windowing?

No. I tried to explain that earlier today, and in my earlier mails
I hadn't thought that part through. Only the IOMMU driver needs to care
about the window.

> > > 
> > > Examples:
> > > =========
> > > 
> > > Single-master IOMMU:
> > > --------------------
> > > 
> > > 	iommu {
> > > 		#address-cells = <0>;
> > > 		#size-cells = <0>;
> > > 	};
> > > 
> > > 	master {
> > > 		iommus = <&/iommu>;
> > > 	};
> > > 
> > > Multiple-master IOMMU with fixed associations:
> > > ----------------------------------------------
> > > 
> > > 	/* multiple-master IOMMU */
> > > 	iommu {
> > > 		/*
> > > 		 * Masters are statically associated with this IOMMU and
> > > 		 * address translation is always enabled.
> > > 		 */
> > > 		#iommu-cells = <0>;
> > > 	};
> > 
> > copied wrong? I guess you mean #address-cells=<0>/#size-cells=<0> here.
> > 
> > > 	/* static association with IOMMU */
> > > 	master@1 {
> > > 		reg = <1>;
> 
> Just for clarification, "reg" just has its standard meaning here, and
> is nothing to do with the IOMMU?

correct

> > > 		iommus = <&/iommu>;
> 
> In effect, "iommus" is doing the same thing as my "slaves" property.
> 
> The way #address-cells and #size-cells determine the address and range
> size for mastering into the IOMMU is also similar.  The main difference
> is that I didn't build the ID into the address.

Right. I think the difference is more about what we want to call
things: Calling it iommu means we want to specifically describe
the case of iommus that needs to get handled by all OSs in a particular
way, while the more generic slave connection doesn't correspond to
a specific concept in the OS.

> > > 	};
> > > 
> > > 	/* static association with IOMMU */
> > > 	master@2 {
> > > 		reg = <2>;
> > > 		iommus = <&/iommu>;
> > > 	};
> > > 
> > > Multiple-master IOMMU:
> > > ----------------------
> > > 
> > > 	iommu {
> > > 		/* the specifier represents the ID of the master */
> > > 		#address-cells = <1>;
> > > 		#size-cells = <0>;
> 
> How do we know the size of the input address to the IOMMU?  Do we
> get cases for example where the IOMMU only accepts a 32-bit input
> address, but some 64-bit capable masters are connected through it?

I was stuck on this question for a while before, but then I realized
that it doesn't matter at all: It's the IOMMU driver itself that
manages the address space, and it doesn't matter if a slave can
address a larger range than the IOMMU can accept. If the IOMMU
needs to deal with the opposite case (64-bit input addresses
but a 32-bit master), that limitation can be put into the specifier.

> The size of the output address from the IOMMU will be determined
> by its own mastering destination, which by default in ePAPR is the
> IOMMU node's parent.  I think that's what you intended, and what we
> expect in this case.

Rihgt.

> For determining dma masks, it is the output address that it
> important.  Santosh's code can probably be taught to handle this,
> if given an additional traversal rule for following "iommus"
> properties.  However, deploying an IOMMU whose output address size
> is smaller than the 

Something seems to be missing here. I don't think we want to handle
the case where the IOMMU output cannot the entire memory address
space. If necessary, that would mean using both an IOMMU driver
and swiotlb, but I think it's a reasonable assumption that hardware
isn't /that/ crazy.

> > > Multiple-master device:
> > > -----------------------
> > > 
> > > 	/* single-master IOMMU */
> > > 	iommu@1 {
> > > 		reg = <1>;
> > > 		#address-cells = <0>;
> > > 		#size-cells = <0>;
> > > 	};
> > > 
> > > 	/* multiple-master IOMMU */
> > > 	iommu@2 {
> > > 		reg = <2>;
> > > 		#address-cells = <1>;
> > > 		#size-cells = <0>;
> > > 	};
> > > 
> > > 	/* device with two master interfaces */
> > > 	master {
> > > 		iommus = <&/iommu@1>,    /* master of the single-master IOMMU */
> > > 			 <&/iommu@2 42>; /* ID 42 in multiple-master IOMMU */
> > > 	};
> > > 
> > > Multiple-master IOMMU with configurable DMA window:
> > > ---------------------------------------------------
> > > 
> > > 	/ {
> > > 		#address-cells = <1>;
> > > 		#size-cells = <1>;
> > > 
> > > 		iommu {
> > > 			/* master ID, address of DMA window */
> > > 			#address-cells = <2>;
> > > 			#size-cells = <2>;
> > > 		};
> > > 
> > > 		master {
> > > 			/* master ID 42, 4 GiB DMA window starting at 0 */
> > > 			iommus = <&/iommu  42 0  0x1 0x0>;
> > > 		};
> > > 	};
> > > 
> > > Does that sound about right?
> > 
> > Yes, sounds great. I would probably leave out the Multiple-master device
> > from the examples, since that seems to be a rather obscure case.
> 
> I think multi-master is the common case.

Which of the two cases above do you mean? I was referring to the first
as being obscure, not the second.

I still haven't seen an example of the first, while the second one
is very common.

> > ----------->8
> > 
> > Does that make sense to you? We can change what we say about
> > dma-ranges, I mainly want to be clear with what is or is not
> > allowed at this point.
> 
> I think it would be inconsistent and unnecessary to disallow it in the
> binding.  The meaning you've proposed seems completely consistent with
> ePAPR, so I suggest to keep it.  The IOMMU is just another bus master
> from the ePAPR point of view -- no need to make special rules for it
> unless they are useful.
> 
> The binding does not need to be (and generally shouldn't be) a
> description of precisely what the kernel does and does not support.
> 
> However, if we don't need to support non-identity dma-ranges in Linux
> yet, we have the option to barf if we see such a dma-ranges memorywards
> of an IOMMU, if it simplifies the Linux implementation.  We could always
> relax that later -- and it'll be obvious how to describe that situation
> in DT.

Ok.

> What I would like to see is a recommandation, based on Thierry's binding
> here, for describing how cross-mastering in general is described.  It's
> not really a binding, but more of a template for bindings.
> 
> I'm happy to have a go at writing it, then we can decide whether it's
> useful or not.

I don't mind if you take this on, but I'm not sure if that should be
part of this binding or not. Let's see what you come up with.

> There are a few things from the discussion that are *not* solved by this
> iommu binding, but they seem reasonable.  The binding also doesn't block
> solving those things later if/when needed:
> 
>  1) Cross-mastering to things that are not IOMMUs
> 
>     We might need to solve this later if we encounter SoCs with
>     problematic topologies, we shouldn't worry about it for the time
>     being.
> 
>     We'll to revisit it for GICv3 but that's a separate topic.


>  2) Describing address and ID remappings for cross-mastering.
> 
>     We can describe this in a way that is consistent with this IOMMU
>     binding.  We will need to describe something for GICv3, but the
>     common case will be that IDs are just passed through without
>     remapping.
> 
>     We don't need to clarify how IDs are propagated until we have
>     something in DT for IDs to propagate to.

Ok, thanks for pointing these out. I had forgotten about the MSI
case, but it seems ok to defer that part for now.

	Arnd
Arnd Bergmann May 20, 2014, 8:31 p.m. UTC | #24
On Tuesday 20 May 2014 16:00:02 Thierry Reding wrote:
> On Tue, May 20, 2014 at 03:34:46PM +0200, Arnd Bergmann wrote:
> > On Tuesday 20 May 2014 15:17:43 Thierry Reding wrote:
> > > On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> > > > On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
> > > [...]
> > > > > Couldn't a single-master IOMMU be windowed?
> > > > 
> > > > Ah, yes. That would actually be like an IBM pSeries, which has a windowed
> > > > IOMMU but uses one window per virtual machine. In that case, the window could
> > > > be a property of the iommu node though, rather than part of the address
> > > > in the link.
> > > 
> > > Does that mean that the IOMMU has one statically configured window which
> > > is the same for each virtual machine? That would require some other
> > > mechanism to assign separate address spaces to each virtual machine,
> > > wouldn't it? But I suspect that if the IOMMU allows that it could be
> > > allocated dynamically at runtime.
> > 
> > The way it works on pSeries is that upon VM creation, the guest is assigned
> > one 256MB window for use by assigned DMA capable devices. When the guest
> > creates a mapping, it uses a hypercall to associate a bus address in that
> > range with a guest physical address. The hypervisor checks that the bus
> > address is within the allowed range, and translates the guest physical
> > address into a host physical address, then enters both into the I/O page
> > table or I/O TLB.
> 
> So when a VM is booted it is passed a device tree with that DMA window?

Correct.

> Given what you describe above this seems to be more of a configuration
> option to restrict the IOMMU to a subset of the physical memory for
> purposes of virtualization. So I agree that this wouldn't be a good fit
> for what we're trying to achieve with iommus or dma-ranges in this
> binding.

Thinking about it again now, I wonder if there are any other use cases
for windowed IOMMUs. If this is the only one, there might be no use
in the #address-cells model I suggested instead of your original
#iommu-cells.

> > > > I would like to add an explanation about dma-ranges to the binding:
> > > > 
> > > > 8<--------
> > > > The parent bus of the iommu must have a valid "dma-ranges" property
> > > > describing how the physical address space of the IOMMU maps into
> > > > memory.
> > > 
> > > With physical address space you mean the addresses after translation,
> > > not the I/O virtual addresses, right? But even so, how will this work
> > > when there are multiple IOMMU devices? What determines which IOMMU is
> > > mapped via which entry?
> > > 
> > > Perhaps having multiple IOMMUs implies that there will have to be some
> > > partitioning of the parent address space to make sure two IOMMUs don't
> > > translate to the same ranges?
> > 
> > These dma-ranges properties would almost always be for the entire RAM,
> > and we can treat anything else as a bug.
> 
> Would it typically be a 1:1 mapping? In that case could we define an
> empty dma-ranges property to mean exactly that? That would make it
> consistent with the ranges property.

Yes, I believe that is how it's already defined.

> > > > A device with an "iommus" property will ignore the "dma-ranges" property
> > > > of the parent node and rely on the IOMMU for translation instead.
> > > 
> > > Do we need to consider the case where an IOMMU listed in iommus isn't
> > > enabled (status = "disabled")? In that case presumably the device would
> > > either not function or may optionally continue to master onto the parent
> > > untranslated.
> > 
> > My reasoning was that the DT should specify whether we use the IOMMU
> > or not. Being able to just switch on or off the IOMMU sounds nice as
> > well, so we could change the text above to do that.
> > 
> > Another option would be to do this in the IOMMU code, basically
> > falling back to the IOMMU parent's dma-ranges property and using
> > linear dma_map_ops when that is disabled.
> 
> Yes, it should be trivial for the IOMMU core code to take care of this
> special case. Still I think it's worth mentioning it in the binding so
> that it's clearly specified.

Agreed.

> > > > Using an "iommus" property in bus device nodes with "dma-ranges"
> > > > specifying how child devices relate to the IOMMU is a possible extension
> > > > but is not recommended until this binding gets extended.
> > > 
> > > Just for my understanding, bus device nodes with iommus and dma-ranges
> > > properties could be equivalently written by explicitly moving the iommus
> > > properties into the child device nodes, right? In which case they should
> > > be the same as the other examples. So that concept is a convenience
> > > notation to reduce duplication, but doesn't fundamentally introduce any
> > > new concept.
> > 
> > The one case  where that doesn't work is PCI, because we don't list the
> > PCI devices in DT normally, and the iommus property would only exist
> > at the PCI host controller node.
> 
> But it could work in classic OpenFirmware where the device tree can be
> populated with the tree of PCI devices enumerated by OF. There are also
> devices that have a fixed configuration and where technically the PCI
> devices can be listed in the device tree. This is somewhat important if
> for example one PCI device is a GPIO controller and needs to be
> referenced by phandle from some other device.

Correct. The flaw of classic Open Firmware here was that it cannot
handle PCIe hotplug though, so we can never rely on the DT to
describe all devices.

	Arnd
Arnd Bergmann May 20, 2014, 8:40 p.m. UTC | #25
On Tuesday 20 May 2014 17:39:12 Dave Martin wrote:
> On Tue, May 20, 2014 at 04:26:59PM +0100, Will Deacon wrote:
> > On Tue, May 20, 2014 at 02:23:47PM +0100, Arnd Bergmann wrote:
> > > 	Bit#   33222222 22221111 11111100 00000000
> > > 	       10987654 32109876 54321098 76543210
> > > phys.hi cell:  npt000ss bbbbbbbb dddddfff rrrrrrrr
> > > phys.mid cell: hhhhhhhh hhhhhhhh hhhhhhhh hhhhhhhh
> > > phys.lo cell:  llllllll llllllll llllllll llllllll
> > > 
> > > where:
> > > n is 0 if the address is relocatable, 1 otherwise
> > > p is 1 if the addressable region is "prefetchable", 0 otherwise
> > > t is 1 if the address is aliased (for non-relocatable I/O),
> > >  below 1 MB (for Memory), or below 64 KB (for relocatable I/O).
> > > ss is the space code, denoting the address space
> > > bbbbbbbb is the 8-bit Bus Number
> > > ddddd is the 5-bit Device Number
> > > fff is the 3-bit Function Number
> > > rrrrrrrr is the 8-bit Register Number
> > > hh...hh is a 32-bit unsigned number
> > > ll...ll is a 32-bit unsigned number
> > > 
> > > We can ignore n, p, t and r here, and use the same format for a DMA
> > > address, then define an empty "dma-ranges" property. That would
> > > imply that using b/d/f is sufficient to identify each master at the
> > > iommu. Any device outside of the PCI host but connected to the same
> > > iommu can use the same notation to list the logical b/d/f that gets
> > > sent to the IOMMU in bus master transactions.
> > > 
> > > Do you think this is sufficient for the ARM SMMU, or do we need
> > > something beyond that?
> > 
> > I think it can define the common-cases for the existing implementations,
> > yes. I anticipate Stream-IDs becoming > 16-bit in the near future though,
> > so we'd need extra bits if we're describing other devices coming into the
> > SMMU.
> > 
> > Note that we already have a binding for the current SMMU driver, so I'm not
> > really in a position to shift over to a new binding until the next version of
> > the SMMU architecture comes along...
> 
> How much code relies on the meaning of the nptsbdf bits?

Not much at all, I think this was defined mostly for open firmware
client interfaces, which we don't use with FDT.

n, t and r are probably only used in PCI functions that are listed
in DT, not in the host bridge.

b/d/f I think is mostly used for the interrupt-maps property, when
describing hardwired interrupts.

p and s are used when setting up the translation for inbound MMIO
and PIO.

	Arnd
Thierry Reding May 21, 2014, 8:16 a.m. UTC | #26
On Tue, May 20, 2014 at 10:31:29PM +0200, Arnd Bergmann wrote:
> On Tuesday 20 May 2014 16:00:02 Thierry Reding wrote:
> > On Tue, May 20, 2014 at 03:34:46PM +0200, Arnd Bergmann wrote:
> > > On Tuesday 20 May 2014 15:17:43 Thierry Reding wrote:
> > > > On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> > > > > On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
> > > > [...]
> > > > > > Couldn't a single-master IOMMU be windowed?
> > > > > 
> > > > > Ah, yes. That would actually be like an IBM pSeries, which has a windowed
> > > > > IOMMU but uses one window per virtual machine. In that case, the window could
> > > > > be a property of the iommu node though, rather than part of the address
> > > > > in the link.
> > > > 
> > > > Does that mean that the IOMMU has one statically configured window which
> > > > is the same for each virtual machine? That would require some other
> > > > mechanism to assign separate address spaces to each virtual machine,
> > > > wouldn't it? But I suspect that if the IOMMU allows that it could be
> > > > allocated dynamically at runtime.
> > > 
> > > The way it works on pSeries is that upon VM creation, the guest is assigned
> > > one 256MB window for use by assigned DMA capable devices. When the guest
> > > creates a mapping, it uses a hypercall to associate a bus address in that
> > > range with a guest physical address. The hypervisor checks that the bus
> > > address is within the allowed range, and translates the guest physical
> > > address into a host physical address, then enters both into the I/O page
> > > table or I/O TLB.
> > 
> > So when a VM is booted it is passed a device tree with that DMA window?
> 
> Correct.
> 
> > Given what you describe above this seems to be more of a configuration
> > option to restrict the IOMMU to a subset of the physical memory for
> > purposes of virtualization. So I agree that this wouldn't be a good fit
> > for what we're trying to achieve with iommus or dma-ranges in this
> > binding.
> 
> Thinking about it again now, I wonder if there are any other use cases
> for windowed IOMMUs. If this is the only one, there might be no use
> in the #address-cells model I suggested instead of your original
> #iommu-cells.

So in this case virtualization is the reason why we need the DMA window.
The reason for that is that the guest has no other way of knowing what
other guests might be using, so it's essentially a mechanism for the
host to manage the DMA region and allocate subregions for each guest. If
virtualization isn't an issue then it seems to me that the need for DMA
windows goes away because the operating system will track DMA regions
anyway.

The only reason I can think of why a windowed IOMMU would be useful is
to prevent two or more devices from stepping on each others' toes. But
that's a problem that the OS should already be handling during DMA
buffer allocation, isn't it?

> > > > > I would like to add an explanation about dma-ranges to the binding:
> > > > > 
> > > > > 8<--------
> > > > > The parent bus of the iommu must have a valid "dma-ranges" property
> > > > > describing how the physical address space of the IOMMU maps into
> > > > > memory.
> > > > 
> > > > With physical address space you mean the addresses after translation,
> > > > not the I/O virtual addresses, right? But even so, how will this work
> > > > when there are multiple IOMMU devices? What determines which IOMMU is
> > > > mapped via which entry?
> > > > 
> > > > Perhaps having multiple IOMMUs implies that there will have to be some
> > > > partitioning of the parent address space to make sure two IOMMUs don't
> > > > translate to the same ranges?
> > > 
> > > These dma-ranges properties would almost always be for the entire RAM,
> > > and we can treat anything else as a bug.
> > 
> > Would it typically be a 1:1 mapping? In that case could we define an
> > empty dma-ranges property to mean exactly that? That would make it
> > consistent with the ranges property.
> 
> Yes, I believe that is how it's already defined.

I've gone through the proposal at [0] again, but couldn't find a mention
of an empty "dma-ranges" property. But regardless I think that a 1:1
mapping is the obvious meaning of an empty "dma-ranges" property.

[0]: http://www.openfirmware.org/ofwg/proposals/Closed/Accepted/410-it.txt

One thing I'm not sure about is whether dma-ranges should be documented
in this binding at all. Since there's an accepted standard proposal it
would seem that it doesn't need to be specifically mentioned. One other
option would be to link to the above proposal from the binding and then
complement that with what an empty "dma-ranges" property means.

Or we could possible document this in a file along with other standard
properties. I don't think we currently do that for any properties, but
my concern is that there will always be a limited number of people
knowing about how such properties are supposed to work. If all of a
sudden all these people would disappear, everybody else would be left
with references to these properties but nowhere to look for their
meaning.

> > > > > A device with an "iommus" property will ignore the "dma-ranges" property
> > > > > of the parent node and rely on the IOMMU for translation instead.
> > > > 
> > > > Do we need to consider the case where an IOMMU listed in iommus isn't
> > > > enabled (status = "disabled")? In that case presumably the device would
> > > > either not function or may optionally continue to master onto the parent
> > > > untranslated.
> > > 
> > > My reasoning was that the DT should specify whether we use the IOMMU
> > > or not. Being able to just switch on or off the IOMMU sounds nice as
> > > well, so we could change the text above to do that.
> > > 
> > > Another option would be to do this in the IOMMU code, basically
> > > falling back to the IOMMU parent's dma-ranges property and using
> > > linear dma_map_ops when that is disabled.
> > 
> > Yes, it should be trivial for the IOMMU core code to take care of this
> > special case. Still I think it's worth mentioning it in the binding so
> > that it's clearly specified.
> 
> Agreed.

Okay, I have a new version of the binding that I think incorporates all
the changes discussed so far. It uses #address-cells and #size-cells to
define the length of the specifier, but if we decide against that it can
easily be changed again.

Thierry
Thierry Reding May 21, 2014, 8:26 a.m. UTC | #27
On Tue, May 20, 2014 at 10:26:12PM +0200, Arnd Bergmann wrote:
> On Tuesday 20 May 2014 16:24:59 Dave Martin wrote:
> > On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> > > On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
[...]
> > > > Multiple-master IOMMU:
> > > > ----------------------
> > > > 
> > > > 	iommu {
> > > > 		/* the specifier represents the ID of the master */
> > > > 		#address-cells = <1>;
> > > > 		#size-cells = <0>;
> > 
> > How do we know the size of the input address to the IOMMU?  Do we
> > get cases for example where the IOMMU only accepts a 32-bit input
> > address, but some 64-bit capable masters are connected through it?
> 
> I was stuck on this question for a while before, but then I realized
> that it doesn't matter at all: It's the IOMMU driver itself that
> manages the address space, and it doesn't matter if a slave can
> address a larger range than the IOMMU can accept. If the IOMMU
> needs to deal with the opposite case (64-bit input addresses
> but a 32-bit master), that limitation can be put into the specifier.

Isn't this what DMA masks are for? Couldn't the IOMMU simply use the
master device's DMA mask to do the right thing here?

As such I don't think this information needs to be in device tree at
all. The DMA mask should typically be set by the driver in the first
place because it has knowledge about the capabilities of the device.
A different way of saying that is that the DMA mask is implied by the
device's compatible string.

> > For determining dma masks, it is the output address that it
> > important.  Santosh's code can probably be taught to handle this,
> > if given an additional traversal rule for following "iommus"
> > properties.  However, deploying an IOMMU whose output address size
> > is smaller than the 
> 
> Something seems to be missing here. I don't think we want to handle
> the case where the IOMMU output cannot the entire memory address
> space. If necessary, that would mean using both an IOMMU driver
> and swiotlb, but I think it's a reasonable assumption that hardware
> isn't /that/ crazy.

Similarily, should the IOMMU not be treated like any other device here?
Its DMA mask should determine what address range it can access.

Thierry
Arnd Bergmann May 21, 2014, 8:50 a.m. UTC | #28
On Wednesday 21 May 2014 10:26:11 Thierry Reding wrote:
> On Tue, May 20, 2014 at 10:26:12PM +0200, Arnd Bergmann wrote:
> > On Tuesday 20 May 2014 16:24:59 Dave Martin wrote:
> > > On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> > > > On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
> [...]
> > > > > Multiple-master IOMMU:
> > > > > ----------------------
> > > > > 
> > > > > 	iommu {
> > > > > 		/* the specifier represents the ID of the master */
> > > > > 		#address-cells = <1>;
> > > > > 		#size-cells = <0>;
> > > 
> > > How do we know the size of the input address to the IOMMU?  Do we
> > > get cases for example where the IOMMU only accepts a 32-bit input
> > > address, but some 64-bit capable masters are connected through it?
> > 
> > I was stuck on this question for a while before, but then I realized
> > that it doesn't matter at all: It's the IOMMU driver itself that
> > manages the address space, and it doesn't matter if a slave can
> > address a larger range than the IOMMU can accept. If the IOMMU
> > needs to deal with the opposite case (64-bit input addresses
> > but a 32-bit master), that limitation can be put into the specifier.
> 
> Isn't this what DMA masks are for? Couldn't the IOMMU simply use the
> master device's DMA mask to do the right thing here?

Ah, yes. I guess that's the right way to do it.

> > > For determining dma masks, it is the output address that it
> > > important.  Santosh's code can probably be taught to handle this,
> > > if given an additional traversal rule for following "iommus"
> > > properties.  However, deploying an IOMMU whose output address size
> > > is smaller than the 
> > 
> > Something seems to be missing here. I don't think we want to handle
> > the case where the IOMMU output cannot the entire memory address
> > space. If necessary, that would mean using both an IOMMU driver
> > and swiotlb, but I think it's a reasonable assumption that hardware
> > isn't /that/ crazy.
> 
> Similarily, should the IOMMU not be treated like any other device here?
> Its DMA mask should determine what address range it can access.

Right. But for that we need a dma-ranges property in the parent of the
iommu, just so the mask can be set correctly and we don't have to
rely on the 32-bit fallback case.

	Arnd
Arnd Bergmann May 21, 2014, 8:54 a.m. UTC | #29
On Wednesday 21 May 2014 10:16:09 Thierry Reding wrote:
> On Tue, May 20, 2014 at 10:31:29PM +0200, Arnd Bergmann wrote:
> > On Tuesday 20 May 2014 16:00:02 Thierry Reding wrote:
> > > On Tue, May 20, 2014 at 03:34:46PM +0200, Arnd Bergmann wrote:
> > > > On Tuesday 20 May 2014 15:17:43 Thierry Reding wrote:
> > > > > On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> > > > > > On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
> > > > > [...]
> > > > > > > Couldn't a single-master IOMMU be windowed?
> > > > > > 
> > > > > > Ah, yes. That would actually be like an IBM pSeries, which has a windowed
> > > > > > IOMMU but uses one window per virtual machine. In that case, the window could
> > > > > > be a property of the iommu node though, rather than part of the address
> > > > > > in the link.
> > > > > 
> > > > > Does that mean that the IOMMU has one statically configured window which
> > > > > is the same for each virtual machine? That would require some other
> > > > > mechanism to assign separate address spaces to each virtual machine,
> > > > > wouldn't it? But I suspect that if the IOMMU allows that it could be
> > > > > allocated dynamically at runtime.
> > > > 
> > > > The way it works on pSeries is that upon VM creation, the guest is assigned
> > > > one 256MB window for use by assigned DMA capable devices. When the guest
> > > > creates a mapping, it uses a hypercall to associate a bus address in that
> > > > range with a guest physical address. The hypervisor checks that the bus
> > > > address is within the allowed range, and translates the guest physical
> > > > address into a host physical address, then enters both into the I/O page
> > > > table or I/O TLB.
> > > 
> > > So when a VM is booted it is passed a device tree with that DMA window?
> > 
> > Correct.
> > 
> > > Given what you describe above this seems to be more of a configuration
> > > option to restrict the IOMMU to a subset of the physical memory for
> > > purposes of virtualization. So I agree that this wouldn't be a good fit
> > > for what we're trying to achieve with iommus or dma-ranges in this
> > > binding.
> > 
> > Thinking about it again now, I wonder if there are any other use cases
> > for windowed IOMMUs. If this is the only one, there might be no use
> > in the #address-cells model I suggested instead of your original
> > #iommu-cells.
> 
> So in this case virtualization is the reason why we need the DMA window.
> The reason for that is that the guest has no other way of knowing what
> other guests might be using, so it's essentially a mechanism for the
> host to manage the DMA region and allocate subregions for each guest. If
> virtualization isn't an issue then it seems to me that the need for DMA
> windows goes away because the operating system will track DMA regions
> anyway.
> 
> The only reason I can think of why a windowed IOMMU would be useful is
> to prevent two or more devices from stepping on each others' toes. But
> that's a problem that the OS should already be handling during DMA
> buffer allocation, isn't it?

Right. As long as we always unmap the buffers from the IOMMU after they
have stopped being in use, it's very unlikely that even a broken device
driver causes a DMA into some bus address that happens to be mapped for
another device.

> > > > > > I would like to add an explanation about dma-ranges to the binding:
> > > > > > 
> > > > > > 8<--------
> > > > > > The parent bus of the iommu must have a valid "dma-ranges" property
> > > > > > describing how the physical address space of the IOMMU maps into
> > > > > > memory.
> > > > > 
> > > > > With physical address space you mean the addresses after translation,
> > > > > not the I/O virtual addresses, right? But even so, how will this work
> > > > > when there are multiple IOMMU devices? What determines which IOMMU is
> > > > > mapped via which entry?
> > > > > 
> > > > > Perhaps having multiple IOMMUs implies that there will have to be some
> > > > > partitioning of the parent address space to make sure two IOMMUs don't
> > > > > translate to the same ranges?
> > > > 
> > > > These dma-ranges properties would almost always be for the entire RAM,
> > > > and we can treat anything else as a bug.
> > > 
> > > Would it typically be a 1:1 mapping? In that case could we define an
> > > empty dma-ranges property to mean exactly that? That would make it
> > > consistent with the ranges property.
> > 
> > Yes, I believe that is how it's already defined.
> 
> I've gone through the proposal at [0] again, but couldn't find a mention
> of an empty "dma-ranges" property. But regardless I think that a 1:1
> mapping is the obvious meaning of an empty "dma-ranges" property.
> 
> [0]: http://www.openfirmware.org/ofwg/proposals/Closed/Accepted/410-it.txt
> 
> One thing I'm not sure about is whether dma-ranges should be documented
> in this binding at all. Since there's an accepted standard proposal it
> would seem that it doesn't need to be specifically mentioned. One other
> option would be to link to the above proposal from the binding and then
> complement that with what an empty "dma-ranges" property means.
> 
> Or we could possible document this in a file along with other standard
> properties. I don't think we currently do that for any properties, but
> my concern is that there will always be a limited number of people
> knowing about how such properties are supposed to work. If all of a
> sudden all these people would disappear, everybody else would be left
> with references to these properties but nowhere to look for their
> meaning.

I think it makes sense to document how the standard dma-ranges
interacts with the new iommu binding, because it's not obvious
what happens if you have both together, or iommu without a parent
dma-ranges.

> > > > > > A device with an "iommus" property will ignore the "dma-ranges" property
> > > > > > of the parent node and rely on the IOMMU for translation instead.
> > > > > 
> > > > > Do we need to consider the case where an IOMMU listed in iommus isn't
> > > > > enabled (status = "disabled")? In that case presumably the device would
> > > > > either not function or may optionally continue to master onto the parent
> > > > > untranslated.
> > > > 
> > > > My reasoning was that the DT should specify whether we use the IOMMU
> > > > or not. Being able to just switch on or off the IOMMU sounds nice as
> > > > well, so we could change the text above to do that.
> > > > 
> > > > Another option would be to do this in the IOMMU code, basically
> > > > falling back to the IOMMU parent's dma-ranges property and using
> > > > linear dma_map_ops when that is disabled.
> > > 
> > > Yes, it should be trivial for the IOMMU core code to take care of this
> > > special case. Still I think it's worth mentioning it in the binding so
> > > that it's clearly specified.
> > 
> > Agreed.
> 
> Okay, I have a new version of the binding that I think incorporates all
> the changes discussed so far. It uses #address-cells and #size-cells to
> define the length of the specifier, but if we decide against that it can
> easily be changed again.

Ok.	

	Arnd
Thierry Reding May 21, 2014, 9 a.m. UTC | #30
On Wed, May 21, 2014 at 10:50:38AM +0200, Arnd Bergmann wrote:
> On Wednesday 21 May 2014 10:26:11 Thierry Reding wrote:
> > On Tue, May 20, 2014 at 10:26:12PM +0200, Arnd Bergmann wrote:
> > > On Tuesday 20 May 2014 16:24:59 Dave Martin wrote:
> > > > On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> > > > > On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
> > [...]
> > > > > > Multiple-master IOMMU:
> > > > > > ----------------------
> > > > > > 
> > > > > > 	iommu {
> > > > > > 		/* the specifier represents the ID of the master */
> > > > > > 		#address-cells = <1>;
> > > > > > 		#size-cells = <0>;
> > > > 
> > > > How do we know the size of the input address to the IOMMU?  Do we
> > > > get cases for example where the IOMMU only accepts a 32-bit input
> > > > address, but some 64-bit capable masters are connected through it?
> > > 
> > > I was stuck on this question for a while before, but then I realized
> > > that it doesn't matter at all: It's the IOMMU driver itself that
> > > manages the address space, and it doesn't matter if a slave can
> > > address a larger range than the IOMMU can accept. If the IOMMU
> > > needs to deal with the opposite case (64-bit input addresses
> > > but a 32-bit master), that limitation can be put into the specifier.
> > 
> > Isn't this what DMA masks are for? Couldn't the IOMMU simply use the
> > master device's DMA mask to do the right thing here?
> 
> Ah, yes. I guess that's the right way to do it.
> 
> > > > For determining dma masks, it is the output address that it
> > > > important.  Santosh's code can probably be taught to handle this,
> > > > if given an additional traversal rule for following "iommus"
> > > > properties.  However, deploying an IOMMU whose output address size
> > > > is smaller than the 
> > > 
> > > Something seems to be missing here. I don't think we want to handle
> > > the case where the IOMMU output cannot the entire memory address
> > > space. If necessary, that would mean using both an IOMMU driver
> > > and swiotlb, but I think it's a reasonable assumption that hardware
> > > isn't /that/ crazy.
> > 
> > Similarily, should the IOMMU not be treated like any other device here?
> > Its DMA mask should determine what address range it can access.
> 
> Right. But for that we need a dma-ranges property in the parent of the
> iommu, just so the mask can be set correctly and we don't have to
> rely on the 32-bit fallback case.

Shouldn't the IOMMU driver be the one to set the DMA mask for the device
in exactly the same way that other drivers override the 32-bit default?

Thierry
Thierry Reding May 21, 2014, 9:02 a.m. UTC | #31
On Wed, May 21, 2014 at 10:54:42AM +0200, Arnd Bergmann wrote:
> On Wednesday 21 May 2014 10:16:09 Thierry Reding wrote:
> > On Tue, May 20, 2014 at 10:31:29PM +0200, Arnd Bergmann wrote:
> > > On Tuesday 20 May 2014 16:00:02 Thierry Reding wrote:
> > > > On Tue, May 20, 2014 at 03:34:46PM +0200, Arnd Bergmann wrote:
> > > > > On Tuesday 20 May 2014 15:17:43 Thierry Reding wrote:
> > > > > > On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> > > > > > > On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
> > > > > > [...]
> > > > > > > > Couldn't a single-master IOMMU be windowed?
> > > > > > > 
> > > > > > > Ah, yes. That would actually be like an IBM pSeries, which has a windowed
> > > > > > > IOMMU but uses one window per virtual machine. In that case, the window could
> > > > > > > be a property of the iommu node though, rather than part of the address
> > > > > > > in the link.
> > > > > > 
> > > > > > Does that mean that the IOMMU has one statically configured window which
> > > > > > is the same for each virtual machine? That would require some other
> > > > > > mechanism to assign separate address spaces to each virtual machine,
> > > > > > wouldn't it? But I suspect that if the IOMMU allows that it could be
> > > > > > allocated dynamically at runtime.
> > > > > 
> > > > > The way it works on pSeries is that upon VM creation, the guest is assigned
> > > > > one 256MB window for use by assigned DMA capable devices. When the guest
> > > > > creates a mapping, it uses a hypercall to associate a bus address in that
> > > > > range with a guest physical address. The hypervisor checks that the bus
> > > > > address is within the allowed range, and translates the guest physical
> > > > > address into a host physical address, then enters both into the I/O page
> > > > > table or I/O TLB.
> > > > 
> > > > So when a VM is booted it is passed a device tree with that DMA window?
> > > 
> > > Correct.
> > > 
> > > > Given what you describe above this seems to be more of a configuration
> > > > option to restrict the IOMMU to a subset of the physical memory for
> > > > purposes of virtualization. So I agree that this wouldn't be a good fit
> > > > for what we're trying to achieve with iommus or dma-ranges in this
> > > > binding.
> > > 
> > > Thinking about it again now, I wonder if there are any other use cases
> > > for windowed IOMMUs. If this is the only one, there might be no use
> > > in the #address-cells model I suggested instead of your original
> > > #iommu-cells.
> > 
> > So in this case virtualization is the reason why we need the DMA window.
> > The reason for that is that the guest has no other way of knowing what
> > other guests might be using, so it's essentially a mechanism for the
> > host to manage the DMA region and allocate subregions for each guest. If
> > virtualization isn't an issue then it seems to me that the need for DMA
> > windows goes away because the operating system will track DMA regions
> > anyway.
> > 
> > The only reason I can think of why a windowed IOMMU would be useful is
> > to prevent two or more devices from stepping on each others' toes. But
> > that's a problem that the OS should already be handling during DMA
> > buffer allocation, isn't it?
> 
> Right. As long as we always unmap the buffers from the IOMMU after they
> have stopped being in use, it's very unlikely that even a broken device
> driver causes a DMA into some bus address that happens to be mapped for
> another device.

I think that if buffers remain mapped in the IOMMU when they have been
deallocated that should be considered a bug.

Thierry
Arnd Bergmann May 21, 2014, 9:32 a.m. UTC | #32
On Wednesday 21 May 2014 11:02:45 Thierry Reding wrote:
> On Wed, May 21, 2014 at 10:54:42AM +0200, Arnd Bergmann wrote:
>
> > Right. As long as we always unmap the buffers from the IOMMU after they
> > have stopped being in use, it's very unlikely that even a broken device
> > driver causes a DMA into some bus address that happens to be mapped for
> > another device.
> 
> I think that if buffers remain mapped in the IOMMU when they have been
> deallocated that should be considered a bug.

You could see it as a performance optimization in some cases, e.g. when you
cannot flush individual IOTLBs or doing so is expensive, and you just keep
assigning new addresses until every one has been used, then you do a global
IOTLB flush once. Obviously you have to maintain the IO page tables correctly.

	Arnd
Arnd Bergmann May 21, 2014, 9:36 a.m. UTC | #33
On Wednesday 21 May 2014 11:00:38 Thierry Reding wrote:
> On Wed, May 21, 2014 at 10:50:38AM +0200, Arnd Bergmann wrote:
> > On Wednesday 21 May 2014 10:26:11 Thierry Reding wrote:

> > > > > For determining dma masks, it is the output address that it
> > > > > important.  Santosh's code can probably be taught to handle this,
> > > > > if given an additional traversal rule for following "iommus"
> > > > > properties.  However, deploying an IOMMU whose output address size
> > > > > is smaller than the 
> > > > 
> > > > Something seems to be missing here. I don't think we want to handle
> > > > the case where the IOMMU output cannot the entire memory address
> > > > space. If necessary, that would mean using both an IOMMU driver
> > > > and swiotlb, but I think it's a reasonable assumption that hardware
> > > > isn't /that/ crazy.
> > > 
> > > Similarily, should the IOMMU not be treated like any other device here?
> > > Its DMA mask should determine what address range it can access.
> > 
> > Right. But for that we need a dma-ranges property in the parent of the
> > iommu, just so the mask can be set correctly and we don't have to
> > rely on the 32-bit fallback case.
> 
> Shouldn't the IOMMU driver be the one to set the DMA mask for the device
> in exactly the same way that other drivers override the 32-bit default?

The IOMMU driver could /ask/ for an appropriate mask based on its internal
design, but if you have an IOMMU with a 64-bit output address connected
to a 32-bit bus, that should fail.

Note that it's not obvious what the IOMMU's DMA mask actually means.
It clearly has to be the mask that is used for allocating the IO page
tables, but it wouldn't normally be used in the path that allocates
pages on behalf of a DMA master attached to the IOMMU, because that
allocation is performed by the code that looks at the other device's
dma mask.

	Arnd
Thierry Reding May 21, 2014, 10:50 a.m. UTC | #34
On Wed, May 21, 2014 at 11:36:32AM +0200, Arnd Bergmann wrote:
> On Wednesday 21 May 2014 11:00:38 Thierry Reding wrote:
> > On Wed, May 21, 2014 at 10:50:38AM +0200, Arnd Bergmann wrote:
> > > On Wednesday 21 May 2014 10:26:11 Thierry Reding wrote:
> 
> > > > > > For determining dma masks, it is the output address that it
> > > > > > important.  Santosh's code can probably be taught to handle this,
> > > > > > if given an additional traversal rule for following "iommus"
> > > > > > properties.  However, deploying an IOMMU whose output address size
> > > > > > is smaller than the 
> > > > > 
> > > > > Something seems to be missing here. I don't think we want to handle
> > > > > the case where the IOMMU output cannot the entire memory address
> > > > > space. If necessary, that would mean using both an IOMMU driver
> > > > > and swiotlb, but I think it's a reasonable assumption that hardware
> > > > > isn't /that/ crazy.
> > > > 
> > > > Similarily, should the IOMMU not be treated like any other device here?
> > > > Its DMA mask should determine what address range it can access.
> > > 
> > > Right. But for that we need a dma-ranges property in the parent of the
> > > iommu, just so the mask can be set correctly and we don't have to
> > > rely on the 32-bit fallback case.
> > 
> > Shouldn't the IOMMU driver be the one to set the DMA mask for the device
> > in exactly the same way that other drivers override the 32-bit default?
> 
> The IOMMU driver could /ask/ for an appropriate mask based on its internal
> design, but if you have an IOMMU with a 64-bit output address connected
> to a 32-bit bus, that should fail.

Are there real use-cases where that really happens? I guess if we need
that the correct thing would be to bitwise AND both the DMA mask of the
IOMMU device (as set by the driver) with that derived from the IOMMU's
parent bus' dma-ranges property.

> Note that it's not obvious what the IOMMU's DMA mask actually means.
> It clearly has to be the mask that is used for allocating the IO page
> tables, but it wouldn't normally be used in the path that allocates
> pages on behalf of a DMA master attached to the IOMMU, because that
> allocation is performed by the code that looks at the other device's
> dma mask.

Interesting. If a DMA buffer is allocated using the master's DMA mask
wouldn't that cause breakage if the IOMMU and master's DMA masks don't
match. It seems to me like the right thing to do for buffer allocation
is to use the IOMMU's DMA mask if a device uses the IOMMU for
translation and use the device's DMA mask when determining to what I/O
virtual address to map that buffer.

Obviously if we always assume that IOMMU hardware is sane and can always
access at least the whole memory then this isn't an issue. But what if a
device can do DMA to a 64-bit address space, but the IOMMU can only
address 32 bits. If the device's DMA mask is used for allocations, then
buffers could reside beyond the 4 GiB boundary that the IOMMU can
address, so effectively the IOMMU wouldn't be able to write to those
buffers.

Thierry
Arnd Bergmann May 21, 2014, 2:01 p.m. UTC | #35
On Wednesday 21 May 2014 12:50:38 Thierry Reding wrote:
> On Wed, May 21, 2014 at 11:36:32AM +0200, Arnd Bergmann wrote:
> > On Wednesday 21 May 2014 11:00:38 Thierry Reding wrote:
> > > On Wed, May 21, 2014 at 10:50:38AM +0200, Arnd Bergmann wrote:
> > > > On Wednesday 21 May 2014 10:26:11 Thierry Reding wrote:
> > 
> > > > > > > For determining dma masks, it is the output address that it
> > > > > > > important.  Santosh's code can probably be taught to handle this,
> > > > > > > if given an additional traversal rule for following "iommus"
> > > > > > > properties.  However, deploying an IOMMU whose output address size
> > > > > > > is smaller than the 
> > > > > > 
> > > > > > Something seems to be missing here. I don't think we want to handle
> > > > > > the case where the IOMMU output cannot the entire memory address
> > > > > > space. If necessary, that would mean using both an IOMMU driver
> > > > > > and swiotlb, but I think it's a reasonable assumption that hardware
> > > > > > isn't /that/ crazy.
> > > > > 
> > > > > Similarily, should the IOMMU not be treated like any other device here?
> > > > > Its DMA mask should determine what address range it can access.
> > > > 
> > > > Right. But for that we need a dma-ranges property in the parent of the
> > > > iommu, just so the mask can be set correctly and we don't have to
> > > > rely on the 32-bit fallback case.
> > > 
> > > Shouldn't the IOMMU driver be the one to set the DMA mask for the device
> > > in exactly the same way that other drivers override the 32-bit default?
> > 
> > The IOMMU driver could /ask/ for an appropriate mask based on its internal
> > design, but if you have an IOMMU with a 64-bit output address connected
> > to a 32-bit bus, that should fail.
> 
> Are there real use-cases where that really happens? I guess if we need
> that the correct thing would be to bitwise AND both the DMA mask of the
> IOMMU device (as set by the driver) with that derived from the IOMMU's
> parent bus' dma-ranges property.

It would be unusual for an IOMMU to need this, but it's how the DMA
mask is supposed to work for normal devices. As mentioned before, I
would probably just error out if we ever encounter such an IOMMU.

> > Note that it's not obvious what the IOMMU's DMA mask actually means.
> > It clearly has to be the mask that is used for allocating the IO page
> > tables, but it wouldn't normally be used in the path that allocates
> > pages on behalf of a DMA master attached to the IOMMU, because that
> > allocation is performed by the code that looks at the other device's
> > dma mask.
> 
> Interesting. If a DMA buffer is allocated using the master's DMA mask
> wouldn't that cause breakage if the IOMMU and master's DMA masks don't
> match. It seems to me like the right thing to do for buffer allocation
> is to use the IOMMU's DMA mask if a device uses the IOMMU for
> translation and use the device's DMA mask when determining to what I/O
> virtual address to map that buffer.

Unfortunately not all code agrees regarding how dma mask is actually
interpreted. The most important use is within the dma_map_ops, and
that is aware of the IOMMU. The dma_map_ops use that to decide what
IOVA (bus address) to generate that is usable for the device, normally
this would be a 32-bit range.

When driver code looks at the dma mask of the device itself to make
an allocation decision without taking the IOMMU or swiotlb into
account, things can indeed go wrong.

Russell has recently done a good cleanup of various issues around
dma masks, and I can't find any drivers that get this wrong.
However, there is an issue with the two or three subsystems using
"PCI_DMA_BUS_IS_PHYS" to decide how they should treat high buffers
coming from user space that get passed to hardware.

If the SCSI layer or the network layer find the horribly misnamed
PCI_DMA_BUS_IS_PHYS (which is hardcoded to "1" on ARM32), they
will create copy in low memory for any data that is above
the device dma_mask (SCSI) or above max_low_pfn (network).

This is not normally a bug, and won't hurt for the swiotlb case,
but will give us worse performance for the IOMMU case, and
we should probably change this code to calculate the boundary
per device by calling a function from dma_map_ops.

We also really need to implement swiotlb support on ARM32 to deal
with any other device (besides SCSI and network) that does not
have an IOMMU but wants to use the streaming DMA API on pages
outside of the dma_mask. We already have this case on shmobile.

> Obviously if we always assume that IOMMU hardware is sane and can always
> access at least the whole memory then this isn't an issue. But what if a
> device can do DMA to a 64-bit address space, but the IOMMU can only
> address 32 bits. If the device's DMA mask is used for allocations, then
> buffers could reside beyond the 4 GiB boundary that the IOMMU can
> address, so effectively the IOMMU wouldn't be able to write to those
> buffers.

The mask of the device is not even an issue here, it's more the general
case of passing a buffer outside of the IOMMU's upstream bus DMA mask
into a driver connected to the IOMMU.

	Arnd
Grant Grundler May 21, 2014, 3:44 p.m. UTC | #36
On Wed, May 21, 2014 at 2:32 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wednesday 21 May 2014 11:02:45 Thierry Reding wrote:
>> On Wed, May 21, 2014 at 10:54:42AM +0200, Arnd Bergmann wrote:
>>
>> > Right. As long as we always unmap the buffers from the IOMMU after they
>> > have stopped being in use, it's very unlikely that even a broken device
>> > driver causes a DMA into some bus address that happens to be mapped for
>> > another device.
>>
>> I think that if buffers remain mapped in the IOMMU when they have been
>> deallocated that should be considered a bug.

There is currently no general requirement to tear down mappings immediately.
An option to enforce immediate tear down might be useful since I agree
Virtual Guests sharing host memory through shared physical devices
will want that. ie there should be no opportunity for a shared device
to be able to access another guests memory through "stale" (but live)
DMA mappings. I don't think that's the case but I'd ask someone like
Alex Williamson for a more certain answer.

> You could see it as a performance optimization in some cases, e.g. when you
> cannot flush individual IOTLBs or doing so is expensive, and you just keep
> assigning new addresses until every one has been used, then you do a global
> IOTLB flush once.

This is in fact exactly what several IOMMUs do - including the Intel
IOMMU driver.

IIRC, the Intel IOMMU driver collects 256 or so  entries to flush and
then flushes them all at once and updates the metadata. I don't if it
does this for ranges assigned to different virtual guests though or
just within a given range.

> Obviously you have to maintain the IO page tables correctly.

To be clear, "Correctly" in this case just means until the IOTLB is
flushed, the given IOMMU Pdir entries are marked "in use" even though
the driver has handed "ownership" back to the IOMMU driver.

cheers,
grant
Arnd Bergmann May 21, 2014, 4:01 p.m. UTC | #37
On Wednesday 21 May 2014 08:44:42 Grant Grundler wrote:
> On Wed, May 21, 2014 at 2:32 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Wednesday 21 May 2014 11:02:45 Thierry Reding wrote:
> >> On Wed, May 21, 2014 at 10:54:42AM +0200, Arnd Bergmann wrote:
> >>
> >> > Right. As long as we always unmap the buffers from the IOMMU after they
> >> > have stopped being in use, it's very unlikely that even a broken device
> >> > driver causes a DMA into some bus address that happens to be mapped for
> >> > another device.
> >>
> >> I think that if buffers remain mapped in the IOMMU when they have been
> >> deallocated that should be considered a bug.
> 
> There is currently no general requirement to tear down mappings immediately.
> An option to enforce immediate tear down might be useful since I agree
> Virtual Guests sharing host memory through shared physical devices
> will want that. ie there should be no opportunity for a shared device
> to be able to access another guests memory through "stale" (but live)
> DMA mappings. I don't think that's the case but I'd ask someone like
> Alex Williamson for a more certain answer.

I believe powerpc has a boot-time option to enforce the immediate
IOTLB flush, which is very useful for device driver debugging when
something goes wrong with stale DMAs.

> > Obviously you have to maintain the IO page tables correctly.
> 
> To be clear, "Correctly" in this case just means until the IOTLB is
> flushed, the given IOMMU Pdir entries are marked "in use" even though
> the driver has handed "ownership" back to the IOMMU driver.

I don't know what a Pdir is, but I guess strictly speaking we have
to ensure that all IO page table entries that have been unmapped by
a driver are marked as invalid at least by the time the IOTLB is flushed,
plus we have to flush each entry before it gets reused for a different
page.

	Arnd
Dave Martin May 21, 2014, 5:09 p.m. UTC | #38
On Wed, May 21, 2014 at 11:00:38AM +0200, Thierry Reding wrote:
> On Wed, May 21, 2014 at 10:50:38AM +0200, Arnd Bergmann wrote:
> > On Wednesday 21 May 2014 10:26:11 Thierry Reding wrote:
> > > On Tue, May 20, 2014 at 10:26:12PM +0200, Arnd Bergmann wrote:
> > > > On Tuesday 20 May 2014 16:24:59 Dave Martin wrote:
> > > > > On Tue, May 20, 2014 at 02:41:18PM +0200, Arnd Bergmann wrote:
> > > > > > On Tuesday 20 May 2014 14:02:43 Thierry Reding wrote:
> > > [...]
> > > > > > > Multiple-master IOMMU:
> > > > > > > ----------------------
> > > > > > > 
> > > > > > > 	iommu {
> > > > > > > 		/* the specifier represents the ID of the master */
> > > > > > > 		#address-cells = <1>;
> > > > > > > 		#size-cells = <0>;
> > > > > 
> > > > > How do we know the size of the input address to the IOMMU?  Do we
> > > > > get cases for example where the IOMMU only accepts a 32-bit input
> > > > > address, but some 64-bit capable masters are connected through it?
> > > > 
> > > > I was stuck on this question for a while before, but then I realized
> > > > that it doesn't matter at all: It's the IOMMU driver itself that
> > > > manages the address space, and it doesn't matter if a slave can
> > > > address a larger range than the IOMMU can accept. If the IOMMU
> > > > needs to deal with the opposite case (64-bit input addresses
> > > > but a 32-bit master), that limitation can be put into the specifier.
> > > 
> > > Isn't this what DMA masks are for? Couldn't the IOMMU simply use the
> > > master device's DMA mask to do the right thing here?
> > 
> > Ah, yes. I guess that's the right way to do it.
> > 
> > > > > For determining dma masks, it is the output address that it
> > > > > important.  Santosh's code can probably be taught to handle this,
> > > > > if given an additional traversal rule for following "iommus"
> > > > > properties.  However, deploying an IOMMU whose output address size
> > > > > is smaller than the 
> > > > 
> > > > Something seems to be missing here. I don't think we want to handle
> > > > the case where the IOMMU output cannot the entire memory address
> > > > space. If necessary, that would mean using both an IOMMU driver
> > > > and swiotlb, but I think it's a reasonable assumption that hardware
> > > > isn't /that/ crazy.
> > > 
> > > Similarily, should the IOMMU not be treated like any other device here?
> > > Its DMA mask should determine what address range it can access.
> > 
> > Right. But for that we need a dma-ranges property in the parent of the
> > iommu, just so the mask can be set correctly and we don't have to
> > rely on the 32-bit fallback case.
> 
> Shouldn't the IOMMU driver be the one to set the DMA mask for the device
> in exactly the same way that other drivers override the 32-bit default?
> 

Are we confusing the "next-hop DMA mask" with the "end-to-end DMA mask"
here?  The device has a next-hop mask which may be non-trivial.  The
IOMMU also has a next-hop mask and/or remapping, but I think we agree
that in sensible systems that will be trivial.  There might be other
non-trivial remappings between the IOMMU and memory (but again, not
in the common case).

If we just use the same name for all these, we are liable to get
confused.

To answer the question "what memory can the kernel allocate for DMA
with this device", it is the end-to-end transformation that is
important.

Cheers
---Dave
Arnd Bergmann May 21, 2014, 6:11 p.m. UTC | #39
On Wednesday 21 May 2014 18:09:54 Dave Martin wrote:
> On Wed, May 21, 2014 at 11:00:38AM +0200, Thierry Reding wrote:
> > On Wed, May 21, 2014 at 10:50:38AM +0200, Arnd Bergmann wrote:
> > > On Wednesday 21 May 2014 10:26:11 Thierry Reding wrote:
> > > > 
> > > > Similarily, should the IOMMU not be treated like any other device here?
> > > > Its DMA mask should determine what address range it can access.
> > > 
> > > Right. But for that we need a dma-ranges property in the parent of the
> > > iommu, just so the mask can be set correctly and we don't have to
> > > rely on the 32-bit fallback case.
> > 
> > Shouldn't the IOMMU driver be the one to set the DMA mask for the device
> > in exactly the same way that other drivers override the 32-bit default?
> > 
> 
> Are we confusing the "next-hop DMA mask" with the "end-to-end DMA mask"
> here?  The device has a next-hop mask which may be non-trivial.  The
> IOMMU also has a next-hop mask and/or remapping, but I think we agree
> that in sensible systems that will be trivial.  There might be other
> non-trivial remappings between the IOMMU and memory (but again, not
> in the common case).
> 
> If we just use the same name for all these, we are liable to get
> confused.
> 
> To answer the question "what memory can the kernel allocate for DMA
> with this device", it is the end-to-end transformation that is
> important.

Yes, but that is not the same as the dma mask of the device. The DMA
mask gets set by the device according to its capabilities, and may
get limited by what the bus to either memory or to the iommu can do,
if one is in use.

Without an IOMMU, the mask is used for allocations, with an IOMMU,
The iommu code makes the decision what to allocate without taking
dev->dma_mask into account.

As mentioned, this is currently not handled well for SCSI and network,
where we use a smaller mask than necessary and can end up copying
data.

	Arnd
diff mbox

Patch

diff --git a/Documentation/devicetree/bindings/iommu/iommu.txt b/Documentation/devicetree/bindings/iommu/iommu.txt
new file mode 100644
index 000000000000..2d67b52b656e
--- /dev/null
+++ b/Documentation/devicetree/bindings/iommu/iommu.txt
@@ -0,0 +1,109 @@ 
+This document describes the generic device tree binding for IOMMUs and their
+master(s).
+
+
+IOMMU device node:
+==================
+
+An IOMMU can provide the following services:
+
+* Remap address space to allow devices to access physical memory ranges that
+  they otherwise wouldn't be capable of accessing.
+
+  Example: 32-bit DMA to 64-bit physical addresses
+
+* Implement scatter-gather at page level granularity so that the device does
+  not have to.
+
+* Provide system protection against "rogue" DMA by forcing all accesses to go
+  through the IOMMU and faulting when encountering accesses to unmapped
+  address regions.
+
+* Provide address space isolation between multiple contexts.
+
+  Example: Virtualization
+
+Device nodes compatible with this binding represent hardware with some of the
+above capabilities.
+
+IOMMUs can be single-master or multiple-master. Single-master IOMMU devices
+typically have a fixed association to the master device, whereas multiple-
+master IOMMU devices can translate accesses from more than one master.
+
+Required properties:
+--------------------
+- #iommu-cells: The number of cells in an IOMMU specifier. The meaning of the
+  cells is defined by the binding for the IOMMU device.
+
+  Typical values include:
+  * 0: Single-master IOMMU devices are often not configurable, therefore the
+    specifying doesn't need to encode any information and can be empty.
+
+  * 1: Multiple-master IOMMU devices need to know for which master they should
+    enable translation. Typically the single cell in the specifier corresponds
+    to the master device's ID.
+
+
+IOMMU master node:
+==================
+
+Devices that access memory through an IOMMU are called masters. A device can
+have multiple master interfaces (to one or more IOMMU devices).
+
+Required properties:
+--------------------
+- iommus: A list of phandle and IOMMU specifier pairs that describe the IOMMU
+  master interfaces of the device. One entry in the list describes one master
+  interface of the device.
+
+Optional properties:
+--------------------
+- iommu-names: A list of names identifying each entry in the iommus property.
+
+
+Examples:
+=========
+
+Single-master IOMMU:
+--------------------
+
+	iommu {
+		#iommu-cells = <0>;
+	};
+
+	master {
+		iommu = <&/iommu>;
+	};
+
+Multi-master IOMMU:
+-------------------
+
+	iommu {
+		/* the specifier represents the ID of the master */
+		#iommu-cells = <1>;
+	};
+
+	master {
+		/* device has master ID 42 in the IOMMU */
+		iommu = <&/iommu 42>;
+	};
+
+Multi-master device:
+--------------------
+
+	/* single-master IOMMU */
+	iommu@1 {
+		#iommu-cells = <0>;
+	};
+
+	/* multi-master IOMMU */
+	iommu@2 {
+		/* the specifier represents the ID of the master */
+		#iommu-cells = <1>;
+	};
+
+	/* device with two master interfaces */
+	master {
+		iommus = <&/iommu@1>,    /* master of the single-master IOMMU */
+			 <&/iommu@2 42>; /* ID 42 in multi-master IOMMU */
+	};