diff mbox

[v5,2/3] arm64: Add IOMMU dma_ops

Message ID 8a5abd0a9929aae160ccb74d7a8d9c3698f61910.1438362603.git.robin.murphy@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Robin Murphy July 31, 2015, 5:18 p.m. UTC
Taking some inspiration from the arch/arm code, implement the
arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.

Unfortunately the device setup code has to start out as a big ugly mess
in order to work usefully right now, as 'proper' operation depends on
changes to device probe and DMA configuration ordering, IOMMU groups for
platform devices, and default domain support in arm/arm64 IOMMU drivers.
The workarounds here need only exist until that work is finished.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 arch/arm64/mm/dma-mapping.c | 425 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 425 insertions(+)

Comments

Catalin Marinas Aug. 3, 2015, 5:33 p.m. UTC | #1
On Fri, Jul 31, 2015 at 06:18:28PM +0100, Robin Murphy wrote:
> Taking some inspiration from the arch/arm code, implement the
> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
> 
> Unfortunately the device setup code has to start out as a big ugly mess
> in order to work usefully right now, as 'proper' operation depends on
> changes to device probe and DMA configuration ordering, IOMMU groups for
> platform devices, and default domain support in arm/arm64 IOMMU drivers.
> The workarounds here need only exist until that work is finished.
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Joerg Roedel Aug. 7, 2015, 8:52 a.m. UTC | #2
On Fri, Jul 31, 2015 at 06:18:28PM +0100, Robin Murphy wrote:
> +/*
> + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
> + * everything it needs to - the device isn't yet fully created, and the
> + * IOMMU driver hasn't seen it yet, so we need this delayed attachment
> + * dance. Once IOMMU probe ordering is sorted to move the
> + * arch_setup_dma_ops() call later, all the notifier bits below become
> + * unnecessary, and will go away.
> + */
> +struct iommu_dma_notifier_data {
> +	struct list_head list;
> +	struct device *dev;
> +	struct iommu_domain *dma_domain;
> +};
> +static LIST_HEAD(iommu_dma_masters);
> +static DEFINE_MUTEX(iommu_dma_notifier_lock);

Ugh, thats incredibly ugly. Why can't you do the setup work then the
iommu driver sees the device? Just call the dma-api setup functions
there (like the x86 iommu drivers do it too) and be done without any
notifiers.

> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> +				  const struct iommu_ops *ops)
> +{
> +	struct iommu_domain *domain;
> +	int err;
> +
> +	if (!ops)
> +		return;
> +	/*
> +	 * In a perfect world, everything happened in the right order up to
> +	 * here, and the IOMMU core has already attached the device to an
> +	 * appropriate default domain for us to set up...
> +	 */
> +	domain = iommu_get_domain_for_dev(dev);
> +	if (!domain) {
> +		/*
> +		 * Urgh. Reliable default domains for platform devices can't
> +		 * happen anyway without some sensible way of handling
> +		 * non-trivial groups. So until then, HORRIBLE HACKS!
> +		 */

I don't get this, what is preventing to rely on default domains here?

> +		domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);

The IOMMU core should already tried to allocate an IOMMU_DOMAIN_DMA type
domain. No need to try this again here.



	Joerg
Robin Murphy Aug. 7, 2015, 3:27 p.m. UTC | #3
On 07/08/15 09:52, Joerg Roedel wrote:
> On Fri, Jul 31, 2015 at 06:18:28PM +0100, Robin Murphy wrote:
>> +/*
>> + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
>> + * everything it needs to - the device isn't yet fully created, and the
>> + * IOMMU driver hasn't seen it yet, so we need this delayed attachment
>> + * dance. Once IOMMU probe ordering is sorted to move the
>> + * arch_setup_dma_ops() call later, all the notifier bits below become
>> + * unnecessary, and will go away.
>> + */
>> +struct iommu_dma_notifier_data {
>> +	struct list_head list;
>> +	struct device *dev;
>> +	struct iommu_domain *dma_domain;
>> +};
>> +static LIST_HEAD(iommu_dma_masters);
>> +static DEFINE_MUTEX(iommu_dma_notifier_lock);
>
> Ugh, thats incredibly ugly. Why can't you do the setup work then the
> iommu driver sees the device? Just call the dma-api setup functions
> there (like the x86 iommu drivers do it too) and be done without any
> notifiers.

As per the comments, the issue here lies in the order in which the 
OF/driver core code currently calls things for platform devices: as it 
stands we can't attach the device to a domain in arch_setup_dma_ops() 
because it doesn't have a group, and we can't even add it to a group 
ourselves because it isn't fully created and doesn't exist in sysfs yet. 
The only reason arch/arm is currently getting away without this 
workaround is that the few IOMMU drivers there hooked up to the generic 
infrastructure don't actually mind that they get an attach_device from 
arch_setup_dma_ops() before they've even seen an add_device (largely 
because they don't care about groups).

Laurent's probe-deferral series largely solves these problems in the 
right place - adding identical boilerplate code to every IOMMU driver to 
do something they shouldn't have to know about (and don't necessarily 
have all the right information for) is exactly what we don't want to do. 
As discussed over on another thread, I'm planning to pick that series up 
and polish it off after this, but my top priority is getting the basic 
dma_ops functionality into arm64 that people need right now. I will be 
only too happy when I can submit the patch removing this notifier 
workaround ;)

>> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
>> +				  const struct iommu_ops *ops)
>> +{
>> +	struct iommu_domain *domain;
>> +	int err;
>> +
>> +	if (!ops)
>> +		return;
>> +	/*
>> +	 * In a perfect world, everything happened in the right order up to
>> +	 * here, and the IOMMU core has already attached the device to an
>> +	 * appropriate default domain for us to set up...
>> +	 */
>> +	domain = iommu_get_domain_for_dev(dev);
>> +	if (!domain) {
>> +		/*
>> +		 * Urgh. Reliable default domains for platform devices can't
>> +		 * happen anyway without some sensible way of handling
>> +		 * non-trivial groups. So until then, HORRIBLE HACKS!
>> +		 */
>
> I don't get this, what is preventing to rely on default domains here?

No driver other than the AMD IOMMU has any support yet. Support for 
IOMMU_DOMAIN_DMA can easily be added to existing drivers based on patch 
1 of this series, but more generally it's not entirely clear how default 
domains are going to work beyond x86. On systems like Juno or Seattle 
with different sets of masters behind different IOMMU instances (with 
limited domain contexts each), the most sensible approach would be to 
have a single default domain per IOMMU (spanning domains across 
instances would need some hideous synchronisation logic for some 
implementations), but the current domain_alloc interface gives us no way 
to do that. On something like RK3288 with two different types of IOMMU 
on the platform "bus", it breaks down even further as there's no way to 
guarantee that iommu_domain_alloc() even gives you a domain from the 
right *driver* (hence bypassing it by going through ops directly here).

>
>> +		domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
>
> The IOMMU core should already tried to allocate an IOMMU_DOMAIN_DMA type
> domain. No need to try this again here.

Only for PCI devices, via iommu_group_get_for_pci_dev(). The code here, 
however, only runs for platform devices - ops will be always null for a 
PCI device since of_iommu_configure() will have bailed out (see the 
silly warning removed by my patch you picked up the other day). Once 
iommu_group_get_for_dev() supports platform devices, this too can go 
away. In the meantime if someone adds PCI support to 
of_iommu_configure() and IOMMU_DOMAIN_DMA support to their IOMMU driver, 
then we'll get a domain back from iommu_get_domain_for_dev() and just 
use that.

Robin.
Joerg Roedel Aug. 11, 2015, 9:49 a.m. UTC | #4
On Fri, Aug 07, 2015 at 04:27:56PM +0100, Robin Murphy wrote:
> As per the comments, the issue here lies in the order in which the
> OF/driver core code currently calls things for platform devices: as
> it stands we can't attach the device to a domain in
> arch_setup_dma_ops() because it doesn't have a group, and we can't
> even add it to a group ourselves because it isn't fully created and
> doesn't exist in sysfs yet. The only reason arch/arm is currently
> getting away without this workaround is that the few IOMMU drivers
> there hooked up to the generic infrastructure don't actually mind
> that they get an attach_device from arch_setup_dma_ops() before
> they've even seen an add_device (largely because they don't care
> about groups).

Hmm, what about just registering the iommu-driver in arch_setup_dma_ops
with bus_set_iommu and not care about devices? The code will iterate
over the devices already on the bus and tries to attach them to the
iommu driver. But as you said, the devices are not yet on the bus, so
when a device is later added by the OF/driver core code you can do
everything needed for the device in the add_device call-back.

This might include initializing the hardware iommu needed for the
device and setting its per-device dma_ops.

> Laurent's probe-deferral series largely solves these problems in the
> right place - adding identical boilerplate code to every IOMMU
> driver to do something they shouldn't have to know about (and don't
> necessarily have all the right information for) is exactly what we
> don't want to do. As discussed over on another thread, I'm planning
> to pick that series up and polish it off after this, but my top
> priority is getting the basic dma_ops functionality into arm64 that
> people need right now. I will be only too happy when I can submit
> the patch removing this notifier workaround ;)

I've experienced it often that someone promises me to fix things later,
but that it doesn't happen then, so please understand that I am pretty
cautious about adding such hacks ;)

> No driver other than the AMD IOMMU has any support yet. Support for
> IOMMU_DOMAIN_DMA can easily be added to existing drivers based on
> patch 1 of this series, but more generally it's not entirely clear
> how default domains are going to work beyond x86. On systems like
> Juno or Seattle with different sets of masters behind different
> IOMMU instances (with limited domain contexts each), the most
> sensible approach would be to have a single default domain per IOMMU
> (spanning domains across instances would need some hideous
> synchronisation logic for some implementations), but the current
> domain_alloc interface gives us no way to do that. On something like
> RK3288 with two different types of IOMMU on the platform "bus", it
> breaks down even further as there's no way to guarantee that
> iommu_domain_alloc() even gives you a domain from the right *driver*
> (hence bypassing it by going through ops directly here).

Default domain allocation comes down to how the bus organizes its
iommu-groups. For every group (at least in its current design) a default
domain is allocated. An a group is typically only behind a single iommu
instance.

> Only for PCI devices, via iommu_group_get_for_pci_dev(). The code
> here, however, only runs for platform devices - ops will be always
> null for a PCI device since of_iommu_configure() will have bailed
> out (see the silly warning removed by my patch you picked up the
> other day). Once iommu_group_get_for_dev() supports platform
> devices, this too can go away. In the meantime if someone adds PCI
> support to of_iommu_configure() and IOMMU_DOMAIN_DMA support to
> their IOMMU driver, then we'll get a domain back from
> iommu_get_domain_for_dev() and just use that.

What is the plan for getting an iommu-groups implementation for the
platform bus?


	Joerg
Robin Murphy Aug. 11, 2015, 8:15 p.m. UTC | #5
Hi Joerg,

On 11/08/15 10:49, Joerg Roedel wrote:
> On Fri, Aug 07, 2015 at 04:27:56PM +0100, Robin Murphy wrote:
>> As per the comments, the issue here lies in the order in which the
>> OF/driver core code currently calls things for platform devices: as
>> it stands we can't attach the device to a domain in
>> arch_setup_dma_ops() because it doesn't have a group, and we can't
>> even add it to a group ourselves because it isn't fully created and
>> doesn't exist in sysfs yet. The only reason arch/arm is currently
>> getting away without this workaround is that the few IOMMU drivers
>> there hooked up to the generic infrastructure don't actually mind
>> that they get an attach_device from arch_setup_dma_ops() before
>> they've even seen an add_device (largely because they don't care
>> about groups).
>
> Hmm, what about just registering the iommu-driver in arch_setup_dma_ops
> with bus_set_iommu and not care about devices? The code will iterate
> over the devices already on the bus and tries to attach them to the
> iommu driver. But as you said, the devices are not yet on the bus, so
> when a device is later added by the OF/driver core code you can do
> everything needed for the device in the add_device call-back.

The main problem here is that "the bus" turns out to be a nebulous and 
poorly-defined thing. "The PCI bus" breaks down as soon as you have 
multiple host controllers - you could quite easily build a SoC/chipset 
where not all the host controllers/root complexes are behind IOMMUs, or 
some are behind different types of IOMMU, and then what? The "platform 
bus" is just a glorified holding pen for device structures and utterly 
useless as any kind of abstraction.

Since I'm not concerning myself with PCI at the moment; considering the 
perspective of "the" IOMMU driver on the platform bus (assuming said 
driver is up and running first, either via OF_DECLARE or deferred 
probing of masters), the issues with relying on the add_device callback 
from the bus notifier alone are:

1) We get these callbacks for everything - devices behind one of our 
IOMMUs, devices behind different IOMMUs, devices with no IOMMU in their 
path at all - with nothing but a struct device pointer to try to 
distinguish one from the other.

2) Even for the devices actually relevant to us, we don't easily have 
the details we need to be able to set it up. In the PCI case, it's 
simple because the device has one bus ID which can trivially be looked 
up by common code; if you have driver-specific ACPI tables that identify 
platform devices and give them a single ID that you can kind of handle 
like a PCI device, fair enough; in the more general DT case, though, 
devices are going to have have any number of arbitrary IDs plus who 
knows what other associated data.

Now, both of those could probably be handled by having a big mess of 
open-coded DT parsing in every IOMMU driver's add_device, but that's 
exactly the kind of thing that having a generic DT binding should avoid. 
The generic binding is good enough to express most such arbitrary data 
pretty well, and even crazy stuff like a single device mastering through 
multiple IOMMUs with different IDs, so there's every argument for 
keeping the DT parsing generic and in core code. That's what the 
of_xlate infrastructure does: the appropriate IOMMU driver gets one or 
more of_xlate calls for a master device, from which it can collate all 
the "bus" information it needs, then later it gets the add_device 
callback once the device exists, at which point it can retrieve that 
data and use it to set up the device, allocate a group, etc.

If allocating a group for a platform device automatically set up a DMA 
mapping domain and attached it, then we wouldn't need to do anything 
with domains in arch_setup_dma_ops and the ordering problem vanishes. 
However, the drivers can't support DMA domains without a dma_ops 
implementation to back them up, which needs the DMA mapping code, which 
needs DMA domains in order to work... Hence the initial need to 
bootstrap the process via fake DMA domains in the arch code, enabling 
steady incremental development instead of one massive wide-reaching 
patch dump that's virtually impossible for me to test well and for 
others to develop against.

> This might include initializing the hardware iommu needed for the
> device and setting its per-device dma_ops.

I'm not sure that feels appropriate for our situation - If your IOMMU is 
tightly integrated into the I/O hub which forms your CPU's only 
connection to the outside world, then having it do architectural things 
seems reasonable. The kind of IOMMUs I'm dealing with here, however, are 
either just some IP block that you stitch into your SoC's mess of 
interconnects somewhere like any other peripheral (to keep your legacy 
32-bit devices behind), or even just a simple TLB with a page table 
walker and some control registers which you integrate directly into your 
own bigger devices. The latter are not really even general-purpose 
enough for arbitrary IOMMU API use, but treating them as separate IOMMUs 
makes sense from a common-driver perspective.

Having drivers for those kinds of IOMMU do things like setting dma_ops 
directly when they aren't even necessarily tied to a particular CPU 
architecture, and may have to coexist with drivers for incompatible 
IOMMUs in the same system, sounds like a recipe for tons of duplicated 
code and horrible hacks. I think on DT-based systems we *have* to keep 
system topology and device infrastructure handling in the core OF code 
and not have IOMMU drivers reinventing funnily-shaped and conflicting 
wheels all over the place.

>> Laurent's probe-deferral series largely solves these problems in the
>> right place - adding identical boilerplate code to every IOMMU
>> driver to do something they shouldn't have to know about (and don't
>> necessarily have all the right information for) is exactly what we
>> don't want to do. As discussed over on another thread, I'm planning
>> to pick that series up and polish it off after this, but my top
>> priority is getting the basic dma_ops functionality into arm64 that
>> people need right now. I will be only too happy when I can submit
>> the patch removing this notifier workaround ;)
>
> I've experienced it often that someone promises me to fix things later,
> but that it doesn't happen then, so please understand that I am pretty
> cautious about adding such hacks ;)

Oh, I understand completely - that's why I've tried as much as possible 
to keep all the workarounds out of the core code and local to arm64 
where it's easier for us to remove them cleanly. If it's any 
consolation, said patch is already written:

http://www.linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=af0110ed070c13417023ca7a560dc43e9d1c46f7

(the aforementioned rough WIP branch, from which I'm hoping to pull 
together an RFC for -rc1)

>> No driver other than the AMD IOMMU has any support yet. Support for
>> IOMMU_DOMAIN_DMA can easily be added to existing drivers based on
>> patch 1 of this series, but more generally it's not entirely clear
>> how default domains are going to work beyond x86. On systems like
>> Juno or Seattle with different sets of masters behind different
>> IOMMU instances (with limited domain contexts each), the most
>> sensible approach would be to have a single default domain per IOMMU
>> (spanning domains across instances would need some hideous
>> synchronisation logic for some implementations), but the current
>> domain_alloc interface gives us no way to do that. On something like
>> RK3288 with two different types of IOMMU on the platform "bus", it
>> breaks down even further as there's no way to guarantee that
>> iommu_domain_alloc() even gives you a domain from the right *driver*
>> (hence bypassing it by going through ops directly here).
>
> Default domain allocation comes down to how the bus organizes its
> iommu-groups. For every group (at least in its current design) a default
> domain is allocated. An a group is typically only behind a single iommu
> instance.

Default-domain-per-group is great for device isolation, but the main 
thing people seem to want on ARM is the opposite of that - assembling 
big bits of virtually-contiguous memory that are visible to all the 
devices that need to share them. Per-IOMMU (or even per-system if 
appropriate) default domains provide a really neat solution for that 
use-case and mean we could mostly completely remove the IOMMU-specific 
code from GPU drivers, which is one of the things standing in the way of 
having arch/arm share the common implementation.

I'll note that limited contexts aren't a made-up consideration either. 
The PCIe SMMU in my Juno can support a maximum of 4 domains, yet the bus 
has onboard LAN and SATA plus 4 slots...

>> Only for PCI devices, via iommu_group_get_for_pci_dev(). The code
>> here, however, only runs for platform devices - ops will be always
>> null for a PCI device since of_iommu_configure() will have bailed
>> out (see the silly warning removed by my patch you picked up the
>> other day). Once iommu_group_get_for_dev() supports platform
>> devices, this too can go away. In the meantime if someone adds PCI
>> support to of_iommu_configure() and IOMMU_DOMAIN_DMA support to
>> their IOMMU driver, then we'll get a domain back from
>> iommu_get_domain_for_dev() and just use that.
>
> What is the plan for getting an iommu-groups implementation for the
> platform bus?

The realistic options are trying to handle it with logic in the drivers 
(engenders code duplication, has potentially unsolvable ordering issues) 
or defining groups statically in DT via a new binding (needs to avoid 
being too much of a Linux-specific implementation detail). I tried 
prototyping the former ages ago and it got overcomplicated very quickly. 
Having considered the static DT method more recently, I've sketched out 
some ideas for a binding and a rough design for the code, but nothing's 
actually written yet. It is another of the "make stuff work on arm64" 
balls I have in the air, though, if only because the USB SMMU on Juno is 
otherwise unusable (well, unless you hack out one or other of the 
EHCI/OHCI pair).

Robin.
Daniel Kurtz Sept. 22, 2015, 5:12 p.m. UTC | #6
Hi Robin,

On Sat, Aug 1, 2015 at 1:18 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> Taking some inspiration from the arch/arm code, implement the
> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
>
> Unfortunately the device setup code has to start out as a big ugly mess
> in order to work usefully right now, as 'proper' operation depends on
> changes to device probe and DMA configuration ordering, IOMMU groups for
> platform devices, and default domain support in arm/arm64 IOMMU drivers.
> The workarounds here need only exist until that work is finished.
>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---

[snip]

> +static void __iommu_sync_sg_for_cpu(struct device *dev,
> +                                   struct scatterlist *sgl, int nelems,
> +                                   enum dma_data_direction dir)
> +{
> +       struct scatterlist *sg;
> +       int i;
> +
> +       if (is_device_dma_coherent(dev))
> +               return;
> +
> +       for_each_sg(sgl, sg, nelems, i)
> +               __dma_unmap_area(sg_virt(sg), sg->length, dir);
> +}

In an earlier review [0], Marek asked you to change the loop in
__iommu_sync_sg_for_cpu loop() to loop over the virtual areas when
invalidating/cleaning memory ranges.

[0] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/328232.html

However, this changed the meaning of the 'nelems' argument from what
was for arm_iommu_sync_sg_for_cpu() in arch/arm/mm/dma-mapping.c:
 "number of buffers to sync (returned from dma_map_sg)"
to:
 "number of virtual areas to sync (same as was passed to dma_map_sg)"

This has caused some confusion by callers of dma_sync_sg_for_device()
that must work for both arm & arm64 as illustrated by [1].
[1] https://lkml.org/lkml/2015/9/21/250

Based on the implementation of debug_dma_sync_sg_for_cpu() in
lib/dma-debug.c, I think the arm interpretation of nelems (returned
from dma_map_sg) is correct.

Therefore, I think we need an outer iteration over dma chunks, and an
inner iteration that calls __dma_map_area() over the set virtual areas
that correspond to that dma chunk, both here and for
__iommu_sync_sg_for_device().  This will be complicated by the fact
that iommu pages could actually be smaller than PAGE_SIZE, and offset
within a single physical page.  Also, as an optimization, we would
want to combine contiguous virtual areas into a single call to
__dma_unmap_area().

-Dan
Robin Murphy Sept. 22, 2015, 6:11 p.m. UTC | #7
Hi Dan,

On 22/09/15 18:12, Daniel Kurtz wrote:
> Hi Robin,
>
> On Sat, Aug 1, 2015 at 1:18 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>> Taking some inspiration from the arch/arm code, implement the
>> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer.
>>
>> Unfortunately the device setup code has to start out as a big ugly mess
>> in order to work usefully right now, as 'proper' operation depends on
>> changes to device probe and DMA configuration ordering, IOMMU groups for
>> platform devices, and default domain support in arm/arm64 IOMMU drivers.
>> The workarounds here need only exist until that work is finished.
>>
>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>> ---
>
> [snip]
>
>> +static void __iommu_sync_sg_for_cpu(struct device *dev,
>> +                                   struct scatterlist *sgl, int nelems,
>> +                                   enum dma_data_direction dir)
>> +{
>> +       struct scatterlist *sg;
>> +       int i;
>> +
>> +       if (is_device_dma_coherent(dev))
>> +               return;
>> +
>> +       for_each_sg(sgl, sg, nelems, i)
>> +               __dma_unmap_area(sg_virt(sg), sg->length, dir);
>> +}
>
> In an earlier review [0], Marek asked you to change the loop in
> __iommu_sync_sg_for_cpu loop() to loop over the virtual areas when
> invalidating/cleaning memory ranges.
>
> [0] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/328232.html
>
> However, this changed the meaning of the 'nelems' argument from what
> was for arm_iommu_sync_sg_for_cpu() in arch/arm/mm/dma-mapping.c:
>   "number of buffers to sync (returned from dma_map_sg)"
> to:
>   "number of virtual areas to sync (same as was passed to dma_map_sg)"
>
> This has caused some confusion by callers of dma_sync_sg_for_device()
> that must work for both arm & arm64 as illustrated by [1].
> [1] https://lkml.org/lkml/2015/9/21/250

Funnily enough, I happened to stumble across that earlier of my own 
volition, and felt obliged to respond ;)

> Based on the implementation of debug_dma_sync_sg_for_cpu() in
> lib/dma-debug.c, I think the arm interpretation of nelems (returned
> from dma_map_sg) is correct.

As I explained over on the other thread, you can only do cache 
maintenance on CPU addresses, and those haven't changed regardless of 
what mapping you set up in the IOMMU for the device to see, therefore 
iterating over the mapped DMA chunks makes no sense if you have no way 
to infer a CPU address from a DMA address alone (indeed, I struggled a 
bit to get this initially, hence Marek's feedback). Note that the 
arm_iommu_sync_sg_* code is iterating over entries using the original 
CPU address, offset and length fields in exactly that way, not using the 
DMA address/length fields at all, therefore if you pass in less than the 
original number of entries you'll simply miss out part of the buffer; 
what that code _does_ is indeed correct, but it's not the same thing as 
the comments imply, and the comments are wrong.

AFAICS, debug_dma_sync_sg_* still expects to be called with the original 
nents as well, it just bails out early after mapped_ents entries since 
any further entries won't have DMA addresses to check anyway.

I suspect the offending comments were simply copied from the 
arm_dma_sync_sg_* implementations, which rather counterintuitively _do_ 
operate on the mapped DMA addresses, because they might be flushing a 
bounced copy of the buffer instead of the original pages (and can depend 
on the necessary 1:1 DMA:CPU relationship either way).

Robin.

[0]:http://article.gmane.org/gmane.linux.kernel/2044263

>
> Therefore, I think we need an outer iteration over dma chunks, and an
> inner iteration that calls __dma_map_area() over the set virtual areas
> that correspond to that dma chunk, both here and for
> __iommu_sync_sg_for_device().  This will be complicated by the fact
> that iommu pages could actually be smaller than PAGE_SIZE, and offset
> within a single physical page.  Also, as an optimization, we would
> want to combine contiguous virtual areas into a single call to
> __dma_unmap_area().
>
> -Dan
>
diff mbox

Patch

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index e5d74cd..c735f45 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -534,3 +534,428 @@  static int __init dma_debug_do_init(void)
 	return 0;
 }
 fs_initcall(dma_debug_do_init);
+
+
+#ifdef CONFIG_IOMMU_DMA
+#include <linux/dma-iommu.h>
+#include <linux/platform_device.h>
+#include <linux/amba/bus.h>
+
+/* Thankfully, all cache ops are by VA so we can ignore phys here */
+static void flush_page(struct device *dev, const void *virt, phys_addr_t phys)
+{
+	__dma_flush_range(virt, virt + PAGE_SIZE);
+}
+
+static void *__iommu_alloc_attrs(struct device *dev, size_t size,
+				 dma_addr_t *handle, gfp_t gfp,
+				 struct dma_attrs *attrs)
+{
+	bool coherent = is_device_dma_coherent(dev);
+	int ioprot = dma_direction_to_prot(DMA_BIDIRECTIONAL, coherent);
+	void *addr;
+
+	if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n"))
+		return NULL;
+	/*
+	 * Some drivers rely on this, and we probably don't want the
+	 * possibility of stale kernel data being read by devices anyway.
+	 */
+	gfp |= __GFP_ZERO;
+
+	if (gfp & __GFP_WAIT) {
+		struct page **pages;
+		pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, coherent);
+
+		pages = iommu_dma_alloc(dev, size, gfp, ioprot,	handle,
+					flush_page);
+		if (!pages)
+			return NULL;
+
+		addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot,
+					      __builtin_return_address(0));
+		if (!addr)
+			iommu_dma_free(dev, pages, size, handle);
+	} else {
+		struct page *page;
+		/*
+		 * In atomic context we can't remap anything, so we'll only
+		 * get the virtually contiguous buffer we need by way of a
+		 * physically contiguous allocation.
+		 */
+		if (coherent) {
+			page = alloc_pages(gfp, get_order(size));
+			addr = page ? page_address(page) : NULL;
+		} else {
+			addr = __alloc_from_pool(size, &page, gfp);
+		}
+		if (!addr)
+			return NULL;
+
+		*handle = iommu_dma_map_page(dev, page, 0, size, ioprot);
+		if (iommu_dma_mapping_error(dev, *handle)) {
+			if (coherent)
+				__free_pages(page, get_order(size));
+			else
+				__free_from_pool(addr, size);
+			addr = NULL;
+		}
+	}
+	return addr;
+}
+
+static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr,
+			       dma_addr_t handle, struct dma_attrs *attrs)
+{
+	/*
+	 * @cpu_addr will be one of 3 things depending on how it was allocated:
+	 * - A remapped array of pages from iommu_dma_alloc(), for all
+	 *   non-atomic allocations.
+	 * - A non-cacheable alias from the atomic pool, for atomic
+	 *   allocations by non-coherent devices.
+	 * - A normal lowmem address, for atomic allocations by
+	 *   coherent devices.
+	 * Hence how dodgy the below logic looks...
+	 */
+	if (__in_atomic_pool(cpu_addr, size)) {
+		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
+		__free_from_pool(cpu_addr, size);
+	} else if (is_vmalloc_addr(cpu_addr)){
+		struct vm_struct *area = find_vm_area(cpu_addr);
+
+		if (WARN_ON(!area || !area->pages))
+			return;
+		iommu_dma_free(dev, area->pages, size, &handle);
+		dma_common_free_remap(cpu_addr, size, VM_USERMAP);
+	} else {
+		iommu_dma_unmap_page(dev, handle, size, 0, NULL);
+		__free_pages(virt_to_page(cpu_addr), get_order(size));
+	}
+}
+
+static int __iommu_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
+			      void *cpu_addr, dma_addr_t dma_addr, size_t size,
+			      struct dma_attrs *attrs)
+{
+	struct vm_struct *area;
+	int ret;
+
+	vma->vm_page_prot = __get_dma_pgprot(attrs, vma->vm_page_prot,
+					     is_device_dma_coherent(dev));
+
+	if (dma_mmap_from_coherent(dev, vma, cpu_addr, size, &ret))
+		return ret;
+
+	area = find_vm_area(cpu_addr);
+	if (WARN_ON(!area || !area->pages))
+		return -ENXIO;
+
+	return iommu_dma_mmap(area->pages, size, vma);
+}
+
+static int __iommu_get_sgtable(struct device *dev, struct sg_table *sgt,
+			       void *cpu_addr, dma_addr_t dma_addr,
+			       size_t size, struct dma_attrs *attrs)
+{
+	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	struct vm_struct *area = find_vm_area(cpu_addr);
+
+	if (WARN_ON(!area || !area->pages))
+		return -ENXIO;
+
+	return sg_alloc_table_from_pages(sgt, area->pages, count, 0, size,
+					 GFP_KERNEL);
+}
+
+static void __iommu_sync_single_for_cpu(struct device *dev,
+					dma_addr_t dev_addr, size_t size,
+					enum dma_data_direction dir)
+{
+	phys_addr_t phys;
+
+	if (is_device_dma_coherent(dev))
+		return;
+
+	phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
+	__dma_unmap_area(phys_to_virt(phys), size, dir);
+}
+
+static void __iommu_sync_single_for_device(struct device *dev,
+					   dma_addr_t dev_addr, size_t size,
+					   enum dma_data_direction dir)
+{
+	phys_addr_t phys;
+
+	if (is_device_dma_coherent(dev))
+		return;
+
+	phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr);
+	__dma_map_area(phys_to_virt(phys), size, dir);
+}
+
+static dma_addr_t __iommu_map_page(struct device *dev, struct page *page,
+				   unsigned long offset, size_t size,
+				   enum dma_data_direction dir,
+				   struct dma_attrs *attrs)
+{
+	bool coherent = is_device_dma_coherent(dev);
+	int prot = dma_direction_to_prot(dir, coherent);
+	dma_addr_t dev_addr = iommu_dma_map_page(dev, page, offset, size, prot);
+
+	if (!iommu_dma_mapping_error(dev, dev_addr) &&
+	    !dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
+		__iommu_sync_single_for_device(dev, dev_addr, size, dir);
+
+	return dev_addr;
+}
+
+static void __iommu_unmap_page(struct device *dev, dma_addr_t dev_addr,
+			       size_t size, enum dma_data_direction dir,
+			       struct dma_attrs *attrs)
+{
+	if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
+		__iommu_sync_single_for_cpu(dev, dev_addr, size, dir);
+
+	iommu_dma_unmap_page(dev, dev_addr, size, dir, attrs);
+}
+
+static void __iommu_sync_sg_for_cpu(struct device *dev,
+				    struct scatterlist *sgl, int nelems,
+				    enum dma_data_direction dir)
+{
+	struct scatterlist *sg;
+	int i;
+
+	if (is_device_dma_coherent(dev))
+		return;
+
+	for_each_sg(sgl, sg, nelems, i)
+		__dma_unmap_area(sg_virt(sg), sg->length, dir);
+}
+
+static void __iommu_sync_sg_for_device(struct device *dev,
+				       struct scatterlist *sgl, int nelems,
+				       enum dma_data_direction dir)
+{
+	struct scatterlist *sg;
+	int i;
+
+	if (is_device_dma_coherent(dev))
+		return;
+
+	for_each_sg(sgl, sg, nelems, i)
+		__dma_map_area(sg_virt(sg), sg->length, dir);
+}
+
+static int __iommu_map_sg_attrs(struct device *dev, struct scatterlist *sgl,
+				int nelems, enum dma_data_direction dir,
+				struct dma_attrs *attrs)
+{
+	bool coherent = is_device_dma_coherent(dev);
+
+	if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
+		__iommu_sync_sg_for_device(dev, sgl, nelems, dir);
+
+	return iommu_dma_map_sg(dev, sgl, nelems,
+			dma_direction_to_prot(dir, coherent));
+}
+
+static void __iommu_unmap_sg_attrs(struct device *dev,
+				   struct scatterlist *sgl, int nelems,
+				   enum dma_data_direction dir,
+				   struct dma_attrs *attrs)
+{
+	if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs))
+		__iommu_sync_sg_for_cpu(dev, sgl, nelems, dir);
+
+	iommu_dma_unmap_sg(dev, sgl, nelems, dir, attrs);
+}
+
+static struct dma_map_ops iommu_dma_ops = {
+	.alloc = __iommu_alloc_attrs,
+	.free = __iommu_free_attrs,
+	.mmap = __iommu_mmap_attrs,
+	.get_sgtable = __iommu_get_sgtable,
+	.map_page = __iommu_map_page,
+	.unmap_page = __iommu_unmap_page,
+	.map_sg = __iommu_map_sg_attrs,
+	.unmap_sg = __iommu_unmap_sg_attrs,
+	.sync_single_for_cpu = __iommu_sync_single_for_cpu,
+	.sync_single_for_device = __iommu_sync_single_for_device,
+	.sync_sg_for_cpu = __iommu_sync_sg_for_cpu,
+	.sync_sg_for_device = __iommu_sync_sg_for_device,
+	.dma_supported = iommu_dma_supported,
+	.mapping_error = iommu_dma_mapping_error,
+};
+
+/*
+ * TODO: Right now __iommu_setup_dma_ops() gets called too early to do
+ * everything it needs to - the device isn't yet fully created, and the
+ * IOMMU driver hasn't seen it yet, so we need this delayed attachment
+ * dance. Once IOMMU probe ordering is sorted to move the
+ * arch_setup_dma_ops() call later, all the notifier bits below become
+ * unnecessary, and will go away.
+ */
+struct iommu_dma_notifier_data {
+	struct list_head list;
+	struct device *dev;
+	struct iommu_domain *dma_domain;
+};
+static LIST_HEAD(iommu_dma_masters);
+static DEFINE_MUTEX(iommu_dma_notifier_lock);
+
+static int __iommu_attach_notifier(struct notifier_block *nb,
+				   unsigned long action, void *data)
+{
+	struct iommu_dma_notifier_data *master, *tmp;
+
+	if (action != BUS_NOTIFY_ADD_DEVICE)
+		return 0;
+	/*
+	 * We expect the list to only contain the most recent addition
+	 * (which should be the same device as in @data) so just process
+	 * the whole thing blindly. If any previous attachments did happen
+	 * to fail, they get a free retry since the domains are still live.
+	 */
+	mutex_lock(&iommu_dma_notifier_lock);
+	list_for_each_entry_safe(master, tmp, &iommu_dma_masters, list) {
+		if (iommu_attach_device(master->dma_domain, master->dev)) {
+			pr_warn("Failed to attach device %s to IOMMU mapping; retaining platform DMA ops\n",
+				dev_name(master->dev));
+		} else {
+			master->dev->archdata.dma_ops = &iommu_dma_ops;
+			list_del(&master->list);
+			kfree(master);
+		}
+	}
+	mutex_unlock(&iommu_dma_notifier_lock);
+	return 0;
+}
+
+static int register_iommu_dma_ops_notifier(struct bus_type *bus)
+{
+	int ret;
+	struct notifier_block *nb = kzalloc(sizeof(*nb), GFP_KERNEL);
+
+	/*
+	 * The device must be attached to a domain before the driver probe
+	 * routine gets a chance to start allocating DMA buffers. However,
+	 * the IOMMU driver also needs a chance to configure the iommu_group
+	 * via its add_device callback first, so we need to make the attach
+	 * happen between those two points. Since the IOMMU core uses a bus
+	 * notifier with default priority for add_device, do the same but
+	 * with a lower priority to ensure the appropriate ordering.
+	 */
+	nb->notifier_call = __iommu_attach_notifier;
+	nb->priority = -100;
+
+	ret = bus_register_notifier(bus, nb);
+	if (ret) {
+		pr_warn("Failed to register DMA domain notifier; IOMMU DMA ops unavailable on bus '%s'\n",
+			bus->name);
+		kfree(nb);
+	}
+	return ret;
+}
+
+static int queue_iommu_attach(struct iommu_domain *domain, struct device *dev)
+{
+	struct iommu_dma_notifier_data *iommudata = NULL;
+
+	iommudata = kzalloc(sizeof(*iommudata), GFP_KERNEL);
+	if (!iommudata)
+		return -ENOMEM;
+
+	iommudata->dev = dev;
+	iommudata->dma_domain = domain;
+
+	mutex_lock(&iommu_dma_notifier_lock);
+	list_add(&iommudata->list, &iommu_dma_masters);
+	mutex_unlock(&iommu_dma_notifier_lock);
+	return 0;
+}
+
+static int __init arm64_iommu_dma_init(void)
+{
+	int ret;
+
+	ret = iommu_dma_init();
+	if (!ret)
+		ret = register_iommu_dma_ops_notifier(&platform_bus_type);
+	if (!ret)
+		ret = register_iommu_dma_ops_notifier(&amba_bustype);
+	return ret;
+}
+arch_initcall(arm64_iommu_dma_init);
+
+/* Hijack some domain feature flags for the stop-gap meddling below */
+#define __IOMMU_DOMAIN_ARM64		(1U << 31)
+#define __IOMMU_DOMAIN_ARM64_IOVA	(1U << 30)
+
+static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
+				  const struct iommu_ops *ops)
+{
+	struct iommu_domain *domain;
+	int err;
+
+	if (!ops)
+		return;
+	/*
+	 * In a perfect world, everything happened in the right order up to
+	 * here, and the IOMMU core has already attached the device to an
+	 * appropriate default domain for us to set up...
+	 */
+	domain = iommu_get_domain_for_dev(dev);
+	if (!domain) {
+		/*
+		 * Urgh. Reliable default domains for platform devices can't
+		 * happen anyway without some sensible way of handling
+		 * non-trivial groups. So until then, HORRIBLE HACKS!
+		 */
+		domain = ops->domain_alloc(IOMMU_DOMAIN_DMA);
+		if (!domain)
+			domain = ops->domain_alloc(IOMMU_DOMAIN_UNMANAGED);
+		if (!domain)
+			goto out_no_domain;
+
+		domain->ops = ops;
+		domain->type = IOMMU_DOMAIN_DMA | __IOMMU_DOMAIN_ARM64;
+		if (!domain->dma_api_cookie) {
+			domain->type |= __IOMMU_DOMAIN_ARM64_IOVA;
+			if (iommu_get_dma_cookie(domain))
+				goto out_put_domain;
+		}
+	}
+
+	if (iommu_dma_init_domain(domain, dma_base, size)) {
+		pr_warn("Failed to create %llu-byte IOMMU mapping for device %s\n",
+			size, dev_name(dev));
+		goto out_put_domain;
+	}
+
+	if (dev->iommu_group)
+		err = iommu_attach_device(domain, dev);
+	else
+		err = queue_iommu_attach(domain, dev);
+
+	if (!err) {
+		dev->archdata.dma_ops = &iommu_dma_ops;
+		return;
+	}
+
+out_put_domain:
+	if (domain->type & __IOMMU_DOMAIN_ARM64_IOVA)
+		iommu_put_dma_cookie(domain);
+	if (domain->type & __IOMMU_DOMAIN_ARM64)
+		ops->domain_free(domain);
+out_no_domain:
+	pr_warn("Failed to set up IOMMU domain for device %s\n", dev_name(dev));
+}
+
+#else
+
+static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
+				  struct iommu_ops *iommu)
+{ }
+
+#endif  /* CONFIG_IOMMU_DMA */