Message ID | 8a5abd0a9929aae160ccb74d7a8d9c3698f61910.1438362603.git.robin.murphy@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jul 31, 2015 at 06:18:28PM +0100, Robin Murphy wrote: > Taking some inspiration from the arch/arm code, implement the > arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer. > > Unfortunately the device setup code has to start out as a big ugly mess > in order to work usefully right now, as 'proper' operation depends on > changes to device probe and DMA configuration ordering, IOMMU groups for > platform devices, and default domain support in arm/arm64 IOMMU drivers. > The workarounds here need only exist until that work is finished. > > Signed-off-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
On Fri, Jul 31, 2015 at 06:18:28PM +0100, Robin Murphy wrote: > +/* > + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do > + * everything it needs to - the device isn't yet fully created, and the > + * IOMMU driver hasn't seen it yet, so we need this delayed attachment > + * dance. Once IOMMU probe ordering is sorted to move the > + * arch_setup_dma_ops() call later, all the notifier bits below become > + * unnecessary, and will go away. > + */ > +struct iommu_dma_notifier_data { > + struct list_head list; > + struct device *dev; > + struct iommu_domain *dma_domain; > +}; > +static LIST_HEAD(iommu_dma_masters); > +static DEFINE_MUTEX(iommu_dma_notifier_lock); Ugh, thats incredibly ugly. Why can't you do the setup work then the iommu driver sees the device? Just call the dma-api setup functions there (like the x86 iommu drivers do it too) and be done without any notifiers. > +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, > + const struct iommu_ops *ops) > +{ > + struct iommu_domain *domain; > + int err; > + > + if (!ops) > + return; > + /* > + * In a perfect world, everything happened in the right order up to > + * here, and the IOMMU core has already attached the device to an > + * appropriate default domain for us to set up... > + */ > + domain = iommu_get_domain_for_dev(dev); > + if (!domain) { > + /* > + * Urgh. Reliable default domains for platform devices can't > + * happen anyway without some sensible way of handling > + * non-trivial groups. So until then, HORRIBLE HACKS! > + */ I don't get this, what is preventing to rely on default domains here? > + domain = ops->domain_alloc(IOMMU_DOMAIN_DMA); The IOMMU core should already tried to allocate an IOMMU_DOMAIN_DMA type domain. No need to try this again here. Joerg
On 07/08/15 09:52, Joerg Roedel wrote: > On Fri, Jul 31, 2015 at 06:18:28PM +0100, Robin Murphy wrote: >> +/* >> + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do >> + * everything it needs to - the device isn't yet fully created, and the >> + * IOMMU driver hasn't seen it yet, so we need this delayed attachment >> + * dance. Once IOMMU probe ordering is sorted to move the >> + * arch_setup_dma_ops() call later, all the notifier bits below become >> + * unnecessary, and will go away. >> + */ >> +struct iommu_dma_notifier_data { >> + struct list_head list; >> + struct device *dev; >> + struct iommu_domain *dma_domain; >> +}; >> +static LIST_HEAD(iommu_dma_masters); >> +static DEFINE_MUTEX(iommu_dma_notifier_lock); > > Ugh, thats incredibly ugly. Why can't you do the setup work then the > iommu driver sees the device? Just call the dma-api setup functions > there (like the x86 iommu drivers do it too) and be done without any > notifiers. As per the comments, the issue here lies in the order in which the OF/driver core code currently calls things for platform devices: as it stands we can't attach the device to a domain in arch_setup_dma_ops() because it doesn't have a group, and we can't even add it to a group ourselves because it isn't fully created and doesn't exist in sysfs yet. The only reason arch/arm is currently getting away without this workaround is that the few IOMMU drivers there hooked up to the generic infrastructure don't actually mind that they get an attach_device from arch_setup_dma_ops() before they've even seen an add_device (largely because they don't care about groups). Laurent's probe-deferral series largely solves these problems in the right place - adding identical boilerplate code to every IOMMU driver to do something they shouldn't have to know about (and don't necessarily have all the right information for) is exactly what we don't want to do. As discussed over on another thread, I'm planning to pick that series up and polish it off after this, but my top priority is getting the basic dma_ops functionality into arm64 that people need right now. I will be only too happy when I can submit the patch removing this notifier workaround ;) >> +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, >> + const struct iommu_ops *ops) >> +{ >> + struct iommu_domain *domain; >> + int err; >> + >> + if (!ops) >> + return; >> + /* >> + * In a perfect world, everything happened in the right order up to >> + * here, and the IOMMU core has already attached the device to an >> + * appropriate default domain for us to set up... >> + */ >> + domain = iommu_get_domain_for_dev(dev); >> + if (!domain) { >> + /* >> + * Urgh. Reliable default domains for platform devices can't >> + * happen anyway without some sensible way of handling >> + * non-trivial groups. So until then, HORRIBLE HACKS! >> + */ > > I don't get this, what is preventing to rely on default domains here? No driver other than the AMD IOMMU has any support yet. Support for IOMMU_DOMAIN_DMA can easily be added to existing drivers based on patch 1 of this series, but more generally it's not entirely clear how default domains are going to work beyond x86. On systems like Juno or Seattle with different sets of masters behind different IOMMU instances (with limited domain contexts each), the most sensible approach would be to have a single default domain per IOMMU (spanning domains across instances would need some hideous synchronisation logic for some implementations), but the current domain_alloc interface gives us no way to do that. On something like RK3288 with two different types of IOMMU on the platform "bus", it breaks down even further as there's no way to guarantee that iommu_domain_alloc() even gives you a domain from the right *driver* (hence bypassing it by going through ops directly here). > >> + domain = ops->domain_alloc(IOMMU_DOMAIN_DMA); > > The IOMMU core should already tried to allocate an IOMMU_DOMAIN_DMA type > domain. No need to try this again here. Only for PCI devices, via iommu_group_get_for_pci_dev(). The code here, however, only runs for platform devices - ops will be always null for a PCI device since of_iommu_configure() will have bailed out (see the silly warning removed by my patch you picked up the other day). Once iommu_group_get_for_dev() supports platform devices, this too can go away. In the meantime if someone adds PCI support to of_iommu_configure() and IOMMU_DOMAIN_DMA support to their IOMMU driver, then we'll get a domain back from iommu_get_domain_for_dev() and just use that. Robin.
On Fri, Aug 07, 2015 at 04:27:56PM +0100, Robin Murphy wrote: > As per the comments, the issue here lies in the order in which the > OF/driver core code currently calls things for platform devices: as > it stands we can't attach the device to a domain in > arch_setup_dma_ops() because it doesn't have a group, and we can't > even add it to a group ourselves because it isn't fully created and > doesn't exist in sysfs yet. The only reason arch/arm is currently > getting away without this workaround is that the few IOMMU drivers > there hooked up to the generic infrastructure don't actually mind > that they get an attach_device from arch_setup_dma_ops() before > they've even seen an add_device (largely because they don't care > about groups). Hmm, what about just registering the iommu-driver in arch_setup_dma_ops with bus_set_iommu and not care about devices? The code will iterate over the devices already on the bus and tries to attach them to the iommu driver. But as you said, the devices are not yet on the bus, so when a device is later added by the OF/driver core code you can do everything needed for the device in the add_device call-back. This might include initializing the hardware iommu needed for the device and setting its per-device dma_ops. > Laurent's probe-deferral series largely solves these problems in the > right place - adding identical boilerplate code to every IOMMU > driver to do something they shouldn't have to know about (and don't > necessarily have all the right information for) is exactly what we > don't want to do. As discussed over on another thread, I'm planning > to pick that series up and polish it off after this, but my top > priority is getting the basic dma_ops functionality into arm64 that > people need right now. I will be only too happy when I can submit > the patch removing this notifier workaround ;) I've experienced it often that someone promises me to fix things later, but that it doesn't happen then, so please understand that I am pretty cautious about adding such hacks ;) > No driver other than the AMD IOMMU has any support yet. Support for > IOMMU_DOMAIN_DMA can easily be added to existing drivers based on > patch 1 of this series, but more generally it's not entirely clear > how default domains are going to work beyond x86. On systems like > Juno or Seattle with different sets of masters behind different > IOMMU instances (with limited domain contexts each), the most > sensible approach would be to have a single default domain per IOMMU > (spanning domains across instances would need some hideous > synchronisation logic for some implementations), but the current > domain_alloc interface gives us no way to do that. On something like > RK3288 with two different types of IOMMU on the platform "bus", it > breaks down even further as there's no way to guarantee that > iommu_domain_alloc() even gives you a domain from the right *driver* > (hence bypassing it by going through ops directly here). Default domain allocation comes down to how the bus organizes its iommu-groups. For every group (at least in its current design) a default domain is allocated. An a group is typically only behind a single iommu instance. > Only for PCI devices, via iommu_group_get_for_pci_dev(). The code > here, however, only runs for platform devices - ops will be always > null for a PCI device since of_iommu_configure() will have bailed > out (see the silly warning removed by my patch you picked up the > other day). Once iommu_group_get_for_dev() supports platform > devices, this too can go away. In the meantime if someone adds PCI > support to of_iommu_configure() and IOMMU_DOMAIN_DMA support to > their IOMMU driver, then we'll get a domain back from > iommu_get_domain_for_dev() and just use that. What is the plan for getting an iommu-groups implementation for the platform bus? Joerg
Hi Joerg, On 11/08/15 10:49, Joerg Roedel wrote: > On Fri, Aug 07, 2015 at 04:27:56PM +0100, Robin Murphy wrote: >> As per the comments, the issue here lies in the order in which the >> OF/driver core code currently calls things for platform devices: as >> it stands we can't attach the device to a domain in >> arch_setup_dma_ops() because it doesn't have a group, and we can't >> even add it to a group ourselves because it isn't fully created and >> doesn't exist in sysfs yet. The only reason arch/arm is currently >> getting away without this workaround is that the few IOMMU drivers >> there hooked up to the generic infrastructure don't actually mind >> that they get an attach_device from arch_setup_dma_ops() before >> they've even seen an add_device (largely because they don't care >> about groups). > > Hmm, what about just registering the iommu-driver in arch_setup_dma_ops > with bus_set_iommu and not care about devices? The code will iterate > over the devices already on the bus and tries to attach them to the > iommu driver. But as you said, the devices are not yet on the bus, so > when a device is later added by the OF/driver core code you can do > everything needed for the device in the add_device call-back. The main problem here is that "the bus" turns out to be a nebulous and poorly-defined thing. "The PCI bus" breaks down as soon as you have multiple host controllers - you could quite easily build a SoC/chipset where not all the host controllers/root complexes are behind IOMMUs, or some are behind different types of IOMMU, and then what? The "platform bus" is just a glorified holding pen for device structures and utterly useless as any kind of abstraction. Since I'm not concerning myself with PCI at the moment; considering the perspective of "the" IOMMU driver on the platform bus (assuming said driver is up and running first, either via OF_DECLARE or deferred probing of masters), the issues with relying on the add_device callback from the bus notifier alone are: 1) We get these callbacks for everything - devices behind one of our IOMMUs, devices behind different IOMMUs, devices with no IOMMU in their path at all - with nothing but a struct device pointer to try to distinguish one from the other. 2) Even for the devices actually relevant to us, we don't easily have the details we need to be able to set it up. In the PCI case, it's simple because the device has one bus ID which can trivially be looked up by common code; if you have driver-specific ACPI tables that identify platform devices and give them a single ID that you can kind of handle like a PCI device, fair enough; in the more general DT case, though, devices are going to have have any number of arbitrary IDs plus who knows what other associated data. Now, both of those could probably be handled by having a big mess of open-coded DT parsing in every IOMMU driver's add_device, but that's exactly the kind of thing that having a generic DT binding should avoid. The generic binding is good enough to express most such arbitrary data pretty well, and even crazy stuff like a single device mastering through multiple IOMMUs with different IDs, so there's every argument for keeping the DT parsing generic and in core code. That's what the of_xlate infrastructure does: the appropriate IOMMU driver gets one or more of_xlate calls for a master device, from which it can collate all the "bus" information it needs, then later it gets the add_device callback once the device exists, at which point it can retrieve that data and use it to set up the device, allocate a group, etc. If allocating a group for a platform device automatically set up a DMA mapping domain and attached it, then we wouldn't need to do anything with domains in arch_setup_dma_ops and the ordering problem vanishes. However, the drivers can't support DMA domains without a dma_ops implementation to back them up, which needs the DMA mapping code, which needs DMA domains in order to work... Hence the initial need to bootstrap the process via fake DMA domains in the arch code, enabling steady incremental development instead of one massive wide-reaching patch dump that's virtually impossible for me to test well and for others to develop against. > This might include initializing the hardware iommu needed for the > device and setting its per-device dma_ops. I'm not sure that feels appropriate for our situation - If your IOMMU is tightly integrated into the I/O hub which forms your CPU's only connection to the outside world, then having it do architectural things seems reasonable. The kind of IOMMUs I'm dealing with here, however, are either just some IP block that you stitch into your SoC's mess of interconnects somewhere like any other peripheral (to keep your legacy 32-bit devices behind), or even just a simple TLB with a page table walker and some control registers which you integrate directly into your own bigger devices. The latter are not really even general-purpose enough for arbitrary IOMMU API use, but treating them as separate IOMMUs makes sense from a common-driver perspective. Having drivers for those kinds of IOMMU do things like setting dma_ops directly when they aren't even necessarily tied to a particular CPU architecture, and may have to coexist with drivers for incompatible IOMMUs in the same system, sounds like a recipe for tons of duplicated code and horrible hacks. I think on DT-based systems we *have* to keep system topology and device infrastructure handling in the core OF code and not have IOMMU drivers reinventing funnily-shaped and conflicting wheels all over the place. >> Laurent's probe-deferral series largely solves these problems in the >> right place - adding identical boilerplate code to every IOMMU >> driver to do something they shouldn't have to know about (and don't >> necessarily have all the right information for) is exactly what we >> don't want to do. As discussed over on another thread, I'm planning >> to pick that series up and polish it off after this, but my top >> priority is getting the basic dma_ops functionality into arm64 that >> people need right now. I will be only too happy when I can submit >> the patch removing this notifier workaround ;) > > I've experienced it often that someone promises me to fix things later, > but that it doesn't happen then, so please understand that I am pretty > cautious about adding such hacks ;) Oh, I understand completely - that's why I've tried as much as possible to keep all the workarounds out of the core code and local to arm64 where it's easier for us to remove them cleanly. If it's any consolation, said patch is already written: http://www.linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=af0110ed070c13417023ca7a560dc43e9d1c46f7 (the aforementioned rough WIP branch, from which I'm hoping to pull together an RFC for -rc1) >> No driver other than the AMD IOMMU has any support yet. Support for >> IOMMU_DOMAIN_DMA can easily be added to existing drivers based on >> patch 1 of this series, but more generally it's not entirely clear >> how default domains are going to work beyond x86. On systems like >> Juno or Seattle with different sets of masters behind different >> IOMMU instances (with limited domain contexts each), the most >> sensible approach would be to have a single default domain per IOMMU >> (spanning domains across instances would need some hideous >> synchronisation logic for some implementations), but the current >> domain_alloc interface gives us no way to do that. On something like >> RK3288 with two different types of IOMMU on the platform "bus", it >> breaks down even further as there's no way to guarantee that >> iommu_domain_alloc() even gives you a domain from the right *driver* >> (hence bypassing it by going through ops directly here). > > Default domain allocation comes down to how the bus organizes its > iommu-groups. For every group (at least in its current design) a default > domain is allocated. An a group is typically only behind a single iommu > instance. Default-domain-per-group is great for device isolation, but the main thing people seem to want on ARM is the opposite of that - assembling big bits of virtually-contiguous memory that are visible to all the devices that need to share them. Per-IOMMU (or even per-system if appropriate) default domains provide a really neat solution for that use-case and mean we could mostly completely remove the IOMMU-specific code from GPU drivers, which is one of the things standing in the way of having arch/arm share the common implementation. I'll note that limited contexts aren't a made-up consideration either. The PCIe SMMU in my Juno can support a maximum of 4 domains, yet the bus has onboard LAN and SATA plus 4 slots... >> Only for PCI devices, via iommu_group_get_for_pci_dev(). The code >> here, however, only runs for platform devices - ops will be always >> null for a PCI device since of_iommu_configure() will have bailed >> out (see the silly warning removed by my patch you picked up the >> other day). Once iommu_group_get_for_dev() supports platform >> devices, this too can go away. In the meantime if someone adds PCI >> support to of_iommu_configure() and IOMMU_DOMAIN_DMA support to >> their IOMMU driver, then we'll get a domain back from >> iommu_get_domain_for_dev() and just use that. > > What is the plan for getting an iommu-groups implementation for the > platform bus? The realistic options are trying to handle it with logic in the drivers (engenders code duplication, has potentially unsolvable ordering issues) or defining groups statically in DT via a new binding (needs to avoid being too much of a Linux-specific implementation detail). I tried prototyping the former ages ago and it got overcomplicated very quickly. Having considered the static DT method more recently, I've sketched out some ideas for a binding and a rough design for the code, but nothing's actually written yet. It is another of the "make stuff work on arm64" balls I have in the air, though, if only because the USB SMMU on Juno is otherwise unusable (well, unless you hack out one or other of the EHCI/OHCI pair). Robin.
Hi Robin, On Sat, Aug 1, 2015 at 1:18 AM, Robin Murphy <robin.murphy@arm.com> wrote: > Taking some inspiration from the arch/arm code, implement the > arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer. > > Unfortunately the device setup code has to start out as a big ugly mess > in order to work usefully right now, as 'proper' operation depends on > changes to device probe and DMA configuration ordering, IOMMU groups for > platform devices, and default domain support in arm/arm64 IOMMU drivers. > The workarounds here need only exist until that work is finished. > > Signed-off-by: Robin Murphy <robin.murphy@arm.com> > --- [snip] > +static void __iommu_sync_sg_for_cpu(struct device *dev, > + struct scatterlist *sgl, int nelems, > + enum dma_data_direction dir) > +{ > + struct scatterlist *sg; > + int i; > + > + if (is_device_dma_coherent(dev)) > + return; > + > + for_each_sg(sgl, sg, nelems, i) > + __dma_unmap_area(sg_virt(sg), sg->length, dir); > +} In an earlier review [0], Marek asked you to change the loop in __iommu_sync_sg_for_cpu loop() to loop over the virtual areas when invalidating/cleaning memory ranges. [0] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/328232.html However, this changed the meaning of the 'nelems' argument from what was for arm_iommu_sync_sg_for_cpu() in arch/arm/mm/dma-mapping.c: "number of buffers to sync (returned from dma_map_sg)" to: "number of virtual areas to sync (same as was passed to dma_map_sg)" This has caused some confusion by callers of dma_sync_sg_for_device() that must work for both arm & arm64 as illustrated by [1]. [1] https://lkml.org/lkml/2015/9/21/250 Based on the implementation of debug_dma_sync_sg_for_cpu() in lib/dma-debug.c, I think the arm interpretation of nelems (returned from dma_map_sg) is correct. Therefore, I think we need an outer iteration over dma chunks, and an inner iteration that calls __dma_map_area() over the set virtual areas that correspond to that dma chunk, both here and for __iommu_sync_sg_for_device(). This will be complicated by the fact that iommu pages could actually be smaller than PAGE_SIZE, and offset within a single physical page. Also, as an optimization, we would want to combine contiguous virtual areas into a single call to __dma_unmap_area(). -Dan
Hi Dan, On 22/09/15 18:12, Daniel Kurtz wrote: > Hi Robin, > > On Sat, Aug 1, 2015 at 1:18 AM, Robin Murphy <robin.murphy@arm.com> wrote: >> Taking some inspiration from the arch/arm code, implement the >> arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer. >> >> Unfortunately the device setup code has to start out as a big ugly mess >> in order to work usefully right now, as 'proper' operation depends on >> changes to device probe and DMA configuration ordering, IOMMU groups for >> platform devices, and default domain support in arm/arm64 IOMMU drivers. >> The workarounds here need only exist until that work is finished. >> >> Signed-off-by: Robin Murphy <robin.murphy@arm.com> >> --- > > [snip] > >> +static void __iommu_sync_sg_for_cpu(struct device *dev, >> + struct scatterlist *sgl, int nelems, >> + enum dma_data_direction dir) >> +{ >> + struct scatterlist *sg; >> + int i; >> + >> + if (is_device_dma_coherent(dev)) >> + return; >> + >> + for_each_sg(sgl, sg, nelems, i) >> + __dma_unmap_area(sg_virt(sg), sg->length, dir); >> +} > > In an earlier review [0], Marek asked you to change the loop in > __iommu_sync_sg_for_cpu loop() to loop over the virtual areas when > invalidating/cleaning memory ranges. > > [0] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/328232.html > > However, this changed the meaning of the 'nelems' argument from what > was for arm_iommu_sync_sg_for_cpu() in arch/arm/mm/dma-mapping.c: > "number of buffers to sync (returned from dma_map_sg)" > to: > "number of virtual areas to sync (same as was passed to dma_map_sg)" > > This has caused some confusion by callers of dma_sync_sg_for_device() > that must work for both arm & arm64 as illustrated by [1]. > [1] https://lkml.org/lkml/2015/9/21/250 Funnily enough, I happened to stumble across that earlier of my own volition, and felt obliged to respond ;) > Based on the implementation of debug_dma_sync_sg_for_cpu() in > lib/dma-debug.c, I think the arm interpretation of nelems (returned > from dma_map_sg) is correct. As I explained over on the other thread, you can only do cache maintenance on CPU addresses, and those haven't changed regardless of what mapping you set up in the IOMMU for the device to see, therefore iterating over the mapped DMA chunks makes no sense if you have no way to infer a CPU address from a DMA address alone (indeed, I struggled a bit to get this initially, hence Marek's feedback). Note that the arm_iommu_sync_sg_* code is iterating over entries using the original CPU address, offset and length fields in exactly that way, not using the DMA address/length fields at all, therefore if you pass in less than the original number of entries you'll simply miss out part of the buffer; what that code _does_ is indeed correct, but it's not the same thing as the comments imply, and the comments are wrong. AFAICS, debug_dma_sync_sg_* still expects to be called with the original nents as well, it just bails out early after mapped_ents entries since any further entries won't have DMA addresses to check anyway. I suspect the offending comments were simply copied from the arm_dma_sync_sg_* implementations, which rather counterintuitively _do_ operate on the mapped DMA addresses, because they might be flushing a bounced copy of the buffer instead of the original pages (and can depend on the necessary 1:1 DMA:CPU relationship either way). Robin. [0]:http://article.gmane.org/gmane.linux.kernel/2044263 > > Therefore, I think we need an outer iteration over dma chunks, and an > inner iteration that calls __dma_map_area() over the set virtual areas > that correspond to that dma chunk, both here and for > __iommu_sync_sg_for_device(). This will be complicated by the fact > that iommu pages could actually be smaller than PAGE_SIZE, and offset > within a single physical page. Also, as an optimization, we would > want to combine contiguous virtual areas into a single call to > __dma_unmap_area(). > > -Dan >
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index e5d74cd..c735f45 100644 --- a/arch/arm64/mm/dma-mapping.c +++ b/arch/arm64/mm/dma-mapping.c @@ -534,3 +534,428 @@ static int __init dma_debug_do_init(void) return 0; } fs_initcall(dma_debug_do_init); + + +#ifdef CONFIG_IOMMU_DMA +#include <linux/dma-iommu.h> +#include <linux/platform_device.h> +#include <linux/amba/bus.h> + +/* Thankfully, all cache ops are by VA so we can ignore phys here */ +static void flush_page(struct device *dev, const void *virt, phys_addr_t phys) +{ + __dma_flush_range(virt, virt + PAGE_SIZE); +} + +static void *__iommu_alloc_attrs(struct device *dev, size_t size, + dma_addr_t *handle, gfp_t gfp, + struct dma_attrs *attrs) +{ + bool coherent = is_device_dma_coherent(dev); + int ioprot = dma_direction_to_prot(DMA_BIDIRECTIONAL, coherent); + void *addr; + + if (WARN(!dev, "cannot create IOMMU mapping for unknown device\n")) + return NULL; + /* + * Some drivers rely on this, and we probably don't want the + * possibility of stale kernel data being read by devices anyway. + */ + gfp |= __GFP_ZERO; + + if (gfp & __GFP_WAIT) { + struct page **pages; + pgprot_t prot = __get_dma_pgprot(attrs, PAGE_KERNEL, coherent); + + pages = iommu_dma_alloc(dev, size, gfp, ioprot, handle, + flush_page); + if (!pages) + return NULL; + + addr = dma_common_pages_remap(pages, size, VM_USERMAP, prot, + __builtin_return_address(0)); + if (!addr) + iommu_dma_free(dev, pages, size, handle); + } else { + struct page *page; + /* + * In atomic context we can't remap anything, so we'll only + * get the virtually contiguous buffer we need by way of a + * physically contiguous allocation. + */ + if (coherent) { + page = alloc_pages(gfp, get_order(size)); + addr = page ? page_address(page) : NULL; + } else { + addr = __alloc_from_pool(size, &page, gfp); + } + if (!addr) + return NULL; + + *handle = iommu_dma_map_page(dev, page, 0, size, ioprot); + if (iommu_dma_mapping_error(dev, *handle)) { + if (coherent) + __free_pages(page, get_order(size)); + else + __free_from_pool(addr, size); + addr = NULL; + } + } + return addr; +} + +static void __iommu_free_attrs(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t handle, struct dma_attrs *attrs) +{ + /* + * @cpu_addr will be one of 3 things depending on how it was allocated: + * - A remapped array of pages from iommu_dma_alloc(), for all + * non-atomic allocations. + * - A non-cacheable alias from the atomic pool, for atomic + * allocations by non-coherent devices. + * - A normal lowmem address, for atomic allocations by + * coherent devices. + * Hence how dodgy the below logic looks... + */ + if (__in_atomic_pool(cpu_addr, size)) { + iommu_dma_unmap_page(dev, handle, size, 0, NULL); + __free_from_pool(cpu_addr, size); + } else if (is_vmalloc_addr(cpu_addr)){ + struct vm_struct *area = find_vm_area(cpu_addr); + + if (WARN_ON(!area || !area->pages)) + return; + iommu_dma_free(dev, area->pages, size, &handle); + dma_common_free_remap(cpu_addr, size, VM_USERMAP); + } else { + iommu_dma_unmap_page(dev, handle, size, 0, NULL); + __free_pages(virt_to_page(cpu_addr), get_order(size)); + } +} + +static int __iommu_mmap_attrs(struct device *dev, struct vm_area_struct *vma, + void *cpu_addr, dma_addr_t dma_addr, size_t size, + struct dma_attrs *attrs) +{ + struct vm_struct *area; + int ret; + + vma->vm_page_prot = __get_dma_pgprot(attrs, vma->vm_page_prot, + is_device_dma_coherent(dev)); + + if (dma_mmap_from_coherent(dev, vma, cpu_addr, size, &ret)) + return ret; + + area = find_vm_area(cpu_addr); + if (WARN_ON(!area || !area->pages)) + return -ENXIO; + + return iommu_dma_mmap(area->pages, size, vma); +} + +static int __iommu_get_sgtable(struct device *dev, struct sg_table *sgt, + void *cpu_addr, dma_addr_t dma_addr, + size_t size, struct dma_attrs *attrs) +{ + unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; + struct vm_struct *area = find_vm_area(cpu_addr); + + if (WARN_ON(!area || !area->pages)) + return -ENXIO; + + return sg_alloc_table_from_pages(sgt, area->pages, count, 0, size, + GFP_KERNEL); +} + +static void __iommu_sync_single_for_cpu(struct device *dev, + dma_addr_t dev_addr, size_t size, + enum dma_data_direction dir) +{ + phys_addr_t phys; + + if (is_device_dma_coherent(dev)) + return; + + phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr); + __dma_unmap_area(phys_to_virt(phys), size, dir); +} + +static void __iommu_sync_single_for_device(struct device *dev, + dma_addr_t dev_addr, size_t size, + enum dma_data_direction dir) +{ + phys_addr_t phys; + + if (is_device_dma_coherent(dev)) + return; + + phys = iommu_iova_to_phys(iommu_get_domain_for_dev(dev), dev_addr); + __dma_map_area(phys_to_virt(phys), size, dir); +} + +static dma_addr_t __iommu_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, + struct dma_attrs *attrs) +{ + bool coherent = is_device_dma_coherent(dev); + int prot = dma_direction_to_prot(dir, coherent); + dma_addr_t dev_addr = iommu_dma_map_page(dev, page, offset, size, prot); + + if (!iommu_dma_mapping_error(dev, dev_addr) && + !dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs)) + __iommu_sync_single_for_device(dev, dev_addr, size, dir); + + return dev_addr; +} + +static void __iommu_unmap_page(struct device *dev, dma_addr_t dev_addr, + size_t size, enum dma_data_direction dir, + struct dma_attrs *attrs) +{ + if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs)) + __iommu_sync_single_for_cpu(dev, dev_addr, size, dir); + + iommu_dma_unmap_page(dev, dev_addr, size, dir, attrs); +} + +static void __iommu_sync_sg_for_cpu(struct device *dev, + struct scatterlist *sgl, int nelems, + enum dma_data_direction dir) +{ + struct scatterlist *sg; + int i; + + if (is_device_dma_coherent(dev)) + return; + + for_each_sg(sgl, sg, nelems, i) + __dma_unmap_area(sg_virt(sg), sg->length, dir); +} + +static void __iommu_sync_sg_for_device(struct device *dev, + struct scatterlist *sgl, int nelems, + enum dma_data_direction dir) +{ + struct scatterlist *sg; + int i; + + if (is_device_dma_coherent(dev)) + return; + + for_each_sg(sgl, sg, nelems, i) + __dma_map_area(sg_virt(sg), sg->length, dir); +} + +static int __iommu_map_sg_attrs(struct device *dev, struct scatterlist *sgl, + int nelems, enum dma_data_direction dir, + struct dma_attrs *attrs) +{ + bool coherent = is_device_dma_coherent(dev); + + if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs)) + __iommu_sync_sg_for_device(dev, sgl, nelems, dir); + + return iommu_dma_map_sg(dev, sgl, nelems, + dma_direction_to_prot(dir, coherent)); +} + +static void __iommu_unmap_sg_attrs(struct device *dev, + struct scatterlist *sgl, int nelems, + enum dma_data_direction dir, + struct dma_attrs *attrs) +{ + if (!dma_get_attr(DMA_ATTR_SKIP_CPU_SYNC, attrs)) + __iommu_sync_sg_for_cpu(dev, sgl, nelems, dir); + + iommu_dma_unmap_sg(dev, sgl, nelems, dir, attrs); +} + +static struct dma_map_ops iommu_dma_ops = { + .alloc = __iommu_alloc_attrs, + .free = __iommu_free_attrs, + .mmap = __iommu_mmap_attrs, + .get_sgtable = __iommu_get_sgtable, + .map_page = __iommu_map_page, + .unmap_page = __iommu_unmap_page, + .map_sg = __iommu_map_sg_attrs, + .unmap_sg = __iommu_unmap_sg_attrs, + .sync_single_for_cpu = __iommu_sync_single_for_cpu, + .sync_single_for_device = __iommu_sync_single_for_device, + .sync_sg_for_cpu = __iommu_sync_sg_for_cpu, + .sync_sg_for_device = __iommu_sync_sg_for_device, + .dma_supported = iommu_dma_supported, + .mapping_error = iommu_dma_mapping_error, +}; + +/* + * TODO: Right now __iommu_setup_dma_ops() gets called too early to do + * everything it needs to - the device isn't yet fully created, and the + * IOMMU driver hasn't seen it yet, so we need this delayed attachment + * dance. Once IOMMU probe ordering is sorted to move the + * arch_setup_dma_ops() call later, all the notifier bits below become + * unnecessary, and will go away. + */ +struct iommu_dma_notifier_data { + struct list_head list; + struct device *dev; + struct iommu_domain *dma_domain; +}; +static LIST_HEAD(iommu_dma_masters); +static DEFINE_MUTEX(iommu_dma_notifier_lock); + +static int __iommu_attach_notifier(struct notifier_block *nb, + unsigned long action, void *data) +{ + struct iommu_dma_notifier_data *master, *tmp; + + if (action != BUS_NOTIFY_ADD_DEVICE) + return 0; + /* + * We expect the list to only contain the most recent addition + * (which should be the same device as in @data) so just process + * the whole thing blindly. If any previous attachments did happen + * to fail, they get a free retry since the domains are still live. + */ + mutex_lock(&iommu_dma_notifier_lock); + list_for_each_entry_safe(master, tmp, &iommu_dma_masters, list) { + if (iommu_attach_device(master->dma_domain, master->dev)) { + pr_warn("Failed to attach device %s to IOMMU mapping; retaining platform DMA ops\n", + dev_name(master->dev)); + } else { + master->dev->archdata.dma_ops = &iommu_dma_ops; + list_del(&master->list); + kfree(master); + } + } + mutex_unlock(&iommu_dma_notifier_lock); + return 0; +} + +static int register_iommu_dma_ops_notifier(struct bus_type *bus) +{ + int ret; + struct notifier_block *nb = kzalloc(sizeof(*nb), GFP_KERNEL); + + /* + * The device must be attached to a domain before the driver probe + * routine gets a chance to start allocating DMA buffers. However, + * the IOMMU driver also needs a chance to configure the iommu_group + * via its add_device callback first, so we need to make the attach + * happen between those two points. Since the IOMMU core uses a bus + * notifier with default priority for add_device, do the same but + * with a lower priority to ensure the appropriate ordering. + */ + nb->notifier_call = __iommu_attach_notifier; + nb->priority = -100; + + ret = bus_register_notifier(bus, nb); + if (ret) { + pr_warn("Failed to register DMA domain notifier; IOMMU DMA ops unavailable on bus '%s'\n", + bus->name); + kfree(nb); + } + return ret; +} + +static int queue_iommu_attach(struct iommu_domain *domain, struct device *dev) +{ + struct iommu_dma_notifier_data *iommudata = NULL; + + iommudata = kzalloc(sizeof(*iommudata), GFP_KERNEL); + if (!iommudata) + return -ENOMEM; + + iommudata->dev = dev; + iommudata->dma_domain = domain; + + mutex_lock(&iommu_dma_notifier_lock); + list_add(&iommudata->list, &iommu_dma_masters); + mutex_unlock(&iommu_dma_notifier_lock); + return 0; +} + +static int __init arm64_iommu_dma_init(void) +{ + int ret; + + ret = iommu_dma_init(); + if (!ret) + ret = register_iommu_dma_ops_notifier(&platform_bus_type); + if (!ret) + ret = register_iommu_dma_ops_notifier(&amba_bustype); + return ret; +} +arch_initcall(arm64_iommu_dma_init); + +/* Hijack some domain feature flags for the stop-gap meddling below */ +#define __IOMMU_DOMAIN_ARM64 (1U << 31) +#define __IOMMU_DOMAIN_ARM64_IOVA (1U << 30) + +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, + const struct iommu_ops *ops) +{ + struct iommu_domain *domain; + int err; + + if (!ops) + return; + /* + * In a perfect world, everything happened in the right order up to + * here, and the IOMMU core has already attached the device to an + * appropriate default domain for us to set up... + */ + domain = iommu_get_domain_for_dev(dev); + if (!domain) { + /* + * Urgh. Reliable default domains for platform devices can't + * happen anyway without some sensible way of handling + * non-trivial groups. So until then, HORRIBLE HACKS! + */ + domain = ops->domain_alloc(IOMMU_DOMAIN_DMA); + if (!domain) + domain = ops->domain_alloc(IOMMU_DOMAIN_UNMANAGED); + if (!domain) + goto out_no_domain; + + domain->ops = ops; + domain->type = IOMMU_DOMAIN_DMA | __IOMMU_DOMAIN_ARM64; + if (!domain->dma_api_cookie) { + domain->type |= __IOMMU_DOMAIN_ARM64_IOVA; + if (iommu_get_dma_cookie(domain)) + goto out_put_domain; + } + } + + if (iommu_dma_init_domain(domain, dma_base, size)) { + pr_warn("Failed to create %llu-byte IOMMU mapping for device %s\n", + size, dev_name(dev)); + goto out_put_domain; + } + + if (dev->iommu_group) + err = iommu_attach_device(domain, dev); + else + err = queue_iommu_attach(domain, dev); + + if (!err) { + dev->archdata.dma_ops = &iommu_dma_ops; + return; + } + +out_put_domain: + if (domain->type & __IOMMU_DOMAIN_ARM64_IOVA) + iommu_put_dma_cookie(domain); + if (domain->type & __IOMMU_DOMAIN_ARM64) + ops->domain_free(domain); +out_no_domain: + pr_warn("Failed to set up IOMMU domain for device %s\n", dev_name(dev)); +} + +#else + +static void __iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, + struct iommu_ops *iommu) +{ } + +#endif /* CONFIG_IOMMU_DMA */
Taking some inspiration from the arch/arm code, implement the arch-specific side of the DMA mapping ops using the new IOMMU-DMA layer. Unfortunately the device setup code has to start out as a big ugly mess in order to work usefully right now, as 'proper' operation depends on changes to device probe and DMA configuration ordering, IOMMU groups for platform devices, and default domain support in arm/arm64 IOMMU drivers. The workarounds here need only exist until that work is finished. Signed-off-by: Robin Murphy <robin.murphy@arm.com> --- arch/arm64/mm/dma-mapping.c | 425 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 425 insertions(+)