Message ID | 1483947002-16410-1-git-send-email-nikita.yoush@cogentembedded.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jan 09, 2017 at 10:30:02AM +0300, Nikita Yushchenko wrote: > It is possible that device is capable of 64-bit DMA addresses, and > device driver tries to set wide DMA mask, but bridge or bus used to > connect device to the system can't handle wide addresses. > > With swiotlb, memory above 4G still can be used by drivers for streaming > DMA, but *dev->mask and dev->dma_coherent_mask must still keep values > that hardware handles physically. > > This patch enforces that. Based on original version by > Arnd Bergmann <arnd@arndb.de>, extended with coherent mask hadnling. > > Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com> > CC: Arnd Bergmann <arnd@arndb.de> > --- > Changes since v1: > - fixed issues noted by Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> > - save mask, not size > - remove doube empty line > > arch/arm64/Kconfig | 3 +++ > arch/arm64/include/asm/device.h | 1 + > arch/arm64/mm/dma-mapping.c | 51 +++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 55 insertions(+) I still don't think this patch is general enough. The problem you're seeing with swiotlb seems to be exactly the same problem reported by Feng Kan over at: http://lkml.kernel.org/r/CAL85gmA_SSCwM80TKdkZqEe+S1beWzDEvdki1kpkmUTDRmSP7g@mail.gmail.com [read on; it was initially thought to be a hardware erratum, but it's actually the inability to restrict the DMA mask of the endpoint that's the problem] The point here is that an IOMMU doesn't solve your issue, and the IOMMU-backed DMA ops need the same treatment. In light of that, it really feels to me like the DMA masks should be restricted in of_dma_configure so that the parent mask is taken into account there, rather than hook into each set of DMA ops to intercept set_dma_mask. We'd still need to do something to stop dma_set_mask widening the mask if it was restricted by of_dma_configure, but I think Robin (cc'd) was playing with that. Will
Hi > The point here is that an IOMMU doesn't solve your issue, and the > IOMMU-backed DMA ops need the same treatment. In light of that, it really > feels to me like the DMA masks should be restricted in of_dma_configure > so that the parent mask is taken into account there, rather than hook > into each set of DMA ops to intercept set_dma_mask. We'd still need to > do something to stop dma_set_mask widening the mask if it was restricted > by of_dma_configure, but I think Robin (cc'd) was playing with that. What issue "IOMMU doesn't solve"? Issue I'm trying to address is - inconsistency within swiotlb dma_map_ops, where (1) any wide mask is silently accepted, but (2) then mask is used to decide if bounce buffers are needed or not. This inconsistency causes NVMe+R-Car cobmo not working (and breaking memory instead). I just can't think out what similar issue iommu can have. Do you mean that in iommu case, mask also must not be set to whatever wider than initial value? Why? What is the use of mask in iommu case? Is there any real case when iommu can't address all memory existing in the system? NVMe maintainer has just stated that they expect set_dma_mask(DMA_BIT_MASK(64)) to always succeed, and are going to error out driver probe if that call fails. They claim that architecture must always be able to dma_map() whatever memory existing in the system - via iommu or swiotlb or whatever. Their direction is to remove bounce buffers from block and other layers. With this direction, semantics of dma mask becomes even more questionable. I'd say dma_mask is candidate for removal (or to move to swiotlb's or iommu's local area) Nikita
On Tuesday, January 10, 2017 3:47:25 PM CET Nikita Yushchenko wrote: > > > The point here is that an IOMMU doesn't solve your issue, and the > > IOMMU-backed DMA ops need the same treatment. In light of that, it really > > feels to me like the DMA masks should be restricted in of_dma_configure > > so that the parent mask is taken into account there, rather than hook > > into each set of DMA ops to intercept set_dma_mask. of_dma_configure() sets up a 32-bit mask, which is assumed to always work in the kernel. We can't change it to a larger mask because that would break drivers that have to restrict devices to 32-bit. If the bus addressing is narrower than 32 bits however, the initial mask should probably be limited to whatever the bus supports, but that is not the problem we are trying to solve here. > > We'd still need to > > do something to stop dma_set_mask widening the mask if it was restricted > > by of_dma_configure, but I think Robin (cc'd) was playing with that. > > What issue "IOMMU doesn't solve"? > > Issue I'm trying to address is - inconsistency within swiotlb > dma_map_ops, where (1) any wide mask is silently accepted, but (2) then > mask is used to decide if bounce buffers are needed or not. This > inconsistency causes NVMe+R-Car cobmo not working (and breaking memory > instead). It's not just an inconsistency, it's a known bug that we really need to fix. > I just can't think out what similar issue iommu can have. > Do you mean that in iommu case, mask also must not be set to whatever > wider than initial value? Why? What is the use of mask in iommu case? Is > there any real case when iommu can't address all memory existing in the > system? I think the problem that Will is referring to is when the IOMMU has a virtual address space that is wider than the DMA mask of the device: In this case, dma_map_single() might return a dma_addr_t that is not reachable by the device. I'd consider that a separate bug that needs to be worked around in the IOMMU code. > NVMe maintainer has just stated that they expect > set_dma_mask(DMA_BIT_MASK(64)) to always succeed, and are going to error > out driver probe if that call fails. They claim that architecture must > always be able to dma_map() whatever memory existing in the system - via > iommu or swiotlb or whatever. Their direction is to remove bounce > buffers from block and other layers. > > With this direction, semantics of dma mask becomes even more > questionable. I'd say dma_mask is candidate for removal (or to move to > swiotlb's or iommu's local area) Removing dma_mask is not realistic any time soon, there are too many things in the kernel that use it for one thing or another, so any changes here have to be done really carefully. We definitely need the mask to support architectures without swiotlb. Arnd
On 10/01/17 12:47, Nikita Yushchenko wrote: > Hi > >> The point here is that an IOMMU doesn't solve your issue, and the >> IOMMU-backed DMA ops need the same treatment. In light of that, it really >> feels to me like the DMA masks should be restricted in of_dma_configure >> so that the parent mask is taken into account there, rather than hook >> into each set of DMA ops to intercept set_dma_mask. We'd still need to >> do something to stop dma_set_mask widening the mask if it was restricted >> by of_dma_configure, but I think Robin (cc'd) was playing with that. > > What issue "IOMMU doesn't solve"? > > Issue I'm trying to address is - inconsistency within swiotlb > dma_map_ops, where (1) any wide mask is silently accepted, but (2) then > mask is used to decide if bounce buffers are needed or not. This > inconsistency causes NVMe+R-Car cobmo not working (and breaking memory > instead). The fundamental underlying problem is the "any wide mask is silently accepted" part, and that applies equally to IOMMU ops as well. > I just can't think out what similar issue iommu can have. > Do you mean that in iommu case, mask also must not be set to whatever > wider than initial value? Why? What is the use of mask in iommu case? Is > there any real case when iommu can't address all memory existing in the > system? There's a very subtle misunderstanding there - the DMA mask does not describe the memory a device can address, it describes the range of addresses the device is capable of generating. Yes, in the non-IOMMU case they are equivalent, but once you put an IOMMU in between, the problem is merely shifted from "what range of physical addresses can this device access" to "what range of IOVAs is valid to give to this device" - the fact that those IOVAs can map to any underlying physical address only obviates the need for any bouncing at the memory end; it doesn't remove the fact that the device has a hardware addressing limitation which needs to be accommodated. The thread Will linked to describes that equivalent version of your problem - the IOMMU gives the device 48-bit addresses which get erroneously truncated because it doesn't know that only 42 bits are actually wired up. That situation still requires the device's DMA mask to correctly describe its addressing capability just as yours does. > NVMe maintainer has just stated that they expect > set_dma_mask(DMA_BIT_MASK(64)) to always succeed, and are going to error > out driver probe if that call fails. They claim that architecture must > always be able to dma_map() whatever memory existing in the system - via > iommu or swiotlb or whatever. Their direction is to remove bounce > buffers from block and other layers. I have to say I'm in full agreement with that - having subsystems do their own bouncing is generally redundant, although there are some edge-cases like MMC_BLOCK_BOUNCE (which is primarily about aligning and coalescing buffers than working around DMA limitations per se). > With this direction, semantics of dma mask becomes even more > questionable. I'd say dma_mask is candidate for removal (or to move to > swiotlb's or iommu's local area) We still need a way for drivers to communicate a device's probed addressing capability to SWIOTLB, so there's always going to have to be *some* sort of public interface. Personally, the change in semantics I'd like to see is to make dma_set_mask() only fail if DMA is entirely disallowed - in the normal case it would always succeed, but the DMA API implementation would be permitted to set a smaller mask than requested (this is effectively what the x86 IOMMU ops do already). The significant work that would require, though, is changing all the drivers currently using this sort of pattern: if (!dma_set_mask(dev, DMA_BIT_MASK(64)) /* put device into 64-bit mode */ else if (!dma_set_mask(dev, DMA_BIT_MASK(32)) /* put device into 32-bit mode */ else /* error */ to something like this: if (!dma_set_mask(dev, DMA_BIT_MASK(64)) /* error */ if (dma_get_mask(dev) > DMA_BIT_MASK(32)) /* put device into 64-bit mode */ else /* put device into 32-bit mode */ Which would be a pretty major job. Robin.
On Tuesday, January 10, 2017 1:25:12 PM CET Robin Murphy wrote: > On 10/01/17 12:47, Nikita Yushchenko wrote: > >> The point here is that an IOMMU doesn't solve your issue, and the > >> IOMMU-backed DMA ops need the same treatment. In light of that, it really > >> feels to me like the DMA masks should be restricted in of_dma_configure > >> so that the parent mask is taken into account there, rather than hook > >> into each set of DMA ops to intercept set_dma_mask. We'd still need to > >> do something to stop dma_set_mask widening the mask if it was restricted > >> by of_dma_configure, but I think Robin (cc'd) was playing with that. > > > > What issue "IOMMU doesn't solve"? > > > > Issue I'm trying to address is - inconsistency within swiotlb > > dma_map_ops, where (1) any wide mask is silently accepted, but (2) then > > mask is used to decide if bounce buffers are needed or not. This > > inconsistency causes NVMe+R-Car cobmo not working (and breaking memory > > instead). > > The fundamental underlying problem is the "any wide mask is silently > accepted" part, and that applies equally to IOMMU ops as well. It's a much rarer problem for the IOMMU case though, because it only impacts devices that are restricted to addressing of less than 32-bits. If you have an IOMMU enabled, the dma-mapping interface does not care if the device can do wider than 32 bit addressing, as it will never hand out IOVAs above 0xffffffff. > > I just can't think out what similar issue iommu can have. > > Do you mean that in iommu case, mask also must not be set to whatever > > wider than initial value? Why? What is the use of mask in iommu case? Is > > there any real case when iommu can't address all memory existing in the > > system? > > There's a very subtle misunderstanding there - the DMA mask does not > describe the memory a device can address, it describes the range of > addresses the device is capable of generating. Yes, in the non-IOMMU > case they are equivalent, but once you put an IOMMU in between, the > problem is merely shifted from "what range of physical addresses can > this device access" to "what range of IOVAs is valid to give to this > device" - the fact that those IOVAs can map to any underlying physical > address only obviates the need for any bouncing at the memory end; it > doesn't remove the fact that the device has a hardware addressing > limitation which needs to be accommodated. > > The thread Will linked to describes that equivalent version of your > problem - the IOMMU gives the device 48-bit addresses which get > erroneously truncated because it doesn't know that only 42 bits are > actually wired up. That situation still requires the device's DMA mask > to correctly describe its addressing capability just as yours does. That problem should only impact virtual machines which have a guest bus address space covering more than 42 bits of physical RAM, whereas the problem we have with swiotlb is for the dma-mapping interface. > > With this direction, semantics of dma mask becomes even more > > questionable. I'd say dma_mask is candidate for removal (or to move to > > swiotlb's or iommu's local area) > > We still need a way for drivers to communicate a device's probed > addressing capability to SWIOTLB, so there's always going to have to be > *some* sort of public interface. Personally, the change in semantics I'd > like to see is to make dma_set_mask() only fail if DMA is entirely > disallowed - in the normal case it would always succeed, but the DMA API > implementation would be permitted to set a smaller mask than requested > (this is effectively what the x86 IOMMU ops do already). With swiotlb enabled, it only needs to fail if the mask does not contain the swiotlb bounce buffer area, either because the start of RAM is outside of the mask, or the bounce area has been allocated at the end of ZONE_DMA and the mask is smaller than ZONE_DMA. Arnd
>> What issue "IOMMU doesn't solve"? >> >> Issue I'm trying to address is - inconsistency within swiotlb >> dma_map_ops, where (1) any wide mask is silently accepted, but (2) then >> mask is used to decide if bounce buffers are needed or not. This >> inconsistency causes NVMe+R-Car cobmo not working (and breaking memory >> instead). > > The fundamental underlying problem is the "any wide mask is silently > accepted" part, and that applies equally to IOMMU ops as well. Is just posted version better? It should cover iommu case as well. Nikita
On 10/01/17 13:42, Arnd Bergmann wrote: > On Tuesday, January 10, 2017 1:25:12 PM CET Robin Murphy wrote: >> On 10/01/17 12:47, Nikita Yushchenko wrote: >>>> The point here is that an IOMMU doesn't solve your issue, and the >>>> IOMMU-backed DMA ops need the same treatment. In light of that, it really >>>> feels to me like the DMA masks should be restricted in of_dma_configure >>>> so that the parent mask is taken into account there, rather than hook >>>> into each set of DMA ops to intercept set_dma_mask. We'd still need to >>>> do something to stop dma_set_mask widening the mask if it was restricted >>>> by of_dma_configure, but I think Robin (cc'd) was playing with that. >>> >>> What issue "IOMMU doesn't solve"? >>> >>> Issue I'm trying to address is - inconsistency within swiotlb >>> dma_map_ops, where (1) any wide mask is silently accepted, but (2) then >>> mask is used to decide if bounce buffers are needed or not. This >>> inconsistency causes NVMe+R-Car cobmo not working (and breaking memory >>> instead). >> >> The fundamental underlying problem is the "any wide mask is silently >> accepted" part, and that applies equally to IOMMU ops as well. > > It's a much rarer problem for the IOMMU case though, because it only > impacts devices that are restricted to addressing of less than 32-bits. > > If you have an IOMMU enabled, the dma-mapping interface does not care > if the device can do wider than 32 bit addressing, as it will never > hand out IOVAs above 0xffffffff. I can assure you that it will - we constrain allocations to the intersection of the IOMMU domain aperture (normally the IOMMU's physical input address width) and the given device's DMA mask. If both of those are >32 bits then >32-bit IOVAs will fall out. For the arm64/common implementation I have prototyped a copy of the x86 optimisation which always first tries to get 32-bit IOVAs for PCI devices, but even then it can start returning higher addresses if the 32-bit space fills up. >>> I just can't think out what similar issue iommu can have. >>> Do you mean that in iommu case, mask also must not be set to whatever >>> wider than initial value? Why? What is the use of mask in iommu case? Is >>> there any real case when iommu can't address all memory existing in the >>> system? >> >> There's a very subtle misunderstanding there - the DMA mask does not >> describe the memory a device can address, it describes the range of >> addresses the device is capable of generating. Yes, in the non-IOMMU >> case they are equivalent, but once you put an IOMMU in between, the >> problem is merely shifted from "what range of physical addresses can >> this device access" to "what range of IOVAs is valid to give to this >> device" - the fact that those IOVAs can map to any underlying physical >> address only obviates the need for any bouncing at the memory end; it >> doesn't remove the fact that the device has a hardware addressing >> limitation which needs to be accommodated. >> >> The thread Will linked to describes that equivalent version of your >> problem - the IOMMU gives the device 48-bit addresses which get >> erroneously truncated because it doesn't know that only 42 bits are >> actually wired up. That situation still requires the device's DMA mask >> to correctly describe its addressing capability just as yours does. > > That problem should only impact virtual machines which have a guest > bus address space covering more than 42 bits of physical RAM, whereas > the problem we have with swiotlb is for the dma-mapping interface. As above, it impacts DMA API use for anything whose addressing capability is narrower than the IOMMU's reported input size and whose driver is able to blindly set a too-big DMA mask. It just happens to be the case that the stars line up on most systems, and for 32-bit devices who keep the default DMA mask. I actually have a third variation of this problem involving a PCI root complex which *could* drive full-width (40-bit) addresses, but won't, due to the way its PCI<->AXI interface is programmed. That would require even more complicated dma-ranges handling to describe the windows of valid physical addresses which it *will* pass, so I'm not pressing the issue - let's just get the basic DMA mask case fixed first. >>> With this direction, semantics of dma mask becomes even more >>> questionable. I'd say dma_mask is candidate for removal (or to move to >>> swiotlb's or iommu's local area) >> >> We still need a way for drivers to communicate a device's probed >> addressing capability to SWIOTLB, so there's always going to have to be >> *some* sort of public interface. Personally, the change in semantics I'd >> like to see is to make dma_set_mask() only fail if DMA is entirely >> disallowed - in the normal case it would always succeed, but the DMA API >> implementation would be permitted to set a smaller mask than requested >> (this is effectively what the x86 IOMMU ops do already). > > With swiotlb enabled, it only needs to fail if the mask does not contain > the swiotlb bounce buffer area, either because the start of RAM is outside > of the mask, or the bounce area has been allocated at the end of ZONE_DMA > and the mask is smaller than ZONE_DMA. Agreed, I'd managed to overlook that specific case, but I'd be inclined to consider "impossible" a subset of "disallowed" still :) Robin.
On Tue, Jan 10, 2017 at 03:47:25PM +0300, Nikita Yushchenko wrote: > With this direction, semantics of dma mask becomes even more > questionable. I'd say dma_mask is candidate for removal (or to move to > swiotlb's or iommu's local area) We need the dma mask so that the device can advertise what addresses the device supports. Many old devices only support 32-bit DMA addressing, and some less common ones just 24-bit or other weird ones.
On Tue, Jan 10, 2017 at 01:25:12PM +0000, Robin Murphy wrote: > We still need a way for drivers to communicate a device's probed > addressing capability to SWIOTLB, so there's always going to have to be > *some* sort of public interface. Personally, the change in semantics I'd > like to see is to make dma_set_mask() only fail if DMA is entirely > disallowed - in the normal case it would always succeed, but the DMA API > implementation would be permitted to set a smaller mask than requested > (this is effectively what the x86 IOMMU ops do already). Yes, this sounds reasonable. > The significant > work that would require, though, is changing all the drivers currently > using this sort of pattern: > > if (!dma_set_mask(dev, DMA_BIT_MASK(64)) > /* put device into 64-bit mode */ > else if (!dma_set_mask(dev, DMA_BIT_MASK(32)) > /* put device into 32-bit mode */ > else > /* error */ While we have this pattern in a lot of places it's already rather pointless on most architectures as the first dma_set_mask call won't ever fail for the most common dma_ops implementations. > to something like this: > > if (!dma_set_mask(dev, DMA_BIT_MASK(64)) > /* error */ > if (dma_get_mask(dev) > DMA_BIT_MASK(32)) > /* put device into 64-bit mode */ > else > /* put device into 32-bit mode */ > > Which would be a pretty major job. I don't think it's too bad. Also for many modern devices there is no need to put the device into a specific mode. It's mostly a historic issue from the PCI/PCI-X days with the less efficient DAC addressing scheme.
On Tue, Jan 10, 2017 at 02:42:23PM +0100, Arnd Bergmann wrote: > It's a much rarer problem for the IOMMU case though, because it only > impacts devices that are restricted to addressing of less than 32-bits. > > If you have an IOMMU enabled, the dma-mapping interface does not care > if the device can do wider than 32 bit addressing, as it will never > hand out IOVAs above 0xffffffff. That's absolutely not the case. IOMMUs can and do generate addresses larger than 32-bit. Also various platforms have modes where an IOMMU can be used when <= 32-bit addresses are used and bypassed if full 64-bit addressing is supported and I/O isolation is not explicitly requested.
On Tuesday, January 10, 2017 2:16:57 PM CET Robin Murphy wrote: > On 10/01/17 13:42, Arnd Bergmann wrote: > > On Tuesday, January 10, 2017 1:25:12 PM CET Robin Murphy wrote: > >> On 10/01/17 12:47, Nikita Yushchenko wrote: > >>>> The point here is that an IOMMU doesn't solve your issue, and the > >>>> IOMMU-backed DMA ops need the same treatment. In light of that, it really > >>>> feels to me like the DMA masks should be restricted in of_dma_configure > >>>> so that the parent mask is taken into account there, rather than hook > >>>> into each set of DMA ops to intercept set_dma_mask. We'd still need to > >>>> do something to stop dma_set_mask widening the mask if it was restricted > >>>> by of_dma_configure, but I think Robin (cc'd) was playing with that. > >>> > >>> What issue "IOMMU doesn't solve"? > >>> > >>> Issue I'm trying to address is - inconsistency within swiotlb > >>> dma_map_ops, where (1) any wide mask is silently accepted, but (2) then > >>> mask is used to decide if bounce buffers are needed or not. This > >>> inconsistency causes NVMe+R-Car cobmo not working (and breaking memory > >>> instead). > >> > >> The fundamental underlying problem is the "any wide mask is silently > >> accepted" part, and that applies equally to IOMMU ops as well. > > > > It's a much rarer problem for the IOMMU case though, because it only > > impacts devices that are restricted to addressing of less than 32-bits. > > > > If you have an IOMMU enabled, the dma-mapping interface does not care > > if the device can do wider than 32 bit addressing, as it will never > > hand out IOVAs above 0xffffffff. > > I can assure you that it will - we constrain allocations to the > intersection of the IOMMU domain aperture (normally the IOMMU's physical > input address width) and the given device's DMA mask. If both of those > are >32 bits then >32-bit IOVAs will fall out. For the arm64/common > implementation I have prototyped a copy of the x86 optimisation which > always first tries to get 32-bit IOVAs for PCI devices, but even then it > can start returning higher addresses if the 32-bit space fills up. Ok, got it. I have to admit that most of my knowledge about the internals of IOMMUs is from PowerPC of a few years ago, which couldn't do this at all. I agree that we need to do the same thing on swiotlb and iommu then. > >> The thread Will linked to describes that equivalent version of your > >> problem - the IOMMU gives the device 48-bit addresses which get > >> erroneously truncated because it doesn't know that only 42 bits are > >> actually wired up. That situation still requires the device's DMA mask > >> to correctly describe its addressing capability just as yours does. > > > > That problem should only impact virtual machines which have a guest > > bus address space covering more than 42 bits of physical RAM, whereas > > the problem we have with swiotlb is for the dma-mapping interface. > > > I actually have a third variation of this problem involving a PCI root > complex which *could* drive full-width (40-bit) addresses, but won't, > due to the way its PCI<->AXI interface is programmed. That would require > even more complicated dma-ranges handling to describe the windows of > valid physical addresses which it *will* pass, so I'm not pressing the > issue - let's just get the basic DMA mask case fixed first. Can you describe this a little more? We should at least try to not make it harder to solve the next problem while solving this one, so I'd like to understand the exact limitation you are hitting there. Arnd
> I actually have a third variation of this problem involving a PCI root > complex which *could* drive full-width (40-bit) addresses, but won't, > due to the way its PCI<->AXI interface is programmed. That would require > even more complicated dma-ranges handling to describe the windows of > valid physical addresses which it *will* pass, so I'm not pressing the > issue - let's just get the basic DMA mask case fixed first. R-Car + NVMe is actually not "basic case". It has PCI<->AXI interface involved. PCI addresses are 64-bit and controller does handle 64-bit addresses there. Mapping between PCI addresses and AXI addresses is defined. But AXI is 32-bit. SoC has iommu that probably could be used between PCIe module and RAM. Although AFAIK nobody made that working yet. Board I work with has 4G of RAM, in 4 banks, located at different parts of wide address space, and only one of them is below 4G. But if iommu is capable of translating addresses such that 4 gigabyte banks map to first 4 gigabytes of address space, then all memory will become available for DMA from PCIe device.
On Wednesday, January 11, 2017 3:37:22 PM CET Nikita Yushchenko wrote: > > I actually have a third variation of this problem involving a PCI root > > complex which *could* drive full-width (40-bit) addresses, but won't, > > due to the way its PCI<->AXI interface is programmed. That would require > > even more complicated dma-ranges handling to describe the windows of > > valid physical addresses which it *will* pass, so I'm not pressing the > > issue - let's just get the basic DMA mask case fixed first. > > R-Car + NVMe is actually not "basic case". > > It has PCI<->AXI interface involved. > PCI addresses are 64-bit and controller does handle 64-bit addresses > there. Mapping between PCI addresses and AXI addresses is defined. But > AXI is 32-bit. > > SoC has iommu that probably could be used between PCIe module and RAM. > Although AFAIK nobody made that working yet. > > Board I work with has 4G of RAM, in 4 banks, located at different parts > of wide address space, and only one of them is below 4G. But if iommu is > capable of translating addresses such that 4 gigabyte banks map to first > 4 gigabytes of address space, then all memory will become available for > DMA from PCIe device. You can in theory handle this by defining your own platform specific dma_map_ops, as we used to do in the old days. Unfortunately, the modern way of using the generic IOVA allocation can't handle really it, so it's unclear if the work that would be necessary to support it (and the long term maintenance cost) outweigh the benefits. The more likely option here is to try harder to get the IOMMU working (or show that it's impossible but make sure the next chip gets it right). Arnd
On 11/01/17 12:37, Nikita Yushchenko wrote: >> I actually have a third variation of this problem involving a PCI root >> complex which *could* drive full-width (40-bit) addresses, but won't, >> due to the way its PCI<->AXI interface is programmed. That would require >> even more complicated dma-ranges handling to describe the windows of >> valid physical addresses which it *will* pass, so I'm not pressing the >> issue - let's just get the basic DMA mask case fixed first. > > R-Car + NVMe is actually not "basic case". I meant "basic" in terms of what needs to be done in Linux - simply preventing device drivers from overwriting the DT-configured DMA mask will make everything work as well as well as it possibly can on R-Car, both with or without the IOMMU, since apparently all you need is to ensure a PCI device never gets given a DMA address above 4GB. The situation where PCI devices *can* DMA to all of physical memory, but can't use arbitrary addresses *outside* it - which only becomes a problem with an IOMMU - is considerably trickier. > It has PCI<->AXI interface involved. > PCI addresses are 64-bit and controller does handle 64-bit addresses > there. Mapping between PCI addresses and AXI addresses is defined. But > AXI is 32-bit. > > SoC has iommu that probably could be used between PCIe module and RAM. > Although AFAIK nobody made that working yet. > > Board I work with has 4G of RAM, in 4 banks, located at different parts > of wide address space, and only one of them is below 4G. But if iommu is > capable of translating addresses such that 4 gigabyte banks map to first > 4 gigabytes of address space, then all memory will become available for > DMA from PCIe device. The aforementioned situation on Juno is similar yet different - the PLDA XR3 root complex uses an address-based lookup table to translate outgoing PCI memory space transactions to AXI bus addresses with the appropriate attributes, in power-of-two-sized regions. The firmware configures 3 LUT entries - 2GB at 0x8000_0000 and 8GB at 0x8_8000_0000 with cache-coherent attributes to cover the DRAM areas, plus a small one with device attributes covering the GICv2m MSI frame. The issue is that there is no "no match" translation, so any transaction not within one of those regions never makes it out of the root complex at all. That's fine in normal operation, as there's nothing outside those regions in the physical memory map a PCI device should be accessing anyway, but turning on the SMMU is another matter - since the IOVA allocator runs top-down, a PCI device with a 64-bit DMA mask will do a dma_map or dma_alloc, get the physical page mapped to an IOVA up around FF_FFFF_F000 (the SMMU will constrain things to the system bus width of 40 bits), then try to access that address and get a termination straight back from the RC. Similarly, A KVM guest which wants to place its memory at arbitrary locations and expect device passthrough to work is going to have a bad time. I don't know if it's feasible to have the firmware set the LUT up differently, as that might lead to other problems when not using the SMMU, and/or just require far more than the 8 available LUT entries (assuming they have to be non-overlapping - I'm not 100% sure and documentation is sparse). Thus it seems appropriate to describe the currently valid PCI-AXI translations with dma-ranges, but then we'd have multiple entries - last time I looked Linux simply ignores all but the last one in that case - which can't be combined into a simple bitmask, so I'm not entirely sure where to go from there. Especially as so far it seems to be a problem exclusive to one not-widely-available ageing early-access development platform... It happens that limiting all PCI DMA masks to 32 bits would bodge around this problem thanks to the current IOVA allocator behaviour, but that's pretty yuck, and would force unnecessary bouncing for the non-SMMU case. My other hack to carve up IOVA domains to reserve all addresses not matching memblocks is hardly any more realistic, hence why the SMMU is in the Juno DT in a change-it-at-your-own-peril "disabled" state ;) Robin.
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 1117421..afb2c08 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -216,6 +216,9 @@ config NEED_DMA_MAP_STATE config NEED_SG_DMA_LENGTH def_bool y +config ARCH_HAS_DMA_SET_COHERENT_MASK + def_bool y + config SMP def_bool y diff --git a/arch/arm64/include/asm/device.h b/arch/arm64/include/asm/device.h index 243ef25..a57e7bb 100644 --- a/arch/arm64/include/asm/device.h +++ b/arch/arm64/include/asm/device.h @@ -22,6 +22,7 @@ struct dev_archdata { void *iommu; /* private IOMMU data */ #endif bool dma_coherent; + u64 parent_dma_mask; }; struct pdev_archdata { diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index e040827..5ab15ce 100644 --- a/arch/arm64/mm/dma-mapping.c +++ b/arch/arm64/mm/dma-mapping.c @@ -352,6 +352,30 @@ static int __swiotlb_dma_supported(struct device *hwdev, u64 mask) return 1; } +static int __swiotlb_set_dma_mask(struct device *dev, u64 mask) +{ + /* device is not DMA capable */ + if (!dev->dma_mask) + return -EIO; + + /* mask is below swiotlb bounce buffer, so fail */ + if (!swiotlb_dma_supported(dev, mask)) + return -EIO; + + /* + * because of the swiotlb, we can return success for + * larger masks, but need to ensure that bounce buffers + * are used above parent_dma_mask, so set that as + * the effective mask. + */ + if (mask > dev->archdata.parent_dma_mask) + mask = dev->archdata.parent_dma_mask; + + *dev->dma_mask = mask; + + return 0; +} + static struct dma_map_ops swiotlb_dma_ops = { .alloc = __dma_alloc, .free = __dma_free, @@ -367,8 +391,23 @@ static struct dma_map_ops swiotlb_dma_ops = { .sync_sg_for_device = __swiotlb_sync_sg_for_device, .dma_supported = __swiotlb_dma_supported, .mapping_error = swiotlb_dma_mapping_error, + .set_dma_mask = __swiotlb_set_dma_mask, }; +int dma_set_coherent_mask(struct device *dev, u64 mask) +{ + if (!dma_supported(dev, mask)) + return -EIO; + + if (get_dma_ops(dev) == &swiotlb_dma_ops && + mask > dev->archdata.parent_dma_mask) + mask = dev->archdata.parent_dma_mask; + + dev->coherent_dma_mask = mask; + return 0; +} +EXPORT_SYMBOL(dma_set_coherent_mask); + static int __init atomic_pool_init(void) { pgprot_t prot = __pgprot(PROT_NORMAL_NC); @@ -958,6 +997,18 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, if (!dev->archdata.dma_ops) dev->archdata.dma_ops = &swiotlb_dma_ops; + /* + * we don't yet support buses that have a non-zero mapping. + * Let's hope we won't need it + */ + WARN_ON(dma_base != 0); + + /* + * Whatever the parent bus can set. A device must not set + * a DMA mask larger than this. + */ + dev->archdata.parent_dma_mask = size - 1; + dev->archdata.dma_coherent = coherent; __iommu_setup_dma_ops(dev, dma_base, size, iommu); }
It is possible that device is capable of 64-bit DMA addresses, and device driver tries to set wide DMA mask, but bridge or bus used to connect device to the system can't handle wide addresses. With swiotlb, memory above 4G still can be used by drivers for streaming DMA, but *dev->mask and dev->dma_coherent_mask must still keep values that hardware handles physically. This patch enforces that. Based on original version by Arnd Bergmann <arnd@arndb.de>, extended with coherent mask hadnling. Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com> CC: Arnd Bergmann <arnd@arndb.de> --- Changes since v1: - fixed issues noted by Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> - save mask, not size - remove doube empty line arch/arm64/Kconfig | 3 +++ arch/arm64/include/asm/device.h | 1 + arch/arm64/mm/dma-mapping.c | 51 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 55 insertions(+)