Message ID | bad125dff49f6e49c895e818c9d1abb346a46e8e.camel@synopsys.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, May 18, 2018 at 01:03:46PM +0000, Alexey Brodkin wrote: > Note mmc_get_dma_dir() is just "data->flags & MMC_DATA_WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE". > I.e. if we're preparing for sending data dma_noncoherent_map_sg() will have DMA_TO_DEVICE which > is quite OK for passing to dma_noncoherent_sync_sg_for_device() but in case of reading we'll have > DMA_FROM_DEVICE which we'll pass to dma_noncoherent_sync_sg_for_device() in dma_noncoherent_map_sg(). > > I'd say this is not entirely correct because IMHO arch_sync_dma_for_cpu() is supposed to only be used > in case of DMA_FROM_DEVICE and arch_sync_dma_for_device() only in case of DMA_TO_DEVICE. arc overrides the dir paramter of the dma_sync_single_for_device/ dma_sync_single_for_cpu calls. My patches dropped that, and I have restored that, and audit for the other architectures is pending. That being said the existing arc code still looks rather odd as it didn't do the same thing for the scatterlist versions of the calls. I've thrown in a few patches into my new tree to make the sg versions make the normal calls, and to clean up the area a bit. > You seem to lost an offset in the page so if we happen to have a buffer not aligned to > a page boundary then we were obviously corrupting data outside our data :) Oops! Thank you for all the debugging!
Hi Christoph, On Fri, 2018-05-18 at 15:27 +0200, hch@lst.de wrote: > On Fri, May 18, 2018 at 01:03:46PM +0000, Alexey Brodkin wrote: > > Note mmc_get_dma_dir() is just "data->flags & MMC_DATA_WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE". > > I.e. if we're preparing for sending data dma_noncoherent_map_sg() will have DMA_TO_DEVICE which > > is quite OK for passing to dma_noncoherent_sync_sg_for_device() but in case of reading we'll have > > DMA_FROM_DEVICE which we'll pass to dma_noncoherent_sync_sg_for_device() in dma_noncoherent_map_sg(). > > > > I'd say this is not entirely correct because IMHO arch_sync_dma_for_cpu() is supposed to only be used > > in case of DMA_FROM_DEVICE and arch_sync_dma_for_device() only in case of DMA_TO_DEVICE. > > arc overrides the dir paramter of the dma_sync_single_for_device/ > dma_sync_single_for_cpu calls. My patches dropped that, and I have > restored that, and audit for the other architectures is pending. Well at least for me that's a confusion what is a reason to pass direction to function which purpose is already known. I'd say that XXX_sync_for_device() doesn't need _variable_ direction as an argument, otherwise what does that mean if we pass DMA_FROM_DEVICE to that function? > That being said the existing arc code still looks rather odd as it > didn't do the same thing for the scatterlist versions of the calls. That might easily be the case so good we caught that now and it will be fixed :) > I've thrown in a few patches into my new tree to make the sg versions > make the normal calls, and to clean up the area a bit. I'll try your newer series now, thanks! -Alexey
On 05/18/2018 06:11 AM, Alexey Brodkin wrote: > void arch_sync_dma_for_device(struct device *dev, phys_addr_t paddr, > size_t size, enum dma_data_direction dir) > { > + if (dir != DMA_TO_DEVICE){ > + dump_stack(); > + printk(" *** %s@%d: DMA direction is %s instead of %s\n", > + __func__, __LINE__, dir_to_str(dir), dir_to_str(DMA_TO_DEVICE)); > + } > + > return _dma_cache_sync(dev, paddr, size, dir); > } > > void arch_sync_dma_for_cpu(struct device *dev, phys_addr_t paddr, > size_t size, enum dma_data_direction dir) > { > + if (dir != DMA_FROM_DEVICE) { > + dump_stack(); > + printk(" *** %s@%d: DMA direction is %s instead of %s\n", > + __func__, __LINE__, dir_to_str(dir), dir_to_str(DMA_FROM_DEVICE)); > + } > + > return _dma_cache_sync(dev, paddr, size, dir); > } ... > In case of MMC/DW_MCI (AKA DesignWare MobileStorage controller) that's an execution flow: > 1) __dw_mci_start_request() > 2) dw_mci_pre_dma_transfer() > 3) dma_map_sg(..., mmc_get_dma_dir(data)) > > Note mmc_get_dma_dir() is just "data->flags & MMC_DATA_WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE". > I.e. if we're preparing for sending data dma_noncoherent_map_sg() will have DMA_TO_DEVICE which > is quite OK for passing to dma_noncoherent_sync_sg_for_device() but in case of reading we'll have > DMA_FROM_DEVICE which we'll pass to dma_noncoherent_sync_sg_for_device() in dma_noncoherent_map_sg(). > > I'd say this is not entirely correct because IMHO arch_sync_dma_for_cpu() is supposed to only be used > in case of DMA_FROM_DEVICE and arch_sync_dma_for_device() only in case of DMA_TO_DEVICE. So roughly 10 years ago, some kernel rookie name Vineet Gupta, asked the exact same question :-) http://kernelnewbies.kernelnewbies.narkive.com/aGW1QcDv/query-about-dma-sync-for-cpu-and-direction-to-device I never understood the need for this direction. And if memory serves me right, at that time I was seeing twice the amount of cache flushing ! > But the real fix of my problem is: > ---------------------------------------->8------------------------------------ > --- a/lib/dma-noncoherent.c > +++ b/lib/dma-noncoherent.c > @@ -35,7 +35,7 @@ static dma_addr_t dma_noncoherent_map_page(struct device *dev, struct page *page > > addr = dma_direct_map_page(dev, page, offset, size, dir, attrs); > if (!dma_mapping_error(dev, addr) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) > - arch_sync_dma_for_device(dev, page_to_phys(page), size, dir); > + arch_sync_dma_for_device(dev, page_to_phys(page) + offset, size, dir); > return addr; > } > ---------------------------------------->8------------------------------------ > > You seem to lost an offset in the page so if we happen to have a buffer not aligned to > a page boundary then we were obviously corrupting data outside our data :) Neat !
On 05/18/2018 06:23 AM, hch@lst.de wrote: > Fri, May 18, 2018 at 01:03:46PM +0000, Alexey Brodkin wrote: >> Note mmc_get_dma_dir() is just "data->flags & MMC_DATA_WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE". >> I.e. if we're preparing for sending data dma_noncoherent_map_sg() will have DMA_TO_DEVICE which >> is quite OK for passing to dma_noncoherent_sync_sg_for_device() but in case of reading we'll have >> DMA_FROM_DEVICE which we'll pass to dma_noncoherent_sync_sg_for_device() in dma_noncoherent_map_sg(). >> >> I'd say this is not entirely correct because IMHO arch_sync_dma_for_cpu() is supposed to only be used >> in case of DMA_FROM_DEVICE and arch_sync_dma_for_device() only in case of DMA_TO_DEVICE. > arc overrides the dir paramter of the dma_sync_single_for_device/ > dma_sync_single_for_cpu calls. My patches dropped that, and I have > restored that, and audit for the other architectures is pending. Right, for now lets retain that and do a sweeping audit of @direction - to me it seems extraneous (as it did 10 years ago), but I'm not an expert in this are so perhaps it is needed for some device / arches and it would be good to understand that finally. > That being said the existing arc code still looks rather odd as it > didn't do the same thing for the scatterlist versions of the calls. > I've thrown in a few patches into my new tree to make the sg versions > make the normal calls, and to clean up the area a bit. Not calling names or anything here, but it doesn't exist for sg variants, because I didn't write that code :-) It was introduced by your commi: 2016-01-20 052c96dbe33b arc: convert to dma_map_ops
On Fri, May 18, 2018 at 10:20:02AM -0700, Vineet Gupta wrote: > I never understood the need for this direction. And if memory serves me > right, at that time I was seeing twice the amount of cache flushing ! It's necessary. Take a moment to think carefully about this: dma_map_single(, dir) dma_sync_single_for_cpu(, dir) dma_sync_single_for_device(, dir) dma_unmap_single(, dir) In the case of a DMA-incoherent architecture, the operations done at each stage depend on the direction argument: map for_cpu for_device unmap TO_DEV writeback none writeback none TO_CPU invalidate invalidate* invalidate invalidate* BIDIR writeback invalidate writeback invalidate * - only necessary if the CPU speculatively prefetches. The multiple invalidations for the TO_CPU case handles different conditions that can result in data corruption, and for some CPUs, all four are necessary. This is what is implemented for 32-bit ARM, depending on the CPU capabilities, as we have DMA incoherent devices and we have CPUs that speculatively prefetch data, and so may load data into the caches while DMA is in operation. Things get more interesting if the implementation behind the DMA API has to copy data between the buffer supplied to the mapping and some DMA accessible buffer: map for_cpu for_device unmap TO_DEV copy to dma none copy to dma none TO_CPU none copy to cpu none copy to cpu BIDIR copy to dma copy to cpu copy to dma copy to cpu So, in both cases, the value of the direction argument defines what you need to do in each call.
Hi Russel, On Fri, 2018-05-18 at 18:50 +0100, Russell King - ARM Linux wrote: > It's necessary. Take a moment to think carefully about this: > > dma_map_single(, dir) > > dma_sync_single_for_cpu(, dir) > > dma_sync_single_for_device(, dir) > > dma_unmap_single(, dir) > > In the case of a DMA-incoherent architecture, the operations done at each > stage depend on the direction argument: > > map for_cpu for_device unmap > TO_DEV writeback none writeback none > TO_CPU invalidate invalidate* invalidate invalidate* > BIDIR writeback invalidate writeback invalidate > > * - only necessary if the CPU speculatively prefetches. I think invalidation of DMA buffer is required on for_cpu(TO_CPU) even if CPU doesn't preferch - what if we reuse the same buffer for multiple reads from DMA device? > The multiple invalidations for the TO_CPU case handles different > conditions that can result in data corruption, and for some CPUs, all > four are necessary. I would agree that map()/unmap() a quite a special cases and so depending on direction we need to execute in them either for_cpu() or for_device() call-backs depending on direction. As for invalidation in case of for_device(TO_CPU) I still don't see a rationale behind it. Would be interesting to see a real example where we benefit from this. > This is what is implemented for 32-bit ARM, depending on the CPU > capabilities, as we have DMA incoherent devices and we have CPUs that > speculatively prefetch data, and so may load data into the caches while > DMA is in operation. > > > Things get more interesting if the implementation behind the DMA API has > to copy data between the buffer supplied to the mapping and some DMA > accessible buffer: > > map for_cpu for_device unmap > TO_DEV copy to dma none copy to dma none > TO_CPU none copy to cpu none copy to cpu > BIDIR copy to dma copy to cpu copy to dma copy to cpu > > So, in both cases, the value of the direction argument defines what you > need to do in each call. Interesting enough in your seond table (which describes more complicated case indeed) you set "none" for for_device(TO_CPU) which looks logical to me. So IMHO that's what make sense: ---------------------------->8----------------------------- map for_cpu for_device unmap TO_DEV writeback none writeback none TO_CPU none invalidate none invalidate* BIDIR writeback invalidate writeback invalidate* ---------------------------->8----------------------------- * is the case for prefetching CPU. -Alexey
On 18.05.2018 15:03, Alexey Brodkin wrote: > But the real fix of my problem is: > ---------------------------------------->8------------------------------------ > --- a/lib/dma-noncoherent.c > +++ b/lib/dma-noncoherent.c > @@ -35,7 +35,7 @@ static dma_addr_t dma_noncoherent_map_page(struct device *dev, struct page *page > > addr = dma_direct_map_page(dev, page, offset, size, dir, attrs); > if (!dma_mapping_error(dev, addr) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) > - arch_sync_dma_for_device(dev, page_to_phys(page), size, dir); > + arch_sync_dma_for_device(dev, page_to_phys(page) + offset, size, dir); > return addr; > } > ---------------------------------------->8------------------------------------ > > You seem to lost an offset in the page so if we happen to have a buffer not aligned to > a page boundary then we were obviously corrupting data outside our data :) Good. This patch seems to fix the dma issues I faced on my 32bit B160L parisc box. So it leaves only one open issue on parisc: Now every 32 bit parisc system is unnecessarily non-coherent. Helge
On 05/18/2018 10:50 AM, Russell King - ARM Linux wrote: > On Fri, May 18, 2018 at 10:20:02AM -0700, Vineet Gupta wrote: >> I never understood the need for this direction. And if memory serves me >> right, at that time I was seeing twice the amount of cache flushing ! > It's necessary. Take a moment to think carefully about this: > > dma_map_single(, dir) > > dma_sync_single_for_cpu(, dir) > > dma_sync_single_for_device(, dir) > > dma_unmap_single(, dir) As an aside, do these imply a state machine of sorts - does a driver needs to always call map_single first ? My original point of contention/confusion is the specific combinations of API and direction, specifically for_cpu(TO_DEV) and for_device(TO_CPU) Semantically what does dma_sync_single_for_cpu(TO_DEV) even imply for a non dma coherent arch. Your tables below have "none" for both, implying it is unlikely to be a real combination (for ARM and ARC atleast). The other case, actually @dir TO_CPU, independent of for_{cpu, device} implies driver intends to touch it after the call, so it would invalidate any stray lines, unconditionally (and not just for speculative prefetch case). > In the case of a DMA-incoherent architecture, the operations done at each > stage depend on the direction argument: > > map for_cpu for_device unmap > TO_DEV writeback none writeback none > TO_CPU invalidate invalidate* invalidate invalidate* > BIDIR writeback invalidate writeback invalidate > > * - only necessary if the CPU speculatively prefetches. > > The multiple invalidations for the TO_CPU case handles different > conditions that can result in data corruption, and for some CPUs, all > four are necessary. Can you please explain in some more detail, TO_CPU row, why invalidate is conditional sometimes.
On Fri, May 18, 2018 at 07:57:34PM +0000, Alexey Brodkin wrote: > Hi Russel, That's Russell. > On Fri, 2018-05-18 at 18:50 +0100, Russell King - ARM Linux wrote: > > It's necessary. Take a moment to think carefully about this: > > > > dma_map_single(, dir) > > > > dma_sync_single_for_cpu(, dir) > > > > dma_sync_single_for_device(, dir) > > > > dma_unmap_single(, dir) > > > > In the case of a DMA-incoherent architecture, the operations done at each > > stage depend on the direction argument: > > > > map for_cpu for_device unmap > > TO_DEV writeback none writeback none > > TO_CPU invalidate invalidate* invalidate invalidate* > > BIDIR writeback invalidate writeback invalidate > > > > * - only necessary if the CPU speculatively prefetches. > > I think invalidation of DMA buffer is required on for_cpu(TO_CPU) even > if CPU doesn't preferch - what if we reuse the same buffer for multiple > reads from DMA device? That's fine - for non-coherent DMA, the CPU caches will only end up containing data for that memory if: - the CPU speculatively fetches data from that memory, or - the CPU explicitly touches that memory > > The multiple invalidations for the TO_CPU case handles different > > conditions that can result in data corruption, and for some CPUs, all > > four are necessary. > > I would agree that map()/unmap() a quite a special cases and so depending > on direction we need to execute in them either for_cpu() or for_device() > call-backs depending on direction. > > As for invalidation in case of for_device(TO_CPU) I still don't see > a rationale behind it. Would be interesting to see a real example where > we benefit from this. Yes, you could avoid that, but depending how you structure the architecture implementation, it can turn out to be a corner case. The above table is precisely how 32-bit ARM is implemented, because the way we implement them is based on who owns the memory - the "map" and "for_device" operations translate internally to a cpu-to-device ownership transition of the buffer. Similar for "unmap" and "to_cpu". It basically avoids having to add additional functions at the lower implementation levels. > > Things get more interesting if the implementation behind the DMA API has > > to copy data between the buffer supplied to the mapping and some DMA > > accessible buffer: > > > > map for_cpu for_device unmap > > TO_DEV copy to dma none copy to dma none > > TO_CPU none copy to cpu none copy to cpu > > BIDIR copy to dma copy to cpu copy to dma copy to cpu > > > > So, in both cases, the value of the direction argument defines what you > > need to do in each call. > > Interesting enough in your seond table (which describes more complicated > case indeed) you set "none" for for_device(TO_CPU) which looks logical > to me. > > So IMHO that's what make sense: > ---------------------------->8----------------------------- > map for_cpu for_device unmap > TO_DEV writeback none writeback none > TO_CPU none invalidate none invalidate* > BIDIR writeback invalidate writeback invalidate* > ---------------------------->8----------------------------- That doesn't make sense for the TO_CPU case. If the caches contain dirty cache lines, and you're DMAing from the device to the system RAM, other system activity can cause the dirty cache lines to be evicted (written back) to memory which the DMA has already overwritten. The result is data corruption. So, you really can't have "none" in the "map" case there. Given that, the "for_cpu" case becomes dependent on whether the CPU speculatively prefetches.
On Fri, May 18, 2018 at 01:35:08PM -0700, Vineet Gupta wrote: > On 05/18/2018 10:50 AM, Russell King - ARM Linux wrote: > >On Fri, May 18, 2018 at 10:20:02AM -0700, Vineet Gupta wrote: > >>I never understood the need for this direction. And if memory serves me > >>right, at that time I was seeing twice the amount of cache flushing ! > >It's necessary. Take a moment to think carefully about this: > > > > dma_map_single(, dir) > > > > dma_sync_single_for_cpu(, dir) > > > > dma_sync_single_for_device(, dir) > > > > dma_unmap_single(, dir) > > As an aside, do these imply a state machine of sorts - does a driver needs > to always call map_single first ? Kind-of, but some drivers do omit some of the dma_sync_*() calls. For example, if a buffer is written to, then mapped with TO_DEVICE, and then the CPU wishes to write to it, it's fairly common that a driver omits the dma_sync_single_for_cpu() call. If you think about the cases I gave and what cache operations happen, such a scenario practically turns out to be safe. > My original point of contention/confusion is the specific combinations of > API and direction, specifically for_cpu(TO_DEV) and for_device(TO_CPU) Remember that it is expected that all calls for a mapping use the same direction argument while that mapping exists. In other words, if you call dma_map_single(TO_DEVICE) and then use any of the other functions, the other functions will also use TO_DEVICE. The DMA direction argument describes the direction of the DMA operation being performed on the buffer, not on the individual dma_* operation. What isn't expected at arch level is for drivers to do: dma_map_single(TO_DEVICE) dma_sync_single_for_cpu(FROM_DEVICE) or vice versa. > Semantically what does dma_sync_single_for_cpu(TO_DEV) even imply for a non > dma coherent arch. > > Your tables below have "none" for both, implying it is unlikely to be a real > combination (for ARM and ARC atleast). Very little for the cases that I've stated (and as I mentioned above, some drivers do omit the call in that case.) > The other case, actually @dir TO_CPU, independent of for_{cpu, device} > implies driver intends to touch it after the call, so it would invalidate > any stray lines, unconditionally (and not just for speculative prefetch > case). If you don't have a CPU that speculatively prefetches, and you've already had to invalidate the cache lines (to avoid write-backs corrupting DMA'd data) then there's no need for the architecture to do any work at the for_cpu(TO_CPU) case - the CPU shouldn't be touching cache lines that are part of the buffer while it is mapped, which means a non-speculating CPU won't pull in any cache lines without an explicit access. Speculating CPUs are different. The action of the speculation is to try and guess what data the program wants to access ahead of the program flow. That causes the CPU to prefetch data into the cache. The point in the program flow that this happens is not really determinant to the programmer. This means that if you try to read from the DMA buffer after the DMA operation has complete without invalidating the cache between the DMA completing and the CPU reading, you have no guarantee that you're reading the data that the DMA operation has been written. The cache may have loaded itself with data before the DMA operation completed, and the CPU may see that stale data. The difference between non-speculating CPUs and speculating CPUs is that for non-speculating CPUs, caches work according to explicit accesses by the program, and the program is stalled while the data is fetched from external memory. Speculating CPUs try to predict ahead of time what data the program will require in the future, and attempt to load that data into the caches _before_ the program requires it - which means that the program suffers fewer stalls. > >In the case of a DMA-incoherent architecture, the operations done at each > >stage depend on the direction argument: > > > > map for_cpu for_device unmap > >TO_DEV writeback none writeback none > >TO_CPU invalidate invalidate* invalidate invalidate* > >BIDIR writeback invalidate writeback invalidate > > > >* - only necessary if the CPU speculatively prefetches. > > > >The multiple invalidations for the TO_CPU case handles different > >conditions that can result in data corruption, and for some CPUs, all > >four are necessary. > > Can you please explain in some more detail, TO_CPU row, why invalidate is > conditional sometimes. See above - I hope my explanation above is sufficient.
On Fri, May 18, 2018 at 10:05:51PM +0200, Helge Deller wrote: > This patch seems to fix the dma issues I faced on my 32bit B160L parisc box. > > So it leaves only one open issue on parisc: > Now every 32 bit parisc system is unnecessarily non-coherent. I diagree with those comments, let me resend the refactored patch to make it more clear.
--- a/arch/arc/mm/dma.c +++ b/arch/arc/mm/dma.c @@ -152,14 +152,37 @@ static void _dma_cache_sync(struct device *dev, phys_addr_t paddr, size_t size, } } +static const char *dir_to_str(enum dma_data_direction dir) +{ + switch (dir) { + case DMA_BIDIRECTIONAL: return "DMA_BIDIRECTIONAL"; + case DMA_TO_DEVICE: return "DMA_TO_DEVICE"; + case DMA_FROM_DEVICE: return "DMA_FROM_DEVICE"; + case DMA_NONE: return "DMA_NONE"; + default: return "WRONG_VALUE!"; + } +} + void arch_sync_dma_for_device(struct device *dev, phys_addr_t paddr, size_t size, enum dma_data_direction dir) { + if (dir != DMA_TO_DEVICE){ + dump_stack(); + printk(" *** %s@%d: DMA direction is %s instead of %s\n", + __func__, __LINE__, dir_to_str(dir), dir_to_str(DMA_TO_DEVICE)); + } + return _dma_cache_sync(dev, paddr, size, dir); } void arch_sync_dma_for_cpu(struct device *dev, phys_addr_t paddr, size_t size, enum dma_data_direction dir) { + if (dir != DMA_FROM_DEVICE) { + dump_stack(); + printk(" *** %s@%d: DMA direction is %s instead of %s\n", + __func__, __LINE__, dir_to_str(dir), dir_to_str(DMA_FROM_DEVICE)); + } + return _dma_cache_sync(dev, paddr, size, dir); } ---------------------------------------->8------------------------------------ And with that I noticed a bit unexpected output, see below: ---------------------------------------->8------------------------------------ Stack Trace: arc_unwind_core.constprop.1+0xd4/0xf8 dump_stack+0x68/0x80 arch_sync_dma_for_device+0x34/0xc4 dma_noncoherent_map_sg+0x80/0x94 __dw_mci_start_request+0x1ee/0x868 dw_mci_request+0x17e/0x1c8 mmc_wait_for_req+0x106/0x1ac mmc_app_sd_status+0x108/0x130 mmc_sd_setup_card+0xc6/0x2e8 mmc_attach_sd+0x1b6/0x394 mmc_rescan+0x2f4/0x3bc process_one_work+0x194/0x348 worker_thread+0xf2/0x478 kthread+0x120/0x13c ret_from_fork+0x18/0x1c *** arch_sync_dma_for_device@172: DMA direction is DMA_FROM_DEVICE instead of DMA_TO_DEVICE ... Stack Trace: arc_unwind_core.constprop.1+0xd4/0xf8 dump_stack+0x68/0x80 arch_sync_dma_for_device+0x34/0xc4 dma_noncoherent_map_page+0x86/0x8c usb_hcd_map_urb_for_dma+0x49e/0x53c usb_hcd_submit_urb+0x43c/0x8c4 usb_control_msg+0xbe/0x16c hub_port_init+0x5e0/0xb0c hub_event+0x4e6/0x1164 process_one_work+0x194/0x348 worker_thread+0xf2/0x478 kthread+0x120/0x13c ret_from_fork+0x18/0x1c mmcblk0: p1 p2 *** arch_sync_dma_for_device@172: DMA direction is DMA_FROM_DEVICE instead of DMA_TO_DEVICE ... and quite some more of the similar ... ---------------------------------------->8------------------------------------ In case of MMC/DW_MCI (AKA DesignWare MobileStorage controller) that's an execution flow: 1) __dw_mci_start_request() 2) dw_mci_pre_dma_transfer() 3) dma_map_sg(..., mmc_get_dma_dir(data)) Note mmc_get_dma_dir() is just "data->flags & MMC_DATA_WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE". I.e. if we're preparing for sending data dma_noncoherent_map_sg() will have DMA_TO_DEVICE which is quite OK for passing to dma_noncoherent_sync_sg_for_device() but in case of reading we'll have DMA_FROM_DEVICE which we'll pass to dma_noncoherent_sync_sg_for_device() in dma_noncoherent_map_sg(). I'd say this is not entirely correct because IMHO arch_sync_dma_for_cpu() is supposed to only be used in case of DMA_FROM_DEVICE and arch_sync_dma_for_device() only in case of DMA_TO_DEVICE. > +static dma_addr_t dma_noncoherent_map_page(struct device *dev, struct page *page, > + unsigned long offset, size_t size, enum dma_data_direction dir, > + unsigned long attrs) > +{ > + dma_addr_t addr; > + > + addr = dma_direct_map_page(dev, page, offset, size, dir, attrs); > + if (!dma_mapping_error(dev, addr) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) > + arch_sync_dma_for_device(dev, page_to_phys(page), size, dir); > + return addr; > +} > + > +static int dma_noncoherent_map_sg(struct device *dev, struct scatterlist *sgl, > + int nents, enum dma_data_direction dir, unsigned long attrs) > +{ > + nents = dma_direct_map_sg(dev, sgl, nents, dir, attrs); > + if (nents > 0 && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) > + dma_noncoherent_sync_sg_for_device(dev, sgl, nents, dir); > + return nents; > +} The same is for unmap functions. My guess is we need to respect direction in map/unmap functions and use either dma_noncoherent_sync_single_for_cpu(..., DMA_FROM_DEVICE) or dma_noncoherent_sync_single_for_device(...,DMA_TO_DEVICE). > +static void dma_noncoherent_unmap_page(struct device *dev, dma_addr_t addr, > + size_t size, enum dma_data_direction dir, unsigned long attrs) > +{ > + if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) > + dma_noncoherent_sync_single_for_cpu(dev, addr, size, dir); > +} > + > +static void dma_noncoherent_unmap_sg(struct device *dev, struct scatterlist *sgl, > + int nents, enum dma_data_direction dir, unsigned long attrs) > +{ > + if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC)) > + dma_noncoherent_sync_sg_for_cpu(dev, sgl, nents, dir); > +} > +#endif But the real fix of my problem is: ---------------------------------------->8------------------------------------ --- a/lib/dma-noncoherent.c +++ b/lib/dma-noncoherent.c @@ -35,7 +35,7 @@ static dma_addr_t dma_noncoherent_map_page(struct device *dev, struct page *page addr = dma_direct_map_page(dev, page, offset, size, dir, attrs); if (!dma_mapping_error(dev, addr) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) - arch_sync_dma_for_device(dev, page_to_phys(page), size, dir); + arch_sync_dma_for_device(dev, page_to_phys(page) + offset, size, dir); return addr; } ---------------------------------------->8------------------------------------