diff mbox

[02/20] dma-mapping: provide a generic dma-noncoherent implementation

Message ID bad125dff49f6e49c895e818c9d1abb346a46e8e.camel@synopsys.com (mailing list archive)
State New, archived
Headers show

Commit Message

Alexey Brodkin May 18, 2018, 1:03 p.m. UTC
Hi Christoph,

On Fri, 2018-05-11 at 09:59 +0200, Christoph Hellwig wrote:

[snip]

There seems to be one subtle issue with map/unmap code.
While investigating problems on ARC I added instrumentation as below:
---------------------------------------->8------------------------------------
You seem to lost an offset in the page so if we happen to have a buffer not aligned to
a page boundary then we were obviously corrupting data outside our data :)

-Alexey

Comments

Christoph Hellwig May 18, 2018, 1:27 p.m. UTC | #1
On Fri, May 18, 2018 at 01:03:46PM +0000, Alexey Brodkin wrote:
> Note mmc_get_dma_dir() is just "data->flags & MMC_DATA_WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE".
> I.e. if we're preparing for sending data dma_noncoherent_map_sg() will have DMA_TO_DEVICE which
> is quite OK for passing to dma_noncoherent_sync_sg_for_device() but in case of reading we'll have
> DMA_FROM_DEVICE which we'll pass to dma_noncoherent_sync_sg_for_device() in dma_noncoherent_map_sg().
> 
> I'd say this is not entirely correct because IMHO arch_sync_dma_for_cpu() is supposed to only be used
> in case of DMA_FROM_DEVICE and arch_sync_dma_for_device() only in case of DMA_TO_DEVICE.

arc overrides the dir paramter of the dma_sync_single_for_device/
dma_sync_single_for_cpu calls.  My patches dropped that, and I have
restored that, and audit for the other architectures is pending.

That being said the existing arc code still looks rather odd as it
didn't do the same thing for the scatterlist versions of the calls.
I've thrown in a few patches into my new tree to make the sg versions
make the normal calls, and to clean up the area a bit.

> You seem to lost an offset in the page so if we happen to have a buffer not aligned to
> a page boundary then we were obviously corrupting data outside our data :)

Oops!  Thank you for all the debugging!
Alexey Brodkin May 18, 2018, 2:13 p.m. UTC | #2
Hi Christoph,

On Fri, 2018-05-18 at 15:27 +0200, hch@lst.de wrote:
> On Fri, May 18, 2018 at 01:03:46PM +0000, Alexey Brodkin wrote:
> > Note mmc_get_dma_dir() is just "data->flags & MMC_DATA_WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE".
> > I.e. if we're preparing for sending data dma_noncoherent_map_sg() will have DMA_TO_DEVICE which
> > is quite OK for passing to dma_noncoherent_sync_sg_for_device() but in case of reading we'll have
> > DMA_FROM_DEVICE which we'll pass to dma_noncoherent_sync_sg_for_device() in dma_noncoherent_map_sg().
> > 
> > I'd say this is not entirely correct because IMHO arch_sync_dma_for_cpu() is supposed to only be used
> > in case of DMA_FROM_DEVICE and arch_sync_dma_for_device() only in case of DMA_TO_DEVICE.
> 
> arc overrides the dir paramter of the dma_sync_single_for_device/
> dma_sync_single_for_cpu calls.  My patches dropped that, and I have
> restored that, and audit for the other architectures is pending.

Well at least for me that's a confusion what is a reason to pass direction
to function which purpose is already known.

I'd say that XXX_sync_for_device() doesn't need _variable_ direction as an argument,
otherwise what does that mean if we pass DMA_FROM_DEVICE to that function?

> That being said the existing arc code still looks rather odd as it
> didn't do the same thing for the scatterlist versions of the calls.

That might easily be the case so good we caught that now and it will be fixed :)

> I've thrown in a few patches into my new tree to make the sg versions
> make the normal calls, and to clean up the area a bit.

I'll try your newer series now, thanks!

-Alexey
Vineet Gupta May 18, 2018, 5:20 p.m. UTC | #3
On 05/18/2018 06:11 AM, Alexey Brodkin wrote:
>   void arch_sync_dma_for_device(struct device *dev, phys_addr_t paddr,
>                  size_t size, enum dma_data_direction dir)
>   {
> +       if (dir != DMA_TO_DEVICE){
> +               dump_stack();
> +               printk(" *** %s@%d: DMA direction is %s instead of %s\n",
> +                      __func__, __LINE__, dir_to_str(dir), dir_to_str(DMA_TO_DEVICE));
> +       }
> +
>          return _dma_cache_sync(dev, paddr, size, dir);
>   }
>   
>   void arch_sync_dma_for_cpu(struct device *dev, phys_addr_t paddr,
>                  size_t size, enum dma_data_direction dir)
>   {
> +       if (dir != DMA_FROM_DEVICE) {
> +               dump_stack();
> +               printk(" *** %s@%d: DMA direction is %s instead of %s\n",
> +                      __func__, __LINE__, dir_to_str(dir), dir_to_str(DMA_FROM_DEVICE));
> +       }
> +
>          return _dma_cache_sync(dev, paddr, size, dir);
>   }

...
> In case of MMC/DW_MCI (AKA DesignWare MobileStorage controller) that's an execution flow:
> 1) __dw_mci_start_request()
> 2) dw_mci_pre_dma_transfer()
> 3) dma_map_sg(..., mmc_get_dma_dir(data))
>
> Note mmc_get_dma_dir() is just "data->flags & MMC_DATA_WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE".
> I.e. if we're preparing for sending data dma_noncoherent_map_sg() will have DMA_TO_DEVICE which
> is quite OK for passing to dma_noncoherent_sync_sg_for_device() but in case of reading we'll have
> DMA_FROM_DEVICE which we'll pass to dma_noncoherent_sync_sg_for_device() in dma_noncoherent_map_sg().
>
> I'd say this is not entirely correct because IMHO arch_sync_dma_for_cpu() is supposed to only be used
> in case of DMA_FROM_DEVICE and arch_sync_dma_for_device() only in case of DMA_TO_DEVICE.

So roughly 10 years ago, some kernel rookie name Vineet Gupta, asked the exact 
same question :-)

http://kernelnewbies.kernelnewbies.narkive.com/aGW1QcDv/query-about-dma-sync-for-cpu-and-direction-to-device

I never understood the need for this direction. And if memory serves me right, at 
that time I was seeing twice the amount of cache flushing !

> But the real fix of my problem is:
> ---------------------------------------->8------------------------------------
> --- a/lib/dma-noncoherent.c
> +++ b/lib/dma-noncoherent.c
> @@ -35,7 +35,7 @@ static dma_addr_t dma_noncoherent_map_page(struct device *dev, struct page *page
>   
>          addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
>          if (!dma_mapping_error(dev, addr) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> -               arch_sync_dma_for_device(dev, page_to_phys(page), size, dir);
> +               arch_sync_dma_for_device(dev, page_to_phys(page) + offset, size, dir);
>          return addr;
>   }
> ---------------------------------------->8------------------------------------
>
> You seem to lost an offset in the page so if we happen to have a buffer not aligned to
> a page boundary then we were obviously corrupting data outside our data :)

Neat !
Vineet Gupta May 18, 2018, 5:28 p.m. UTC | #4
On 05/18/2018 06:23 AM, hch@lst.de wrote:
>   Fri, May 18, 2018 at 01:03:46PM +0000, Alexey Brodkin wrote:
>> Note mmc_get_dma_dir() is just "data->flags & MMC_DATA_WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE".
>> I.e. if we're preparing for sending data dma_noncoherent_map_sg() will have DMA_TO_DEVICE which
>> is quite OK for passing to dma_noncoherent_sync_sg_for_device() but in case of reading we'll have
>> DMA_FROM_DEVICE which we'll pass to dma_noncoherent_sync_sg_for_device() in dma_noncoherent_map_sg().
>>
>> I'd say this is not entirely correct because IMHO arch_sync_dma_for_cpu() is supposed to only be used
>> in case of DMA_FROM_DEVICE and arch_sync_dma_for_device() only in case of DMA_TO_DEVICE.
> arc overrides the dir paramter of the dma_sync_single_for_device/
> dma_sync_single_for_cpu calls.  My patches dropped that, and I have
> restored that, and audit for the other architectures is pending.

Right, for now lets retain that and do a sweeping audit of @direction - to me it 
seems extraneous (as it did 10 years ago), but I'm not an expert in this are so 
perhaps it is needed for some device / arches and it would be good to understand 
that finally.

> That being said the existing arc code still looks rather odd as it
> didn't do the same thing for the scatterlist versions of the calls.
> I've thrown in a few patches into my new tree to make the sg versions
> make the normal calls, and to clean up the area a bit.

Not calling names or anything here, but it doesn't exist for sg variants, because 
I didn't write that code :-)
It was introduced by your commi:

2016-01-20 052c96dbe33b arc: convert to dma_map_ops
Russell King (Oracle) May 18, 2018, 5:50 p.m. UTC | #5
On Fri, May 18, 2018 at 10:20:02AM -0700, Vineet Gupta wrote:
> I never understood the need for this direction. And if memory serves me
> right, at that time I was seeing twice the amount of cache flushing !

It's necessary.  Take a moment to think carefully about this:

	dma_map_single(, dir)

	dma_sync_single_for_cpu(, dir)

	dma_sync_single_for_device(, dir)

	dma_unmap_single(, dir)

In the case of a DMA-incoherent architecture, the operations done at each
stage depend on the direction argument:

	map		for_cpu		for_device	unmap
TO_DEV	writeback	none		writeback	none
TO_CPU	invalidate	invalidate*	invalidate	invalidate*
BIDIR	writeback	invalidate	writeback	invalidate

* - only necessary if the CPU speculatively prefetches.

The multiple invalidations for the TO_CPU case handles different
conditions that can result in data corruption, and for some CPUs, all
four are necessary.

This is what is implemented for 32-bit ARM, depending on the CPU
capabilities, as we have DMA incoherent devices and we have CPUs that
speculatively prefetch data, and so may load data into the caches while
DMA is in operation.


Things get more interesting if the implementation behind the DMA API has
to copy data between the buffer supplied to the mapping and some DMA
accessible buffer:

	map		for_cpu		for_device	unmap
TO_DEV	copy to dma	none		copy to dma	none
TO_CPU	none		copy to cpu	none		copy to cpu
BIDIR	copy to dma	copy to cpu	copy to dma	copy to cpu

So, in both cases, the value of the direction argument defines what you
need to do in each call.
Alexey Brodkin May 18, 2018, 7:57 p.m. UTC | #6
Hi Russel,

On Fri, 2018-05-18 at 18:50 +0100, Russell King - ARM Linux wrote:
> It's necessary.  Take a moment to think carefully about this:
> 
>         dma_map_single(, dir)
> 
>         dma_sync_single_for_cpu(, dir)
> 
>         dma_sync_single_for_device(, dir)
> 
>         dma_unmap_single(, dir)
> 
> In the case of a DMA-incoherent architecture, the operations done at each
> stage depend on the direction argument:
> 
>         map             for_cpu         for_device      unmap
> TO_DEV  writeback       none            writeback       none
> TO_CPU  invalidate      invalidate*     invalidate      invalidate*
> BIDIR   writeback       invalidate      writeback       invalidate
> 
> * - only necessary if the CPU speculatively prefetches.

I think invalidation of DMA buffer is required on for_cpu(TO_CPU) even
if CPU doesn't preferch - what if we reuse the same buffer for multiple
reads from DMA device?

> The multiple invalidations for the TO_CPU case handles different
> conditions that can result in data corruption, and for some CPUs, all
> four are necessary.

I would agree that map()/unmap() a quite a special cases and so depending
on direction we need to execute in them either for_cpu() or for_device()
call-backs depending on direction.

As for invalidation in case of for_device(TO_CPU) I still don't see
a rationale behind it. Would be interesting to see a real example where
we benefit from this.

> This is what is implemented for 32-bit ARM, depending on the CPU
> capabilities, as we have DMA incoherent devices and we have CPUs that
> speculatively prefetch data, and so may load data into the caches while
> DMA is in operation.
> 
> 
> Things get more interesting if the implementation behind the DMA API has
> to copy data between the buffer supplied to the mapping and some DMA
> accessible buffer:
> 
>         map             for_cpu         for_device      unmap
> TO_DEV  copy to dma     none            copy to dma     none
> TO_CPU  none            copy to cpu     none            copy to cpu
> BIDIR   copy to dma     copy to cpu     copy to dma     copy to cpu
> 
> So, in both cases, the value of the direction argument defines what you
> need to do in each call.

Interesting enough in your seond table (which describes more complicated
case indeed) you set "none" for for_device(TO_CPU) which looks logical
to me.

So IMHO that's what make sense:
---------------------------->8-----------------------------
        map             for_cpu         for_device      unmap
TO_DEV  writeback       none            writeback       none
TO_CPU  none            invalidate      none            invalidate*
BIDIR   writeback       invalidate      writeback       invalidate*
---------------------------->8-----------------------------

* is the case for prefetching CPU.

-Alexey
Helge Deller May 18, 2018, 8:05 p.m. UTC | #7
On 18.05.2018 15:03, Alexey Brodkin wrote:
> But the real fix of my problem is:
> ---------------------------------------->8------------------------------------
> --- a/lib/dma-noncoherent.c
> +++ b/lib/dma-noncoherent.c
> @@ -35,7 +35,7 @@ static dma_addr_t dma_noncoherent_map_page(struct device *dev, struct page *page
>  
>         addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
>         if (!dma_mapping_error(dev, addr) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> -               arch_sync_dma_for_device(dev, page_to_phys(page), size, dir);
> +               arch_sync_dma_for_device(dev, page_to_phys(page) + offset, size, dir);
>         return addr;
>  }
> ---------------------------------------->8------------------------------------
> 
> You seem to lost an offset in the page so if we happen to have a buffer not aligned to
> a page boundary then we were obviously corrupting data outside our data :)

Good.
This patch seems to fix the dma issues I faced on my 32bit B160L parisc box.

So it leaves only one open issue on parisc:
Now every 32 bit parisc system is unnecessarily non-coherent.

Helge
Vineet Gupta May 18, 2018, 8:35 p.m. UTC | #8
On 05/18/2018 10:50 AM, Russell King - ARM Linux wrote:
> On Fri, May 18, 2018 at 10:20:02AM -0700, Vineet Gupta wrote:
>> I never understood the need for this direction. And if memory serves me
>> right, at that time I was seeing twice the amount of cache flushing !
> It's necessary.  Take a moment to think carefully about this:
>
> 	dma_map_single(, dir)
>
> 	dma_sync_single_for_cpu(, dir)
>
> 	dma_sync_single_for_device(, dir)
>
> 	dma_unmap_single(, dir)

As an aside, do these imply a state machine of sorts - does a driver needs to 
always call map_single first ?

My original point of contention/confusion is the specific combinations of API and 
direction, specifically for_cpu(TO_DEV) and for_device(TO_CPU)

Semantically what does dma_sync_single_for_cpu(TO_DEV) even imply for a non dma 
coherent arch.
Your tables below have "none" for both, implying it is unlikely to be a real 
combination (for ARM and ARC atleast).

The other case, actually @dir TO_CPU, independent of for_{cpu, device}  implies 
driver intends to touch it after the call, so it would invalidate any stray lines, 
unconditionally (and not just for speculative prefetch case).


> In the case of a DMA-incoherent architecture, the operations done at each
> stage depend on the direction argument:
>
> 	map		for_cpu		for_device	unmap
> TO_DEV	writeback	none		writeback	none
> TO_CPU	invalidate	invalidate*	invalidate	invalidate*
> BIDIR	writeback	invalidate	writeback	invalidate
>
> * - only necessary if the CPU speculatively prefetches.
>
> The multiple invalidations for the TO_CPU case handles different
> conditions that can result in data corruption, and for some CPUs, all
> four are necessary.

Can you please explain in some more detail, TO_CPU row, why invalidate is 
conditional sometimes.
Russell King (Oracle) May 18, 2018, 9:33 p.m. UTC | #9
On Fri, May 18, 2018 at 07:57:34PM +0000, Alexey Brodkin wrote:
> Hi Russel,

That's Russell.

> On Fri, 2018-05-18 at 18:50 +0100, Russell King - ARM Linux wrote:
> > It's necessary.  Take a moment to think carefully about this:
> > 
> >         dma_map_single(, dir)
> > 
> >         dma_sync_single_for_cpu(, dir)
> > 
> >         dma_sync_single_for_device(, dir)
> > 
> >         dma_unmap_single(, dir)
> > 
> > In the case of a DMA-incoherent architecture, the operations done at each
> > stage depend on the direction argument:
> > 
> >         map             for_cpu         for_device      unmap
> > TO_DEV  writeback       none            writeback       none
> > TO_CPU  invalidate      invalidate*     invalidate      invalidate*
> > BIDIR   writeback       invalidate      writeback       invalidate
> > 
> > * - only necessary if the CPU speculatively prefetches.
> 
> I think invalidation of DMA buffer is required on for_cpu(TO_CPU) even
> if CPU doesn't preferch - what if we reuse the same buffer for multiple
> reads from DMA device?

That's fine - for non-coherent DMA, the CPU caches will only end up
containing data for that memory if:
- the CPU speculatively fetches data from that memory, or
- the CPU explicitly touches that memory

> > The multiple invalidations for the TO_CPU case handles different
> > conditions that can result in data corruption, and for some CPUs, all
> > four are necessary.
> 
> I would agree that map()/unmap() a quite a special cases and so depending
> on direction we need to execute in them either for_cpu() or for_device()
> call-backs depending on direction.
> 
> As for invalidation in case of for_device(TO_CPU) I still don't see
> a rationale behind it. Would be interesting to see a real example where
> we benefit from this.

Yes, you could avoid that, but depending how you structure the
architecture implementation, it can turn out to be a corner case.
The above table is precisely how 32-bit ARM is implemented, because
the way we implement them is based on who owns the memory - the "map"
and "for_device" operations translate internally to a cpu-to-device
ownership transition of the buffer.  Similar for "unmap" and "to_cpu".
It basically avoids having to add additional functions at the lower
implementation levels.

> > Things get more interesting if the implementation behind the DMA API has
> > to copy data between the buffer supplied to the mapping and some DMA
> > accessible buffer:
> > 
> >         map             for_cpu         for_device      unmap
> > TO_DEV  copy to dma     none            copy to dma     none
> > TO_CPU  none            copy to cpu     none            copy to cpu
> > BIDIR   copy to dma     copy to cpu     copy to dma     copy to cpu
> > 
> > So, in both cases, the value of the direction argument defines what you
> > need to do in each call.
> 
> Interesting enough in your seond table (which describes more complicated
> case indeed) you set "none" for for_device(TO_CPU) which looks logical
> to me.
> 
> So IMHO that's what make sense:
> ---------------------------->8-----------------------------
>         map             for_cpu         for_device      unmap
> TO_DEV  writeback       none            writeback       none
> TO_CPU  none            invalidate      none            invalidate*
> BIDIR   writeback       invalidate      writeback       invalidate*
> ---------------------------->8-----------------------------

That doesn't make sense for the TO_CPU case.  If the caches contain
dirty cache lines, and you're DMAing from the device to the system
RAM, other system activity can cause the dirty cache lines to be
evicted (written back) to memory which the DMA has already overwritten.
The result is data corruption.  So, you really can't have "none" in
the "map" case there.

Given that, the "for_cpu" case becomes dependent on whether the CPU
speculatively prefetches.
Russell King (Oracle) May 18, 2018, 9:55 p.m. UTC | #10
On Fri, May 18, 2018 at 01:35:08PM -0700, Vineet Gupta wrote:
> On 05/18/2018 10:50 AM, Russell King - ARM Linux wrote:
> >On Fri, May 18, 2018 at 10:20:02AM -0700, Vineet Gupta wrote:
> >>I never understood the need for this direction. And if memory serves me
> >>right, at that time I was seeing twice the amount of cache flushing !
> >It's necessary.  Take a moment to think carefully about this:
> >
> >	dma_map_single(, dir)
> >
> >	dma_sync_single_for_cpu(, dir)
> >
> >	dma_sync_single_for_device(, dir)
> >
> >	dma_unmap_single(, dir)
> 
> As an aside, do these imply a state machine of sorts - does a driver needs
> to always call map_single first ?

Kind-of, but some drivers do omit some of the dma_sync_*() calls.
For example, if a buffer is written to, then mapped with TO_DEVICE,
and then the CPU wishes to write to it, it's fairly common that a
driver omits the dma_sync_single_for_cpu() call.  If you think about
the cases I gave and what cache operations happen, such a scenario
practically turns out to be safe.

> My original point of contention/confusion is the specific combinations of
> API and direction, specifically for_cpu(TO_DEV) and for_device(TO_CPU)

Remember that it is expected that all calls for a mapping use the
same direction argument while that mapping exists.  In other words,
if you call dma_map_single(TO_DEVICE) and then use any of the other
functions, the other functions will also use TO_DEVICE.  The DMA
direction argument describes the direction of the DMA operation
being performed on the buffer, not on the individual dma_* operation.

What isn't expected at arch level is for drivers to do:

	dma_map_single(TO_DEVICE)
	dma_sync_single_for_cpu(FROM_DEVICE)

or vice versa.

> Semantically what does dma_sync_single_for_cpu(TO_DEV) even imply for a non
> dma coherent arch.
> 
> Your tables below have "none" for both, implying it is unlikely to be a real
> combination (for ARM and ARC atleast).

Very little for the cases that I've stated (and as I mentioned
above, some drivers do omit the call in that case.)

> The other case, actually @dir TO_CPU, independent of for_{cpu, device} 
> implies driver intends to touch it after the call, so it would invalidate
> any stray lines, unconditionally (and not just for speculative prefetch
> case).

If you don't have a CPU that speculatively prefetches, and you've
already had to invalidate the cache lines (to avoid write-backs
corrupting DMA'd data) then there's no need for the architecture
to do any work at the for_cpu(TO_CPU) case - the CPU shouldn't
be touching cache lines that are part of the buffer while it is
mapped, which means a non-speculating CPU won't pull in any
cache lines without an explicit access.

Speculating CPUs are different.  The action of the speculation is
to try and guess what data the program wants to access ahead of
the program flow.  That causes the CPU to prefetch data into the
cache.  The point in the program flow that this happens is not
really determinant to the programmer.  This means that if you try
to read from the DMA buffer after the DMA operation has complete
without invalidating the cache between the DMA completing and the
CPU reading, you have no guarantee that you're reading the data
that the DMA operation has been written.  The cache may have
loaded itself with data before the DMA operation completed, and
the CPU may see that stale data.

The difference between non-speculating CPUs and speculating CPUs
is that for non-speculating CPUs, caches work according to explicit
accesses by the program, and the program is stalled while the data
is fetched from external memory.  Speculating CPUs try to predict
ahead of time what data the program will require in the future,
and attempt to load that data into the caches _before_ the program
requires it - which means that the program suffers fewer stalls.

> >In the case of a DMA-incoherent architecture, the operations done at each
> >stage depend on the direction argument:
> >
> >	map		for_cpu		for_device	unmap
> >TO_DEV	writeback	none		writeback	none
> >TO_CPU	invalidate	invalidate*	invalidate	invalidate*
> >BIDIR	writeback	invalidate	writeback	invalidate
> >
> >* - only necessary if the CPU speculatively prefetches.
> >
> >The multiple invalidations for the TO_CPU case handles different
> >conditions that can result in data corruption, and for some CPUs, all
> >four are necessary.
> 
> Can you please explain in some more detail, TO_CPU row, why invalidate is
> conditional sometimes.

See above - I hope my explanation above is sufficient.
Christoph Hellwig May 19, 2018, 6:38 a.m. UTC | #11
On Fri, May 18, 2018 at 10:05:51PM +0200, Helge Deller wrote:
> This patch seems to fix the dma issues I faced on my 32bit B160L parisc box.
> 
> So it leaves only one open issue on parisc:
> Now every 32 bit parisc system is unnecessarily non-coherent.

I diagree with those comments, let me resend the refactored patch
to make it more clear.
diff mbox

Patch

--- a/arch/arc/mm/dma.c
+++ b/arch/arc/mm/dma.c
@@ -152,14 +152,37 @@  static void _dma_cache_sync(struct device *dev, phys_addr_t paddr, size_t size,
        }
 }
 
+static const char *dir_to_str(enum dma_data_direction dir)
+{
+       switch (dir) {
+       case DMA_BIDIRECTIONAL: return "DMA_BIDIRECTIONAL";
+       case DMA_TO_DEVICE: return "DMA_TO_DEVICE";
+       case DMA_FROM_DEVICE: return "DMA_FROM_DEVICE";
+       case DMA_NONE: return "DMA_NONE";
+       default: return "WRONG_VALUE!";
+       }
+}
+
 void arch_sync_dma_for_device(struct device *dev, phys_addr_t paddr,
                size_t size, enum dma_data_direction dir)
 {
+       if (dir != DMA_TO_DEVICE){
+               dump_stack();
+               printk(" *** %s@%d: DMA direction is %s instead of %s\n",
+                      __func__, __LINE__, dir_to_str(dir), dir_to_str(DMA_TO_DEVICE));
+       }
+
        return _dma_cache_sync(dev, paddr, size, dir);
 }
 
 void arch_sync_dma_for_cpu(struct device *dev, phys_addr_t paddr,
                size_t size, enum dma_data_direction dir)
 {
+       if (dir != DMA_FROM_DEVICE) {
+               dump_stack();
+               printk(" *** %s@%d: DMA direction is %s instead of %s\n",
+                      __func__, __LINE__, dir_to_str(dir), dir_to_str(DMA_FROM_DEVICE));
+       }
+
        return _dma_cache_sync(dev, paddr, size, dir);
 }
---------------------------------------->8------------------------------------

And with that I noticed a bit unexpected output, see below:
---------------------------------------->8------------------------------------
Stack Trace:
  arc_unwind_core.constprop.1+0xd4/0xf8
  dump_stack+0x68/0x80
  arch_sync_dma_for_device+0x34/0xc4
  dma_noncoherent_map_sg+0x80/0x94
  __dw_mci_start_request+0x1ee/0x868
  dw_mci_request+0x17e/0x1c8
  mmc_wait_for_req+0x106/0x1ac
  mmc_app_sd_status+0x108/0x130
  mmc_sd_setup_card+0xc6/0x2e8
  mmc_attach_sd+0x1b6/0x394
  mmc_rescan+0x2f4/0x3bc
  process_one_work+0x194/0x348
  worker_thread+0xf2/0x478
  kthread+0x120/0x13c
  ret_from_fork+0x18/0x1c
 *** arch_sync_dma_for_device@172: DMA direction is DMA_FROM_DEVICE instead of DMA_TO_DEVICE
...
Stack Trace:
  arc_unwind_core.constprop.1+0xd4/0xf8
  dump_stack+0x68/0x80
  arch_sync_dma_for_device+0x34/0xc4
  dma_noncoherent_map_page+0x86/0x8c
  usb_hcd_map_urb_for_dma+0x49e/0x53c
  usb_hcd_submit_urb+0x43c/0x8c4
  usb_control_msg+0xbe/0x16c
  hub_port_init+0x5e0/0xb0c
  hub_event+0x4e6/0x1164
  process_one_work+0x194/0x348
  worker_thread+0xf2/0x478
  kthread+0x120/0x13c
  ret_from_fork+0x18/0x1c
 mmcblk0: p1 p2
 *** arch_sync_dma_for_device@172: DMA direction is DMA_FROM_DEVICE instead of DMA_TO_DEVICE

...
and quite some more of the similar
...
---------------------------------------->8------------------------------------

In case of MMC/DW_MCI (AKA DesignWare MobileStorage controller) that's an execution flow:
1) __dw_mci_start_request()
2) dw_mci_pre_dma_transfer()
3) dma_map_sg(..., mmc_get_dma_dir(data))

Note mmc_get_dma_dir() is just "data->flags & MMC_DATA_WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE".
I.e. if we're preparing for sending data dma_noncoherent_map_sg() will have DMA_TO_DEVICE which
is quite OK for passing to dma_noncoherent_sync_sg_for_device() but in case of reading we'll have
DMA_FROM_DEVICE which we'll pass to dma_noncoherent_sync_sg_for_device() in dma_noncoherent_map_sg().

I'd say this is not entirely correct because IMHO arch_sync_dma_for_cpu() is supposed to only be used
in case of DMA_FROM_DEVICE and arch_sync_dma_for_device() only in case of DMA_TO_DEVICE.


> +static dma_addr_t dma_noncoherent_map_page(struct device *dev, struct page *page,
> +		unsigned long offset, size_t size, enum dma_data_direction dir,
> +		unsigned long attrs)
> +{
> +	dma_addr_t addr;
> +
> +	addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
> +	if (!dma_mapping_error(dev, addr) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> +		arch_sync_dma_for_device(dev, page_to_phys(page), size, dir);
> +	return addr;
> +}
> +
> +static int dma_noncoherent_map_sg(struct device *dev, struct scatterlist *sgl,
> +		int nents, enum dma_data_direction dir, unsigned long attrs)
> +{
> +	nents = dma_direct_map_sg(dev, sgl, nents, dir, attrs);
> +	if (nents > 0 && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> +		dma_noncoherent_sync_sg_for_device(dev, sgl, nents, dir);
> +	return nents;
> +}

The same is for unmap functions.
My guess is we need to respect direction in map/unmap functions and use
either dma_noncoherent_sync_single_for_cpu(..., DMA_FROM_DEVICE) or
dma_noncoherent_sync_single_for_device(...,DMA_TO_DEVICE).


> +static void dma_noncoherent_unmap_page(struct device *dev, dma_addr_t addr,
> +		size_t size, enum dma_data_direction dir, unsigned long attrs)
> +{
> +	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> +		dma_noncoherent_sync_single_for_cpu(dev, addr, size, dir);
> +}
> +
> +static void dma_noncoherent_unmap_sg(struct device *dev, struct scatterlist *sgl,
> +		int nents, enum dma_data_direction dir, unsigned long attrs)
> +{
> +	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> +		dma_noncoherent_sync_sg_for_cpu(dev, sgl, nents, dir);
> +}
> +#endif

But the real fix of my problem is:
---------------------------------------->8------------------------------------
--- a/lib/dma-noncoherent.c
+++ b/lib/dma-noncoherent.c
@@ -35,7 +35,7 @@  static dma_addr_t dma_noncoherent_map_page(struct device *dev, struct page *page
 
        addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
        if (!dma_mapping_error(dev, addr) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-               arch_sync_dma_for_device(dev, page_to_phys(page), size, dir);
+               arch_sync_dma_for_device(dev, page_to_phys(page) + offset, size, dir);
        return addr;
 }
---------------------------------------->8------------------------------------