diff mbox series

arm64: dma: Drop cache invalidation from arch_dma_prep_coherent()

Message ID 20220823122111.17439-1-will@kernel.org (mailing list archive)
State New, archived
Headers show
Series arm64: dma: Drop cache invalidation from arch_dma_prep_coherent() | expand

Commit Message

Will Deacon Aug. 23, 2022, 12:21 p.m. UTC
arch_dma_prep_coherent() is called when preparing a non-cacheable region
for a consistent DMA buffer allocation. Since the buffer pages may
previously have been written via a cacheable mapping and consequently
allocated as dirty cachelines, the purpose of this function is to remove
these dirty lines from the cache, writing them back so that the
non-coherent device is able to see them.

On arm64, this operation can be achieved with a clean to the point of
coherency; a subsequent invalidation is not required and serves little
purpose in the presence of a cacheable alias (e.g. the linear map),
since clean lines can be speculatively fetched back into the cache after
the invalidation operation has completed.

Relax the cache maintenance in arch_dma_prep_coherent() so that only a
clean, and not a clean-and-invalidate operation is performed.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
---

I'm slightly wary about this change as other architectures seem to do
clean+invalidate here, but I'd like to hear what others think in any
case.

 arch/arm64/mm/dma-mapping.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Ard Biesheuvel Aug. 24, 2022, 9:58 a.m. UTC | #1
On Tue, 23 Aug 2022 at 14:21, Will Deacon <will@kernel.org> wrote:
>
> arch_dma_prep_coherent() is called when preparing a non-cacheable region
> for a consistent DMA buffer allocation. Since the buffer pages may
> previously have been written via a cacheable mapping and consequently
> allocated as dirty cachelines, the purpose of this function is to remove
> these dirty lines from the cache, writing them back so that the
> non-coherent device is able to see them.
>
> On arm64, this operation can be achieved with a clean to the point of
> coherency; a subsequent invalidation is not required and serves little
> purpose in the presence of a cacheable alias (e.g. the linear map),
> since clean lines can be speculatively fetched back into the cache after
> the invalidation operation has completed.
>
> Relax the cache maintenance in arch_dma_prep_coherent() so that only a
> clean, and not a clean-and-invalidate operation is performed.
>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>
> I'm slightly wary about this change as other architectures seem to do
> clean+invalidate here, but I'd like to hear what others think in any
> case.
>

Assuming that the architecture never permits non-cacheable accesses to
hit in the caches [i.e., if the behavior the ARM ARM describes in
'Behavior of caches at reset' is the only exception], then I agree
that invalidation should make no difference. However, given the fact
that we are dealing with mismatched attribute mappings here, retaining
the invalidate seems like the safer bet. (IIRC, the justification for
permitting the linear map's mismatched attributes is that we never
access the region via its linear address, and so we are unlikely to be
affected by any conflicts. Purging cachelines that cover this region
seems like a sensible thing to do in that respect.)

Why are you proposing to change this?

>  arch/arm64/mm/dma-mapping.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index 599cf81f5685..83a512a6ff0d 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -36,7 +36,7 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
>  {
>         unsigned long start = (unsigned long)page_address(page);
>
> -       dcache_clean_inval_poc(start, start + size);
> +       dcache_clean_poc(start, start + size);
>  }
>
>  #ifdef CONFIG_IOMMU_DMA
> --
> 2.37.1.595.g718a3a8f04-goog
>
Will Deacon Aug. 24, 2022, 11:23 a.m. UTC | #2
Hi Ard,

Cheers for having a look.

On Wed, Aug 24, 2022 at 11:58:46AM +0200, Ard Biesheuvel wrote:
> On Tue, 23 Aug 2022 at 14:21, Will Deacon <will@kernel.org> wrote:
> >
> > arch_dma_prep_coherent() is called when preparing a non-cacheable region
> > for a consistent DMA buffer allocation. Since the buffer pages may
> > previously have been written via a cacheable mapping and consequently
> > allocated as dirty cachelines, the purpose of this function is to remove
> > these dirty lines from the cache, writing them back so that the
> > non-coherent device is able to see them.
> >
> > On arm64, this operation can be achieved with a clean to the point of
> > coherency; a subsequent invalidation is not required and serves little
> > purpose in the presence of a cacheable alias (e.g. the linear map),
> > since clean lines can be speculatively fetched back into the cache after
> > the invalidation operation has completed.
> >
> > Relax the cache maintenance in arch_dma_prep_coherent() so that only a
> > clean, and not a clean-and-invalidate operation is performed.
> >
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Ard Biesheuvel <ardb@kernel.org>
> > Signed-off-by: Will Deacon <will@kernel.org>
> > ---
> >
> > I'm slightly wary about this change as other architectures seem to do
> > clean+invalidate here, but I'd like to hear what others think in any
> > case.
> >
> 
> Assuming that the architecture never permits non-cacheable accesses to
> hit in the caches [i.e., if the behavior the ARM ARM describes in
> 'Behavior of caches at reset' is the only exception],

As long as the allocated lines are clean, then the non-cacheable access
won't hit.

> then I agree that invalidation should make no difference. However, given
> the fact that we are dealing with mismatched attribute mappings here,
> retaining the invalidate seems like the safer bet. (IIRC, the
> justification for permitting the linear map's mismatched attributes is
> that we never access the region via its linear address, and so we are
> unlikely to be affected by any conflicts. Purging cachelines that cover
> this region seems like a sensible thing to do in that respect.)

It's certainly harmless to keep the invalidation, but I also don't think
it's serving any real purpose either given that the clean lines can be
pulled straight back in by speculative accesses through the linear map.

> Why are you proposing to change this?

Some hardware folks were asking me why we had the invalidation here and
then I realised that it's not necessary. An alternative to removing it
would be to add a comment saying it's not needed, but it's a bit weird
writing that!

Will
Ard Biesheuvel Aug. 24, 2022, 11:49 a.m. UTC | #3
On Wed, 24 Aug 2022 at 13:23, Will Deacon <will@kernel.org> wrote:
>
> Hi Ard,
>
> Cheers for having a look.
>
> On Wed, Aug 24, 2022 at 11:58:46AM +0200, Ard Biesheuvel wrote:
> > On Tue, 23 Aug 2022 at 14:21, Will Deacon <will@kernel.org> wrote:
> > >
> > > arch_dma_prep_coherent() is called when preparing a non-cacheable region
> > > for a consistent DMA buffer allocation. Since the buffer pages may
> > > previously have been written via a cacheable mapping and consequently
> > > allocated as dirty cachelines, the purpose of this function is to remove
> > > these dirty lines from the cache, writing them back so that the
> > > non-coherent device is able to see them.
> > >
> > > On arm64, this operation can be achieved with a clean to the point of
> > > coherency; a subsequent invalidation is not required and serves little
> > > purpose in the presence of a cacheable alias (e.g. the linear map),
> > > since clean lines can be speculatively fetched back into the cache after
> > > the invalidation operation has completed.
> > >
> > > Relax the cache maintenance in arch_dma_prep_coherent() so that only a
> > > clean, and not a clean-and-invalidate operation is performed.
> > >
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > Cc: Mark Rutland <mark.rutland@arm.com>
> > > Cc: Robin Murphy <robin.murphy@arm.com>
> > > Cc: Christoph Hellwig <hch@lst.de>
> > > Cc: Ard Biesheuvel <ardb@kernel.org>
> > > Signed-off-by: Will Deacon <will@kernel.org>
> > > ---
> > >
> > > I'm slightly wary about this change as other architectures seem to do
> > > clean+invalidate here, but I'd like to hear what others think in any
> > > case.
> > >
> >
> > Assuming that the architecture never permits non-cacheable accesses to
> > hit in the caches [i.e., if the behavior the ARM ARM describes in
> > 'Behavior of caches at reset' is the only exception],
>
> As long as the allocated lines are clean, then the non-cacheable access
> won't hit.
>

If this is guaranteed to remain true even in the presence of
mismatched attribute mappings of the same physical region, then we
should be fine, I suppose.

But if not, shouldn't invalidating the lines substantially reduce the
likelihood of hitting this issue? And even if they /could/ be
speculated back in, how likely is that to happen if the linear alias
VA is not exposed anywhere?

> > then I agree that invalidation should make no difference. However, given
> > the fact that we are dealing with mismatched attribute mappings here,
> > retaining the invalidate seems like the safer bet. (IIRC, the
> > justification for permitting the linear map's mismatched attributes is
> > that we never access the region via its linear address, and so we are
> > unlikely to be affected by any conflicts. Purging cachelines that cover
> > this region seems like a sensible thing to do in that respect.)
>
> It's certainly harmless to keep the invalidation, but I also don't think
> it's serving any real purpose either given that the clean lines can be
> pulled straight back in by speculative accesses through the linear map.
>
> > Why are you proposing to change this?
>
> Some hardware folks were asking me why we had the invalidation here and
> then I realised that it's not necessary. An alternative to removing it
> would be to add a comment saying it's not needed, but it's a bit weird
> writing that!
>

I don't disagree with that. If we weren't violating the ARM ARM by
having two live mappings of the same physical region with mismatched
attributes, I wouldn't be as suspicious.
Catalin Marinas Aug. 24, 2022, 10 p.m. UTC | #4
On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote:
> arch_dma_prep_coherent() is called when preparing a non-cacheable region
> for a consistent DMA buffer allocation. Since the buffer pages may
> previously have been written via a cacheable mapping and consequently
> allocated as dirty cachelines, the purpose of this function is to remove
> these dirty lines from the cache, writing them back so that the
> non-coherent device is able to see them.
> 
> On arm64, this operation can be achieved with a clean to the point of
> coherency; a subsequent invalidation is not required and serves little
> purpose in the presence of a cacheable alias (e.g. the linear map),
> since clean lines can be speculatively fetched back into the cache after
> the invalidation operation has completed.
> 
> Relax the cache maintenance in arch_dma_prep_coherent() so that only a
> clean, and not a clean-and-invalidate operation is performed.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
> 
> I'm slightly wary about this change as other architectures seem to do
> clean+invalidate here, but I'd like to hear what others think in any
> case.

Given that we still have the cacheable alias around, I don't see much
point in the invalidation. So:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

(I was wondering why not just invalidate without clean but it could be
that the allocated memory was zeroed and we want that to make it to the
PoC)
Christoph Hellwig Sept. 7, 2022, 9:03 a.m. UTC | #5
On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote:
> arch_dma_prep_coherent() is called when preparing a non-cacheable region
> for a consistent DMA buffer allocation. Since the buffer pages may
> previously have been written via a cacheable mapping and consequently
> allocated as dirty cachelines, the purpose of this function is to remove
> these dirty lines from the cache, writing them back so that the
> non-coherent device is able to see them.

Yes.

> I'm slightly wary about this change as other architectures seem to do
> clean+invalidate here, but I'd like to hear what others think in any
> case.

If arm64 is fine with having clean but present cachelines when creating
an uncached mapping for a cache line, the invalidate is not required.

But isn't it better for the cache if these by definition useless
cachelines get evicted?

My biggest concern here is that we're now moving from consolidating
these semantics in all the different architectures to different ones,
making a centralization of the policies even harder.
Christoph Hellwig Sept. 7, 2022, 9:04 a.m. UTC | #6
On Wed, Aug 24, 2022 at 11:00:02PM +0100, Catalin Marinas wrote:
> (I was wondering why not just invalidate without clean but it could be
> that the allocated memory was zeroed and we want that to make it to the
> PoC)

The memory is zerod after arch_dma_prep_coherent.  So a pure invalidate
seems like the right thing to do here.
Robin Murphy Sept. 7, 2022, 9:27 a.m. UTC | #7
On 2022-09-07 10:03, Christoph Hellwig wrote:
> On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote:
>> arch_dma_prep_coherent() is called when preparing a non-cacheable region
>> for a consistent DMA buffer allocation. Since the buffer pages may
>> previously have been written via a cacheable mapping and consequently
>> allocated as dirty cachelines, the purpose of this function is to remove
>> these dirty lines from the cache, writing them back so that the
>> non-coherent device is able to see them.
> 
> Yes.
> 
>> I'm slightly wary about this change as other architectures seem to do
>> clean+invalidate here, but I'd like to hear what others think in any
>> case.
> 
> If arm64 is fine with having clean but present cachelines when creating
> an uncached mapping for a cache line, the invalidate is not required.
> 
> But isn't it better for the cache if these by definition useless
> cachelines get evicted?
> 
> My biggest concern here is that we're now moving from consolidating
> these semantics in all the different architectures to different ones,
> making a centralization of the policies even harder.

FWIW I agree with Ard in not being entirely confident with this change. 
The impression I had (which may be wrong) was that the architecture 
never actually ruled out unexpected cache hits in the case of mismatched 
attributes, it just quietly stopped mentioning it at all. And even if 
the architecture did rule them out, how confident are we about errata 
that might still allow them to happen?

It seems like we don't stand to gain much by removing the invalidation - 
since the overhead will still be in the clean - other than the potential 
for a slightly increased chance of rare and hard-to-debug memory 
corruption :/

Cheers,
Robin.

(who's spent the last few months elbow-deep in a hideous CPU cache 
erratum...)
Russell King (Oracle) Sept. 7, 2022, 2:10 p.m. UTC | #8
On Wed, Sep 07, 2022 at 11:04:34AM +0200, Christoph Hellwig wrote:
> On Wed, Aug 24, 2022 at 11:00:02PM +0100, Catalin Marinas wrote:
> > (I was wondering why not just invalidate without clean but it could be
> > that the allocated memory was zeroed and we want that to make it to the
> > PoC)
> 
> The memory is zerod after arch_dma_prep_coherent.  So a pure invalidate
> seems like the right thing to do here.

My thoughts on Christoph's comment...

That seems backwards to me.

If we allocate memory, clean (and maybe invalidate) the cache, and then
zero the memory _after_ the clean, then couldn't we be creating dirty
cache lines. There are caches out there that are write-allocate, which
means that by writing zeros to memory, you drag cache lines back into
the cache. These dirty cache lines could then be written back at some
random time later, corrupting the data that has been DMA'd to the
buffer or placed there via the uncached mapping (if the architecture
doesn't hit those cache lines.)

It has to be:
- allocate
- zero
- deal with the cache

not:
- allocate
- deal with the cache
- zero

Moreover, if one does:
- allocate
- invalidate
- zero
- clean+(invalidate?)

then aren't we making the zero operation potentially more expensive than
if we had cache lines already there - I'm thinking there about a
no-write-allocate cache here. Those cache lines could have been used to
buffer the zeroing writes and then written out later. Depending on the
architecture, having cached lines may mean that write merging happens,
meaning writebacks to memory occur in larger chunks.
Christoph Hellwig Sept. 7, 2022, 2:14 p.m. UTC | #9
On Wed, Sep 07, 2022 at 03:10:35PM +0100, Russell King (Oracle) wrote:
> If we allocate memory, clean (and maybe invalidate) the cache, and then
> zero the memory _after_ the clean, then couldn't we be creating dirty
> cache lines. There are caches out there that are write-allocate, which
> means that by writing zeros to memory, you drag cache lines back into
> the cache.

When writting through an uncached mapping?
Russell King (Oracle) Sept. 7, 2022, 2:15 p.m. UTC | #10
On Wed, Sep 07, 2022 at 04:14:41PM +0200, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 03:10:35PM +0100, Russell King (Oracle) wrote:
> > If we allocate memory, clean (and maybe invalidate) the cache, and then
> > zero the memory _after_ the clean, then couldn't we be creating dirty
> > cache lines. There are caches out there that are write-allocate, which
> > means that by writing zeros to memory, you drag cache lines back into
> > the cache.
> 
> When writting through an uncached mapping?

Eww. That's going to be slow.
Will Deacon Sept. 7, 2022, 4:25 p.m. UTC | #11
On Wed, Sep 07, 2022 at 10:27:45AM +0100, Robin Murphy wrote:
> On 2022-09-07 10:03, Christoph Hellwig wrote:
> > On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote:
> > > arch_dma_prep_coherent() is called when preparing a non-cacheable region
> > > for a consistent DMA buffer allocation. Since the buffer pages may
> > > previously have been written via a cacheable mapping and consequently
> > > allocated as dirty cachelines, the purpose of this function is to remove
> > > these dirty lines from the cache, writing them back so that the
> > > non-coherent device is able to see them.
> > 
> > Yes.
> > 
> > > I'm slightly wary about this change as other architectures seem to do
> > > clean+invalidate here, but I'd like to hear what others think in any
> > > case.
> > 
> > If arm64 is fine with having clean but present cachelines when creating
> > an uncached mapping for a cache line, the invalidate is not required.
> > 
> > But isn't it better for the cache if these by definition useless
> > cachelines get evicted?
> > 
> > My biggest concern here is that we're now moving from consolidating
> > these semantics in all the different architectures to different ones,
> > making a centralization of the policies even harder.
> 
> FWIW I agree with Ard in not being entirely confident with this change. The
> impression I had (which may be wrong) was that the architecture never
> actually ruled out unexpected cache hits in the case of mismatched
> attributes, it just quietly stopped mentioning it at all. And even if the
> architecture did rule them out, how confident are we about errata that might
> still allow them to happen?

The architecture does rule this out as long as the cacheable alias is not
dirty (i.e. has not been written to). If this was not the case, then our
non-cacheable DMA buffers would be unreliable because speculative fills
from the linear map could "mask" the actual DMA data. The presence of the
linear map also means that the lines could immediately be refetched by
speculation, rendering the invalidation pointless in the current code.

Regardless of this patch, a CPU erratum violating the architectural
guarantee would be broken. It is the presence of the linear alias that
would need to be resolved.

The architecture text is buried in the infamous "B2.8 Mismatched memory
attributes" section of the Arm ARM, but it's also worth talking to Mark R,
Catalin and the handful of folks in ATG who grok this to get a summary
which is easier to digest.

> It seems like we don't stand to gain much by removing the invalidation -
> since the overhead will still be in the clean - other than the potential for
> a slightly increased chance of rare and hard-to-debug memory corruption :/

I just find it odd that we rely on the CPU not hitting the cacheable alias
in other places, yet we're using an invalidation for this path. It's
inconsistent and difficult to explain to people. As I said, I'm happy to
add a comment to the existing code instead of the change here, but I don't
know what to say other than something like:

  /*
   * The architecture says we only need a clean here, but invalidate as
   * well just in case.
   */

Which feels deeply unsatisfactory.

Will
Will Deacon Sept. 7, 2022, 4:28 p.m. UTC | #12
On Wed, Sep 07, 2022 at 11:03:05AM +0200, Christoph Hellwig wrote:
> On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote:
> > arch_dma_prep_coherent() is called when preparing a non-cacheable region
> > for a consistent DMA buffer allocation. Since the buffer pages may
> > previously have been written via a cacheable mapping and consequently
> > allocated as dirty cachelines, the purpose of this function is to remove
> > these dirty lines from the cache, writing them back so that the
> > non-coherent device is able to see them.
> 
> Yes.
> 
> > I'm slightly wary about this change as other architectures seem to do
> > clean+invalidate here, but I'd like to hear what others think in any
> > case.
> 
> If arm64 is fine with having clean but present cachelines when creating
> an uncached mapping for a cache line, the invalidate is not required.
> 
> But isn't it better for the cache if these by definition useless
> cachelines get evicted?
> 
> My biggest concern here is that we're now moving from consolidating
> these semantics in all the different architectures to different ones,
> making a centralization of the policies even harder.

I definitely wouldn't object to a pointless invalidation if it helped with
consolidation, and we could add a comment to the arch code to that effect.
At the moment, however, I can't imagine what this consolidation would look
like -- at the end of the day we're going to need to dip into the arch
for the cache op.

Do you have a branch, or is it a longer-term goal?

Will
Robin Murphy Sept. 7, 2022, 5:50 p.m. UTC | #13
On 2022-09-07 17:25, Will Deacon wrote:
> On Wed, Sep 07, 2022 at 10:27:45AM +0100, Robin Murphy wrote:
>> On 2022-09-07 10:03, Christoph Hellwig wrote:
>>> On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote:
>>>> arch_dma_prep_coherent() is called when preparing a non-cacheable region
>>>> for a consistent DMA buffer allocation. Since the buffer pages may
>>>> previously have been written via a cacheable mapping and consequently
>>>> allocated as dirty cachelines, the purpose of this function is to remove
>>>> these dirty lines from the cache, writing them back so that the
>>>> non-coherent device is able to see them.
>>>
>>> Yes.
>>>
>>>> I'm slightly wary about this change as other architectures seem to do
>>>> clean+invalidate here, but I'd like to hear what others think in any
>>>> case.
>>>
>>> If arm64 is fine with having clean but present cachelines when creating
>>> an uncached mapping for a cache line, the invalidate is not required.
>>>
>>> But isn't it better for the cache if these by definition useless
>>> cachelines get evicted?
>>>
>>> My biggest concern here is that we're now moving from consolidating
>>> these semantics in all the different architectures to different ones,
>>> making a centralization of the policies even harder.
>>
>> FWIW I agree with Ard in not being entirely confident with this change. The
>> impression I had (which may be wrong) was that the architecture never
>> actually ruled out unexpected cache hits in the case of mismatched
>> attributes, it just quietly stopped mentioning it at all. And even if the
>> architecture did rule them out, how confident are we about errata that might
>> still allow them to happen?
> 
> The architecture does rule this out as long as the cacheable alias is not
> dirty (i.e. has not been written to). If this was not the case, then our
> non-cacheable DMA buffers would be unreliable because speculative fills
> from the linear map could "mask" the actual DMA data. The presence of the
> linear map also means that the lines could immediately be refetched by
> speculation, rendering the invalidation pointless in the current code.
> 
> Regardless of this patch, a CPU erratum violating the architectural
> guarantee would be broken. It is the presence of the linear alias that
> would need to be resolved.

Oh have I got a treat for you... but that's not relevant here ;)

> The architecture text is buried in the infamous "B2.8 Mismatched memory
> attributes" section of the Arm ARM, but it's also worth talking to Mark R,
> Catalin and the handful of folks in ATG who grok this to get a summary
> which is easier to digest.

I'm looking at DDI0487I.a and nothing stands out as different from 
previous readings. In particular,

"A read of the memory Location by one agent might not return the value 
most recently written to that memory Location by the same agent"

seems entirely consistent with non-cacheable reads being allowed to 
unexpectedly hit clean (but stale) cache entries even if non-cacheable 
writes don't. And frankly I don't see how

"There might be a loss of coherency when multiple agents attempt to 
access a memory Location."

helps rule *anything* out. The bit about writeback does seem to imply 
that non-cacheable writes can't dirty a cache entry, so I can at least 
accept that unexpected hits for writes are off the table. Furthermore, 
the bit about mismatched shareability - which I believe applies here 
since our cacheable mappings are ISh while non-cacheable is inherently 
OSh - explicitly mentions:

"Software running on a PE cleans and invalidates a Location from cache 
before and after each read or write to that Location by that PE."

If the architecture saying to clean and invalidate is supposed to be a 
guarantee that invalidation is unnecessary, I give up.

>> It seems like we don't stand to gain much by removing the invalidation -
>> since the overhead will still be in the clean - other than the potential for
>> a slightly increased chance of rare and hard-to-debug memory corruption :/
> 
> I just find it odd that we rely on the CPU not hitting the cacheable alias
> in other places, yet we're using an invalidation for this path. It's
> inconsistent and difficult to explain to people. As I said, I'm happy to
> add a comment to the existing code instead of the change here, but I don't
> know what to say other than something like:
> 
>    /*
>     * The architecture says we only need a clean here, but invalidate as
>     * well just in case.
>     */
> 
> Which feels deeply unsatisfactory.

Well, the chances of the alias being speculated back into caches by 
access to nearby linear map pages still seem a lot smaller than the 
chances of caches hanging on to lines we've just finished accessing with 
cacheable attributes. Christoph makes a good point that at worst it's a 
concise hint for caches with LRU replacement policy that might otherwise 
think we could be coming back for our freshly-written clean zeros.

Cheers,
Robin.
Catalin Marinas Sept. 8, 2022, 10:32 a.m. UTC | #14
On Wed, Sep 07, 2022 at 05:25:43PM +0100, Will Deacon wrote:
> On Wed, Sep 07, 2022 at 10:27:45AM +0100, Robin Murphy wrote:
> > On 2022-09-07 10:03, Christoph Hellwig wrote:
> > > On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote:
> > > > arch_dma_prep_coherent() is called when preparing a non-cacheable region
> > > > for a consistent DMA buffer allocation. Since the buffer pages may
> > > > previously have been written via a cacheable mapping and consequently
> > > > allocated as dirty cachelines, the purpose of this function is to remove
> > > > these dirty lines from the cache, writing them back so that the
> > > > non-coherent device is able to see them.
> > > 
> > > Yes.
> > > 
> > > > I'm slightly wary about this change as other architectures seem to do
> > > > clean+invalidate here, but I'd like to hear what others think in any
> > > > case.
> > > 
> > > If arm64 is fine with having clean but present cachelines when creating
> > > an uncached mapping for a cache line, the invalidate is not required.
> > > 
> > > But isn't it better for the cache if these by definition useless
> > > cachelines get evicted?
> > > 
> > > My biggest concern here is that we're now moving from consolidating
> > > these semantics in all the different architectures to different ones,
> > > making a centralization of the policies even harder.
> > 
> > FWIW I agree with Ard in not being entirely confident with this change. The
> > impression I had (which may be wrong) was that the architecture never
> > actually ruled out unexpected cache hits in the case of mismatched
> > attributes, it just quietly stopped mentioning it at all. And even if the
> > architecture did rule them out, how confident are we about errata that might
> > still allow them to happen?

For the erratum you have in mind, invalidation doesn't help either (as
you know already).

The architecture does rule out unexpected cache hits, though not by
stating this explicitly but rather only listing the cases where one
needs cache maintenance in the presence of mismatched aliases. Such
unexpected cache hits may happen micro-architecturally but are not
visible to software (in theory and ignoring timing side-channels).

> > It seems like we don't stand to gain much by removing the invalidation -
> > since the overhead will still be in the clean - other than the potential for
> > a slightly increased chance of rare and hard-to-debug memory corruption :/
> 
> I just find it odd that we rely on the CPU not hitting the cacheable alias
> in other places, yet we're using an invalidation for this path. It's
> inconsistent and difficult to explain to people. As I said, I'm happy to
> add a comment to the existing code instead of the change here, but I don't
> know what to say other than something like:
> 
>   /*
>    * The architecture says we only need a clean here, but invalidate as
>    * well just in case.
>    */

The architecture requires invalidate or clean while also stating that
clean+invalidate can be used instead, so I don't think there's much to
justify. From the mismatched attributes section:

1. If the mismatched attributes for a memory location all assign the
   same shareability attribute to a Location that has a cacheable
   attribute, any loss of uniprocessor semantics, ordering, or coherency
   within a shareability domain can be avoided by use of software cache
   management. To do so, software must use the techniques that are
   required for the software management of the ordering or coherency of
   cacheable Locations between agents in different shareability domains.
   This means:
   - Before writing to a cacheable Location not using the Write-Back
     attribute, software must invalidate, or clean, a Location from the
     caches if any agent might have written to the Location with the
     Write-Back attribute. This avoids the possibility of overwriting
     the Location with stale data.
   - After writing to a cacheable Location with the Write-Back
     attribute, software must clean the Location from the caches, to
     make the write visible to external memory.
   ...

When creating the non-cacheable alias for DMA, what we need is the first
bullet point above. A clean would be required if we wrote something via
the write-back mapping and we want that data to be visible (not needed
if the zeroing is done via the non-cacheable alias).

However, it just crossed my mind that we can't always drop the
CPU-written (meta)data with an invalidate even if the zeroing is done on
the non-cacheable alias. With MTE+kasan, the page allocator may colour
the memory (via the cacheable alias). Even if non-cacheable accesses are
not checked by MTE, when freeing it back it could potentially confuse
the allocator if the tags were lost. Similarly for the streaming DMA and
this could have been worse but luckily we had c50f11c6196f ("arm64: mm:
Don't invalidate FROM_DEVICE buffers at start of DMA transfer") for
other reasons.

So yeah, I think a clean is needed here or clean+invalidate but not
invalidate only due to the addition of MTE.
Robin Murphy Sept. 8, 2022, 11:32 a.m. UTC | #15
On 2022-09-08 11:32, Catalin Marinas wrote:
> On Wed, Sep 07, 2022 at 05:25:43PM +0100, Will Deacon wrote:
>> On Wed, Sep 07, 2022 at 10:27:45AM +0100, Robin Murphy wrote:
>>> On 2022-09-07 10:03, Christoph Hellwig wrote:
>>>> On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote:
>>>>> arch_dma_prep_coherent() is called when preparing a non-cacheable region
>>>>> for a consistent DMA buffer allocation. Since the buffer pages may
>>>>> previously have been written via a cacheable mapping and consequently
>>>>> allocated as dirty cachelines, the purpose of this function is to remove
>>>>> these dirty lines from the cache, writing them back so that the
>>>>> non-coherent device is able to see them.
>>>>
>>>> Yes.
>>>>
>>>>> I'm slightly wary about this change as other architectures seem to do
>>>>> clean+invalidate here, but I'd like to hear what others think in any
>>>>> case.
>>>>
>>>> If arm64 is fine with having clean but present cachelines when creating
>>>> an uncached mapping for a cache line, the invalidate is not required.
>>>>
>>>> But isn't it better for the cache if these by definition useless
>>>> cachelines get evicted?
>>>>
>>>> My biggest concern here is that we're now moving from consolidating
>>>> these semantics in all the different architectures to different ones,
>>>> making a centralization of the policies even harder.
>>>
>>> FWIW I agree with Ard in not being entirely confident with this change. The
>>> impression I had (which may be wrong) was that the architecture never
>>> actually ruled out unexpected cache hits in the case of mismatched
>>> attributes, it just quietly stopped mentioning it at all. And even if the
>>> architecture did rule them out, how confident are we about errata that might
>>> still allow them to happen?
> 
> For the erratum you have in mind, invalidation doesn't help either (as
> you know already).
> 
> The architecture does rule out unexpected cache hits, though not by
> stating this explicitly but rather only listing the cases where one
> needs cache maintenance in the presence of mismatched aliases. Such
> unexpected cache hits may happen micro-architecturally but are not
> visible to software (in theory and ignoring timing side-channels).
> 
>>> It seems like we don't stand to gain much by removing the invalidation -
>>> since the overhead will still be in the clean - other than the potential for
>>> a slightly increased chance of rare and hard-to-debug memory corruption :/
>>
>> I just find it odd that we rely on the CPU not hitting the cacheable alias
>> in other places, yet we're using an invalidation for this path. It's
>> inconsistent and difficult to explain to people. As I said, I'm happy to
>> add a comment to the existing code instead of the change here, but I don't
>> know what to say other than something like:
>>
>>    /*
>>     * The architecture says we only need a clean here, but invalidate as
>>     * well just in case.
>>     */
> 
> The architecture requires invalidate or clean while also stating that
> clean+invalidate can be used instead, so I don't think there's much to
> justify. From the mismatched attributes section:
> 
> 1. If the mismatched attributes for a memory location all assign the
>     same shareability attribute to a Location that has a cacheable
>     attribute,

This is not describing our case, though. We do have a cacheable 
attribute in at least one alias, but the shareability is *not* the same 
across all aliases, therefore at face value it clearly does not apply.

If it's intended to mean that the shareability of all aliases with 
cacheable attributes is the same and the shareability of aliases without 
cacheable attributes doesn't matter, then it should bloody well say 
that, not something entirely contradictory :(

I'm basically just shaking my fists at the sky here, but it irks me no 
end to be writing patches based on what half a dozen people (myself 
included, in some cases) happen to know that an architecture is supposed 
to mean, rather than what it actually says.

Cheers,
Robin.

> any loss of uniprocessor semantics, ordering, or coherency
>     within a shareability domain can be avoided by use of software cache
>     management. To do so, software must use the techniques that are
>     required for the software management of the ordering or coherency of
>     cacheable Locations between agents in different shareability domains.
>     This means:
>     - Before writing to a cacheable Location not using the Write-Back
>       attribute, software must invalidate, or clean, a Location from the
>       caches if any agent might have written to the Location with the
>       Write-Back attribute. This avoids the possibility of overwriting
>       the Location with stale data.
>     - After writing to a cacheable Location with the Write-Back
>       attribute, software must clean the Location from the caches, to
>       make the write visible to external memory.
>     ...
> 
> When creating the non-cacheable alias for DMA, what we need is the first
> bullet point above. A clean would be required if we wrote something via
> the write-back mapping and we want that data to be visible (not needed
> if the zeroing is done via the non-cacheable alias).
> 
> However, it just crossed my mind that we can't always drop the
> CPU-written (meta)data with an invalidate even if the zeroing is done on
> the non-cacheable alias. With MTE+kasan, the page allocator may colour
> the memory (via the cacheable alias). Even if non-cacheable accesses are
> not checked by MTE, when freeing it back it could potentially confuse
> the allocator if the tags were lost. Similarly for the streaming DMA and
> this could have been worse but luckily we had c50f11c6196f ("arm64: mm:
> Don't invalidate FROM_DEVICE buffers at start of DMA transfer") for
> other reasons.
> 
> So yeah, I think a clean is needed here or clean+invalidate but not
> invalidate only due to the addition of MTE.
>
Catalin Marinas Sept. 8, 2022, 1:02 p.m. UTC | #16
On Thu, Sep 08, 2022 at 12:32:42PM +0100, Robin Murphy wrote:
> On 2022-09-08 11:32, Catalin Marinas wrote:
> > The architecture requires invalidate or clean while also stating that
> > clean+invalidate can be used instead, so I don't think there's much to
> > justify. From the mismatched attributes section:
> > 
> > 1. If the mismatched attributes for a memory location all assign the
> >     same shareability attribute to a Location that has a cacheable
> >     attribute,
> 
> This is not describing our case, though. We do have a cacheable attribute in
> at least one alias, but the shareability is *not* the same across all
> aliases, therefore at face value it clearly does not apply.

From the CPU perspective, both the cacheable and non-cacheable aliases
have the same inner shareable attribute. The device accessing the buffer
presumably should use the same shareability attributes. In the worst
case it's outer shareable but on most implementations the inner and
outer shareability domains are the same.
Catalin Marinas Sept. 8, 2022, 1:27 p.m. UTC | #17
On Thu, Sep 08, 2022 at 02:02:13PM +0100, Catalin Marinas wrote:
> On Thu, Sep 08, 2022 at 12:32:42PM +0100, Robin Murphy wrote:
> > On 2022-09-08 11:32, Catalin Marinas wrote:
> > > The architecture requires invalidate or clean while also stating that
> > > clean+invalidate can be used instead, so I don't think there's much to
> > > justify. From the mismatched attributes section:
> > > 
> > > 1. If the mismatched attributes for a memory location all assign the
> > >     same shareability attribute to a Location that has a cacheable
> > >     attribute,
> > 
> > This is not describing our case, though. We do have a cacheable attribute in
> > at least one alias, but the shareability is *not* the same across all
> > aliases, therefore at face value it clearly does not apply.
> 
> From the CPU perspective, both the cacheable and non-cacheable aliases
> have the same inner shareable attribute. The device accessing the buffer
> presumably should use the same shareability attributes. In the worst
> case it's outer shareable but on most implementations the inner and
> outer shareability domains are the same.

Ah, so you are referring to this paragraph:

  The shareability field is only relevant if the memory is a Normal
  Cacheable memory type. All Device and Normal Non-cacheable memory
  regions are always treated as Outer Shareable, regardless of the
  translation table shareability attributes.

I think so far we've ignored this case and I'm not aware hardware that
would break coherency here. There's a theoretical risk that some system
level cache only resides in the outer shareable domain and any cache
maintenance on the inner shareable alias does not affect it. IIRC we
have another case where the architecture requires the inner and outer
shareable domains to be the same (a fairly recent update w.r.t. FWB). We
could ask the architects to clarify the mismatched aliases to include
the outer vs inner case here.

But if you want the same attributes, we can change the Linux linear
mapping to outer shareable.
Will Deacon Sept. 8, 2022, 1:32 p.m. UTC | #18
On Thu, Sep 08, 2022 at 02:27:19PM +0100, Catalin Marinas wrote:
> On Thu, Sep 08, 2022 at 02:02:13PM +0100, Catalin Marinas wrote:
> > On Thu, Sep 08, 2022 at 12:32:42PM +0100, Robin Murphy wrote:
> > > On 2022-09-08 11:32, Catalin Marinas wrote:
> > > > The architecture requires invalidate or clean while also stating that
> > > > clean+invalidate can be used instead, so I don't think there's much to
> > > > justify. From the mismatched attributes section:
> > > > 
> > > > 1. If the mismatched attributes for a memory location all assign the
> > > >     same shareability attribute to a Location that has a cacheable
> > > >     attribute,
> > > 
> > > This is not describing our case, though. We do have a cacheable attribute in
> > > at least one alias, but the shareability is *not* the same across all
> > > aliases, therefore at face value it clearly does not apply.
> > 
> > From the CPU perspective, both the cacheable and non-cacheable aliases
> > have the same inner shareable attribute. The device accessing the buffer
> > presumably should use the same shareability attributes. In the worst
> > case it's outer shareable but on most implementations the inner and
> > outer shareability domains are the same.
> 
> Ah, so you are referring to this paragraph:
> 
>   The shareability field is only relevant if the memory is a Normal
>   Cacheable memory type. All Device and Normal Non-cacheable memory
>   regions are always treated as Outer Shareable, regardless of the
>   translation table shareability attributes.

FWIW, I've never understood what this paragraph means for the shareability
of something like an inner-cacheable, outer-non-cacheable mapping.

*shrug*

Will
Catalin Marinas Sept. 8, 2022, 3:49 p.m. UTC | #19
On Thu, Sep 08, 2022 at 11:32:15AM +0100, Catalin Marinas wrote:
> On Wed, Sep 07, 2022 at 05:25:43PM +0100, Will Deacon wrote:
> > On Wed, Sep 07, 2022 at 10:27:45AM +0100, Robin Murphy wrote:
> > > It seems like we don't stand to gain much by removing the invalidation -
> > > since the overhead will still be in the clean - other than the potential for
> > > a slightly increased chance of rare and hard-to-debug memory corruption :/
> > 
> > I just find it odd that we rely on the CPU not hitting the cacheable alias
> > in other places, yet we're using an invalidation for this path. It's
> > inconsistent and difficult to explain to people. As I said, I'm happy to
> > add a comment to the existing code instead of the change here, but I don't
> > know what to say other than something like:
> > 
> >   /*
> >    * The architecture says we only need a clean here, but invalidate as
> >    * well just in case.
> >    */
[...]
> So yeah, I think a clean is needed here or clean+invalidate but not
> invalidate only due to the addition of MTE.

OK, I think invalidate works as well. The ARM ARM has a note on the DC
IVAC instruction:

  When FEAT_MTE is implemented, this instruction might invalidate
  Allocation Tags from caches. When it invalidates Allocation Tags from
  caches, it also cleans them.

(hopefully it cleans them first before invalidating)
Catalin Marinas Sept. 22, 2022, 8:02 p.m. UTC | #20
On Tue, 23 Aug 2022 13:21:11 +0100, Will Deacon wrote:
> arch_dma_prep_coherent() is called when preparing a non-cacheable region
> for a consistent DMA buffer allocation. Since the buffer pages may
> previously have been written via a cacheable mapping and consequently
> allocated as dirty cachelines, the purpose of this function is to remove
> these dirty lines from the cache, writing them back so that the
> non-coherent device is able to see them.
> 
> [...]

Applied to arm64 (for-next/misc), thanks!

[1/1] arm64: dma: Drop cache invalidation from arch_dma_prep_coherent()
      https://git.kernel.org/arm64/c/c44094eee32f
diff mbox series

Patch

diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 599cf81f5685..83a512a6ff0d 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -36,7 +36,7 @@  void arch_dma_prep_coherent(struct page *page, size_t size)
 {
 	unsigned long start = (unsigned long)page_address(page);
 
-	dcache_clean_inval_poc(start, start + size);
+	dcache_clean_poc(start, start + size);
 }
 
 #ifdef CONFIG_IOMMU_DMA