Message ID | 20220823122111.17439-1-will@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: dma: Drop cache invalidation from arch_dma_prep_coherent() | expand |
On Tue, 23 Aug 2022 at 14:21, Will Deacon <will@kernel.org> wrote: > > arch_dma_prep_coherent() is called when preparing a non-cacheable region > for a consistent DMA buffer allocation. Since the buffer pages may > previously have been written via a cacheable mapping and consequently > allocated as dirty cachelines, the purpose of this function is to remove > these dirty lines from the cache, writing them back so that the > non-coherent device is able to see them. > > On arm64, this operation can be achieved with a clean to the point of > coherency; a subsequent invalidation is not required and serves little > purpose in the presence of a cacheable alias (e.g. the linear map), > since clean lines can be speculatively fetched back into the cache after > the invalidation operation has completed. > > Relax the cache maintenance in arch_dma_prep_coherent() so that only a > clean, and not a clean-and-invalidate operation is performed. > > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Robin Murphy <robin.murphy@arm.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Ard Biesheuvel <ardb@kernel.org> > Signed-off-by: Will Deacon <will@kernel.org> > --- > > I'm slightly wary about this change as other architectures seem to do > clean+invalidate here, but I'd like to hear what others think in any > case. > Assuming that the architecture never permits non-cacheable accesses to hit in the caches [i.e., if the behavior the ARM ARM describes in 'Behavior of caches at reset' is the only exception], then I agree that invalidation should make no difference. However, given the fact that we are dealing with mismatched attribute mappings here, retaining the invalidate seems like the safer bet. (IIRC, the justification for permitting the linear map's mismatched attributes is that we never access the region via its linear address, and so we are unlikely to be affected by any conflicts. Purging cachelines that cover this region seems like a sensible thing to do in that respect.) Why are you proposing to change this? > arch/arm64/mm/dma-mapping.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c > index 599cf81f5685..83a512a6ff0d 100644 > --- a/arch/arm64/mm/dma-mapping.c > +++ b/arch/arm64/mm/dma-mapping.c > @@ -36,7 +36,7 @@ void arch_dma_prep_coherent(struct page *page, size_t size) > { > unsigned long start = (unsigned long)page_address(page); > > - dcache_clean_inval_poc(start, start + size); > + dcache_clean_poc(start, start + size); > } > > #ifdef CONFIG_IOMMU_DMA > -- > 2.37.1.595.g718a3a8f04-goog >
Hi Ard, Cheers for having a look. On Wed, Aug 24, 2022 at 11:58:46AM +0200, Ard Biesheuvel wrote: > On Tue, 23 Aug 2022 at 14:21, Will Deacon <will@kernel.org> wrote: > > > > arch_dma_prep_coherent() is called when preparing a non-cacheable region > > for a consistent DMA buffer allocation. Since the buffer pages may > > previously have been written via a cacheable mapping and consequently > > allocated as dirty cachelines, the purpose of this function is to remove > > these dirty lines from the cache, writing them back so that the > > non-coherent device is able to see them. > > > > On arm64, this operation can be achieved with a clean to the point of > > coherency; a subsequent invalidation is not required and serves little > > purpose in the presence of a cacheable alias (e.g. the linear map), > > since clean lines can be speculatively fetched back into the cache after > > the invalidation operation has completed. > > > > Relax the cache maintenance in arch_dma_prep_coherent() so that only a > > clean, and not a clean-and-invalidate operation is performed. > > > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > Cc: Mark Rutland <mark.rutland@arm.com> > > Cc: Robin Murphy <robin.murphy@arm.com> > > Cc: Christoph Hellwig <hch@lst.de> > > Cc: Ard Biesheuvel <ardb@kernel.org> > > Signed-off-by: Will Deacon <will@kernel.org> > > --- > > > > I'm slightly wary about this change as other architectures seem to do > > clean+invalidate here, but I'd like to hear what others think in any > > case. > > > > Assuming that the architecture never permits non-cacheable accesses to > hit in the caches [i.e., if the behavior the ARM ARM describes in > 'Behavior of caches at reset' is the only exception], As long as the allocated lines are clean, then the non-cacheable access won't hit. > then I agree that invalidation should make no difference. However, given > the fact that we are dealing with mismatched attribute mappings here, > retaining the invalidate seems like the safer bet. (IIRC, the > justification for permitting the linear map's mismatched attributes is > that we never access the region via its linear address, and so we are > unlikely to be affected by any conflicts. Purging cachelines that cover > this region seems like a sensible thing to do in that respect.) It's certainly harmless to keep the invalidation, but I also don't think it's serving any real purpose either given that the clean lines can be pulled straight back in by speculative accesses through the linear map. > Why are you proposing to change this? Some hardware folks were asking me why we had the invalidation here and then I realised that it's not necessary. An alternative to removing it would be to add a comment saying it's not needed, but it's a bit weird writing that! Will
On Wed, 24 Aug 2022 at 13:23, Will Deacon <will@kernel.org> wrote: > > Hi Ard, > > Cheers for having a look. > > On Wed, Aug 24, 2022 at 11:58:46AM +0200, Ard Biesheuvel wrote: > > On Tue, 23 Aug 2022 at 14:21, Will Deacon <will@kernel.org> wrote: > > > > > > arch_dma_prep_coherent() is called when preparing a non-cacheable region > > > for a consistent DMA buffer allocation. Since the buffer pages may > > > previously have been written via a cacheable mapping and consequently > > > allocated as dirty cachelines, the purpose of this function is to remove > > > these dirty lines from the cache, writing them back so that the > > > non-coherent device is able to see them. > > > > > > On arm64, this operation can be achieved with a clean to the point of > > > coherency; a subsequent invalidation is not required and serves little > > > purpose in the presence of a cacheable alias (e.g. the linear map), > > > since clean lines can be speculatively fetched back into the cache after > > > the invalidation operation has completed. > > > > > > Relax the cache maintenance in arch_dma_prep_coherent() so that only a > > > clean, and not a clean-and-invalidate operation is performed. > > > > > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > > Cc: Mark Rutland <mark.rutland@arm.com> > > > Cc: Robin Murphy <robin.murphy@arm.com> > > > Cc: Christoph Hellwig <hch@lst.de> > > > Cc: Ard Biesheuvel <ardb@kernel.org> > > > Signed-off-by: Will Deacon <will@kernel.org> > > > --- > > > > > > I'm slightly wary about this change as other architectures seem to do > > > clean+invalidate here, but I'd like to hear what others think in any > > > case. > > > > > > > Assuming that the architecture never permits non-cacheable accesses to > > hit in the caches [i.e., if the behavior the ARM ARM describes in > > 'Behavior of caches at reset' is the only exception], > > As long as the allocated lines are clean, then the non-cacheable access > won't hit. > If this is guaranteed to remain true even in the presence of mismatched attribute mappings of the same physical region, then we should be fine, I suppose. But if not, shouldn't invalidating the lines substantially reduce the likelihood of hitting this issue? And even if they /could/ be speculated back in, how likely is that to happen if the linear alias VA is not exposed anywhere? > > then I agree that invalidation should make no difference. However, given > > the fact that we are dealing with mismatched attribute mappings here, > > retaining the invalidate seems like the safer bet. (IIRC, the > > justification for permitting the linear map's mismatched attributes is > > that we never access the region via its linear address, and so we are > > unlikely to be affected by any conflicts. Purging cachelines that cover > > this region seems like a sensible thing to do in that respect.) > > It's certainly harmless to keep the invalidation, but I also don't think > it's serving any real purpose either given that the clean lines can be > pulled straight back in by speculative accesses through the linear map. > > > Why are you proposing to change this? > > Some hardware folks were asking me why we had the invalidation here and > then I realised that it's not necessary. An alternative to removing it > would be to add a comment saying it's not needed, but it's a bit weird > writing that! > I don't disagree with that. If we weren't violating the ARM ARM by having two live mappings of the same physical region with mismatched attributes, I wouldn't be as suspicious.
On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote: > arch_dma_prep_coherent() is called when preparing a non-cacheable region > for a consistent DMA buffer allocation. Since the buffer pages may > previously have been written via a cacheable mapping and consequently > allocated as dirty cachelines, the purpose of this function is to remove > these dirty lines from the cache, writing them back so that the > non-coherent device is able to see them. > > On arm64, this operation can be achieved with a clean to the point of > coherency; a subsequent invalidation is not required and serves little > purpose in the presence of a cacheable alias (e.g. the linear map), > since clean lines can be speculatively fetched back into the cache after > the invalidation operation has completed. > > Relax the cache maintenance in arch_dma_prep_coherent() so that only a > clean, and not a clean-and-invalidate operation is performed. > > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Robin Murphy <robin.murphy@arm.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Ard Biesheuvel <ardb@kernel.org> > Signed-off-by: Will Deacon <will@kernel.org> > --- > > I'm slightly wary about this change as other architectures seem to do > clean+invalidate here, but I'd like to hear what others think in any > case. Given that we still have the cacheable alias around, I don't see much point in the invalidation. So: Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (I was wondering why not just invalidate without clean but it could be that the allocated memory was zeroed and we want that to make it to the PoC)
On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote: > arch_dma_prep_coherent() is called when preparing a non-cacheable region > for a consistent DMA buffer allocation. Since the buffer pages may > previously have been written via a cacheable mapping and consequently > allocated as dirty cachelines, the purpose of this function is to remove > these dirty lines from the cache, writing them back so that the > non-coherent device is able to see them. Yes. > I'm slightly wary about this change as other architectures seem to do > clean+invalidate here, but I'd like to hear what others think in any > case. If arm64 is fine with having clean but present cachelines when creating an uncached mapping for a cache line, the invalidate is not required. But isn't it better for the cache if these by definition useless cachelines get evicted? My biggest concern here is that we're now moving from consolidating these semantics in all the different architectures to different ones, making a centralization of the policies even harder.
On Wed, Aug 24, 2022 at 11:00:02PM +0100, Catalin Marinas wrote: > (I was wondering why not just invalidate without clean but it could be > that the allocated memory was zeroed and we want that to make it to the > PoC) The memory is zerod after arch_dma_prep_coherent. So a pure invalidate seems like the right thing to do here.
On 2022-09-07 10:03, Christoph Hellwig wrote: > On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote: >> arch_dma_prep_coherent() is called when preparing a non-cacheable region >> for a consistent DMA buffer allocation. Since the buffer pages may >> previously have been written via a cacheable mapping and consequently >> allocated as dirty cachelines, the purpose of this function is to remove >> these dirty lines from the cache, writing them back so that the >> non-coherent device is able to see them. > > Yes. > >> I'm slightly wary about this change as other architectures seem to do >> clean+invalidate here, but I'd like to hear what others think in any >> case. > > If arm64 is fine with having clean but present cachelines when creating > an uncached mapping for a cache line, the invalidate is not required. > > But isn't it better for the cache if these by definition useless > cachelines get evicted? > > My biggest concern here is that we're now moving from consolidating > these semantics in all the different architectures to different ones, > making a centralization of the policies even harder. FWIW I agree with Ard in not being entirely confident with this change. The impression I had (which may be wrong) was that the architecture never actually ruled out unexpected cache hits in the case of mismatched attributes, it just quietly stopped mentioning it at all. And even if the architecture did rule them out, how confident are we about errata that might still allow them to happen? It seems like we don't stand to gain much by removing the invalidation - since the overhead will still be in the clean - other than the potential for a slightly increased chance of rare and hard-to-debug memory corruption :/ Cheers, Robin. (who's spent the last few months elbow-deep in a hideous CPU cache erratum...)
On Wed, Sep 07, 2022 at 11:04:34AM +0200, Christoph Hellwig wrote: > On Wed, Aug 24, 2022 at 11:00:02PM +0100, Catalin Marinas wrote: > > (I was wondering why not just invalidate without clean but it could be > > that the allocated memory was zeroed and we want that to make it to the > > PoC) > > The memory is zerod after arch_dma_prep_coherent. So a pure invalidate > seems like the right thing to do here. My thoughts on Christoph's comment... That seems backwards to me. If we allocate memory, clean (and maybe invalidate) the cache, and then zero the memory _after_ the clean, then couldn't we be creating dirty cache lines. There are caches out there that are write-allocate, which means that by writing zeros to memory, you drag cache lines back into the cache. These dirty cache lines could then be written back at some random time later, corrupting the data that has been DMA'd to the buffer or placed there via the uncached mapping (if the architecture doesn't hit those cache lines.) It has to be: - allocate - zero - deal with the cache not: - allocate - deal with the cache - zero Moreover, if one does: - allocate - invalidate - zero - clean+(invalidate?) then aren't we making the zero operation potentially more expensive than if we had cache lines already there - I'm thinking there about a no-write-allocate cache here. Those cache lines could have been used to buffer the zeroing writes and then written out later. Depending on the architecture, having cached lines may mean that write merging happens, meaning writebacks to memory occur in larger chunks.
On Wed, Sep 07, 2022 at 03:10:35PM +0100, Russell King (Oracle) wrote: > If we allocate memory, clean (and maybe invalidate) the cache, and then > zero the memory _after_ the clean, then couldn't we be creating dirty > cache lines. There are caches out there that are write-allocate, which > means that by writing zeros to memory, you drag cache lines back into > the cache. When writting through an uncached mapping?
On Wed, Sep 07, 2022 at 04:14:41PM +0200, Christoph Hellwig wrote: > On Wed, Sep 07, 2022 at 03:10:35PM +0100, Russell King (Oracle) wrote: > > If we allocate memory, clean (and maybe invalidate) the cache, and then > > zero the memory _after_ the clean, then couldn't we be creating dirty > > cache lines. There are caches out there that are write-allocate, which > > means that by writing zeros to memory, you drag cache lines back into > > the cache. > > When writting through an uncached mapping? Eww. That's going to be slow.
On Wed, Sep 07, 2022 at 10:27:45AM +0100, Robin Murphy wrote: > On 2022-09-07 10:03, Christoph Hellwig wrote: > > On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote: > > > arch_dma_prep_coherent() is called when preparing a non-cacheable region > > > for a consistent DMA buffer allocation. Since the buffer pages may > > > previously have been written via a cacheable mapping and consequently > > > allocated as dirty cachelines, the purpose of this function is to remove > > > these dirty lines from the cache, writing them back so that the > > > non-coherent device is able to see them. > > > > Yes. > > > > > I'm slightly wary about this change as other architectures seem to do > > > clean+invalidate here, but I'd like to hear what others think in any > > > case. > > > > If arm64 is fine with having clean but present cachelines when creating > > an uncached mapping for a cache line, the invalidate is not required. > > > > But isn't it better for the cache if these by definition useless > > cachelines get evicted? > > > > My biggest concern here is that we're now moving from consolidating > > these semantics in all the different architectures to different ones, > > making a centralization of the policies even harder. > > FWIW I agree with Ard in not being entirely confident with this change. The > impression I had (which may be wrong) was that the architecture never > actually ruled out unexpected cache hits in the case of mismatched > attributes, it just quietly stopped mentioning it at all. And even if the > architecture did rule them out, how confident are we about errata that might > still allow them to happen? The architecture does rule this out as long as the cacheable alias is not dirty (i.e. has not been written to). If this was not the case, then our non-cacheable DMA buffers would be unreliable because speculative fills from the linear map could "mask" the actual DMA data. The presence of the linear map also means that the lines could immediately be refetched by speculation, rendering the invalidation pointless in the current code. Regardless of this patch, a CPU erratum violating the architectural guarantee would be broken. It is the presence of the linear alias that would need to be resolved. The architecture text is buried in the infamous "B2.8 Mismatched memory attributes" section of the Arm ARM, but it's also worth talking to Mark R, Catalin and the handful of folks in ATG who grok this to get a summary which is easier to digest. > It seems like we don't stand to gain much by removing the invalidation - > since the overhead will still be in the clean - other than the potential for > a slightly increased chance of rare and hard-to-debug memory corruption :/ I just find it odd that we rely on the CPU not hitting the cacheable alias in other places, yet we're using an invalidation for this path. It's inconsistent and difficult to explain to people. As I said, I'm happy to add a comment to the existing code instead of the change here, but I don't know what to say other than something like: /* * The architecture says we only need a clean here, but invalidate as * well just in case. */ Which feels deeply unsatisfactory. Will
On Wed, Sep 07, 2022 at 11:03:05AM +0200, Christoph Hellwig wrote: > On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote: > > arch_dma_prep_coherent() is called when preparing a non-cacheable region > > for a consistent DMA buffer allocation. Since the buffer pages may > > previously have been written via a cacheable mapping and consequently > > allocated as dirty cachelines, the purpose of this function is to remove > > these dirty lines from the cache, writing them back so that the > > non-coherent device is able to see them. > > Yes. > > > I'm slightly wary about this change as other architectures seem to do > > clean+invalidate here, but I'd like to hear what others think in any > > case. > > If arm64 is fine with having clean but present cachelines when creating > an uncached mapping for a cache line, the invalidate is not required. > > But isn't it better for the cache if these by definition useless > cachelines get evicted? > > My biggest concern here is that we're now moving from consolidating > these semantics in all the different architectures to different ones, > making a centralization of the policies even harder. I definitely wouldn't object to a pointless invalidation if it helped with consolidation, and we could add a comment to the arch code to that effect. At the moment, however, I can't imagine what this consolidation would look like -- at the end of the day we're going to need to dip into the arch for the cache op. Do you have a branch, or is it a longer-term goal? Will
On 2022-09-07 17:25, Will Deacon wrote: > On Wed, Sep 07, 2022 at 10:27:45AM +0100, Robin Murphy wrote: >> On 2022-09-07 10:03, Christoph Hellwig wrote: >>> On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote: >>>> arch_dma_prep_coherent() is called when preparing a non-cacheable region >>>> for a consistent DMA buffer allocation. Since the buffer pages may >>>> previously have been written via a cacheable mapping and consequently >>>> allocated as dirty cachelines, the purpose of this function is to remove >>>> these dirty lines from the cache, writing them back so that the >>>> non-coherent device is able to see them. >>> >>> Yes. >>> >>>> I'm slightly wary about this change as other architectures seem to do >>>> clean+invalidate here, but I'd like to hear what others think in any >>>> case. >>> >>> If arm64 is fine with having clean but present cachelines when creating >>> an uncached mapping for a cache line, the invalidate is not required. >>> >>> But isn't it better for the cache if these by definition useless >>> cachelines get evicted? >>> >>> My biggest concern here is that we're now moving from consolidating >>> these semantics in all the different architectures to different ones, >>> making a centralization of the policies even harder. >> >> FWIW I agree with Ard in not being entirely confident with this change. The >> impression I had (which may be wrong) was that the architecture never >> actually ruled out unexpected cache hits in the case of mismatched >> attributes, it just quietly stopped mentioning it at all. And even if the >> architecture did rule them out, how confident are we about errata that might >> still allow them to happen? > > The architecture does rule this out as long as the cacheable alias is not > dirty (i.e. has not been written to). If this was not the case, then our > non-cacheable DMA buffers would be unreliable because speculative fills > from the linear map could "mask" the actual DMA data. The presence of the > linear map also means that the lines could immediately be refetched by > speculation, rendering the invalidation pointless in the current code. > > Regardless of this patch, a CPU erratum violating the architectural > guarantee would be broken. It is the presence of the linear alias that > would need to be resolved. Oh have I got a treat for you... but that's not relevant here ;) > The architecture text is buried in the infamous "B2.8 Mismatched memory > attributes" section of the Arm ARM, but it's also worth talking to Mark R, > Catalin and the handful of folks in ATG who grok this to get a summary > which is easier to digest. I'm looking at DDI0487I.a and nothing stands out as different from previous readings. In particular, "A read of the memory Location by one agent might not return the value most recently written to that memory Location by the same agent" seems entirely consistent with non-cacheable reads being allowed to unexpectedly hit clean (but stale) cache entries even if non-cacheable writes don't. And frankly I don't see how "There might be a loss of coherency when multiple agents attempt to access a memory Location." helps rule *anything* out. The bit about writeback does seem to imply that non-cacheable writes can't dirty a cache entry, so I can at least accept that unexpected hits for writes are off the table. Furthermore, the bit about mismatched shareability - which I believe applies here since our cacheable mappings are ISh while non-cacheable is inherently OSh - explicitly mentions: "Software running on a PE cleans and invalidates a Location from cache before and after each read or write to that Location by that PE." If the architecture saying to clean and invalidate is supposed to be a guarantee that invalidation is unnecessary, I give up. >> It seems like we don't stand to gain much by removing the invalidation - >> since the overhead will still be in the clean - other than the potential for >> a slightly increased chance of rare and hard-to-debug memory corruption :/ > > I just find it odd that we rely on the CPU not hitting the cacheable alias > in other places, yet we're using an invalidation for this path. It's > inconsistent and difficult to explain to people. As I said, I'm happy to > add a comment to the existing code instead of the change here, but I don't > know what to say other than something like: > > /* > * The architecture says we only need a clean here, but invalidate as > * well just in case. > */ > > Which feels deeply unsatisfactory. Well, the chances of the alias being speculated back into caches by access to nearby linear map pages still seem a lot smaller than the chances of caches hanging on to lines we've just finished accessing with cacheable attributes. Christoph makes a good point that at worst it's a concise hint for caches with LRU replacement policy that might otherwise think we could be coming back for our freshly-written clean zeros. Cheers, Robin.
On Wed, Sep 07, 2022 at 05:25:43PM +0100, Will Deacon wrote: > On Wed, Sep 07, 2022 at 10:27:45AM +0100, Robin Murphy wrote: > > On 2022-09-07 10:03, Christoph Hellwig wrote: > > > On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote: > > > > arch_dma_prep_coherent() is called when preparing a non-cacheable region > > > > for a consistent DMA buffer allocation. Since the buffer pages may > > > > previously have been written via a cacheable mapping and consequently > > > > allocated as dirty cachelines, the purpose of this function is to remove > > > > these dirty lines from the cache, writing them back so that the > > > > non-coherent device is able to see them. > > > > > > Yes. > > > > > > > I'm slightly wary about this change as other architectures seem to do > > > > clean+invalidate here, but I'd like to hear what others think in any > > > > case. > > > > > > If arm64 is fine with having clean but present cachelines when creating > > > an uncached mapping for a cache line, the invalidate is not required. > > > > > > But isn't it better for the cache if these by definition useless > > > cachelines get evicted? > > > > > > My biggest concern here is that we're now moving from consolidating > > > these semantics in all the different architectures to different ones, > > > making a centralization of the policies even harder. > > > > FWIW I agree with Ard in not being entirely confident with this change. The > > impression I had (which may be wrong) was that the architecture never > > actually ruled out unexpected cache hits in the case of mismatched > > attributes, it just quietly stopped mentioning it at all. And even if the > > architecture did rule them out, how confident are we about errata that might > > still allow them to happen? For the erratum you have in mind, invalidation doesn't help either (as you know already). The architecture does rule out unexpected cache hits, though not by stating this explicitly but rather only listing the cases where one needs cache maintenance in the presence of mismatched aliases. Such unexpected cache hits may happen micro-architecturally but are not visible to software (in theory and ignoring timing side-channels). > > It seems like we don't stand to gain much by removing the invalidation - > > since the overhead will still be in the clean - other than the potential for > > a slightly increased chance of rare and hard-to-debug memory corruption :/ > > I just find it odd that we rely on the CPU not hitting the cacheable alias > in other places, yet we're using an invalidation for this path. It's > inconsistent and difficult to explain to people. As I said, I'm happy to > add a comment to the existing code instead of the change here, but I don't > know what to say other than something like: > > /* > * The architecture says we only need a clean here, but invalidate as > * well just in case. > */ The architecture requires invalidate or clean while also stating that clean+invalidate can be used instead, so I don't think there's much to justify. From the mismatched attributes section: 1. If the mismatched attributes for a memory location all assign the same shareability attribute to a Location that has a cacheable attribute, any loss of uniprocessor semantics, ordering, or coherency within a shareability domain can be avoided by use of software cache management. To do so, software must use the techniques that are required for the software management of the ordering or coherency of cacheable Locations between agents in different shareability domains. This means: - Before writing to a cacheable Location not using the Write-Back attribute, software must invalidate, or clean, a Location from the caches if any agent might have written to the Location with the Write-Back attribute. This avoids the possibility of overwriting the Location with stale data. - After writing to a cacheable Location with the Write-Back attribute, software must clean the Location from the caches, to make the write visible to external memory. ... When creating the non-cacheable alias for DMA, what we need is the first bullet point above. A clean would be required if we wrote something via the write-back mapping and we want that data to be visible (not needed if the zeroing is done via the non-cacheable alias). However, it just crossed my mind that we can't always drop the CPU-written (meta)data with an invalidate even if the zeroing is done on the non-cacheable alias. With MTE+kasan, the page allocator may colour the memory (via the cacheable alias). Even if non-cacheable accesses are not checked by MTE, when freeing it back it could potentially confuse the allocator if the tags were lost. Similarly for the streaming DMA and this could have been worse but luckily we had c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers at start of DMA transfer") for other reasons. So yeah, I think a clean is needed here or clean+invalidate but not invalidate only due to the addition of MTE.
On 2022-09-08 11:32, Catalin Marinas wrote: > On Wed, Sep 07, 2022 at 05:25:43PM +0100, Will Deacon wrote: >> On Wed, Sep 07, 2022 at 10:27:45AM +0100, Robin Murphy wrote: >>> On 2022-09-07 10:03, Christoph Hellwig wrote: >>>> On Tue, Aug 23, 2022 at 01:21:11PM +0100, Will Deacon wrote: >>>>> arch_dma_prep_coherent() is called when preparing a non-cacheable region >>>>> for a consistent DMA buffer allocation. Since the buffer pages may >>>>> previously have been written via a cacheable mapping and consequently >>>>> allocated as dirty cachelines, the purpose of this function is to remove >>>>> these dirty lines from the cache, writing them back so that the >>>>> non-coherent device is able to see them. >>>> >>>> Yes. >>>> >>>>> I'm slightly wary about this change as other architectures seem to do >>>>> clean+invalidate here, but I'd like to hear what others think in any >>>>> case. >>>> >>>> If arm64 is fine with having clean but present cachelines when creating >>>> an uncached mapping for a cache line, the invalidate is not required. >>>> >>>> But isn't it better for the cache if these by definition useless >>>> cachelines get evicted? >>>> >>>> My biggest concern here is that we're now moving from consolidating >>>> these semantics in all the different architectures to different ones, >>>> making a centralization of the policies even harder. >>> >>> FWIW I agree with Ard in not being entirely confident with this change. The >>> impression I had (which may be wrong) was that the architecture never >>> actually ruled out unexpected cache hits in the case of mismatched >>> attributes, it just quietly stopped mentioning it at all. And even if the >>> architecture did rule them out, how confident are we about errata that might >>> still allow them to happen? > > For the erratum you have in mind, invalidation doesn't help either (as > you know already). > > The architecture does rule out unexpected cache hits, though not by > stating this explicitly but rather only listing the cases where one > needs cache maintenance in the presence of mismatched aliases. Such > unexpected cache hits may happen micro-architecturally but are not > visible to software (in theory and ignoring timing side-channels). > >>> It seems like we don't stand to gain much by removing the invalidation - >>> since the overhead will still be in the clean - other than the potential for >>> a slightly increased chance of rare and hard-to-debug memory corruption :/ >> >> I just find it odd that we rely on the CPU not hitting the cacheable alias >> in other places, yet we're using an invalidation for this path. It's >> inconsistent and difficult to explain to people. As I said, I'm happy to >> add a comment to the existing code instead of the change here, but I don't >> know what to say other than something like: >> >> /* >> * The architecture says we only need a clean here, but invalidate as >> * well just in case. >> */ > > The architecture requires invalidate or clean while also stating that > clean+invalidate can be used instead, so I don't think there's much to > justify. From the mismatched attributes section: > > 1. If the mismatched attributes for a memory location all assign the > same shareability attribute to a Location that has a cacheable > attribute, This is not describing our case, though. We do have a cacheable attribute in at least one alias, but the shareability is *not* the same across all aliases, therefore at face value it clearly does not apply. If it's intended to mean that the shareability of all aliases with cacheable attributes is the same and the shareability of aliases without cacheable attributes doesn't matter, then it should bloody well say that, not something entirely contradictory :( I'm basically just shaking my fists at the sky here, but it irks me no end to be writing patches based on what half a dozen people (myself included, in some cases) happen to know that an architecture is supposed to mean, rather than what it actually says. Cheers, Robin. > any loss of uniprocessor semantics, ordering, or coherency > within a shareability domain can be avoided by use of software cache > management. To do so, software must use the techniques that are > required for the software management of the ordering or coherency of > cacheable Locations between agents in different shareability domains. > This means: > - Before writing to a cacheable Location not using the Write-Back > attribute, software must invalidate, or clean, a Location from the > caches if any agent might have written to the Location with the > Write-Back attribute. This avoids the possibility of overwriting > the Location with stale data. > - After writing to a cacheable Location with the Write-Back > attribute, software must clean the Location from the caches, to > make the write visible to external memory. > ... > > When creating the non-cacheable alias for DMA, what we need is the first > bullet point above. A clean would be required if we wrote something via > the write-back mapping and we want that data to be visible (not needed > if the zeroing is done via the non-cacheable alias). > > However, it just crossed my mind that we can't always drop the > CPU-written (meta)data with an invalidate even if the zeroing is done on > the non-cacheable alias. With MTE+kasan, the page allocator may colour > the memory (via the cacheable alias). Even if non-cacheable accesses are > not checked by MTE, when freeing it back it could potentially confuse > the allocator if the tags were lost. Similarly for the streaming DMA and > this could have been worse but luckily we had c50f11c6196f ("arm64: mm: > Don't invalidate FROM_DEVICE buffers at start of DMA transfer") for > other reasons. > > So yeah, I think a clean is needed here or clean+invalidate but not > invalidate only due to the addition of MTE. >
On Thu, Sep 08, 2022 at 12:32:42PM +0100, Robin Murphy wrote: > On 2022-09-08 11:32, Catalin Marinas wrote: > > The architecture requires invalidate or clean while also stating that > > clean+invalidate can be used instead, so I don't think there's much to > > justify. From the mismatched attributes section: > > > > 1. If the mismatched attributes for a memory location all assign the > > same shareability attribute to a Location that has a cacheable > > attribute, > > This is not describing our case, though. We do have a cacheable attribute in > at least one alias, but the shareability is *not* the same across all > aliases, therefore at face value it clearly does not apply. From the CPU perspective, both the cacheable and non-cacheable aliases have the same inner shareable attribute. The device accessing the buffer presumably should use the same shareability attributes. In the worst case it's outer shareable but on most implementations the inner and outer shareability domains are the same.
On Thu, Sep 08, 2022 at 02:02:13PM +0100, Catalin Marinas wrote: > On Thu, Sep 08, 2022 at 12:32:42PM +0100, Robin Murphy wrote: > > On 2022-09-08 11:32, Catalin Marinas wrote: > > > The architecture requires invalidate or clean while also stating that > > > clean+invalidate can be used instead, so I don't think there's much to > > > justify. From the mismatched attributes section: > > > > > > 1. If the mismatched attributes for a memory location all assign the > > > same shareability attribute to a Location that has a cacheable > > > attribute, > > > > This is not describing our case, though. We do have a cacheable attribute in > > at least one alias, but the shareability is *not* the same across all > > aliases, therefore at face value it clearly does not apply. > > From the CPU perspective, both the cacheable and non-cacheable aliases > have the same inner shareable attribute. The device accessing the buffer > presumably should use the same shareability attributes. In the worst > case it's outer shareable but on most implementations the inner and > outer shareability domains are the same. Ah, so you are referring to this paragraph: The shareability field is only relevant if the memory is a Normal Cacheable memory type. All Device and Normal Non-cacheable memory regions are always treated as Outer Shareable, regardless of the translation table shareability attributes. I think so far we've ignored this case and I'm not aware hardware that would break coherency here. There's a theoretical risk that some system level cache only resides in the outer shareable domain and any cache maintenance on the inner shareable alias does not affect it. IIRC we have another case where the architecture requires the inner and outer shareable domains to be the same (a fairly recent update w.r.t. FWB). We could ask the architects to clarify the mismatched aliases to include the outer vs inner case here. But if you want the same attributes, we can change the Linux linear mapping to outer shareable.
On Thu, Sep 08, 2022 at 02:27:19PM +0100, Catalin Marinas wrote: > On Thu, Sep 08, 2022 at 02:02:13PM +0100, Catalin Marinas wrote: > > On Thu, Sep 08, 2022 at 12:32:42PM +0100, Robin Murphy wrote: > > > On 2022-09-08 11:32, Catalin Marinas wrote: > > > > The architecture requires invalidate or clean while also stating that > > > > clean+invalidate can be used instead, so I don't think there's much to > > > > justify. From the mismatched attributes section: > > > > > > > > 1. If the mismatched attributes for a memory location all assign the > > > > same shareability attribute to a Location that has a cacheable > > > > attribute, > > > > > > This is not describing our case, though. We do have a cacheable attribute in > > > at least one alias, but the shareability is *not* the same across all > > > aliases, therefore at face value it clearly does not apply. > > > > From the CPU perspective, both the cacheable and non-cacheable aliases > > have the same inner shareable attribute. The device accessing the buffer > > presumably should use the same shareability attributes. In the worst > > case it's outer shareable but on most implementations the inner and > > outer shareability domains are the same. > > Ah, so you are referring to this paragraph: > > The shareability field is only relevant if the memory is a Normal > Cacheable memory type. All Device and Normal Non-cacheable memory > regions are always treated as Outer Shareable, regardless of the > translation table shareability attributes. FWIW, I've never understood what this paragraph means for the shareability of something like an inner-cacheable, outer-non-cacheable mapping. *shrug* Will
On Thu, Sep 08, 2022 at 11:32:15AM +0100, Catalin Marinas wrote: > On Wed, Sep 07, 2022 at 05:25:43PM +0100, Will Deacon wrote: > > On Wed, Sep 07, 2022 at 10:27:45AM +0100, Robin Murphy wrote: > > > It seems like we don't stand to gain much by removing the invalidation - > > > since the overhead will still be in the clean - other than the potential for > > > a slightly increased chance of rare and hard-to-debug memory corruption :/ > > > > I just find it odd that we rely on the CPU not hitting the cacheable alias > > in other places, yet we're using an invalidation for this path. It's > > inconsistent and difficult to explain to people. As I said, I'm happy to > > add a comment to the existing code instead of the change here, but I don't > > know what to say other than something like: > > > > /* > > * The architecture says we only need a clean here, but invalidate as > > * well just in case. > > */ [...] > So yeah, I think a clean is needed here or clean+invalidate but not > invalidate only due to the addition of MTE. OK, I think invalidate works as well. The ARM ARM has a note on the DC IVAC instruction: When FEAT_MTE is implemented, this instruction might invalidate Allocation Tags from caches. When it invalidates Allocation Tags from caches, it also cleans them. (hopefully it cleans them first before invalidating)
On Tue, 23 Aug 2022 13:21:11 +0100, Will Deacon wrote: > arch_dma_prep_coherent() is called when preparing a non-cacheable region > for a consistent DMA buffer allocation. Since the buffer pages may > previously have been written via a cacheable mapping and consequently > allocated as dirty cachelines, the purpose of this function is to remove > these dirty lines from the cache, writing them back so that the > non-coherent device is able to see them. > > [...] Applied to arm64 (for-next/misc), thanks! [1/1] arm64: dma: Drop cache invalidation from arch_dma_prep_coherent() https://git.kernel.org/arm64/c/c44094eee32f
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index 599cf81f5685..83a512a6ff0d 100644 --- a/arch/arm64/mm/dma-mapping.c +++ b/arch/arm64/mm/dma-mapping.c @@ -36,7 +36,7 @@ void arch_dma_prep_coherent(struct page *page, size_t size) { unsigned long start = (unsigned long)page_address(page); - dcache_clean_inval_poc(start, start + size); + dcache_clean_poc(start, start + size); } #ifdef CONFIG_IOMMU_DMA
arch_dma_prep_coherent() is called when preparing a non-cacheable region for a consistent DMA buffer allocation. Since the buffer pages may previously have been written via a cacheable mapping and consequently allocated as dirty cachelines, the purpose of this function is to remove these dirty lines from the cache, writing them back so that the non-coherent device is able to see them. On arm64, this operation can be achieved with a clean to the point of coherency; a subsequent invalidation is not required and serves little purpose in the presence of a cacheable alias (e.g. the linear map), since clean lines can be speculatively fetched back into the cache after the invalidation operation has completed. Relax the cache maintenance in arch_dma_prep_coherent() so that only a clean, and not a clean-and-invalidate operation is performed. Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> --- I'm slightly wary about this change as other architectures seem to do clean+invalidate here, but I'd like to hear what others think in any case. arch/arm64/mm/dma-mapping.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)