diff mbox

Unnecessary cache-line flush on page table updates ?

Message ID 20110701101019.GA1723@e102109-lin.cambridge.arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Catalin Marinas July 1, 2011, 10:10 a.m. UTC
On Fri, Jul 01, 2011 at 08:04:42AM +0100, heechul Yun wrote:
> Based on TRM of Cortex A9, the MMU reads page table entries from L1-D
> cache not from memory. Then I think we do not need to flush the cache
> line in the following code because MMU will always see up-to-date view
> of page table in both UP and SMP systems.
> 
> linux/arch/arm/mm/proc-v7.S
> 
> ENTRY(cpu_v7_set_pte_ext)
> 	...
>         mcr     p15, 0, r0, c7, c10, 1          @ flush_pte from
> D-cache // why we need this in A9?
>         …
> 
> If this is a necessary one, could you please explain the reason? Thanks.

No, it's not necessary, only that this file is used by other processors
as well. The solution below checks the ID_MMFR3[23:20] bits (coherent
walk) and avoid flushing if the value is 1. The same could be done for
PMD entries, though that's less critical than the PTEs.

Please note that the patch is not fully tested.

8<--------------------

From 67bd5ebdf622637f8293286146441e6292713c3d Mon Sep 17 00:00:00 2001
From: Catalin Marinas <catalin.marinas@arm.com>
Date: Fri, 1 Jul 2011 10:57:07 +0100
Subject: [PATCH] ARMv7: Do not clean the PTE coherent page table walk is supported

This patch adds a check for the ID_MMFR3[23:20] bits (coherent walk) and
only cleans the D-cache corresponding to a PTE if coherent page table
walks are not supported.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/mm/proc-v7.S |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

Comments

heechul Yun July 1, 2011, 9:42 p.m. UTC | #1
Great.

Removing the PTE flush seems to have a noticeable performance
difference in my test. The followings are lmbench 3.0a performance
result measured on a Cortex A9 SMP platform. So far, I did not have
any problem while doing various test.

=========
mm-patch:
=========
Pagefaults on /tmp/XXX: 3.0759 microseconds
Process fork+exit: 464.5414 microseconds
Process fork+execve: 785.4944 microseconds
Process fork+/bin/sh -c: 488.6204 microseconds

=========
original:
=========
Pagefaults on /tmp/XXX: 3.6209 microseconds
Process fork+exit: 485.5236 microseconds
Process fork+execve: 820.0613 microseconds
Process fork+/bin/sh -c: 2966.3828 microseconds

Heechul

On Fri, Jul 1, 2011 at 3:10 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Fri, Jul 01, 2011 at 08:04:42AM +0100, heechul Yun wrote:
>> Based on TRM of Cortex A9, the MMU reads page table entries from L1-D
>> cache not from memory. Then I think we do not need to flush the cache
>> line in the following code because MMU will always see up-to-date view
>> of page table in both UP and SMP systems.
>>
>> linux/arch/arm/mm/proc-v7.S
>>
>> ENTRY(cpu_v7_set_pte_ext)
>>       ...
>>         mcr     p15, 0, r0, c7, c10, 1          @ flush_pte from
>> D-cache // why we need this in A9?
>>         …
>>
>> If this is a necessary one, could you please explain the reason? Thanks.
>
> No, it's not necessary, only that this file is used by other processors
> as well. The solution below checks the ID_MMFR3[23:20] bits (coherent
> walk) and avoid flushing if the value is 1. The same could be done for
> PMD entries, though that's less critical than the PTEs.
>
> Please note that the patch is not fully tested.
>
> 8<--------------------
>
> From 67bd5ebdf622637f8293286146441e6292713c3d Mon Sep 17 00:00:00 2001
> From: Catalin Marinas <catalin.marinas@arm.com>
> Date: Fri, 1 Jul 2011 10:57:07 +0100
> Subject: [PATCH] ARMv7: Do not clean the PTE coherent page table walk is supported
>
> This patch adds a check for the ID_MMFR3[23:20] bits (coherent walk) and
> only cleans the D-cache corresponding to a PTE if coherent page table
> walks are not supported.
>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> ---
>  arch/arm/mm/proc-v7.S |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S
> index 8013afc..fc5b36f 100644
> --- a/arch/arm/mm/proc-v7.S
> +++ b/arch/arm/mm/proc-v7.S
> @@ -166,7 +166,9 @@ ENTRY(cpu_v7_set_pte_ext)
>  ARM(  str     r3, [r0, #2048]! )
>  THUMB(        add     r0, r0, #2048 )
>  THUMB(        str     r3, [r0] )
> -       mcr     p15, 0, r0, c7, c10, 1          @ flush_pte
> +       mrc     p15, 0, r3, c0, c1, 7           @ read ID_MMFR3
> +       tst     r3, #0xf << 20                  @ check the coherent walk bits
> +       mcreq   p15, 0, r0, c7, c10, 1          @ flush_pte
>  #endif
>        mov     pc, lr
>  ENDPROC(cpu_v7_set_pte_ext)
>
> --
> Catalin
>
Catalin Marinas July 4, 2011, 9:45 a.m. UTC | #2
On Fri, Jul 01, 2011 at 10:42:00PM +0100, heechul Yun wrote:
> Great.
> 
> Removing the PTE flush seems to have a noticeable performance
> difference in my test. The followings are lmbench 3.0a performance
> result measured on a Cortex A9 SMP platform. So far, I did not have
> any problem while doing various test.
> 
> =========
> mm-patch:
> =========
> Pagefaults on /tmp/XXX: 3.0759 microseconds
> Process fork+exit: 464.5414 microseconds
> Process fork+execve: 785.4944 microseconds
> Process fork+/bin/sh -c: 488.6204 microseconds
> 
> =========
> original:
> =========
> Pagefaults on /tmp/XXX: 3.6209 microseconds
> Process fork+exit: 485.5236 microseconds
> Process fork+execve: 820.0613 microseconds
> Process fork+/bin/sh -c: 2966.3828 microseconds

Given these results, I think it's worth merging the patch. Can I add
your Tested-by?

I think there can be a few other optimisations in the TLB area but it
needs some digging.

Thanks.
heechul Yun July 4, 2011, 2:59 p.m. UTC | #3
On Mon, Jul 4, 2011 at 2:45 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Fri, Jul 01, 2011 at 10:42:00PM +0100, heechul Yun wrote:
>> Great.
>>
>> Removing the PTE flush seems to have a noticeable performance
>> difference in my test. The followings are lmbench 3.0a performance
>> result measured on a Cortex A9 SMP platform. So far, I did not have
>> any problem while doing various test.
>>
>> =========
>> mm-patch:
>> =========
>> Pagefaults on /tmp/XXX: 3.0759 microseconds
>> Process fork+exit: 464.5414 microseconds
>> Process fork+execve: 785.4944 microseconds
>> Process fork+/bin/sh -c: 488.6204 microseconds
>>
>> =========
>> original:
>> =========
>> Pagefaults on /tmp/XXX: 3.6209 microseconds
>> Process fork+exit: 485.5236 microseconds
>> Process fork+execve: 820.0613 microseconds
>> Process fork+/bin/sh -c: 2966.3828 microseconds
>
> Given these results, I think it's worth merging the patch. Can I add
> your Tested-by?

Yes you can.
I think this patch will be valuable for many applications (e.g.,
server applications).

Heechul

>
> I think there can be a few other optimisations in the TLB area but it
> needs some digging.
>
> Thanks.
>
> --
> Catalin
>
Russell King - ARM Linux July 4, 2011, 7:58 p.m. UTC | #4
On Mon, Jul 04, 2011 at 04:58:35PM +0100, Catalin Marinas wrote:
> On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> > If we are tearing down a mapping, then we don't need any barrier for
> > individual PTE entries, but we do if we remove a higher-level (PMD/PUD)
> > table.
> 
> It depends on the use, I don't think we can generally assume that any
> tearing down is fine since there are cases where we need a guaranteed
> fault (zap_vma_ptes may only tear down PTE entries but the driver using
> it expects a fault if something else tries to access that location). A
> DSB would be enough.

It depends whether we're unmapping due to munmap or for kernel internal
stuff, or whether we're doing it as a result of disposing with the
page tables.  In the latter case, avoiding the DSBs would be a good
trick to pull.

> > That "However" clause is an interesting one though - it seems to imply
> > that no barrier is required when we zero out a new page table, before
> > we link the new page table into the higher order table.  The memory we
> > allocated for the page table doesn't become a page table until it is
> > linked into the page table tree. 
> 
> That's my understanding of that clause as well.
> 
> > It also raises the question about how
> > the system knows that a particular store is to something that's a page
> > table and something that isn't...  Given that normal memory accesses are
> > unordered, I think this paragraph is misleading and wrong.
> 
> Reordering of accesses can happen because of load speculation, store
> reordering in the write buffer or delays on the bus outside the CPU. The
> ARM processors do not issue stores speculatively. When a memory access
> is issued, it checks the TLB and may perform a PTW (otherwise the
> external bus wouldn't know the address. For explicit accesses, if the
> PTW fails, it raises a fault (and we also need a precise abort).

I think you missed my point.  The order in which stores to normal memory
appear is not determinable.  In the case of writeback caches, it depends
on the cache replacement algorithm, the state of the cache and its
allocation behaviour.

It is entirely possible that the store to the page tables _could_ appear
in memory (and therefore visible to the MMU) before the stores to the
new page table initializing it to zeros.

For instance, lets say that the L1 writeback cache is read allocate only,
and does not read entries from the L1 cache.

The new page table happens to contain some L1 cache lines, but the upper
table entry is not in the L1 cache.  This means that stores to the new
page table hit the L1 cache lines, which become dirty.  The store to the
upper table entry bypasses the L1 cache and is immediately placed into
the write buffer.

This means in all probability that the MMU will see the new page table
before it is visibly initialized.

> > As far as the BTB goes, I wonder if we can postpone that for user TLB
> > ops by setting a TIF_ flag and checking that before returning to userspace.
> > That would avoid having to needlessly destroy the cached branch information
> > for kernel space while looping over the page tables.  The only other place
> > that needs to worry about that is module_alloc() and vmap/vmalloc with
> > PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().
> 
> With setting and checking the TIF_ flag we penalise newer hardware
> (Cortex-A8 onwards) where the BTB invalidation is a no-op. But I'll
> check with the people here if there are any implications with deferring
> the BTB invalidation.

Thanks.
heechul Yun July 4, 2011, 9:24 p.m. UTC | #5
On Fri, Jul 1, 2011 at 2:42 PM, heechul Yun <heechul@illinois.edu> wrote:
> Great.
>
> Removing the PTE flush seems to have a noticeable performance
> difference in my test. The followings are lmbench 3.0a performance
> result measured on a Cortex A9 SMP platform. So far, I did not have
> any problem while doing various test.
>
> =========
> mm-patch:
> =========
> Pagefaults on /tmp/XXX: 3.0759 microseconds
> Process fork+exit: 464.5414 microseconds
> Process fork+execve: 785.4944 microseconds
> Process fork+/bin/sh -c: 488.6204 microseconds

I realized that I made a big mistake on the  fork+/bin/sh data (other
numbers are fine). What happened was that there was no /bin/sh
(android platform has /system/bin/sh instead). When I did the first
test with the original kernel I made /bin as a symlink to /system/bin
which I forgot to do after flashing the system with the new, patched,
kernel.

The following is a result of the patched kernel after the correct symlink

Pagefaults on /tmp/XXX: 3.0991 microseconds
Process fork+exit: 465.1620 microseconds
Process fork+execve: 781.5077 microseconds
Process fork+/bin/sh -c: 2804.2023 microseconds

It's still noticeably better (about 5~15%) but not substantial.
Sorry for the confusion.

Heechul

>
> =========
> original:
> =========
> Pagefaults on /tmp/XXX: 3.6209 microseconds
> Process fork+exit: 485.5236 microseconds
> Process fork+execve: 820.0613 microseconds
> Process fork+/bin/sh -c: 2966.3828 microseconds
>
> Heechul
>
> On Fri, Jul 1, 2011 at 3:10 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> On Fri, Jul 01, 2011 at 08:04:42AM +0100, heechul Yun wrote:
>>> Based on TRM of Cortex A9, the MMU reads page table entries from L1-D
>>> cache not from memory. Then I think we do not need to flush the cache
>>> line in the following code because MMU will always see up-to-date view
>>> of page table in both UP and SMP systems.
>>>
>>> linux/arch/arm/mm/proc-v7.S
>>>
>>> ENTRY(cpu_v7_set_pte_ext)
>>>       ...
>>>         mcr     p15, 0, r0, c7, c10, 1          @ flush_pte from
>>> D-cache // why we need this in A9?
>>>         …
>>>
>>> If this is a necessary one, could you please explain the reason? Thanks.
>>
>> No, it's not necessary, only that this file is used by other processors
>> as well. The solution below checks the ID_MMFR3[23:20] bits (coherent
>> walk) and avoid flushing if the value is 1. The same could be done for
>> PMD entries, though that's less critical than the PTEs.
>>
>> Please note that the patch is not fully tested.
>>
>> 8<--------------------
>>
>> From 67bd5ebdf622637f8293286146441e6292713c3d Mon Sep 17 00:00:00 2001
>> From: Catalin Marinas <catalin.marinas@arm.com>
>> Date: Fri, 1 Jul 2011 10:57:07 +0100
>> Subject: [PATCH] ARMv7: Do not clean the PTE coherent page table walk is supported
>>
>> This patch adds a check for the ID_MMFR3[23:20] bits (coherent walk) and
>> only cleans the D-cache corresponding to a PTE if coherent page table
>> walks are not supported.
>>
>> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>> ---
>>  arch/arm/mm/proc-v7.S |    4 +++-
>>  1 files changed, 3 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S
>> index 8013afc..fc5b36f 100644
>> --- a/arch/arm/mm/proc-v7.S
>> +++ b/arch/arm/mm/proc-v7.S
>> @@ -166,7 +166,9 @@ ENTRY(cpu_v7_set_pte_ext)
>>  ARM(  str     r3, [r0, #2048]! )
>>  THUMB(        add     r0, r0, #2048 )
>>  THUMB(        str     r3, [r0] )
>> -       mcr     p15, 0, r0, c7, c10, 1          @ flush_pte
>> +       mrc     p15, 0, r3, c0, c1, 7           @ read ID_MMFR3
>> +       tst     r3, #0xf << 20                  @ check the coherent walk bits
>> +       mcreq   p15, 0, r0, c7, c10, 1          @ flush_pte
>>  #endif
>>        mov     pc, lr
>>  ENDPROC(cpu_v7_set_pte_ext)
>>
>> --
>> Catalin
>>
>
Russell King - ARM Linux July 4, 2011, 11:20 p.m. UTC | #6
On Mon, Jul 04, 2011 at 08:58:19PM +0100, Russell King - ARM Linux wrote:
> On Mon, Jul 04, 2011 at 04:58:35PM +0100, Catalin Marinas wrote:
> > With setting and checking the TIF_ flag we penalise newer hardware
> > (Cortex-A8 onwards) where the BTB invalidation is a no-op. But I'll
> > check with the people here if there are any implications with deferring
> > the BTB invalidation.
> 
> Thanks.

Please can you also check whether BTC invalidation is required when a
page is removed from the page tables for the purpose of swapping it
out (so the PTE entry becomes zero).

Also, is BTC invalidation required if the very same page that was
there before is later restored without any modification to the
instruction content of that page.  (I'm thinking page is aged to old,
and so unmapped, then a prefetch abort which reinstates the same page
that was there previously.)

Finally, is BTC invalidation required if a different physical page
containing the same instruction content as before is placed at that
location?

I expect all but the last case requires BTC invalidation.

Lastly, please check whether population and removal of page table
entries for NX pages require BTC invalidation.  I expect not.
Catalin Marinas July 5, 2011, 10:07 a.m. UTC | #7
On Mon, Jul 04, 2011 at 08:58:19PM +0100, Russell King - ARM Linux wrote:
> On Mon, Jul 04, 2011 at 04:58:35PM +0100, Catalin Marinas wrote:
> > On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> > > That "However" clause is an interesting one though - it seems to imply
> > > that no barrier is required when we zero out a new page table, before
> > > we link the new page table into the higher order table.  The memory we
> > > allocated for the page table doesn't become a page table until it is
> > > linked into the page table tree. 
> > > 
> > > It also raises the question about how
> > > the system knows that a particular store is to something that's a page
> > > table and something that isn't...  Given that normal memory accesses are
> > > unordered, I think this paragraph is misleading and wrong.
> > 
> > Reordering of accesses can happen because of load speculation, store
> > reordering in the write buffer or delays on the bus outside the CPU. The
> > ARM processors do not issue stores speculatively. When a memory access
> > is issued, it checks the TLB and may perform a PTW (otherwise the
> > external bus wouldn't know the address. For explicit accesses, if the
> > PTW fails, it raises a fault (and we also need a precise abort).
> 
> I think you missed my point.  The order in which stores to normal memory
> appear is not determinable.  In the case of writeback caches, it depends
> on the cache replacement algorithm, the state of the cache and its
> allocation behaviour.
> 
> It is entirely possible that the store to the page tables _could_ appear
> in memory (and therefore visible to the MMU) before the stores to the
> new page table initializing it to zeros.

I think I got your point now. Do you mean:

1. allocate a page for pte
2. zero the page
3. write the pmd entry pointing to the pte page

In this case we need a DSB between 2 and 3 (and maybe a cache flush as
well, which we already do) otherwise we could speculatively load garbage
into the TLB.

Going back to the "however" clause - "any writes to the translation
tables are not seen by any explicit memory access that occurs in program
order before the write to the translation tables" - the important issue
here is the understanding of "seen". An explicit memory access "sees" a
write to the translation table if it uses the new translation to
calculate the physical address. But in the case above, point 3 doesn't
need to see the page zeroing at point 3, they are independent, hence the
need for a DSB.

Maybe the clause could be clarified a bit - I'll ping RG.

> For instance, lets say that the L1 writeback cache is read allocate only,
> and does not read entries from the L1 cache.
> 
> The new page table happens to contain some L1 cache lines, but the upper
> table entry is not in the L1 cache.  This means that stores to the new
> page table hit the L1 cache lines, which become dirty.  The store to the
> upper table entry bypasses the L1 cache and is immediately placed into
> the write buffer.
> 
> This means in all probability that the MMU will see the new page table
> before it is visibly initialized.

If the MMU can read the L1 cache, than we need a DSB before the write to
the upper memory (I wonder whether a DMB would do, since the MMU is just
another observer). If the MMU can't read the L1 cache, than we need a
cache clean as well.
Catalin Marinas July 6, 2011, 4:05 p.m. UTC | #8
On Tue, Jul 05, 2011 at 12:20:19AM +0100, Russell King - ARM Linux wrote:
> Please can you also check whether BTC invalidation is required when a
> page is removed from the page tables for the purpose of swapping it
> out (so the PTE entry becomes zero).

Not required.

> Also, is BTC invalidation required if the very same page that was
> there before is later restored without any modification to the
> instruction content of that page.  (I'm thinking page is aged to old,
> and so unmapped, then a prefetch abort which reinstates the same page
> that was there previously.)

Not required.

> Finally, is BTC invalidation required if a different physical page
> containing the same instruction content as before is placed at that
> location?

Required (see the comment below).

> Lastly, please check whether population and removal of page table
> entries for NX pages require BTC invalidation.  I expect not.
 
Not required.

Basically the rules for BTC are similar to those for the I-cache, but it
can be either ASID-tagged VIVT or VIPT, independent of the type of the
I-cache. I think current implementations of the BTC are ASID-tagged VIVT
but the architecture doesn't mandate this.

So the mapping of a data page doesn't require BTC invalidation, even if
the page is not marked as XN.

Having a branch between BTC and ISB is OK as long as we don't change the
mapping of the branch.

Mapping a page in a previously unmapped location doesn't generally
require BTC invalidation if we guarantee that there are no BTC entries
for that location (IOW, we got a new ASID or unmapping the page
previously had a BTC invalidation).
Russell King - ARM Linux July 6, 2011, 6:08 p.m. UTC | #9
On Wed, Jul 06, 2011 at 05:05:51PM +0100, Catalin Marinas wrote:
> On Tue, Jul 05, 2011 at 12:20:19AM +0100, Russell King - ARM Linux wrote:
> > Please can you also check whether BTC invalidation is required when a
> > page is removed from the page tables for the purpose of swapping it
> > out (so the PTE entry becomes zero).
> 
> Not required.
> 
> > Also, is BTC invalidation required if the very same page that was
> > there before is later restored without any modification to the
> > instruction content of that page.  (I'm thinking page is aged to old,
> > and so unmapped, then a prefetch abort which reinstates the same page
> > that was there previously.)
> 
> Not required.
> 
> > Finally, is BTC invalidation required if a different physical page
> > containing the same instruction content as before is placed at that
> > location?
> 
> Required (see the comment below).
> 
> > Lastly, please check whether population and removal of page table
> > entries for NX pages require BTC invalidation.  I expect not.
>  
> Not required.
> 
> Basically the rules for BTC are similar to those for the I-cache, but it
> can be either ASID-tagged VIVT or VIPT, independent of the type of the
> I-cache. I think current implementations of the BTC are ASID-tagged VIVT
> but the architecture doesn't mandate this.
> 
> So the mapping of a data page doesn't require BTC invalidation, even if
> the page is not marked as XN.
> 
> Having a branch between BTC and ISB is OK as long as we don't change the
> mapping of the branch.
> 
> Mapping a page in a previously unmapped location doesn't generally
> require BTC invalidation if we guarantee that there are no BTC entries
> for that location (IOW, we got a new ASID or unmapping the page
> previously had a BTC invalidation).

Okay, so I can say with confidence then that how things stand in my tree,
which limits BTC invalidation to:

1. For kernel mappings, flush_icache_range() which must be called prior
   to code placed in them being executed.

2. For user mappings, __sync_icache_dcache's icache flush, which is
   called before a non-zero user PTE is inserted.

together covers all the necessary points.

The area which needs more to focus some further work is
__sync_icache_dcache(), which is fairly over-zealous about all the
flushing.  In a boot of the Versatile Express platform, it inserts
193k PTEs.  Of that, 192k are for userspace.  For userspace:

87k are for new executable pages, 87 are for executable pages with
their attributes (eg, young-ness/read-only ness) changed.

54k are for new non-executable pages, 51k are for non-executable pages
with attributes changed.

__sync_icache_dcache() handles these userspace PTEs as follows:

0 satisfy the !pte_present_user() test.

105k are ignored due to the VIPT non-aliasing cache and the PTE being
non-executable.

0 have invalid PFNs.

964 result in a Dcache flush.  This number does not increment through
repeated monitoring of the stats via 'cat'.

87k result in an Icache invalidate with BTC invalidation.

Around 130 (of 193k) calls into this code can be saved by checking
whether the PFN has changed and the state of the young bit (no point
doing any flushing if the page is marked old as it won't be accessible
to the user in that state.)

130 may not sound like many, but that's just from a boot.  Over time,
pages will be aged, which will push that number up.  We really ought not
flush the I-cache and BTC just because we marked the page old, and then
again when it is marked young.

That's also something which won't really show up in benchmarks, as it
depends on pushing virtual memory out of the system which is not
something which benchmarks are particularly good at measuring.  It's
effect depends on overall system behaviour depends on the system setup
and memory pressure.
diff mbox

Patch

diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S
index 8013afc..fc5b36f 100644
--- a/arch/arm/mm/proc-v7.S
+++ b/arch/arm/mm/proc-v7.S
@@ -166,7 +166,9 @@  ENTRY(cpu_v7_set_pte_ext)
  ARM(	str	r3, [r0, #2048]! )
  THUMB(	add	r0, r0, #2048 )
  THUMB(	str	r3, [r0] )
-	mcr	p15, 0, r0, c7, c10, 1		@ flush_pte
+	mrc	p15, 0, r3, c0, c1, 7		@ read ID_MMFR3
+	tst	r3, #0xf << 20			@ check the coherent walk bits
+	mcreq	p15, 0, r0, c7, c10, 1		@ flush_pte
 #endif
 	mov	pc, lr
 ENDPROC(cpu_v7_set_pte_ext)