diff mbox

Unnecessary cache-line flush on page table updates ?

Message ID 20110704215507.GH8286@n2100.arm.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Russell King - ARM Linux July 4, 2011, 9:55 p.m. UTC
On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> As far as the BTB goes, I wonder if we can postpone that for user TLB
> ops by setting a TIF_ flag and checking that before returning to userspace.
> That would avoid having to needlessly destroy the cached branch information
> for kernel space while looping over the page tables.  The only other place
> that needs to worry about that is module_alloc() and vmap/vmalloc with
> PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().

Actually, we don't need to do BTC invalidate in flush_cache_vmap(),
but we do need to do a dsb+isb.

Firstly, the majority of mappings are created with NX set, so the BTC
can't be involved in anything created there (as we can't execute code
there.)

Secondly, if we're loading a code into kernel space to execute, then we
need to ensure I/D coherency via flush_icache_range(), which has to
happen after the mappings have been created, and this already does our
BTC invalidate+dsb+isb.

So, as far as kernel space TLB invalidation goes, my conclusion is that
we do not have to touch the BTC at all there, and we can leave that to
flush_icache_range() (and therefore cpu..coherent_kern_range) to deal
with.

Practically, testing it out on Versatile Express loading/unloading a
few modules shows no ill effects from dropping the BTC invalidates from
the kernel TLB invalidate ops.

8<----------
From: Russell King <rmk+kernel@arm.linux.org.uk>
ARM: btc: avoid invalidating the branch target cache on kernel TLB maintanence

Kernel space needs very little in the way of BTC maintanence as most
mappings which are created and destroyed are non-executable, and so
could never enter the instruction stream.

The case which does warrant BTC maintanence is when a module is loaded.
This creates a new executable mapping, but at that point the pages have
not been initialized with code and data, so at that point they contain
unpredictable information.  Invalidating the BTC at this stage serves
little useful purpose.

Before we execute module code, we call flush_icache_range(), which deals
with the BTC maintanence requirements.  This ensures that we have a BTC
maintanence operation before we execute code via the newly created
mapping.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
---
 arch/arm/include/asm/tlbflush.h |   23 -----------------------
 arch/arm/mm/tlb-v6.S            |    1 -
 arch/arm/mm/tlb-v7.S            |    4 ----
 3 files changed, 0 insertions(+), 28 deletions(-)

Comments

Catalin Marinas July 5, 2011, 9:26 a.m. UTC | #1
On Mon, Jul 04, 2011 at 10:55:07PM +0100, Russell King - ARM Linux wrote:
> On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> > As far as the BTB goes, I wonder if we can postpone that for user TLB
> > ops by setting a TIF_ flag and checking that before returning to userspace.
> > That would avoid having to needlessly destroy the cached branch information
> > for kernel space while looping over the page tables.  The only other place
> > that needs to worry about that is module_alloc() and vmap/vmalloc with
> > PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().
> 
> Actually, we don't need to do BTC invalidate in flush_cache_vmap(),
> but we do need to do a dsb+isb.

Why would we need an ISB?

> Firstly, the majority of mappings are created with NX set, so the BTC
> can't be involved in anything created there (as we can't execute code
> there.)

OK.

> Secondly, if we're loading a code into kernel space to execute, then we
> need to ensure I/D coherency via flush_icache_range(), which has to
> happen after the mappings have been created, and this already does our
> BTC invalidate+dsb+isb.

OK.

> So, as far as kernel space TLB invalidation goes, my conclusion is that
> we do not have to touch the BTC at all there, and we can leave that to
> flush_icache_range() (and therefore cpu..coherent_kern_range) to deal
> with.
> 
> Practically, testing it out on Versatile Express loading/unloading a
> few modules shows no ill effects from dropping the BTC invalidates from
> the kernel TLB invalidate ops.

AFAIK the branch predictor is transparent on Cortex-A9 and the BTB
maintenance are no-ops. You wouldn't notice any issues if you remove
them (you can check ID_MMFR1[31:28].

> 8<----------
> From: Russell King <rmk+kernel@arm.linux.org.uk>
> ARM: btc: avoid invalidating the branch target cache on kernel TLB maintanence
> 
> Kernel space needs very little in the way of BTC maintanence as most
> mappings which are created and destroyed are non-executable, and so
> could never enter the instruction stream.
> 
> The case which does warrant BTC maintanence is when a module is loaded.
> This creates a new executable mapping, but at that point the pages have
> not been initialized with code and data, so at that point they contain
> unpredictable information.  Invalidating the BTC at this stage serves
> little useful purpose.
> 
> Before we execute module code, we call flush_icache_range(), which deals
> with the BTC maintanence requirements.  This ensures that we have a BTC
> maintanence operation before we execute code via the newly created
> mapping.
> 
> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

The patch looks fine (it would be good if we get some testing on an
ARM11 platform).
Russell King - ARM Linux July 5, 2011, 9:46 a.m. UTC | #2
On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> On Mon, Jul 04, 2011 at 10:55:07PM +0100, Russell King - ARM Linux wrote:
> > On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> > > As far as the BTB goes, I wonder if we can postpone that for user TLB
> > > ops by setting a TIF_ flag and checking that before returning to userspace.
> > > That would avoid having to needlessly destroy the cached branch information
> > > for kernel space while looping over the page tables.  The only other place
> > > that needs to worry about that is module_alloc() and vmap/vmalloc with
> > > PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().
> > 
> > Actually, we don't need to do BTC invalidate in flush_cache_vmap(),
> > but we do need to do a dsb+isb.
> 
> Why would we need an ISB?

Because the ARM ARM tells us that we must do a dsb+isb to ensure that
instructions are visible to the instruction stream after a change to
the page tables.

I've since added the dsb+isbs back to the TLB ops because the ARM ARM
is qutie explicit that both are required to ensure that explicit
accesses see the effect of the TLB operation.  To me it is rather
perverse that an ISB is required to ensure that this sequence works:

	write page table entry
	clean
	dsb
	flush tlb entry
	dsb
	isb
	read/write new page

but the ARM ARM is quite explicit that this sequence is insufficient:

	write page table entry
	clean
	dsb
	flush tlb entry
	dsb
	read/write new page

to ensure visibility of the new page.

I suspect that the ARM ARM is wrong here - and the ISB is only required
if you're going to be executing from the new page.  Maybe that's already
been fixed?  Is there an update to DDI0406B?
Russell King - ARM Linux July 5, 2011, 10:48 a.m. UTC | #3
On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> AFAIK the branch predictor is transparent on Cortex-A9 and the BTB
> maintenance are no-ops. You wouldn't notice any issues if you remove
> them (you can check ID_MMFR1[31:28].

It's not transparent:

CPU: MMFR0=0x00100103 MMFR1=0x20000000 MMFR2=0x01230000 MMFR3=0x00102111

0b0010 Branch predictor requires flushing on:
	• enabling or disabling the MMU
	• writing new data to instruction locations
	• writing new mappings to the translation tables
	• any change to the TTBR0, TTBR1, or TTBCR registers without a
	  corresponding change to the FCSE ProcessID or ContextID.
Catalin Marinas July 5, 2011, 1:54 p.m. UTC | #4
On Tue, Jul 05, 2011 at 11:48:58AM +0100, Russell King - ARM Linux wrote:
> On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> > AFAIK the branch predictor is transparent on Cortex-A9 and the BTB
> > maintenance are no-ops. You wouldn't notice any issues if you remove
> > them (you can check ID_MMFR1[31:28].
> 
> It's not transparent:
> 
> CPU: MMFR0=0x00100103 MMFR1=0x20000000 MMFR2=0x01230000 MMFR3=0x00102111
> 
> 0b0010 Branch predictor requires flushing on:
> 	• enabling or disabling the MMU
> 	• writing new data to instruction locations
> 	• writing new mappings to the translation tables
> 	• any change to the TTBR0, TTBR1, or TTBCR registers without a
> 	  corresponding change to the FCSE ProcessID or ContextID.

OK, it makes sense. But what's really confusing - the A8 has the same
value for ID_MMFR1 (according to the TRM) but the BTB instructions are
no-ops (can can be enabled by the ACTLR.IBE bit). I suspect the ID_MMFR1
is wrong in this case.

Anyway, as long as we stick to the ARM ARM requirements we should be OK.
Russell King - ARM Linux July 5, 2011, 2:15 p.m. UTC | #5
On Tue, Jul 05, 2011 at 02:54:24PM +0100, Catalin Marinas wrote:
> On Tue, Jul 05, 2011 at 11:48:58AM +0100, Russell King - ARM Linux wrote:
> > On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> > > AFAIK the branch predictor is transparent on Cortex-A9 and the BTB
> > > maintenance are no-ops. You wouldn't notice any issues if you remove
> > > them (you can check ID_MMFR1[31:28].
> > 
> > It's not transparent:
> > 
> > CPU: MMFR0=0x00100103 MMFR1=0x20000000 MMFR2=0x01230000 MMFR3=0x00102111
> > 
> > 0b0010 Branch predictor requires flushing on:
> > 	• enabling or disabling the MMU
> > 	• writing new data to instruction locations
> > 	• writing new mappings to the translation tables
> > 	• any change to the TTBR0, TTBR1, or TTBCR registers without a
> > 	  corresponding change to the FCSE ProcessID or ContextID.
> 
> OK, it makes sense. But what's really confusing - the A8 has the same
> value for ID_MMFR1 (according to the TRM) but the BTB instructions are
> no-ops (can can be enabled by the ACTLR.IBE bit). I suspect the ID_MMFR1
> is wrong in this case.

Grumble.  This kind of thing worries me.

If the ID registers are wrong, then they can't be relied upon by software
making decisions about what can be skipped.  So how can we be confident
that MMFR3 23:20 != 0 really does mean that we don't need to clean the
cache when writing the page tables, and is not a result of some bug?
Catalin Marinas July 5, 2011, 2:40 p.m. UTC | #6
On Tue, Jul 05, 2011 at 03:15:10PM +0100, Russell King - ARM Linux wrote:
> On Tue, Jul 05, 2011 at 02:54:24PM +0100, Catalin Marinas wrote:
> > On Tue, Jul 05, 2011 at 11:48:58AM +0100, Russell King - ARM Linux wrote:
> > > On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> > > > AFAIK the branch predictor is transparent on Cortex-A9 and the BTB
> > > > maintenance are no-ops. You wouldn't notice any issues if you remove
> > > > them (you can check ID_MMFR1[31:28].
> > > 
> > > It's not transparent:
> > > 
> > > CPU: MMFR0=0x00100103 MMFR1=0x20000000 MMFR2=0x01230000 MMFR3=0x00102111
> > > 
> > > 0b0010 Branch predictor requires flushing on:
> > > 	• enabling or disabling the MMU
> > > 	• writing new data to instruction locations
> > > 	• writing new mappings to the translation tables
> > > 	• any change to the TTBR0, TTBR1, or TTBCR registers without a
> > > 	  corresponding change to the FCSE ProcessID or ContextID.
> > 
> > OK, it makes sense. But what's really confusing - the A8 has the same
> > value for ID_MMFR1 (according to the TRM) but the BTB instructions are
> > no-ops (can can be enabled by the ACTLR.IBE bit). I suspect the ID_MMFR1
> > is wrong in this case.
> 
> Grumble.  This kind of thing worries me.
> 
> If the ID registers are wrong, then they can't be relied upon by software
> making decisions about what can be skipped.  

It could be because the BTB operations are configurable via an ACTLR
bit, but the ID registers are fixed values.

> So how can we be confident
> that MMFR3 23:20 != 0 really does mean that we don't need to clean the
> cache when writing the page tables, and is not a result of some bug?

Well, you can get other bugs, you can never guarantee anything. But when
they are found, there are errata workarounds. This would actually be a
very easy to detect issue.
Catalin Marinas July 6, 2011, 3:52 p.m. UTC | #7
On Tue, Jul 05, 2011 at 10:46:52AM +0100, Russell King - ARM Linux wrote:
> On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> > On Mon, Jul 04, 2011 at 10:55:07PM +0100, Russell King - ARM Linux wrote:
> > > On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> > > > As far as the BTB goes, I wonder if we can postpone that for user TLB
> > > > ops by setting a TIF_ flag and checking that before returning to userspace.
> > > > That would avoid having to needlessly destroy the cached branch information
> > > > for kernel space while looping over the page tables.  The only other place
> > > > that needs to worry about that is module_alloc() and vmap/vmalloc with
> > > > PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().
> > > 
> > > Actually, we don't need to do BTC invalidate in flush_cache_vmap(),
> > > but we do need to do a dsb+isb.
> > 
> > Why would we need an ISB?
> 
> Because the ARM ARM tells us that we must do a dsb+isb to ensure that
> instructions are visible to the instruction stream after a change to
> the page tables.
> 
> I've since added the dsb+isbs back to the TLB ops because the ARM ARM
> is qutie explicit that both are required to ensure that explicit
> accesses see the effect of the TLB operation.  To me it is rather
> perverse that an ISB is required to ensure that this sequence works:
> 
> 	write page table entry
> 	clean
> 	dsb
> 	flush tlb entry
> 	dsb
> 	isb
> 	read/write new page

The same requirement can be found in latest (internal only) ARM ARM as
well. I think the reason is that some memory access already in the
pipeline (but after the TLB flush) may have sampled the state of the
page table using an old TLB entry, even though the actual memory access
will happen after the DSB.
Russell King - ARM Linux July 6, 2011, 3:55 p.m. UTC | #8
On Wed, Jul 06, 2011 at 04:52:14PM +0100, Catalin Marinas wrote:
> On Tue, Jul 05, 2011 at 10:46:52AM +0100, Russell King - ARM Linux wrote:
> > On Tue, Jul 05, 2011 at 10:26:00AM +0100, Catalin Marinas wrote:
> > > On Mon, Jul 04, 2011 at 10:55:07PM +0100, Russell King - ARM Linux wrote:
> > > > On Mon, Jul 04, 2011 at 12:13:38PM +0100, Russell King - ARM Linux wrote:
> > > > > As far as the BTB goes, I wonder if we can postpone that for user TLB
> > > > > ops by setting a TIF_ flag and checking that before returning to userspace.
> > > > > That would avoid having to needlessly destroy the cached branch information
> > > > > for kernel space while looping over the page tables.  The only other place
> > > > > that needs to worry about that is module_alloc() and vmap/vmalloc with
> > > > > PROT_KERNEL_EXEC, all of which can be done in flush_cache_vmap().
> > > > 
> > > > Actually, we don't need to do BTC invalidate in flush_cache_vmap(),
> > > > but we do need to do a dsb+isb.
> > > 
> > > Why would we need an ISB?
> > 
> > Because the ARM ARM tells us that we must do a dsb+isb to ensure that
> > instructions are visible to the instruction stream after a change to
> > the page tables.
> > 
> > I've since added the dsb+isbs back to the TLB ops because the ARM ARM
> > is qutie explicit that both are required to ensure that explicit
> > accesses see the effect of the TLB operation.  To me it is rather
> > perverse that an ISB is required to ensure that this sequence works:
> > 
> > 	write page table entry
> > 	clean
> > 	dsb
> > 	flush tlb entry
> > 	dsb
> > 	isb
> > 	read/write new page
> 
> The same requirement can be found in latest (internal only) ARM ARM as
> well. I think the reason is that some memory access already in the
> pipeline (but after the TLB flush) may have sampled the state of the
> page table using an old TLB entry, even though the actual memory access
> will happen after the DSB.

It would be useful to have a statement from RG on this.  Could you
pass this to him please?

Thanks.
Catalin Marinas July 6, 2011, 4:15 p.m. UTC | #9
On Wed, Jul 06, 2011 at 04:55:25PM +0100, Russell King - ARM Linux wrote:
> On Wed, Jul 06, 2011 at 04:52:14PM +0100, Catalin Marinas wrote:
> > On Tue, Jul 05, 2011 at 10:46:52AM +0100, Russell King - ARM Linux wrote:
> > > I've since added the dsb+isbs back to the TLB ops because the ARM ARM
> > > is qutie explicit that both are required to ensure that explicit
> > > accesses see the effect of the TLB operation.  To me it is rather
> > > perverse that an ISB is required to ensure that this sequence works:
> > > 
> > > 	write page table entry
> > > 	clean
> > > 	dsb
> > > 	flush tlb entry
> > > 	dsb
> > > 	isb
> > > 	read/write new page
> > 
> > The same requirement can be found in latest (internal only) ARM ARM as
> > well. I think the reason is that some memory access already in the
> > pipeline (but after the TLB flush) may have sampled the state of the
> > page table using an old TLB entry, even though the actual memory access
> > will happen after the DSB.
> 
> It would be useful to have a statement from RG on this.  Could you
> pass this to him please?

Just checked - it is required as per the ARM ARM and there are cores
that rely on the ISB. As I said, it depends on the pipeline
implementation.
diff mbox

Patch

diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h
index 9aeddce..3704d03 100644
--- a/arch/arm/include/asm/tlbflush.h
+++ b/arch/arm/include/asm/tlbflush.h
@@ -481,19 +481,6 @@  static inline void local_flush_tlb_kernel_page(unsigned long kaddr)
 		asm("mcr p15, 0, %0, c8, c5, 1" : : "r" (kaddr) : "cc");
 	if (tlb_flag(TLB_V7_UIS_PAGE))
 		asm("mcr p15, 0, %0, c8, c3, 1" : : "r" (kaddr) : "cc");
-
-	if (tlb_flag(TLB_BTB)) {
-		/* flush the branch target cache */
-		asm("mcr p15, 0, %0, c7, c5, 6" : : "r" (zero) : "cc");
-		dsb();
-		isb();
-	}
-	if (tlb_flag(TLB_V7_IS_BTB)) {
-		/* flush the branch target cache */
-		asm("mcr p15, 0, %0, c7, c1, 6" : : "r" (zero) : "cc");
-		dsb();
-		isb();
-	}
 }
 
 /*
diff --git a/arch/arm/mm/tlb-v6.S b/arch/arm/mm/tlb-v6.S
index 73d7d89..cdbfda5 100644
--- a/arch/arm/mm/tlb-v6.S
+++ b/arch/arm/mm/tlb-v6.S
@@ -83,7 +83,6 @@  ENTRY(v6wbi_flush_kern_tlb_range)
 	add	r0, r0, #PAGE_SZ
 	cmp	r0, r1
 	blo	1b
-	mcr	p15, 0, r2, c7, c5, 6		@ flush BTAC/BTB
 	mcr	p15, 0, r2, c7, c10, 4		@ data synchronization barrier
 	mcr	p15, 0, r2, c7, c5, 4		@ prefetch flush
 	mov	pc, lr
diff --git a/arch/arm/mm/tlb-v7.S b/arch/arm/mm/tlb-v7.S
index 53cd5b4..dc84f72 100644
--- a/arch/arm/mm/tlb-v7.S
+++ b/arch/arm/mm/tlb-v7.S
@@ -75,11 +75,7 @@  ENTRY(v7wbi_flush_kern_tlb_range)
 	add	r0, r0, #PAGE_SZ
 	cmp	r0, r1
 	blo	1b
-	mov	r2, #0
-	ALT_SMP(mcr	p15, 0, r2, c7, c1, 6)	@ flush BTAC/BTB Inner Shareable
-	ALT_UP(mcr	p15, 0, r2, c7, c5, 6)	@ flush BTAC/BTB
 	dsb
-	isb
 	mov	pc, lr
 ENDPROC(v7wbi_flush_kern_tlb_range)