diff mbox series

arm64: head: Fix cache inconsistency of the identity-mapped region

Message ID 20220415170504.3781878-1-sdonthineni@nvidia.com (mailing list archive)
State New, archived
Headers show
Series arm64: head: Fix cache inconsistency of the identity-mapped region | expand

Commit Message

Shanker Donthineni April 15, 2022, 5:05 p.m. UTC
The secondary cores boot is stuck due to data abort while executing the
instruction 'ldr x8, =__secondary_switched'. The RELA value of this
instruction was updated by a primary boot core from __relocate_kernel()
but those memory updates are not visible to CPUs after calling
switch_to_vhe() causing problem.

The cacheable/shareable attributes of the identity-mapped regions are
different while CPU executing in EL1 (MMU enabled) and for a short period
of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification
(DDI0487G_b), this is not allowed.

G5.10.3 Cache maintenance requirement:
 "If the change affects the cacheability attributes of the area of memory,
 including any change between Write-Through and Write-Back attributes,
 software must ensure that any cached copies of affected locations are
 removed from the caches, typically by cleaning and invalidating the
 locations from the levels of cache that might hold copies of the locations
 affected by the attribute change."

Clean+invalidate the identity-mapped region till PoC before switching to
VHE world to fix the cache inconsistency.

Problem analysis with disassembly (vmlinux):
 1) Both __primary_switch() and enter_vhe() are part of the identity region
 2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480
 3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled
 4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled
   - Non-coherent access causing the cache line fff800010970480 drop
 5) Secondary core executes 'ldr x8, __secondary_switched'
   - Getting data abort because of the incorrect value at ffff800010970488

ffff800010970418 <__primary_switch>:
ffff800010970418:  d503245f  bti  c
ffff80001097041c:  aa0003f3  mov  x19, x0
ffff800010970420:  d5381014  mrs  x20, sctlr_el1
ffff800010970424:  90003c81  adrp x1, ffff800011100000 <init_pg_dir>
ffff800010970428:  97ffffc4  bl   ffff800010970338 <__enable_mmu>
ffff80001097042c:  97ffffe8  bl   ffff8000109703cc <__relocate_kernel>
ffff800010970430:  58000308  ldr  x8, ffff800010970490 <__primary_switch+0x78>
ffff800010970434:  90ffb480  adrp x0, ffff800010000000 <_text>
ffff800010970438:  d63f0100  blr  x8
ffff80001097043c:  d5033fdf  isb
ffff800010970440:  d5181014  msr  sctlr_el1, x20
ffff800010970444:  d5033fdf  isb
ffff800010970448:  940f7efe  bl   ffff800010d50040 <__create_page_tables>
ffff80001097044c:  d508871f  tlbi vmalle1
ffff800010970450:  d503379f  dsb  nsh
ffff800010970454:  d5033fdf  isb
ffff800010970458:  d5181013  msr  sctlr_el1, x19
ffff80001097045c:  d5033fdf  isb
ffff800010970460:  d508751f  ic   iallu
ffff800010970464:  d503379f  dsb  nsh
ffff800010970468:  d5033fdf  isb
ffff80001097046c:  97ffffd8  bl   ffff8000109703cc <__relocate_kernel>
ffff800010970470:  58000108  ldr  x8, ffff800010970490 <__primary_switch+0x78>
ffff800010970474:  90ffb480  adrp x0, ffff800010000000 <_text>
ffff800010970478:  d61f0100  br   x8
ffff80001097047c:  00df10c8  .word   0x00df10c8
ffff800010970480:  000dfba8  .word   0x000dfba8
        ...
ffff800010970498:  d51cd041  msr  tpidr_el2, x1
ffff80001097049c:  d503201f  nop

ffff8000109704a0 <enter_vhe>:
ffff8000109704a0:  d508871f  tlbi vmalle1
ffff8000109704a4:  d503379f  dsb  nsh
ffff8000109704a8:  d5033fdf  isb
ffff8000109704ac:  d53d1000  mrs  x0, sctlr_el12
ffff8000109704b0:  d5181000  msr  sctlr_el1, x0
ffff8000109704b4:  d5033fdf  isb
ffff8000109704b8:  d508751f  ic   iallu
ffff8000109704bc:  d503379f  dsb  nsh
ffff8000109704c0:  d5033fdf  isb
ffff8000109704c4:  d2a60a00  mov  x0, #0x30500000
ffff8000109704c8:  f2810000  movk x0, #0x800
ffff8000109704cc:  d51d1000  msr  sctlr_el12, x0
ffff8000109704d0:  aa1f03e0  mov  x0, xzr
ffff8000109704d4:  d69f03e0  eret

ffff800010961850 <mutate_to_vhe>:
ffff800010961850: d53c1001   mrs  x1, sctlr_el2
ffff800010961854: 370001c1   tbnz w1, #0, ffff80001096188c <mutate_to_vhe+0x3c>
ffff800010961858: d5380721   mrs  x1, id_aa64mmfr1_el1
...
ffff80001096190c: aa010000   orr  x0, x0, x1
ffff800010961910: d5184000   msr  spsr_el1, x0
ffff800010961914: 14003ae3   b    ffff8000109704a0 <enter_vhe>

ffff800010970270 <secondary_startup>:
ffff800010970270: d503245f  bti  c
ffff800010970274: 97dab23a  bl   ffff80001001cb5c <switch_to_vhe>
ffff800010970278: 94000049  bl   ffff80001097039c <__cpu_secondary_check52bitva>
ffff80001097027c: 94000145  bl   ffff800010970790 <__cpu_setup>
ffff800010970280: 90001e81  adrp x1, ffff800010d40000 <swapper_pg_dir>
ffff800010970284: 9400002d  bl   ffff800010970338 <__enable_mmu>
ffff800010970288: 58001008  ldr  x8, ffff800010970488 <__primary_switch+0x70>
ffff80001097028c: d61f0100  br   x8

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
---
 arch/arm64/kernel/head.S | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Marc Zyngier April 18, 2022, 9:16 a.m. UTC | #1
Hi Shanker,

On Fri, 15 Apr 2022 18:05:03 +0100,
Shanker Donthineni <sdonthineni@nvidia.com> wrote:
> 
> The secondary cores boot is stuck due to data abort while executing the
> instruction 'ldr x8, =__secondary_switched'. The RELA value of this
> instruction was updated by a primary boot core from __relocate_kernel()
> but those memory updates are not visible to CPUs after calling
> switch_to_vhe() causing problem.
> 
> The cacheable/shareable attributes of the identity-mapped regions are
> different while CPU executing in EL1 (MMU enabled) and for a short period
> of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification
> (DDI0487G_b), this is not allowed.
> 
> G5.10.3 Cache maintenance requirement:
>  "If the change affects the cacheability attributes of the area of memory,
>  including any change between Write-Through and Write-Back attributes,
>  software must ensure that any cached copies of affected locations are
>  removed from the caches, typically by cleaning and invalidating the
>  locations from the levels of cache that might hold copies of the locations
>  affected by the attribute change."
> 
> Clean+invalidate the identity-mapped region till PoC before switching to
> VHE world to fix the cache inconsistency.
> 
> Problem analysis with disassembly (vmlinux):
>  1) Both __primary_switch() and enter_vhe() are part of the identity region
>  2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480
>  3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled
>  4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled
>    - Non-coherent access causing the cache line fff800010970480 drop

Non-coherent? You mean non-cacheable, right? At this stage, we only
have a single CPU, so I'm not sure coherency is the problem here. When
you say 'drop', is that an eviction? Replaced by what? By the previous
version of the cache line, containing the stale value?

It is also unclear to me how the instruction fetches directly
influence what happens *on the other CPUs*. Is this line kept at a
level beyond the PoU? Are we talking of a system cache here? It would
really help if you could describe your cache topology.

>  5) Secondary core executes 'ldr x8, __secondary_switched'
>    - Getting data abort because of the incorrect value at ffff800010970488

My interpretation of the above is as follows:

- CPU0 performs the RELA update with the MMU on

- A switch to EL2 with the MMU off results in the cache line sitting
  beyond the PoU and containing the RELA update to be replaced with
  the *stale* version (the fetch happening on the i-side).

- CPU1 (with its MMU on) fetches the stale data from the cache

Is this correct?

What is unclear to me is why the eviction occurs. Yes, this is of
course allowed, and the code is wrong for assuming any odd behaviour.
But then, why only clean the idmap? I have the feeling that by this
standard, we should see this a lot more often. Or are we just lucky
that we don't have any other examples of data and instructions sharing
the same cache line and accessed with different cacheability
attributes?

Thanks,

	M.
Shanker Donthineni April 18, 2022, 12:53 p.m. UTC | #2
Thanks Marc for your comments.

On 4/18/22 4:16 AM, Marc Zyngier wrote:
> External email: Use caution opening links or attachments
>
>
> Hi Shanker,
>
> On Fri, 15 Apr 2022 18:05:03 +0100,
> Shanker Donthineni <sdonthineni@nvidia.com> wrote:
>> The secondary cores boot is stuck due to data abort while executing the
>> instruction 'ldr x8, =__secondary_switched'. The RELA value of this
>> instruction was updated by a primary boot core from __relocate_kernel()
>> but those memory updates are not visible to CPUs after calling
>> switch_to_vhe() causing problem.
>>
>> The cacheable/shareable attributes of the identity-mapped regions are
>> different while CPU executing in EL1 (MMU enabled) and for a short period
>> of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification
>> (DDI0487G_b), this is not allowed.
>>
>> G5.10.3 Cache maintenance requirement:
>>  "If the change affects the cacheability attributes of the area of memory,
>>  including any change between Write-Through and Write-Back attributes,
>>  software must ensure that any cached copies of affected locations are
>>  removed from the caches, typically by cleaning and invalidating the
>>  locations from the levels of cache that might hold copies of the locations
>>  affected by the attribute change."
>>
>> Clean+invalidate the identity-mapped region till PoC before switching to
>> VHE world to fix the cache inconsistency.
>>
>> Problem analysis with disassembly (vmlinux):
>>  1) Both __primary_switch() and enter_vhe() are part of the identity region
>>  2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480
>>  3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled
>>  4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled
>>    - Non-coherent access causing the cache line fff800010970480 drop
> Non-coherent? You mean non-cacheable, right? At this stage, we only
> have a single CPU, so I'm not sure coherency is the problem here. When
> you say 'drop', is that an eviction? Replaced by what? By the previous
> version of the cache line, containing the stale value?
Yes,non-cacheable. The cache line corresponding to function enter_vhe() was
marked with shareable & WB-cache as a result of write to RELA, the same cache
line is being fetched with non-shareable & non-cacheable. The eviction is
not pushed to PoC and got dropped because of cache-line attributes mismatch.
 
> It is also unclear to me how the instruction fetches directly
> influence what happens *on the other CPUs*. Is this line kept at a
> level beyond the PoU? Are we talking of a system cache here? It would
> really help if you could describe your cache topology.
>
>>  5) Secondary core executes 'ldr x8, __secondary_switched'
>>    - Getting data abort because of the incorrect value at ffff800010970488
> My interpretation of the above is as follows:
>
> - CPU0 performs the RELA update with the MMU on
>
> - A switch to EL2 with the MMU off results in the cache line sitting
>   beyond the PoU and containing the RELA update to be replaced with
>   the *stale* version (the fetch happening on the i-side).
>
> - CPU1 (with its MMU on) fetches the stale data from the cache
>
> Is this correct?
Yes, correct,

> What is unclear to me is why the eviction occurs. Yes, this is of
> course allowed, and the code is wrong for assuming any odd behaviour.
> But then,
The eviction is happening naturally don't know the time and the exact
line of code where eviction is happening.  
>  why only clean the idmap?
Seems only EL2-STUB code is accessing RELA of idmap with MMU disabled.

I did 2 other experiments, the issue was not reproducible.
 1) CONFIG_RELOCATABLE=n
 2) Change the line "ldr x8, =__secondary_switched" to

    adr_l   x9, _text
    adr_l   x8, __secondary_switched
    sub     x8, x8, x9
    mov_q   x9, KIMAGE_VADDR
    add     x8, x8, x9
>  I have the feeling that by this
> standard, we should see this a lot more often. Or are we just lucky
> that we don't have any other examples of data and instructions sharing
> the same cache line and accessed with different cacheability
> attributes?
I think in all other places either MMU is enabled or instructions/data
pushed to the PoC.

> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
Ard Biesheuvel April 20, 2022, 8:05 a.m. UTC | #3
On Fri, 15 Apr 2022 at 19:05, Shanker Donthineni <sdonthineni@nvidia.com> wrote:
>
> The secondary cores boot is stuck due to data abort while executing the
> instruction 'ldr x8, =__secondary_switched'. The RELA value of this
> instruction was updated by a primary boot core from __relocate_kernel()
> but those memory updates are not visible to CPUs after calling
> switch_to_vhe() causing problem.
>
> The cacheable/shareable attributes of the identity-mapped regions are
> different while CPU executing in EL1 (MMU enabled) and for a short period
> of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification
> (DDI0487G_b), this is not allowed.
>
> G5.10.3 Cache maintenance requirement:
>  "If the change affects the cacheability attributes of the area of memory,
>  including any change between Write-Through and Write-Back attributes,
>  software must ensure that any cached copies of affected locations are
>  removed from the caches, typically by cleaning and invalidating the
>  locations from the levels of cache that might hold copies of the locations
>  affected by the attribute change."
>
> Clean+invalidate the identity-mapped region till PoC before switching to
> VHE world to fix the cache inconsistency.
>
> Problem analysis with disassembly (vmlinux):
>  1) Both __primary_switch() and enter_vhe() are part of the identity region
>  2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480
>  3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled
>  4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled
>    - Non-coherent access causing the cache line fff800010970480 drop
>  5) Secondary core executes 'ldr x8, __secondary_switched'
>    - Getting data abort because of the incorrect value at ffff800010970488
>
> ffff800010970418 <__primary_switch>:
> ffff800010970418:  d503245f  bti  c
> ffff80001097041c:  aa0003f3  mov  x19, x0
> ffff800010970420:  d5381014  mrs  x20, sctlr_el1
> ffff800010970424:  90003c81  adrp x1, ffff800011100000 <init_pg_dir>
> ffff800010970428:  97ffffc4  bl   ffff800010970338 <__enable_mmu>
> ffff80001097042c:  97ffffe8  bl   ffff8000109703cc <__relocate_kernel>
> ffff800010970430:  58000308  ldr  x8, ffff800010970490 <__primary_switch+0x78>
> ffff800010970434:  90ffb480  adrp x0, ffff800010000000 <_text>
> ffff800010970438:  d63f0100  blr  x8
> ffff80001097043c:  d5033fdf  isb
> ffff800010970440:  d5181014  msr  sctlr_el1, x20
> ffff800010970444:  d5033fdf  isb
> ffff800010970448:  940f7efe  bl   ffff800010d50040 <__create_page_tables>
> ffff80001097044c:  d508871f  tlbi vmalle1
> ffff800010970450:  d503379f  dsb  nsh
> ffff800010970454:  d5033fdf  isb
> ffff800010970458:  d5181013  msr  sctlr_el1, x19
> ffff80001097045c:  d5033fdf  isb
> ffff800010970460:  d508751f  ic   iallu
> ffff800010970464:  d503379f  dsb  nsh
> ffff800010970468:  d5033fdf  isb
> ffff80001097046c:  97ffffd8  bl   ffff8000109703cc <__relocate_kernel>
> ffff800010970470:  58000108  ldr  x8, ffff800010970490 <__primary_switch+0x78>
> ffff800010970474:  90ffb480  adrp x0, ffff800010000000 <_text>
> ffff800010970478:  d61f0100  br   x8
> ffff80001097047c:  00df10c8  .word   0x00df10c8
> ffff800010970480:  000dfba8  .word   0x000dfba8
>         ...
> ffff800010970498:  d51cd041  msr  tpidr_el2, x1
> ffff80001097049c:  d503201f  nop
>
> ffff8000109704a0 <enter_vhe>:
> ffff8000109704a0:  d508871f  tlbi vmalle1
> ffff8000109704a4:  d503379f  dsb  nsh
> ffff8000109704a8:  d5033fdf  isb
> ffff8000109704ac:  d53d1000  mrs  x0, sctlr_el12
> ffff8000109704b0:  d5181000  msr  sctlr_el1, x0
> ffff8000109704b4:  d5033fdf  isb
> ffff8000109704b8:  d508751f  ic   iallu
> ffff8000109704bc:  d503379f  dsb  nsh
> ffff8000109704c0:  d5033fdf  isb
> ffff8000109704c4:  d2a60a00  mov  x0, #0x30500000
> ffff8000109704c8:  f2810000  movk x0, #0x800
> ffff8000109704cc:  d51d1000  msr  sctlr_el12, x0
> ffff8000109704d0:  aa1f03e0  mov  x0, xzr
> ffff8000109704d4:  d69f03e0  eret
>
> ffff800010961850 <mutate_to_vhe>:
> ffff800010961850: d53c1001   mrs  x1, sctlr_el2
> ffff800010961854: 370001c1   tbnz w1, #0, ffff80001096188c <mutate_to_vhe+0x3c>
> ffff800010961858: d5380721   mrs  x1, id_aa64mmfr1_el1
> ...
> ffff80001096190c: aa010000   orr  x0, x0, x1
> ffff800010961910: d5184000   msr  spsr_el1, x0
> ffff800010961914: 14003ae3   b    ffff8000109704a0 <enter_vhe>
>
> ffff800010970270 <secondary_startup>:
> ffff800010970270: d503245f  bti  c
> ffff800010970274: 97dab23a  bl   ffff80001001cb5c <switch_to_vhe>
> ffff800010970278: 94000049  bl   ffff80001097039c <__cpu_secondary_check52bitva>
> ffff80001097027c: 94000145  bl   ffff800010970790 <__cpu_setup>
> ffff800010970280: 90001e81  adrp x1, ffff800010d40000 <swapper_pg_dir>
> ffff800010970284: 9400002d  bl   ffff800010970338 <__enable_mmu>
> ffff800010970288: 58001008  ldr  x8, ffff800010970488 <__primary_switch+0x70>
> ffff80001097028c: d61f0100  br   x8
>
> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
> ---
>  arch/arm64/kernel/head.S | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 6a98f1a38c29a..b5786163697bb 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -462,6 +462,16 @@ SYM_FUNC_START_LOCAL(__primary_switched)
>         ldp     x29, x30, [sp], #16             // we must enable KASLR, return
>         ret                                     // to __primary_switch()
>  0:
> +#endif
> +#ifdef CONFIG_RELOCATABLE
> +       /*
> +        * Since the RELA entries of the identity-mapped region are updated
> +        * with MMU enabled, clean and invalidate those entries to avoid
> +        * cache inconsistency while accessing with MMU disabled in hyp-stub.
> +        */
> +       adrp    x0, __idmap_text_start
> +       adr_l   x1, __idmap_text_end
> +       bl      dcache_clean_inval_poc
>  #endif
>         bl      switch_to_vhe                   // Prefer VHE if possible
>         ldp     x29, x30, [sp], #16

Thanks for the elaborate report.

I'd prefer to fix this by moving the literal out of the ID map
entirely. Does the below also fix your issue?

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 6a98f1a38c29..97134d6f78ff 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -639,10 +639,15 @@ SYM_FUNC_START_LOCAL(secondary_startup)
        bl      __cpu_setup                     // initialise processor
        adrp    x1, swapper_pg_dir
        bl      __enable_mmu
-       ldr     x8, =__secondary_switched
+       ldr_l   x8, .L__secondary_switched
        br      x8
 SYM_FUNC_END(secondary_startup)

+       .pushsection ".rodata", "a", %progbits
+.L__secondary_switched:
+       .quad   __secondary_switched
+       .popsection
+
 SYM_FUNC_START_LOCAL(__secondary_switched)
        adr_l   x5, vectors
        msr     vbar_el1, x5
Will Deacon April 20, 2022, 10:08 a.m. UTC | #4
Hi Shanker,

On Mon, Apr 18, 2022 at 07:53:20AM -0500, Shanker R Donthineni wrote:
> On 4/18/22 4:16 AM, Marc Zyngier wrote:
> > External email: Use caution opening links or attachments

Ok.

> > On Fri, 15 Apr 2022 18:05:03 +0100,
> > Shanker Donthineni <sdonthineni@nvidia.com> wrote:
> >> The secondary cores boot is stuck due to data abort while executing the
> >> instruction 'ldr x8, =__secondary_switched'. The RELA value of this
> >> instruction was updated by a primary boot core from __relocate_kernel()
> >> but those memory updates are not visible to CPUs after calling
> >> switch_to_vhe() causing problem.
> >>
> >> The cacheable/shareable attributes of the identity-mapped regions are
> >> different while CPU executing in EL1 (MMU enabled) and for a short period
> >> of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification
> >> (DDI0487G_b), this is not allowed.
> >>
> >> G5.10.3 Cache maintenance requirement:
> >>  "If the change affects the cacheability attributes of the area of memory,
> >>  including any change between Write-Through and Write-Back attributes,
> >>  software must ensure that any cached copies of affected locations are
> >>  removed from the caches, typically by cleaning and invalidating the
> >>  locations from the levels of cache that might hold copies of the locations
> >>  affected by the attribute change."
> >>
> >> Clean+invalidate the identity-mapped region till PoC before switching to
> >> VHE world to fix the cache inconsistency.
> >>
> >> Problem analysis with disassembly (vmlinux):
> >>  1) Both __primary_switch() and enter_vhe() are part of the identity region
> >>  2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480
> >>  3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled
> >>  4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled
> >>    - Non-coherent access causing the cache line fff800010970480 drop
> > Non-coherent? You mean non-cacheable, right? At this stage, we only
> > have a single CPU, so I'm not sure coherency is the problem here. When
> > you say 'drop', is that an eviction? Replaced by what? By the previous
> > version of the cache line, containing the stale value?
> Yes,non-cacheable. The cache line corresponding to function enter_vhe() was
> marked with shareable & WB-cache as a result of write to RELA, the same cache
> line is being fetched with non-shareable & non-cacheable. The eviction is
> not pushed to PoC and got dropped because of cache-line attributes mismatch.

I'm really struggling to understand this. Why is the instruction fetch
non-shareable? I'm trying to align your observations with the rules about
mismatched aliases in the architecture and I'm yet to satisfy myself that
the CPU is allowed to drop a dirty line on the floor in response to an
unexpected hit.

My mental model (which seems to align with Marc) is something like:


1. The boot CPU fetches the line via a cacheable mapping and dirties it
   in its L1 as part of applying the relocations.

2. The boot CPU then enters EL2 with the MMU off and fetches the same
   line on the I-side. AFAICT, the architecture says this is done with
   outer-shareable, non-cacheable attributes.

3. !!! Somehow the instruction fetch results in the _dirty_ line from (1)
       being discarded !!!

4. A secondary CPU later on fetches the line via a cacheable mapping and
   explodes because the relocation hasn't been applied.


Is that what you're seeing? If so, we really need more insight into what
is going on at step (3) because it feels like it could have a much more
significant impact than the issue we're trying to fix here. How is the line
dropped? Is it due to back invalidation from a shared cache? Is it due to
IDC snooping? Does the line actually stick around on the D-side, but somehow
the I-side is shared between CPUs?

Many questions...

Will
Shanker Donthineni April 20, 2022, 1:15 p.m. UTC | #5
Thanks Ard,

On 4/20/22 3:05 AM, Ard Biesheuvel wrote:
> External email: Use caution opening links or attachments
>
>
> On Fri, 15 Apr 2022 at 19:05, Shanker Donthineni <sdonthineni@nvidia.com> wrote:
>> The secondary cores boot is stuck due to data abort while executing the
>> instruction 'ldr x8, =__secondary_switched'. The RELA value of this
>> instruction was updated by a primary boot core from __relocate_kernel()
>> but those memory updates are not visible to CPUs after calling
>> switch_to_vhe() causing problem.
>>
>> The cacheable/shareable attributes of the identity-mapped regions are
>> different while CPU executing in EL1 (MMU enabled) and for a short period
>> of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification
>> (DDI0487G_b), this is not allowed.
>>
>> G5.10.3 Cache maintenance requirement:
>>  "If the change affects the cacheability attributes of the area of memory,
>>  including any change between Write-Through and Write-Back attributes,
>>  software must ensure that any cached copies of affected locations are
>>  removed from the caches, typically by cleaning and invalidating the
>>  locations from the levels of cache that might hold copies of the locations
>>  affected by the attribute change."
>>
>> Clean+invalidate the identity-mapped region till PoC before switching to
>> VHE world to fix the cache inconsistency.
>>
>> Problem analysis with disassembly (vmlinux):
>>  1) Both __primary_switch() and enter_vhe() are part of the identity region
>>  2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480
>>  3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled
>>  4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled
>>    - Non-coherent access causing the cache line fff800010970480 drop
>>  5) Secondary core executes 'ldr x8, __secondary_switched'
>>    - Getting data abort because of the incorrect value at ffff800010970488
>>
>> ffff800010970418 <__primary_switch>:
>> ffff800010970418:  d503245f  bti  c
>> ffff80001097041c:  aa0003f3  mov  x19, x0
>> ffff800010970420:  d5381014  mrs  x20, sctlr_el1
>> ffff800010970424:  90003c81  adrp x1, ffff800011100000 <init_pg_dir>
>> ffff800010970428:  97ffffc4  bl   ffff800010970338 <__enable_mmu>
>> ffff80001097042c:  97ffffe8  bl   ffff8000109703cc <__relocate_kernel>
>> ffff800010970430:  58000308  ldr  x8, ffff800010970490 <__primary_switch+0x78>
>> ffff800010970434:  90ffb480  adrp x0, ffff800010000000 <_text>
>> ffff800010970438:  d63f0100  blr  x8
>> ffff80001097043c:  d5033fdf  isb
>> ffff800010970440:  d5181014  msr  sctlr_el1, x20
>> ffff800010970444:  d5033fdf  isb
>> ffff800010970448:  940f7efe  bl   ffff800010d50040 <__create_page_tables>
>> ffff80001097044c:  d508871f  tlbi vmalle1
>> ffff800010970450:  d503379f  dsb  nsh
>> ffff800010970454:  d5033fdf  isb
>> ffff800010970458:  d5181013  msr  sctlr_el1, x19
>> ffff80001097045c:  d5033fdf  isb
>> ffff800010970460:  d508751f  ic   iallu
>> ffff800010970464:  d503379f  dsb  nsh
>> ffff800010970468:  d5033fdf  isb
>> ffff80001097046c:  97ffffd8  bl   ffff8000109703cc <__relocate_kernel>
>> ffff800010970470:  58000108  ldr  x8, ffff800010970490 <__primary_switch+0x78>
>> ffff800010970474:  90ffb480  adrp x0, ffff800010000000 <_text>
>> ffff800010970478:  d61f0100  br   x8
>> ffff80001097047c:  00df10c8  .word   0x00df10c8
>> ffff800010970480:  000dfba8  .word   0x000dfba8
>>         ...
>> ffff800010970498:  d51cd041  msr  tpidr_el2, x1
>> ffff80001097049c:  d503201f  nop
>>
>> ffff8000109704a0 <enter_vhe>:
>> ffff8000109704a0:  d508871f  tlbi vmalle1
>> ffff8000109704a4:  d503379f  dsb  nsh
>> ffff8000109704a8:  d5033fdf  isb
>> ffff8000109704ac:  d53d1000  mrs  x0, sctlr_el12
>> ffff8000109704b0:  d5181000  msr  sctlr_el1, x0
>> ffff8000109704b4:  d5033fdf  isb
>> ffff8000109704b8:  d508751f  ic   iallu
>> ffff8000109704bc:  d503379f  dsb  nsh
>> ffff8000109704c0:  d5033fdf  isb
>> ffff8000109704c4:  d2a60a00  mov  x0, #0x30500000
>> ffff8000109704c8:  f2810000  movk x0, #0x800
>> ffff8000109704cc:  d51d1000  msr  sctlr_el12, x0
>> ffff8000109704d0:  aa1f03e0  mov  x0, xzr
>> ffff8000109704d4:  d69f03e0  eret
>>
>> ffff800010961850 <mutate_to_vhe>:
>> ffff800010961850: d53c1001   mrs  x1, sctlr_el2
>> ffff800010961854: 370001c1   tbnz w1, #0, ffff80001096188c <mutate_to_vhe+0x3c>
>> ffff800010961858: d5380721   mrs  x1, id_aa64mmfr1_el1
>> ...
>> ffff80001096190c: aa010000   orr  x0, x0, x1
>> ffff800010961910: d5184000   msr  spsr_el1, x0
>> ffff800010961914: 14003ae3   b    ffff8000109704a0 <enter_vhe>
>>
>> ffff800010970270 <secondary_startup>:
>> ffff800010970270: d503245f  bti  c
>> ffff800010970274: 97dab23a  bl   ffff80001001cb5c <switch_to_vhe>
>> ffff800010970278: 94000049  bl   ffff80001097039c <__cpu_secondary_check52bitva>
>> ffff80001097027c: 94000145  bl   ffff800010970790 <__cpu_setup>
>> ffff800010970280: 90001e81  adrp x1, ffff800010d40000 <swapper_pg_dir>
>> ffff800010970284: 9400002d  bl   ffff800010970338 <__enable_mmu>
>> ffff800010970288: 58001008  ldr  x8, ffff800010970488 <__primary_switch+0x70>
>> ffff80001097028c: d61f0100  br   x8
>>
>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
>> ---
>>  arch/arm64/kernel/head.S | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
>> index 6a98f1a38c29a..b5786163697bb 100644
>> --- a/arch/arm64/kernel/head.S
>> +++ b/arch/arm64/kernel/head.S
>> @@ -462,6 +462,16 @@ SYM_FUNC_START_LOCAL(__primary_switched)
>>         ldp     x29, x30, [sp], #16             // we must enable KASLR, return
>>         ret                                     // to __primary_switch()
>>  0:
>> +#endif
>> +#ifdef CONFIG_RELOCATABLE
>> +       /*
>> +        * Since the RELA entries of the identity-mapped region are updated
>> +        * with MMU enabled, clean and invalidate those entries to avoid
>> +        * cache inconsistency while accessing with MMU disabled in hyp-stub.
>> +        */
>> +       adrp    x0, __idmap_text_start
>> +       adr_l   x1, __idmap_text_end
>> +       bl      dcache_clean_inval_poc
>>  #endif
>>         bl      switch_to_vhe                   // Prefer VHE if possible
>>         ldp     x29, x30, [sp], #16
> Thanks for the elaborate report.
>
> I'd prefer to fix this by moving the literal out of the ID map
> entirely. Does the below also fix your issue?
>
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 6a98f1a38c29..97134d6f78ff 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -639,10 +639,15 @@ SYM_FUNC_START_LOCAL(secondary_startup)
>         bl      __cpu_setup                     // initialise processor
>         adrp    x1, swapper_pg_dir
>         bl      __enable_mmu
> -       ldr     x8, =__secondary_switched
> +       ldr_l   x8, .L__secondary_switched
>         br      x8
>  SYM_FUNC_END(secondary_startup)
>
> +       .pushsection ".rodata", "a", %progbits
> +.L__secondary_switched:
> +       .quad   __secondary_switched
> +       .popsection
> +
>  SYM_FUNC_START_LOCAL(__secondary_switched)
>         adr_l   x5, vectors
>         msr     vbar_el1, x5

It should work, I'll test and share results with you. I think
__primary_switched() can also be moved out of RELA for
identty-mapped region.
Shanker Donthineni April 20, 2022, 3:02 p.m. UTC | #6
Hi Will,

On 4/20/22 5:08 AM, Will Deacon wrote:
> Non-coherent? You mean non-cacheable, right? At this stage, we only
>>> have a single CPU, so I'm not sure coherency is the problem here. When
>>> you say 'drop', is that an eviction? Replaced by what? By the previous
>>> version of the cache line, containing the stale value?
>> Yes,non-cacheable. The cache line corresponding to function enter_vhe() was
>> marked with shareable & WB-cache as a result of write to RELA, the same cache
>> line is being fetched with non-shareable & non-cacheable. The eviction is
>> not pushed to PoC and got dropped because of cache-line attributes mismatch.
> I'm really struggling to understand this. Why is the instruction fetch
> non-shareable? I'm trying to align your observations with the rules about
> mismatched aliases in the architecture and I'm yet to satisfy myself that
> the CPU is allowed to drop a dirty line on the floor in response to an
> unexpected hit.
>
> My mental model (which seems to align with Marc) is something like:
>
>
> 1. The boot CPU fetches the line via a cacheable mapping and dirties it
>    in its L1 as part of applying the relocations.

ARM-CPU core is sending ReadNotSharedDirty CHI command to LLC (last-level-cache).
This cache-line is marked as checked-out in LLC, would be used to keep track
of coherency.

> 2. The boot CPU then enters EL2 with the MMU off and fetches the same
>    line on the I-side. AFAICT, the architecture says this is done with
>    outer-shareable, non-cacheable attributes.

ARM-CPU core is sending ReadNoSnoop CHI command when MMU disabled. The
marked cache-line from the step 1) become invalid in LLC. As per the
ARM-ARM specification, CMO is recommended whenever memory attributes
are changed for a given memory region.

With my understating the CPU core must generate coherent accesses for
Shared+Cacheable memory but not clear for OSH+non-cacheable regions
in the spec.

Are you expecting "OSH+non-cacheable" must generate coherent accesses?

> 3. !!! Somehow the instruction fetch results in the _dirty_ line from (1)
>        being discarded !!!

The ARM-CPU is sending WritebackFull CHI command to LLC for the cache line
which was marked as invalid from step 2). The write CMD is ignored/dropped.

> 4. A secondary CPU later on fetches the line via a cacheable mapping and
>    explodes because the relocation hasn't been applied.
>
The modified data was dropped from step 3.

> Is that what you're seeing? If so, we really need more insight into what
> is going on at step (3) because it feels like it could have a much more
> significant impact than the issue we're trying to fix here.

The actual problem happens from step 2 when CPU executes 'b enter_vhe'.

>  How is the line
> dropped? Is it due to back invalidation from a shared cache? Is it due to
> IDC snooping? Does the line actually stick around on the D-side, but somehow
> the I-side is shared between CPUs?
>
> Many questions...
>
> Will
Will Deacon April 21, 2022, 11:47 a.m. UTC | #7
Hi Shanker,

First of all, thanks for continuing to share so many useful details here.
I'm pretty confident we'll eventually get to the bottom of this.

On Wed, Apr 20, 2022 at 10:02:13AM -0500, Shanker R Donthineni wrote:
> On 4/20/22 5:08 AM, Will Deacon wrote:
> > Non-coherent? You mean non-cacheable, right? At this stage, we only
> >>> have a single CPU, so I'm not sure coherency is the problem here. When
> >>> you say 'drop', is that an eviction? Replaced by what? By the previous
> >>> version of the cache line, containing the stale value?
> >> Yes,non-cacheable. The cache line corresponding to function enter_vhe() was
> >> marked with shareable & WB-cache as a result of write to RELA, the same cache
> >> line is being fetched with non-shareable & non-cacheable. The eviction is
> >> not pushed to PoC and got dropped because of cache-line attributes mismatch.
> > I'm really struggling to understand this. Why is the instruction fetch
> > non-shareable? I'm trying to align your observations with the rules about
> > mismatched aliases in the architecture and I'm yet to satisfy myself that
> > the CPU is allowed to drop a dirty line on the floor in response to an
> > unexpected hit.
> >
> > My mental model (which seems to align with Marc) is something like:
> >
> >
> > 1. The boot CPU fetches the line via a cacheable mapping and dirties it
> >    in its L1 as part of applying the relocations.
> 
> ARM-CPU core is sending ReadNotSharedDirty CHI command to LLC (last-level-cache).
> This cache-line is marked as checked-out in LLC, would be used to keep track
> of coherency.

Oh man, that CHI stuff always sends me into a spin so I'm probably not the
right person to debug it, but let's see...

> > 2. The boot CPU then enters EL2 with the MMU off and fetches the same
> >    line on the I-side. AFAICT, the architecture says this is done with
> >    outer-shareable, non-cacheable attributes.
> 
> ARM-CPU core is sending ReadNoSnoop CHI command when MMU disabled. The
> marked cache-line from the step 1) become invalid in LLC.

When you say "invalid" here, do you mean in the same sense as "cache
invalidation"? Why is that the right thing to do in response to a _read_?

Would the same invalidation happen if the ReadNoSnoop originated from a
different CPU?

> As per the ARM-ARM specification, CMO is recommended whenever memory
> attributes are changed for a given memory region.

The mismatched attribute section (B2.8) is quite a lot more nuanced than
that, and in particular I don't see it allowing for writes to be lost
in the presence of a mismatched read. Then again, the whole section is
hard to follow and doesn't really call out instruction fetch other than
in a note at the end.

> With my understating the CPU core must generate coherent accesses for
> Shared+Cacheable memory but not clear for OSH+non-cacheable regions
> in the spec.

Ok, but we don't need the reloc to be visible to the MMU-off fetch in
the case you're describing so coherency between the two aliases is not
required. The problem is that the cacheable data is being lost forever,
so even cacheable accesses can't retrieve it.

> Are you expecting "OSH+non-cacheable" must generate coherent accesses?
> 
> > 3. !!! Somehow the instruction fetch results in the _dirty_ line from (1)
> >        being discarded !!!
> 
> The ARM-CPU is sending WritebackFull CHI command to LLC for the cache line
> which was marked as invalid from step 2). The write CMD is ignored/dropped.

Is this because the LLC is inclusive and doesn't know what to do with the
writeback after invalidating the line earlier on? When else would a
writeback get silently dropped?

Stepping back a bit, my concern with all of this is that it seems as though
non-cacheable instruction fetches that alias with dirty lines on the data
side can lead to data loss. This feels like a significantly broader issue
than the relocation case you've run into, particularly as instruction fetch
with the MMU off is still permitted to fetch speculatively from within a
couple of pages of the PC. Cache maintenance doesn't feel like a viable
solution in general, as we'd need to ensure that no fetch occurs between
the write and the CMO, and this could occur on another CPU, off the back
of an IRQ or even a FIQ (with the fetch being made at a different exception
level).

Given that you have all the low-level details about the problem, it would
probably be worth pulling Arm into this so we can figure out what the right
solution is. I'm happy to be involved with that, if you like.

Will
Shanker Donthineni May 5, 2022, 11:42 a.m. UTC | #8
Hi Will,

On 4/21/22 6:47 AM, Will Deacon wrote:
> External email: Use caution opening links or attachments
>
>
> Hi Shanker,
>
> First of all, thanks for continuing to share so many useful details here.
> I'm pretty confident we'll eventually get to the bottom of this.
>
> On Wed, Apr 20, 2022 at 10:02:13AM -0500, Shanker R Donthineni wrote:
>> On 4/20/22 5:08 AM, Will Deacon wrote:
>>> Non-coherent? You mean non-cacheable, right? At this stage, we only
>>>>> have a single CPU, so I'm not sure coherency is the problem here. When
>>>>> you say 'drop', is that an eviction? Replaced by what? By the previous
>>>>> version of the cache line, containing the stale value?
>>>> Yes,non-cacheable. The cache line corresponding to function enter_vhe() was
>>>> marked with shareable & WB-cache as a result of write to RELA, the same cache
>>>> line is being fetched with non-shareable & non-cacheable. The eviction is
>>>> not pushed to PoC and got dropped because of cache-line attributes mismatch.
>>> I'm really struggling to understand this. Why is the instruction fetch
>>> non-shareable? I'm trying to align your observations with the rules about
>>> mismatched aliases in the architecture and I'm yet to satisfy myself that
>>> the CPU is allowed to drop a dirty line on the floor in response to an
>>> unexpected hit.
>>>
>>> My mental model (which seems to align with Marc) is something like:
>>>
>>>
>>> 1. The boot CPU fetches the line via a cacheable mapping and dirties it
>>>    in its L1 as part of applying the relocations.
>> ARM-CPU core is sending ReadNotSharedDirty CHI command to LLC (last-level-cache).
>> This cache-line is marked as checked-out in LLC, would be used to keep track
>> of coherency.
> Oh man, that CHI stuff always sends me into a spin so I'm probably not the
> right person to debug it, but let's see...
>
>>> 2. The boot CPU then enters EL2 with the MMU off and fetches the same
>>>    line on the I-side. AFAICT, the architecture says this is done with
>>>    outer-shareable, non-cacheable attributes.
>> ARM-CPU core is sending ReadNoSnoop CHI command when MMU disabled. The
>> marked cache-line from the step 1) become invalid in LLC.
> When you say "invalid" here, do you mean in the same sense as "cache
> invalidation"? Why is that the right thing to do in response to a _read_?
>
> Would the same invalidation happen if the ReadNoSnoop originated from a
> different CPU?
>
There is no issue if other cores are issuing ReadNoSnoop commands.

>> As per the ARM-ARM specification, CMO is recommended whenever memory
>> attributes are changed for a given memory region.
> The mismatched attribute section (B2.8) is quite a lot more nuanced than
> that, and in particular I don't see it allowing for writes to be lost
> in the presence of a mismatched read. Then again, the whole section is
> hard to follow and doesn't really call out instruction fetch other than
> in a note at the end.
>
>> With my understating the CPU core must generate coherent accesses for
>> Shared+Cacheable memory but not clear for OSH+non-cacheable regions
>> in the spec.
> Ok, but we don't need the reloc to be visible to the MMU-off fetch in
> the case you're describing so coherency between the two aliases is not
> required. The problem is that the cacheable data is being lost forever,
> so even cacheable accesses can't retrieve it.
>
>> Are you expecting "OSH+non-cacheable" must generate coherent accesses?
>>
>>> 3. !!! Somehow the instruction fetch results in the _dirty_ line from (1)
>>>        being discarded !!!
>> The ARM-CPU is sending WritebackFull CHI command to LLC for the cache line
>> which was marked as invalid from step 2). The write CMD is ignored/dropped.
> Is this because the LLC is inclusive and doesn't know what to do with the
> writeback after invalidating the line earlier on? When else would a
> writeback get silently dropped?
>
> Stepping back a bit, my concern with all of this is that it seems as though
> non-cacheable instruction fetches that alias with dirty lines on the data
> side can lead to data loss. This feels like a significantly broader issue
> than the relocation case you've run into, particularly as instruction fetch
> with the MMU off is still permitted to fetch speculatively from within a
> couple of pages of the PC. Cache maintenance doesn't feel like a viable
> solution in general, as we'd need to ensure that no fetch occurs between
> the write and the CMO, and this could occur on another CPU, off the back
> of an IRQ or even a FIQ (with the fetch being made at a different exception
> level).
>
> Given that you have all the low-level details about the problem, it would
> probably be worth pulling Arm into this so we can figure out what the right
> solution is. I'm happy to be involved with that, if you like.
>
>
Thanks Will for your valuable comments on dropping valid cache lines. I've
forwarded your responses to hardware team, reviewed CHI and ARM-ARM
specification one more time. Our CPU team was agreed to fix in hardware
convert ReadNoSnoop to ReadSnoop command (step-2).

Thanks,
Shanker Donthineni
diff mbox series

Patch

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 6a98f1a38c29a..b5786163697bb 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -462,6 +462,16 @@  SYM_FUNC_START_LOCAL(__primary_switched)
 	ldp	x29, x30, [sp], #16		// we must enable KASLR, return
 	ret					// to __primary_switch()
 0:
+#endif
+#ifdef CONFIG_RELOCATABLE
+	/*
+	 * Since the RELA entries of the identity-mapped region are updated
+	 * with MMU enabled, clean and invalidate those entries to avoid
+	 * cache inconsistency while accessing with MMU disabled in hyp-stub.
+	 */
+	adrp	x0, __idmap_text_start
+	adr_l	x1, __idmap_text_end
+	bl	dcache_clean_inval_poc
 #endif
 	bl	switch_to_vhe			// Prefer VHE if possible
 	ldp	x29, x30, [sp], #16