Message ID | 20220415170504.3781878-1-sdonthineni@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: head: Fix cache inconsistency of the identity-mapped region | expand |
Hi Shanker, On Fri, 15 Apr 2022 18:05:03 +0100, Shanker Donthineni <sdonthineni@nvidia.com> wrote: > > The secondary cores boot is stuck due to data abort while executing the > instruction 'ldr x8, =__secondary_switched'. The RELA value of this > instruction was updated by a primary boot core from __relocate_kernel() > but those memory updates are not visible to CPUs after calling > switch_to_vhe() causing problem. > > The cacheable/shareable attributes of the identity-mapped regions are > different while CPU executing in EL1 (MMU enabled) and for a short period > of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification > (DDI0487G_b), this is not allowed. > > G5.10.3 Cache maintenance requirement: > "If the change affects the cacheability attributes of the area of memory, > including any change between Write-Through and Write-Back attributes, > software must ensure that any cached copies of affected locations are > removed from the caches, typically by cleaning and invalidating the > locations from the levels of cache that might hold copies of the locations > affected by the attribute change." > > Clean+invalidate the identity-mapped region till PoC before switching to > VHE world to fix the cache inconsistency. > > Problem analysis with disassembly (vmlinux): > 1) Both __primary_switch() and enter_vhe() are part of the identity region > 2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480 > 3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled > 4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled > - Non-coherent access causing the cache line fff800010970480 drop Non-coherent? You mean non-cacheable, right? At this stage, we only have a single CPU, so I'm not sure coherency is the problem here. When you say 'drop', is that an eviction? Replaced by what? By the previous version of the cache line, containing the stale value? It is also unclear to me how the instruction fetches directly influence what happens *on the other CPUs*. Is this line kept at a level beyond the PoU? Are we talking of a system cache here? It would really help if you could describe your cache topology. > 5) Secondary core executes 'ldr x8, __secondary_switched' > - Getting data abort because of the incorrect value at ffff800010970488 My interpretation of the above is as follows: - CPU0 performs the RELA update with the MMU on - A switch to EL2 with the MMU off results in the cache line sitting beyond the PoU and containing the RELA update to be replaced with the *stale* version (the fetch happening on the i-side). - CPU1 (with its MMU on) fetches the stale data from the cache Is this correct? What is unclear to me is why the eviction occurs. Yes, this is of course allowed, and the code is wrong for assuming any odd behaviour. But then, why only clean the idmap? I have the feeling that by this standard, we should see this a lot more often. Or are we just lucky that we don't have any other examples of data and instructions sharing the same cache line and accessed with different cacheability attributes? Thanks, M.
Thanks Marc for your comments. On 4/18/22 4:16 AM, Marc Zyngier wrote: > External email: Use caution opening links or attachments > > > Hi Shanker, > > On Fri, 15 Apr 2022 18:05:03 +0100, > Shanker Donthineni <sdonthineni@nvidia.com> wrote: >> The secondary cores boot is stuck due to data abort while executing the >> instruction 'ldr x8, =__secondary_switched'. The RELA value of this >> instruction was updated by a primary boot core from __relocate_kernel() >> but those memory updates are not visible to CPUs after calling >> switch_to_vhe() causing problem. >> >> The cacheable/shareable attributes of the identity-mapped regions are >> different while CPU executing in EL1 (MMU enabled) and for a short period >> of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification >> (DDI0487G_b), this is not allowed. >> >> G5.10.3 Cache maintenance requirement: >> "If the change affects the cacheability attributes of the area of memory, >> including any change between Write-Through and Write-Back attributes, >> software must ensure that any cached copies of affected locations are >> removed from the caches, typically by cleaning and invalidating the >> locations from the levels of cache that might hold copies of the locations >> affected by the attribute change." >> >> Clean+invalidate the identity-mapped region till PoC before switching to >> VHE world to fix the cache inconsistency. >> >> Problem analysis with disassembly (vmlinux): >> 1) Both __primary_switch() and enter_vhe() are part of the identity region >> 2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480 >> 3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled >> 4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled >> - Non-coherent access causing the cache line fff800010970480 drop > Non-coherent? You mean non-cacheable, right? At this stage, we only > have a single CPU, so I'm not sure coherency is the problem here. When > you say 'drop', is that an eviction? Replaced by what? By the previous > version of the cache line, containing the stale value? Yes,non-cacheable. The cache line corresponding to function enter_vhe() was marked with shareable & WB-cache as a result of write to RELA, the same cache line is being fetched with non-shareable & non-cacheable. The eviction is not pushed to PoC and got dropped because of cache-line attributes mismatch. > It is also unclear to me how the instruction fetches directly > influence what happens *on the other CPUs*. Is this line kept at a > level beyond the PoU? Are we talking of a system cache here? It would > really help if you could describe your cache topology. > >> 5) Secondary core executes 'ldr x8, __secondary_switched' >> - Getting data abort because of the incorrect value at ffff800010970488 > My interpretation of the above is as follows: > > - CPU0 performs the RELA update with the MMU on > > - A switch to EL2 with the MMU off results in the cache line sitting > beyond the PoU and containing the RELA update to be replaced with > the *stale* version (the fetch happening on the i-side). > > - CPU1 (with its MMU on) fetches the stale data from the cache > > Is this correct? Yes, correct, > What is unclear to me is why the eviction occurs. Yes, this is of > course allowed, and the code is wrong for assuming any odd behaviour. > But then, The eviction is happening naturally don't know the time and the exact line of code where eviction is happening. > why only clean the idmap? Seems only EL2-STUB code is accessing RELA of idmap with MMU disabled. I did 2 other experiments, the issue was not reproducible. 1) CONFIG_RELOCATABLE=n 2) Change the line "ldr x8, =__secondary_switched" to adr_l x9, _text adr_l x8, __secondary_switched sub x8, x8, x9 mov_q x9, KIMAGE_VADDR add x8, x8, x9 > I have the feeling that by this > standard, we should see this a lot more often. Or are we just lucky > that we don't have any other examples of data and instructions sharing > the same cache line and accessed with different cacheability > attributes? I think in all other places either MMU is enabled or instructions/data pushed to the PoC. > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible.
On Fri, 15 Apr 2022 at 19:05, Shanker Donthineni <sdonthineni@nvidia.com> wrote: > > The secondary cores boot is stuck due to data abort while executing the > instruction 'ldr x8, =__secondary_switched'. The RELA value of this > instruction was updated by a primary boot core from __relocate_kernel() > but those memory updates are not visible to CPUs after calling > switch_to_vhe() causing problem. > > The cacheable/shareable attributes of the identity-mapped regions are > different while CPU executing in EL1 (MMU enabled) and for a short period > of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification > (DDI0487G_b), this is not allowed. > > G5.10.3 Cache maintenance requirement: > "If the change affects the cacheability attributes of the area of memory, > including any change between Write-Through and Write-Back attributes, > software must ensure that any cached copies of affected locations are > removed from the caches, typically by cleaning and invalidating the > locations from the levels of cache that might hold copies of the locations > affected by the attribute change." > > Clean+invalidate the identity-mapped region till PoC before switching to > VHE world to fix the cache inconsistency. > > Problem analysis with disassembly (vmlinux): > 1) Both __primary_switch() and enter_vhe() are part of the identity region > 2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480 > 3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled > 4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled > - Non-coherent access causing the cache line fff800010970480 drop > 5) Secondary core executes 'ldr x8, __secondary_switched' > - Getting data abort because of the incorrect value at ffff800010970488 > > ffff800010970418 <__primary_switch>: > ffff800010970418: d503245f bti c > ffff80001097041c: aa0003f3 mov x19, x0 > ffff800010970420: d5381014 mrs x20, sctlr_el1 > ffff800010970424: 90003c81 adrp x1, ffff800011100000 <init_pg_dir> > ffff800010970428: 97ffffc4 bl ffff800010970338 <__enable_mmu> > ffff80001097042c: 97ffffe8 bl ffff8000109703cc <__relocate_kernel> > ffff800010970430: 58000308 ldr x8, ffff800010970490 <__primary_switch+0x78> > ffff800010970434: 90ffb480 adrp x0, ffff800010000000 <_text> > ffff800010970438: d63f0100 blr x8 > ffff80001097043c: d5033fdf isb > ffff800010970440: d5181014 msr sctlr_el1, x20 > ffff800010970444: d5033fdf isb > ffff800010970448: 940f7efe bl ffff800010d50040 <__create_page_tables> > ffff80001097044c: d508871f tlbi vmalle1 > ffff800010970450: d503379f dsb nsh > ffff800010970454: d5033fdf isb > ffff800010970458: d5181013 msr sctlr_el1, x19 > ffff80001097045c: d5033fdf isb > ffff800010970460: d508751f ic iallu > ffff800010970464: d503379f dsb nsh > ffff800010970468: d5033fdf isb > ffff80001097046c: 97ffffd8 bl ffff8000109703cc <__relocate_kernel> > ffff800010970470: 58000108 ldr x8, ffff800010970490 <__primary_switch+0x78> > ffff800010970474: 90ffb480 adrp x0, ffff800010000000 <_text> > ffff800010970478: d61f0100 br x8 > ffff80001097047c: 00df10c8 .word 0x00df10c8 > ffff800010970480: 000dfba8 .word 0x000dfba8 > ... > ffff800010970498: d51cd041 msr tpidr_el2, x1 > ffff80001097049c: d503201f nop > > ffff8000109704a0 <enter_vhe>: > ffff8000109704a0: d508871f tlbi vmalle1 > ffff8000109704a4: d503379f dsb nsh > ffff8000109704a8: d5033fdf isb > ffff8000109704ac: d53d1000 mrs x0, sctlr_el12 > ffff8000109704b0: d5181000 msr sctlr_el1, x0 > ffff8000109704b4: d5033fdf isb > ffff8000109704b8: d508751f ic iallu > ffff8000109704bc: d503379f dsb nsh > ffff8000109704c0: d5033fdf isb > ffff8000109704c4: d2a60a00 mov x0, #0x30500000 > ffff8000109704c8: f2810000 movk x0, #0x800 > ffff8000109704cc: d51d1000 msr sctlr_el12, x0 > ffff8000109704d0: aa1f03e0 mov x0, xzr > ffff8000109704d4: d69f03e0 eret > > ffff800010961850 <mutate_to_vhe>: > ffff800010961850: d53c1001 mrs x1, sctlr_el2 > ffff800010961854: 370001c1 tbnz w1, #0, ffff80001096188c <mutate_to_vhe+0x3c> > ffff800010961858: d5380721 mrs x1, id_aa64mmfr1_el1 > ... > ffff80001096190c: aa010000 orr x0, x0, x1 > ffff800010961910: d5184000 msr spsr_el1, x0 > ffff800010961914: 14003ae3 b ffff8000109704a0 <enter_vhe> > > ffff800010970270 <secondary_startup>: > ffff800010970270: d503245f bti c > ffff800010970274: 97dab23a bl ffff80001001cb5c <switch_to_vhe> > ffff800010970278: 94000049 bl ffff80001097039c <__cpu_secondary_check52bitva> > ffff80001097027c: 94000145 bl ffff800010970790 <__cpu_setup> > ffff800010970280: 90001e81 adrp x1, ffff800010d40000 <swapper_pg_dir> > ffff800010970284: 9400002d bl ffff800010970338 <__enable_mmu> > ffff800010970288: 58001008 ldr x8, ffff800010970488 <__primary_switch+0x70> > ffff80001097028c: d61f0100 br x8 > > Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> > --- > arch/arm64/kernel/head.S | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S > index 6a98f1a38c29a..b5786163697bb 100644 > --- a/arch/arm64/kernel/head.S > +++ b/arch/arm64/kernel/head.S > @@ -462,6 +462,16 @@ SYM_FUNC_START_LOCAL(__primary_switched) > ldp x29, x30, [sp], #16 // we must enable KASLR, return > ret // to __primary_switch() > 0: > +#endif > +#ifdef CONFIG_RELOCATABLE > + /* > + * Since the RELA entries of the identity-mapped region are updated > + * with MMU enabled, clean and invalidate those entries to avoid > + * cache inconsistency while accessing with MMU disabled in hyp-stub. > + */ > + adrp x0, __idmap_text_start > + adr_l x1, __idmap_text_end > + bl dcache_clean_inval_poc > #endif > bl switch_to_vhe // Prefer VHE if possible > ldp x29, x30, [sp], #16 Thanks for the elaborate report. I'd prefer to fix this by moving the literal out of the ID map entirely. Does the below also fix your issue? diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S index 6a98f1a38c29..97134d6f78ff 100644 --- a/arch/arm64/kernel/head.S +++ b/arch/arm64/kernel/head.S @@ -639,10 +639,15 @@ SYM_FUNC_START_LOCAL(secondary_startup) bl __cpu_setup // initialise processor adrp x1, swapper_pg_dir bl __enable_mmu - ldr x8, =__secondary_switched + ldr_l x8, .L__secondary_switched br x8 SYM_FUNC_END(secondary_startup) + .pushsection ".rodata", "a", %progbits +.L__secondary_switched: + .quad __secondary_switched + .popsection + SYM_FUNC_START_LOCAL(__secondary_switched) adr_l x5, vectors msr vbar_el1, x5
Hi Shanker, On Mon, Apr 18, 2022 at 07:53:20AM -0500, Shanker R Donthineni wrote: > On 4/18/22 4:16 AM, Marc Zyngier wrote: > > External email: Use caution opening links or attachments Ok. > > On Fri, 15 Apr 2022 18:05:03 +0100, > > Shanker Donthineni <sdonthineni@nvidia.com> wrote: > >> The secondary cores boot is stuck due to data abort while executing the > >> instruction 'ldr x8, =__secondary_switched'. The RELA value of this > >> instruction was updated by a primary boot core from __relocate_kernel() > >> but those memory updates are not visible to CPUs after calling > >> switch_to_vhe() causing problem. > >> > >> The cacheable/shareable attributes of the identity-mapped regions are > >> different while CPU executing in EL1 (MMU enabled) and for a short period > >> of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification > >> (DDI0487G_b), this is not allowed. > >> > >> G5.10.3 Cache maintenance requirement: > >> "If the change affects the cacheability attributes of the area of memory, > >> including any change between Write-Through and Write-Back attributes, > >> software must ensure that any cached copies of affected locations are > >> removed from the caches, typically by cleaning and invalidating the > >> locations from the levels of cache that might hold copies of the locations > >> affected by the attribute change." > >> > >> Clean+invalidate the identity-mapped region till PoC before switching to > >> VHE world to fix the cache inconsistency. > >> > >> Problem analysis with disassembly (vmlinux): > >> 1) Both __primary_switch() and enter_vhe() are part of the identity region > >> 2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480 > >> 3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled > >> 4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled > >> - Non-coherent access causing the cache line fff800010970480 drop > > Non-coherent? You mean non-cacheable, right? At this stage, we only > > have a single CPU, so I'm not sure coherency is the problem here. When > > you say 'drop', is that an eviction? Replaced by what? By the previous > > version of the cache line, containing the stale value? > Yes,non-cacheable. The cache line corresponding to function enter_vhe() was > marked with shareable & WB-cache as a result of write to RELA, the same cache > line is being fetched with non-shareable & non-cacheable. The eviction is > not pushed to PoC and got dropped because of cache-line attributes mismatch. I'm really struggling to understand this. Why is the instruction fetch non-shareable? I'm trying to align your observations with the rules about mismatched aliases in the architecture and I'm yet to satisfy myself that the CPU is allowed to drop a dirty line on the floor in response to an unexpected hit. My mental model (which seems to align with Marc) is something like: 1. The boot CPU fetches the line via a cacheable mapping and dirties it in its L1 as part of applying the relocations. 2. The boot CPU then enters EL2 with the MMU off and fetches the same line on the I-side. AFAICT, the architecture says this is done with outer-shareable, non-cacheable attributes. 3. !!! Somehow the instruction fetch results in the _dirty_ line from (1) being discarded !!! 4. A secondary CPU later on fetches the line via a cacheable mapping and explodes because the relocation hasn't been applied. Is that what you're seeing? If so, we really need more insight into what is going on at step (3) because it feels like it could have a much more significant impact than the issue we're trying to fix here. How is the line dropped? Is it due to back invalidation from a shared cache? Is it due to IDC snooping? Does the line actually stick around on the D-side, but somehow the I-side is shared between CPUs? Many questions... Will
Thanks Ard, On 4/20/22 3:05 AM, Ard Biesheuvel wrote: > External email: Use caution opening links or attachments > > > On Fri, 15 Apr 2022 at 19:05, Shanker Donthineni <sdonthineni@nvidia.com> wrote: >> The secondary cores boot is stuck due to data abort while executing the >> instruction 'ldr x8, =__secondary_switched'. The RELA value of this >> instruction was updated by a primary boot core from __relocate_kernel() >> but those memory updates are not visible to CPUs after calling >> switch_to_vhe() causing problem. >> >> The cacheable/shareable attributes of the identity-mapped regions are >> different while CPU executing in EL1 (MMU enabled) and for a short period >> of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification >> (DDI0487G_b), this is not allowed. >> >> G5.10.3 Cache maintenance requirement: >> "If the change affects the cacheability attributes of the area of memory, >> including any change between Write-Through and Write-Back attributes, >> software must ensure that any cached copies of affected locations are >> removed from the caches, typically by cleaning and invalidating the >> locations from the levels of cache that might hold copies of the locations >> affected by the attribute change." >> >> Clean+invalidate the identity-mapped region till PoC before switching to >> VHE world to fix the cache inconsistency. >> >> Problem analysis with disassembly (vmlinux): >> 1) Both __primary_switch() and enter_vhe() are part of the identity region >> 2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480 >> 3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled >> 4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled >> - Non-coherent access causing the cache line fff800010970480 drop >> 5) Secondary core executes 'ldr x8, __secondary_switched' >> - Getting data abort because of the incorrect value at ffff800010970488 >> >> ffff800010970418 <__primary_switch>: >> ffff800010970418: d503245f bti c >> ffff80001097041c: aa0003f3 mov x19, x0 >> ffff800010970420: d5381014 mrs x20, sctlr_el1 >> ffff800010970424: 90003c81 adrp x1, ffff800011100000 <init_pg_dir> >> ffff800010970428: 97ffffc4 bl ffff800010970338 <__enable_mmu> >> ffff80001097042c: 97ffffe8 bl ffff8000109703cc <__relocate_kernel> >> ffff800010970430: 58000308 ldr x8, ffff800010970490 <__primary_switch+0x78> >> ffff800010970434: 90ffb480 adrp x0, ffff800010000000 <_text> >> ffff800010970438: d63f0100 blr x8 >> ffff80001097043c: d5033fdf isb >> ffff800010970440: d5181014 msr sctlr_el1, x20 >> ffff800010970444: d5033fdf isb >> ffff800010970448: 940f7efe bl ffff800010d50040 <__create_page_tables> >> ffff80001097044c: d508871f tlbi vmalle1 >> ffff800010970450: d503379f dsb nsh >> ffff800010970454: d5033fdf isb >> ffff800010970458: d5181013 msr sctlr_el1, x19 >> ffff80001097045c: d5033fdf isb >> ffff800010970460: d508751f ic iallu >> ffff800010970464: d503379f dsb nsh >> ffff800010970468: d5033fdf isb >> ffff80001097046c: 97ffffd8 bl ffff8000109703cc <__relocate_kernel> >> ffff800010970470: 58000108 ldr x8, ffff800010970490 <__primary_switch+0x78> >> ffff800010970474: 90ffb480 adrp x0, ffff800010000000 <_text> >> ffff800010970478: d61f0100 br x8 >> ffff80001097047c: 00df10c8 .word 0x00df10c8 >> ffff800010970480: 000dfba8 .word 0x000dfba8 >> ... >> ffff800010970498: d51cd041 msr tpidr_el2, x1 >> ffff80001097049c: d503201f nop >> >> ffff8000109704a0 <enter_vhe>: >> ffff8000109704a0: d508871f tlbi vmalle1 >> ffff8000109704a4: d503379f dsb nsh >> ffff8000109704a8: d5033fdf isb >> ffff8000109704ac: d53d1000 mrs x0, sctlr_el12 >> ffff8000109704b0: d5181000 msr sctlr_el1, x0 >> ffff8000109704b4: d5033fdf isb >> ffff8000109704b8: d508751f ic iallu >> ffff8000109704bc: d503379f dsb nsh >> ffff8000109704c0: d5033fdf isb >> ffff8000109704c4: d2a60a00 mov x0, #0x30500000 >> ffff8000109704c8: f2810000 movk x0, #0x800 >> ffff8000109704cc: d51d1000 msr sctlr_el12, x0 >> ffff8000109704d0: aa1f03e0 mov x0, xzr >> ffff8000109704d4: d69f03e0 eret >> >> ffff800010961850 <mutate_to_vhe>: >> ffff800010961850: d53c1001 mrs x1, sctlr_el2 >> ffff800010961854: 370001c1 tbnz w1, #0, ffff80001096188c <mutate_to_vhe+0x3c> >> ffff800010961858: d5380721 mrs x1, id_aa64mmfr1_el1 >> ... >> ffff80001096190c: aa010000 orr x0, x0, x1 >> ffff800010961910: d5184000 msr spsr_el1, x0 >> ffff800010961914: 14003ae3 b ffff8000109704a0 <enter_vhe> >> >> ffff800010970270 <secondary_startup>: >> ffff800010970270: d503245f bti c >> ffff800010970274: 97dab23a bl ffff80001001cb5c <switch_to_vhe> >> ffff800010970278: 94000049 bl ffff80001097039c <__cpu_secondary_check52bitva> >> ffff80001097027c: 94000145 bl ffff800010970790 <__cpu_setup> >> ffff800010970280: 90001e81 adrp x1, ffff800010d40000 <swapper_pg_dir> >> ffff800010970284: 9400002d bl ffff800010970338 <__enable_mmu> >> ffff800010970288: 58001008 ldr x8, ffff800010970488 <__primary_switch+0x70> >> ffff80001097028c: d61f0100 br x8 >> >> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> >> --- >> arch/arm64/kernel/head.S | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S >> index 6a98f1a38c29a..b5786163697bb 100644 >> --- a/arch/arm64/kernel/head.S >> +++ b/arch/arm64/kernel/head.S >> @@ -462,6 +462,16 @@ SYM_FUNC_START_LOCAL(__primary_switched) >> ldp x29, x30, [sp], #16 // we must enable KASLR, return >> ret // to __primary_switch() >> 0: >> +#endif >> +#ifdef CONFIG_RELOCATABLE >> + /* >> + * Since the RELA entries of the identity-mapped region are updated >> + * with MMU enabled, clean and invalidate those entries to avoid >> + * cache inconsistency while accessing with MMU disabled in hyp-stub. >> + */ >> + adrp x0, __idmap_text_start >> + adr_l x1, __idmap_text_end >> + bl dcache_clean_inval_poc >> #endif >> bl switch_to_vhe // Prefer VHE if possible >> ldp x29, x30, [sp], #16 > Thanks for the elaborate report. > > I'd prefer to fix this by moving the literal out of the ID map > entirely. Does the below also fix your issue? > > diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S > index 6a98f1a38c29..97134d6f78ff 100644 > --- a/arch/arm64/kernel/head.S > +++ b/arch/arm64/kernel/head.S > @@ -639,10 +639,15 @@ SYM_FUNC_START_LOCAL(secondary_startup) > bl __cpu_setup // initialise processor > adrp x1, swapper_pg_dir > bl __enable_mmu > - ldr x8, =__secondary_switched > + ldr_l x8, .L__secondary_switched > br x8 > SYM_FUNC_END(secondary_startup) > > + .pushsection ".rodata", "a", %progbits > +.L__secondary_switched: > + .quad __secondary_switched > + .popsection > + > SYM_FUNC_START_LOCAL(__secondary_switched) > adr_l x5, vectors > msr vbar_el1, x5 It should work, I'll test and share results with you. I think __primary_switched() can also be moved out of RELA for identty-mapped region.
Hi Will, On 4/20/22 5:08 AM, Will Deacon wrote: > Non-coherent? You mean non-cacheable, right? At this stage, we only >>> have a single CPU, so I'm not sure coherency is the problem here. When >>> you say 'drop', is that an eviction? Replaced by what? By the previous >>> version of the cache line, containing the stale value? >> Yes,non-cacheable. The cache line corresponding to function enter_vhe() was >> marked with shareable & WB-cache as a result of write to RELA, the same cache >> line is being fetched with non-shareable & non-cacheable. The eviction is >> not pushed to PoC and got dropped because of cache-line attributes mismatch. > I'm really struggling to understand this. Why is the instruction fetch > non-shareable? I'm trying to align your observations with the rules about > mismatched aliases in the architecture and I'm yet to satisfy myself that > the CPU is allowed to drop a dirty line on the floor in response to an > unexpected hit. > > My mental model (which seems to align with Marc) is something like: > > > 1. The boot CPU fetches the line via a cacheable mapping and dirties it > in its L1 as part of applying the relocations. ARM-CPU core is sending ReadNotSharedDirty CHI command to LLC (last-level-cache). This cache-line is marked as checked-out in LLC, would be used to keep track of coherency. > 2. The boot CPU then enters EL2 with the MMU off and fetches the same > line on the I-side. AFAICT, the architecture says this is done with > outer-shareable, non-cacheable attributes. ARM-CPU core is sending ReadNoSnoop CHI command when MMU disabled. The marked cache-line from the step 1) become invalid in LLC. As per the ARM-ARM specification, CMO is recommended whenever memory attributes are changed for a given memory region. With my understating the CPU core must generate coherent accesses for Shared+Cacheable memory but not clear for OSH+non-cacheable regions in the spec. Are you expecting "OSH+non-cacheable" must generate coherent accesses? > 3. !!! Somehow the instruction fetch results in the _dirty_ line from (1) > being discarded !!! The ARM-CPU is sending WritebackFull CHI command to LLC for the cache line which was marked as invalid from step 2). The write CMD is ignored/dropped. > 4. A secondary CPU later on fetches the line via a cacheable mapping and > explodes because the relocation hasn't been applied. > The modified data was dropped from step 3. > Is that what you're seeing? If so, we really need more insight into what > is going on at step (3) because it feels like it could have a much more > significant impact than the issue we're trying to fix here. The actual problem happens from step 2 when CPU executes 'b enter_vhe'. > How is the line > dropped? Is it due to back invalidation from a shared cache? Is it due to > IDC snooping? Does the line actually stick around on the D-side, but somehow > the I-side is shared between CPUs? > > Many questions... > > Will
Hi Shanker, First of all, thanks for continuing to share so many useful details here. I'm pretty confident we'll eventually get to the bottom of this. On Wed, Apr 20, 2022 at 10:02:13AM -0500, Shanker R Donthineni wrote: > On 4/20/22 5:08 AM, Will Deacon wrote: > > Non-coherent? You mean non-cacheable, right? At this stage, we only > >>> have a single CPU, so I'm not sure coherency is the problem here. When > >>> you say 'drop', is that an eviction? Replaced by what? By the previous > >>> version of the cache line, containing the stale value? > >> Yes,non-cacheable. The cache line corresponding to function enter_vhe() was > >> marked with shareable & WB-cache as a result of write to RELA, the same cache > >> line is being fetched with non-shareable & non-cacheable. The eviction is > >> not pushed to PoC and got dropped because of cache-line attributes mismatch. > > I'm really struggling to understand this. Why is the instruction fetch > > non-shareable? I'm trying to align your observations with the rules about > > mismatched aliases in the architecture and I'm yet to satisfy myself that > > the CPU is allowed to drop a dirty line on the floor in response to an > > unexpected hit. > > > > My mental model (which seems to align with Marc) is something like: > > > > > > 1. The boot CPU fetches the line via a cacheable mapping and dirties it > > in its L1 as part of applying the relocations. > > ARM-CPU core is sending ReadNotSharedDirty CHI command to LLC (last-level-cache). > This cache-line is marked as checked-out in LLC, would be used to keep track > of coherency. Oh man, that CHI stuff always sends me into a spin so I'm probably not the right person to debug it, but let's see... > > 2. The boot CPU then enters EL2 with the MMU off and fetches the same > > line on the I-side. AFAICT, the architecture says this is done with > > outer-shareable, non-cacheable attributes. > > ARM-CPU core is sending ReadNoSnoop CHI command when MMU disabled. The > marked cache-line from the step 1) become invalid in LLC. When you say "invalid" here, do you mean in the same sense as "cache invalidation"? Why is that the right thing to do in response to a _read_? Would the same invalidation happen if the ReadNoSnoop originated from a different CPU? > As per the ARM-ARM specification, CMO is recommended whenever memory > attributes are changed for a given memory region. The mismatched attribute section (B2.8) is quite a lot more nuanced than that, and in particular I don't see it allowing for writes to be lost in the presence of a mismatched read. Then again, the whole section is hard to follow and doesn't really call out instruction fetch other than in a note at the end. > With my understating the CPU core must generate coherent accesses for > Shared+Cacheable memory but not clear for OSH+non-cacheable regions > in the spec. Ok, but we don't need the reloc to be visible to the MMU-off fetch in the case you're describing so coherency between the two aliases is not required. The problem is that the cacheable data is being lost forever, so even cacheable accesses can't retrieve it. > Are you expecting "OSH+non-cacheable" must generate coherent accesses? > > > 3. !!! Somehow the instruction fetch results in the _dirty_ line from (1) > > being discarded !!! > > The ARM-CPU is sending WritebackFull CHI command to LLC for the cache line > which was marked as invalid from step 2). The write CMD is ignored/dropped. Is this because the LLC is inclusive and doesn't know what to do with the writeback after invalidating the line earlier on? When else would a writeback get silently dropped? Stepping back a bit, my concern with all of this is that it seems as though non-cacheable instruction fetches that alias with dirty lines on the data side can lead to data loss. This feels like a significantly broader issue than the relocation case you've run into, particularly as instruction fetch with the MMU off is still permitted to fetch speculatively from within a couple of pages of the PC. Cache maintenance doesn't feel like a viable solution in general, as we'd need to ensure that no fetch occurs between the write and the CMO, and this could occur on another CPU, off the back of an IRQ or even a FIQ (with the fetch being made at a different exception level). Given that you have all the low-level details about the problem, it would probably be worth pulling Arm into this so we can figure out what the right solution is. I'm happy to be involved with that, if you like. Will
Hi Will, On 4/21/22 6:47 AM, Will Deacon wrote: > External email: Use caution opening links or attachments > > > Hi Shanker, > > First of all, thanks for continuing to share so many useful details here. > I'm pretty confident we'll eventually get to the bottom of this. > > On Wed, Apr 20, 2022 at 10:02:13AM -0500, Shanker R Donthineni wrote: >> On 4/20/22 5:08 AM, Will Deacon wrote: >>> Non-coherent? You mean non-cacheable, right? At this stage, we only >>>>> have a single CPU, so I'm not sure coherency is the problem here. When >>>>> you say 'drop', is that an eviction? Replaced by what? By the previous >>>>> version of the cache line, containing the stale value? >>>> Yes,non-cacheable. The cache line corresponding to function enter_vhe() was >>>> marked with shareable & WB-cache as a result of write to RELA, the same cache >>>> line is being fetched with non-shareable & non-cacheable. The eviction is >>>> not pushed to PoC and got dropped because of cache-line attributes mismatch. >>> I'm really struggling to understand this. Why is the instruction fetch >>> non-shareable? I'm trying to align your observations with the rules about >>> mismatched aliases in the architecture and I'm yet to satisfy myself that >>> the CPU is allowed to drop a dirty line on the floor in response to an >>> unexpected hit. >>> >>> My mental model (which seems to align with Marc) is something like: >>> >>> >>> 1. The boot CPU fetches the line via a cacheable mapping and dirties it >>> in its L1 as part of applying the relocations. >> ARM-CPU core is sending ReadNotSharedDirty CHI command to LLC (last-level-cache). >> This cache-line is marked as checked-out in LLC, would be used to keep track >> of coherency. > Oh man, that CHI stuff always sends me into a spin so I'm probably not the > right person to debug it, but let's see... > >>> 2. The boot CPU then enters EL2 with the MMU off and fetches the same >>> line on the I-side. AFAICT, the architecture says this is done with >>> outer-shareable, non-cacheable attributes. >> ARM-CPU core is sending ReadNoSnoop CHI command when MMU disabled. The >> marked cache-line from the step 1) become invalid in LLC. > When you say "invalid" here, do you mean in the same sense as "cache > invalidation"? Why is that the right thing to do in response to a _read_? > > Would the same invalidation happen if the ReadNoSnoop originated from a > different CPU? > There is no issue if other cores are issuing ReadNoSnoop commands. >> As per the ARM-ARM specification, CMO is recommended whenever memory >> attributes are changed for a given memory region. > The mismatched attribute section (B2.8) is quite a lot more nuanced than > that, and in particular I don't see it allowing for writes to be lost > in the presence of a mismatched read. Then again, the whole section is > hard to follow and doesn't really call out instruction fetch other than > in a note at the end. > >> With my understating the CPU core must generate coherent accesses for >> Shared+Cacheable memory but not clear for OSH+non-cacheable regions >> in the spec. > Ok, but we don't need the reloc to be visible to the MMU-off fetch in > the case you're describing so coherency between the two aliases is not > required. The problem is that the cacheable data is being lost forever, > so even cacheable accesses can't retrieve it. > >> Are you expecting "OSH+non-cacheable" must generate coherent accesses? >> >>> 3. !!! Somehow the instruction fetch results in the _dirty_ line from (1) >>> being discarded !!! >> The ARM-CPU is sending WritebackFull CHI command to LLC for the cache line >> which was marked as invalid from step 2). The write CMD is ignored/dropped. > Is this because the LLC is inclusive and doesn't know what to do with the > writeback after invalidating the line earlier on? When else would a > writeback get silently dropped? > > Stepping back a bit, my concern with all of this is that it seems as though > non-cacheable instruction fetches that alias with dirty lines on the data > side can lead to data loss. This feels like a significantly broader issue > than the relocation case you've run into, particularly as instruction fetch > with the MMU off is still permitted to fetch speculatively from within a > couple of pages of the PC. Cache maintenance doesn't feel like a viable > solution in general, as we'd need to ensure that no fetch occurs between > the write and the CMO, and this could occur on another CPU, off the back > of an IRQ or even a FIQ (with the fetch being made at a different exception > level). > > Given that you have all the low-level details about the problem, it would > probably be worth pulling Arm into this so we can figure out what the right > solution is. I'm happy to be involved with that, if you like. > > Thanks Will for your valuable comments on dropping valid cache lines. I've forwarded your responses to hardware team, reviewed CHI and ARM-ARM specification one more time. Our CPU team was agreed to fix in hardware convert ReadNoSnoop to ReadSnoop command (step-2). Thanks, Shanker Donthineni
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S index 6a98f1a38c29a..b5786163697bb 100644 --- a/arch/arm64/kernel/head.S +++ b/arch/arm64/kernel/head.S @@ -462,6 +462,16 @@ SYM_FUNC_START_LOCAL(__primary_switched) ldp x29, x30, [sp], #16 // we must enable KASLR, return ret // to __primary_switch() 0: +#endif +#ifdef CONFIG_RELOCATABLE + /* + * Since the RELA entries of the identity-mapped region are updated + * with MMU enabled, clean and invalidate those entries to avoid + * cache inconsistency while accessing with MMU disabled in hyp-stub. + */ + adrp x0, __idmap_text_start + adr_l x1, __idmap_text_end + bl dcache_clean_inval_poc #endif bl switch_to_vhe // Prefer VHE if possible ldp x29, x30, [sp], #16
The secondary cores boot is stuck due to data abort while executing the instruction 'ldr x8, =__secondary_switched'. The RELA value of this instruction was updated by a primary boot core from __relocate_kernel() but those memory updates are not visible to CPUs after calling switch_to_vhe() causing problem. The cacheable/shareable attributes of the identity-mapped regions are different while CPU executing in EL1 (MMU enabled) and for a short period of time in hyp-stub (EL2-MMU disabled). As per the ARM-ARM specification (DDI0487G_b), this is not allowed. G5.10.3 Cache maintenance requirement: "If the change affects the cacheability attributes of the area of memory, including any change between Write-Through and Write-Back attributes, software must ensure that any cached copies of affected locations are removed from the caches, typically by cleaning and invalidating the locations from the levels of cache that might hold copies of the locations affected by the attribute change." Clean+invalidate the identity-mapped region till PoC before switching to VHE world to fix the cache inconsistency. Problem analysis with disassembly (vmlinux): 1) Both __primary_switch() and enter_vhe() are part of the identity region 2) RELA entries and enter_vhe() are sharing the same cache line fff800010970480 3) Memory ffff800010970484-ffff800010970498 is updated with EL1-MMU enabled 4) CPU fetches intrsuctions of enter_vhe() with EL2-MMU disabled - Non-coherent access causing the cache line fff800010970480 drop 5) Secondary core executes 'ldr x8, __secondary_switched' - Getting data abort because of the incorrect value at ffff800010970488 ffff800010970418 <__primary_switch>: ffff800010970418: d503245f bti c ffff80001097041c: aa0003f3 mov x19, x0 ffff800010970420: d5381014 mrs x20, sctlr_el1 ffff800010970424: 90003c81 adrp x1, ffff800011100000 <init_pg_dir> ffff800010970428: 97ffffc4 bl ffff800010970338 <__enable_mmu> ffff80001097042c: 97ffffe8 bl ffff8000109703cc <__relocate_kernel> ffff800010970430: 58000308 ldr x8, ffff800010970490 <__primary_switch+0x78> ffff800010970434: 90ffb480 adrp x0, ffff800010000000 <_text> ffff800010970438: d63f0100 blr x8 ffff80001097043c: d5033fdf isb ffff800010970440: d5181014 msr sctlr_el1, x20 ffff800010970444: d5033fdf isb ffff800010970448: 940f7efe bl ffff800010d50040 <__create_page_tables> ffff80001097044c: d508871f tlbi vmalle1 ffff800010970450: d503379f dsb nsh ffff800010970454: d5033fdf isb ffff800010970458: d5181013 msr sctlr_el1, x19 ffff80001097045c: d5033fdf isb ffff800010970460: d508751f ic iallu ffff800010970464: d503379f dsb nsh ffff800010970468: d5033fdf isb ffff80001097046c: 97ffffd8 bl ffff8000109703cc <__relocate_kernel> ffff800010970470: 58000108 ldr x8, ffff800010970490 <__primary_switch+0x78> ffff800010970474: 90ffb480 adrp x0, ffff800010000000 <_text> ffff800010970478: d61f0100 br x8 ffff80001097047c: 00df10c8 .word 0x00df10c8 ffff800010970480: 000dfba8 .word 0x000dfba8 ... ffff800010970498: d51cd041 msr tpidr_el2, x1 ffff80001097049c: d503201f nop ffff8000109704a0 <enter_vhe>: ffff8000109704a0: d508871f tlbi vmalle1 ffff8000109704a4: d503379f dsb nsh ffff8000109704a8: d5033fdf isb ffff8000109704ac: d53d1000 mrs x0, sctlr_el12 ffff8000109704b0: d5181000 msr sctlr_el1, x0 ffff8000109704b4: d5033fdf isb ffff8000109704b8: d508751f ic iallu ffff8000109704bc: d503379f dsb nsh ffff8000109704c0: d5033fdf isb ffff8000109704c4: d2a60a00 mov x0, #0x30500000 ffff8000109704c8: f2810000 movk x0, #0x800 ffff8000109704cc: d51d1000 msr sctlr_el12, x0 ffff8000109704d0: aa1f03e0 mov x0, xzr ffff8000109704d4: d69f03e0 eret ffff800010961850 <mutate_to_vhe>: ffff800010961850: d53c1001 mrs x1, sctlr_el2 ffff800010961854: 370001c1 tbnz w1, #0, ffff80001096188c <mutate_to_vhe+0x3c> ffff800010961858: d5380721 mrs x1, id_aa64mmfr1_el1 ... ffff80001096190c: aa010000 orr x0, x0, x1 ffff800010961910: d5184000 msr spsr_el1, x0 ffff800010961914: 14003ae3 b ffff8000109704a0 <enter_vhe> ffff800010970270 <secondary_startup>: ffff800010970270: d503245f bti c ffff800010970274: 97dab23a bl ffff80001001cb5c <switch_to_vhe> ffff800010970278: 94000049 bl ffff80001097039c <__cpu_secondary_check52bitva> ffff80001097027c: 94000145 bl ffff800010970790 <__cpu_setup> ffff800010970280: 90001e81 adrp x1, ffff800010d40000 <swapper_pg_dir> ffff800010970284: 9400002d bl ffff800010970338 <__enable_mmu> ffff800010970288: 58001008 ldr x8, ffff800010970488 <__primary_switch+0x70> ffff80001097028c: d61f0100 br x8 Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> --- arch/arm64/kernel/head.S | 10 ++++++++++ 1 file changed, 10 insertions(+)