Message ID | 20230127035616.508966-1-aik@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [Question,kernel] x86/amd/sev/nmi+vc: Fix stack handling (why is this happening?) | expand |
On Fri, Jan 27, 2023 at 02:56:16PM +1100, Alexey Kardashevskiy wrote: > AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when > "perf" executes. "./perf record sleep 20" is an example. > > Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not > defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi) > and actual reading of DB7 (which in turn causes #VC) every function is > inlined Very much on purpose. > and no stack frame is created (?). Silly compilers ;-) I think you can force it to by using inline asm with a rsp dependency or somesuch. > Replacing __always_inline with > noinline in local_db_save() or native_get_debugreg() fixes the problem. It will create others. > Why is this crash happening and how to fix that? I am still reading > the assembly but was hoping for a shortcut here :) Thanks, Welcome to the wonderful shit show that is x86 exceptions :/ I thought sev_es_*() is supposed to fix this. Joerg, any clue? > aik-Standard-PC-i440FX-PIIX-1996 login: [A[ 15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2) > [ 15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI > [ 15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73 > [ 15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown > [ 15.775323] RIP: 0010:error_entry+0x17/0x140 > [ 15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9 > [ 15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097 > [ 15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed > [ 15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8 > [ 15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000 > [ 15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > [ 15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000 > [ 15.775339] FS: 0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000 > [ 15.775340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0 > [ 15.775342] Call Trace: > [ 15.775352] <NMI> <snip> > [ 15.775495] ? asm_exc_page_fault+0x22/0x30 > [ 15.775496] ? restore_regs_and_return_to_kernel+0x22/0x22 > [ 15.775497] ? exc_page_fault+0x11/0x120 > [ 15.775499] ? asm_exc_page_fault+0x22/0x30 > [ 15.775500] ? check_preemption_disabled+0x8/0xe0 > [ 15.775502] ? __sev_es_ist_enter+0x13/0x100 > [ 15.775503] ? exc_nmi+0x10e/0x150 > [ 15.775505] ? end_repeat_nmi+0x16/0x67 > [ 15.775506] ? asm_exc_double_fault+0x30/0x30 > [ 15.775507] ? asm_exc_double_fault+0x30/0x30 > [ 15.775508] ? asm_exc_double_fault+0x30/0x30 > [ 15.775509] </NMI> > [ 15.775509] <#VC> > [ 15.775510] ? __show_regs.cold+0x18e/0x23d > [ 15.775511] </#VC> > [ 15.775511] <#DF> > [ 15.775512] ? __die_body.cold+0x1a/0x1f > [ 15.775513] ? die+0x26/0x40 > [ 15.775517] ? handle_stack_overflow+0x44/0x60 > [ 15.775518] ? exc_double_fault+0x14b/0x180 > [ 15.775519] ? asm_exc_double_fault+0x1f/0x30 > [ 15.775520] ? restore_regs_and_return_to_kernel+0x22/0x22 > [ 15.775521] ? asm_exc_page_fault+0x9/0x30 > [ 15.775522] ? error_entry+0x17/0x140 > [ 15.775523] </#DF> > [ 15.775523] WARNING: stack recursion on stack type 6 > [ 15.775524] Modules linked in: msr efivarfs > [ 15.837935] ---[ end trace 0000000000000000 ]--- > > Signed-off-by: Alexey Kardashevskiy <aik@amd.com> > --- > arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++ > arch/x86/kernel/nmi.c | 2 +- > 2 files changed, 30 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h > index b049d950612f..396e2ea114dc 100644 > --- a/arch/x86/include/asm/debugreg.h > +++ b/arch/x86/include/asm/debugreg.h > @@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7) > set_debugreg(dr7, 7); > } > > +/* noinline here inline __always_inline'd native_get_debugreg(int regno) */ > +static noinline unsigned long native_get_debugreg7(void) > +{ > + unsigned long dr7; > + asm("mov %%db7, %0" :"=r" (dr7)); > + return dr7; > +} > + > +static __always_inline unsigned long local_db_save_exc_nmi(void) > +{ > + unsigned long dr7; > + > + if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active()) > + return 0; > + > + dr7 = native_get_debugreg7(); > + dr7 &= ~0x400; /* architecturally set bit */ > + if (dr7) > + set_debugreg(0, 7); > + /* > + * Ensure the compiler doesn't lower the above statements into > + * the critical section; disabling breakpoints late would not > + * be good. > + */ > + barrier(); > + > + return dr7; > +} This is broken, and building with DEBUG_ENTRY=y would've told you. > + > #ifdef CONFIG_CPU_SUP_AMD > extern void set_dr_addr_mask(unsigned long mask, int dr); > #else > diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c > index cec0bfa3bc04..400b5b6b74f6 100644 > --- a/arch/x86/kernel/nmi.c > +++ b/arch/x86/kernel/nmi.c > @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi) > */ > sev_es_ist_enter(regs); > > - this_cpu_write(nmi_dr7, local_db_save()); > + this_cpu_write(nmi_dr7, local_db_save_exc_nmi()); > > irq_state = irqentry_nmi_enter(regs); So what I don't get is why sev_es_ist_enter() doesn't cause us to make a stack frame, that has an actual call in it (although admittedly conditional).
On Fri, Jan 27, 2023 at 10:08:14AM +0100, Peter Zijlstra wrote: > Welcome to the wonderful shit show that is x86 exceptions :/ > > I thought sev_es_*() is supposed to fix this. Joerg, any clue? Hmm, no, not yet, the stack-trace doesn't make much sense to me. The sev_es_* function calls in the NMI path are for re-enabling NMI and adjusting the #VC IST stack to allow nested VCs. Alexey, can you try to get a more stable backtrace? For example by building the kernel with frame pointers? Regards, Joerg
On 27/1/23 21:37, Joerg Roedel wrote: > On Fri, Jan 27, 2023 at 10:08:14AM +0100, Peter Zijlstra wrote: >> Welcome to the wonderful shit show that is x86 exceptions :/ >> >> I thought sev_es_*() is supposed to fix this. Joerg, any clue? > > Hmm, no, not yet, the stack-trace doesn't make much sense to me. The > sev_es_* function calls in the NMI path are for re-enabling NMI and > adjusting the #VC IST stack to allow nested VCs. > > Alexey, can you try to get a more stable backtrace? For example by > building the kernel with frame pointers? Do you mean these guys (below)? aik@aik-Standard-PC-i440FX-PIIX-1996:~$ grep FRAME_POINTER /boot/config-$(uname -r) CONFIG_SCHED_OMIT_FRAME_POINTER=y CONFIG_FRAME_POINTER=y CONFIG_UNWINDER_FRAME_POINTER=y Here is the complete output of that VM (200k so not in the email): https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc Note that the original long backtrace appears more than once, I just did not copy all of that in the first email.
On 27/1/23 20:08, Peter Zijlstra wrote: > On Fri, Jan 27, 2023 at 02:56:16PM +1100, Alexey Kardashevskiy wrote: >> AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when >> "perf" executes. "./perf record sleep 20" is an example. >> >> Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not >> defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi) >> and actual reading of DB7 (which in turn causes #VC) every function is >> inlined > > Very much on purpose. > >> and no stack frame is created (?). > > Silly compilers ;-) I think you can force it to by using inline asm with > a rsp dependency or somesuch. > >> Replacing __always_inline with >> noinline in local_db_save() or native_get_debugreg() fixes the problem. > > It will create others. > >> Why is this crash happening and how to fix that? I am still reading >> the assembly but was hoping for a shortcut here :) Thanks, > > Welcome to the wonderful shit show that is x86 exceptions :/ > > I thought sev_es_*() is supposed to fix this. Joerg, any clue? > >> aik-Standard-PC-i440FX-PIIX-1996 login: [A[ 15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2) >> [ 15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI >> [ 15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73 >> [ 15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown >> [ 15.775323] RIP: 0010:error_entry+0x17/0x140 >> [ 15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9 >> [ 15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097 >> [ 15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed >> [ 15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8 >> [ 15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000 >> [ 15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 >> [ 15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000 >> [ 15.775339] FS: 0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000 >> [ 15.775340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0 >> [ 15.775342] Call Trace: >> [ 15.775352] <NMI> > > <snip> > >> [ 15.775495] ? asm_exc_page_fault+0x22/0x30 >> [ 15.775496] ? restore_regs_and_return_to_kernel+0x22/0x22 >> [ 15.775497] ? exc_page_fault+0x11/0x120 >> [ 15.775499] ? asm_exc_page_fault+0x22/0x30 >> [ 15.775500] ? check_preemption_disabled+0x8/0xe0 >> [ 15.775502] ? __sev_es_ist_enter+0x13/0x100 >> [ 15.775503] ? exc_nmi+0x10e/0x150 >> [ 15.775505] ? end_repeat_nmi+0x16/0x67 >> [ 15.775506] ? asm_exc_double_fault+0x30/0x30 >> [ 15.775507] ? asm_exc_double_fault+0x30/0x30 >> [ 15.775508] ? asm_exc_double_fault+0x30/0x30 >> [ 15.775509] </NMI> >> [ 15.775509] <#VC> >> [ 15.775510] ? __show_regs.cold+0x18e/0x23d >> [ 15.775511] </#VC> >> [ 15.775511] <#DF> >> [ 15.775512] ? __die_body.cold+0x1a/0x1f >> [ 15.775513] ? die+0x26/0x40 >> [ 15.775517] ? handle_stack_overflow+0x44/0x60 >> [ 15.775518] ? exc_double_fault+0x14b/0x180 >> [ 15.775519] ? asm_exc_double_fault+0x1f/0x30 >> [ 15.775520] ? restore_regs_and_return_to_kernel+0x22/0x22 >> [ 15.775521] ? asm_exc_page_fault+0x9/0x30 >> [ 15.775522] ? error_entry+0x17/0x140 >> [ 15.775523] </#DF> >> [ 15.775523] WARNING: stack recursion on stack type 6 >> [ 15.775524] Modules linked in: msr efivarfs >> [ 15.837935] ---[ end trace 0000000000000000 ]--- >> >> Signed-off-by: Alexey Kardashevskiy <aik@amd.com> >> --- >> arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++ >> arch/x86/kernel/nmi.c | 2 +- >> 2 files changed, 30 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h >> index b049d950612f..396e2ea114dc 100644 >> --- a/arch/x86/include/asm/debugreg.h >> +++ b/arch/x86/include/asm/debugreg.h >> @@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7) >> set_debugreg(dr7, 7); >> } >> >> +/* noinline here inline __always_inline'd native_get_debugreg(int regno) */ >> +static noinline unsigned long native_get_debugreg7(void) >> +{ >> + unsigned long dr7; >> + asm("mov %%db7, %0" :"=r" (dr7)); >> + return dr7; >> +} >> + >> +static __always_inline unsigned long local_db_save_exc_nmi(void) >> +{ >> + unsigned long dr7; >> + >> + if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active()) >> + return 0; >> + >> + dr7 = native_get_debugreg7(); >> + dr7 &= ~0x400; /* architecturally set bit */ >> + if (dr7) >> + set_debugreg(0, 7); >> + /* >> + * Ensure the compiler doesn't lower the above statements into >> + * the critical section; disabling breakpoints late would not >> + * be good. >> + */ >> + barrier(); >> + >> + return dr7; >> +} > > This is broken, and building with DEBUG_ENTRY=y would've told you. Huh, good to know. Is this it telling me so? vmlinux.o: warning: objtool: exc_nmi+0x73: call to native_get_debugreg7() leaves .noinstr.text section >> + >> #ifdef CONFIG_CPU_SUP_AMD >> extern void set_dr_addr_mask(unsigned long mask, int dr); >> #else >> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c >> index cec0bfa3bc04..400b5b6b74f6 100644 >> --- a/arch/x86/kernel/nmi.c >> +++ b/arch/x86/kernel/nmi.c >> @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi) >> */ >> sev_es_ist_enter(regs); >> >> - this_cpu_write(nmi_dr7, local_db_save()); >> + this_cpu_write(nmi_dr7, local_db_save_exc_nmi()); >> >> irq_state = irqentry_nmi_enter(regs); > > So what I don't get is why sev_es_ist_enter() doesn't cause us to make a > stack frame, that has an actual call in it (although admittedly > conditional). Is not the frame gone when sev_es_ist_enter() (which does not get inlined as per objdump's "ffffffff81bd4550 <__sev_es_ist_enter>: ") returned?
On Fri, Jan 27, 2023 at 11:13:38PM +1100, Alexey Kardashevskiy wrote: > > This is broken, and building with DEBUG_ENTRY=y would've told you. > > > Huh, good to know. Is this it telling me so? > > vmlinux.o: warning: objtool: exc_nmi+0x73: call to native_get_debugreg7() > leaves .noinstr.text section > Yep. The ramification of all that is that by calling non-noinstr code (double negative, iow, regular instrumented code) is that you can end up in the tracers/*SAN/breakpoints etc.. code -- something we're very much not ready for at this point. > > > + > > > #ifdef CONFIG_CPU_SUP_AMD > > > extern void set_dr_addr_mask(unsigned long mask, int dr); > > > #else > > > diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c > > > index cec0bfa3bc04..400b5b6b74f6 100644 > > > --- a/arch/x86/kernel/nmi.c > > > +++ b/arch/x86/kernel/nmi.c > > > @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi) > > > */ > > > sev_es_ist_enter(regs); > > > - this_cpu_write(nmi_dr7, local_db_save()); > > > + this_cpu_write(nmi_dr7, local_db_save_exc_nmi()); > > > irq_state = irqentry_nmi_enter(regs); > > > > So what I don't get is why sev_es_ist_enter() doesn't cause us to make a > > stack frame, that has an actual call in it (although admittedly > > conditional). > > Is not the frame gone when sev_es_ist_enter() (which does not get inlined as > per objdump's "ffffffff81bd4550 <__sev_es_ist_enter>: > ") returned? Well, returning would consume the callframe, but the stack setup of the caller should remain. Let me go stare at some asm.
On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote: > Here is the complete output of that VM (200k so not in the email): > > https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc Thanks. So looking at the code in the traces: Code starting with the faulting instruction =========================================== 0: 65 48 8b 04 25 c0 db mov %gs:0x2dbc0,%rax 7: 02 00 9: 48 8b 80 a8 08 00 00 mov 0x8a8(%rax),%rax 10: 0f 0d 48 70 prefetchw 0x70(%rax) 14: e8 .byte 0xe8 15: 82 .byte 0x82 I think the fault in the page-fault handler happens here: DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) { unsigned long address = read_cr2(); irqentry_state_t state; prefetchw(¤t->mm->mmap_lock); <--- Here To be precise, it faults while dereferencing current. That means that GS_BASE is likely broken, need to find out why... This at least explains why it page-faults in a loop until the stack overflows and the guard page is hit. Regards, Joerg
On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote:
> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc
Okay, I reproduced the problem here and the root cause turned out to be
that the compiler moved the DR7 read instruction before the 5-byte NOP
which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is
guaranteed to cause #VC exception stack recursion if the NMI was
triggered on the #VC stack, and that leads to all kinds of undefined
behavior.
Regards,
Joerg
On 28/1/23 04:25, Joerg Roedel wrote: > On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote: >> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc > > Okay, I reproduced the problem here and the root cause turned out to be > that the compiler moved the DR7 read instruction before the 5-byte NOP > which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is > guaranteed to cause #VC exception stack recursion if the NMI was > triggered on the #VC stack, and that leads to all kinds of undefined > behavior. Cool! (out of curiosity) where do you see these NOPs? "objdump -D vmlinux" does not show any, is this after lifepatching? Meanwhile, this seems to be doing the right thing: diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h index b049d950612f..687b15297057 100644 --- a/arch/x86/include/asm/debugreg.h +++ b/arch/x86/include/asm/debugreg.h @@ -39,7 +39,7 @@ static __always_inline unsigned long native_get_debugreg(int regno) asm("mov %%db6, %0" :"=r" (val)); break; case 7: - asm("mov %%db7, %0" :"=r" (val)); + asm volatile ("mov %%db7, %0" :"=r" (val));
On Sat, Jan 28, 2023 at 10:24:56PM +1100, Alexey Kardashevskiy wrote: > (out of curiosity) where do you see these NOPs? "objdump -D vmlinux" does > not show any, is this after lifepatching? Here is the disassembly of exc_nmi of a kernel built from tip/master with CONFIG_PARAVIRT=n: <exc_nmi>: 41 54 push %r12 55 push %rbp 48 89 fd mov %rdi,%rbp 53 push %rbx 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 65 8b 05 69 66 41 7e mov %gs:0x7e416669(%rip),%eax # 3254c <pcpu_hot+0xc> 48 98 cltq 48 0f a3 05 33 00 2b bt %rax,0x12b0033(%rip) # ffffffff82ecbf20 <__cpu_online_mask> 01 0f 83 c9 00 00 00 jae ffffffff81c1bfbc <exc_nmi+0xec> 65 8b 05 f6 41 40 7e mov %gs:0x7e4041f6(%rip),%eax # 200f0 <nmi_state> 85 c0 test %eax,%eax 0f 85 f8 00 00 00 jne ffffffff81c1bffa <exc_nmi+0x12a> 65 c7 05 e3 41 40 7e movl $0x1,%gs:0x7e4041e3(%rip) # 200f0 <nmi_state> 01 00 00 00 0f 20 d0 mov %cr2,%rax 65 48 89 05 d0 41 40 mov %rax,%gs:0x7e4041d0(%rip) # 200e8 <nmi_cr2> 7e 41 0f 21 fc mov %db7,%r12 <-- here is the DR7 read 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) <-- here are the NOPS that become a call to sev_es_ist_enter() in SEV-ES guests The DR7 read will cause a #VC exception, switching to the #VC IST stack. If the NMI was raised while already on the #VC IST stack, this DR7 read will overwrite the previous stack frame and cause stack recursion, with all funny side effects. > diff --git a/arch/x86/include/asm/debugreg.h > b/arch/x86/include/asm/debugreg.h > index b049d950612f..687b15297057 100644 > --- a/arch/x86/include/asm/debugreg.h > +++ b/arch/x86/include/asm/debugreg.h > @@ -39,7 +39,7 @@ static __always_inline unsigned long > native_get_debugreg(int regno) > asm("mov %%db6, %0" :"=r" (val)); > break; > case 7: > - asm("mov %%db7, %0" :"=r" (val)); > + asm volatile ("mov %%db7, %0" :"=r" (val)); Yeah, something like this will be the fix. I am still thinking about the right place to put the volatile to make it explicit to the situation we are encountering here (which is SEV-ES specific). Best would be an explicit barrier in C code between sev_es_ist_enter() and the DR7 read, but all barriers I tried to far only seem to affect memory instructions and had no influence on the DR7 read (which is obviously not considered as a memory read by the compiler). The best place to put the barrier is in the sev_es_ist_enter() inline function, right after the static_call to __sev_es_ist_enter(). Regards, Joerg
On Sat, Jan 28, 2023 at 02:52:23PM +0100, Joerg Roedel wrote: > Yeah, something like this will be the fix. I am still thinking about > the right place to put the volatile to make it explicit to the situation > we are encountering here (which is SEV-ES specific). > > Best would be an explicit barrier in C code between sev_es_ist_enter() > and the DR7 read, but all barriers I tried to far only seem to affect > memory instructions and had no influence on the DR7 read (which is > obviously not considered as a memory read by the compiler). > > The best place to put the barrier is in the sev_es_ist_enter() inline > function, right after the static_call to __sev_es_ist_enter(). Okay, after some investigation I was not able to find a compiler barrier which affects DR7 read ordering. This leaves us with the only solution of directly forbidding DR7 register access re-ordering by adding a volatile to the asm, like you did before. I will send a fix later today. Regards, Joerg
On January 28, 2023 3:24:56 AM PST, Alexey Kardashevskiy <aik@amd.com> wrote: > > >On 28/1/23 04:25, Joerg Roedel wrote: >> On Fri, Jan 27, 2023 at 10:56:26PM +1100, Alexey Kardashevskiy wrote: >>> https://github.com/aik/linux/commit/d0d6bbb58fcd927ddd1f8e9d42ab121920c7eafc >> >> Okay, I reproduced the problem here and the root cause turned out to be >> that the compiler moved the DR7 read instruction before the 5-byte NOP >> which becomes the call to sev_es_ist_enter() in SEV-ES guests. This is >> guaranteed to cause #VC exception stack recursion if the NMI was >> triggered on the #VC stack, and that leads to all kinds of undefined >> behavior. > >Cool! > >(out of curiosity) where do you see these NOPs? "objdump -D vmlinux" does not show any, is this after lifepatching? > >Meanwhile, this seems to be doing the right thing: > > >diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h >index b049d950612f..687b15297057 100644 >--- a/arch/x86/include/asm/debugreg.h >+++ b/arch/x86/include/asm/debugreg.h >@@ -39,7 +39,7 @@ static __always_inline unsigned long native_get_debugreg(int regno) > asm("mov %%db6, %0" :"=r" (val)); > break; > case 7: >- asm("mov %%db7, %0" :"=r" (val)); >+ asm volatile ("mov %%db7, %0" :"=r" (val)); > > > It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is not... %dr6 is the status register! I believe they should all be volatile (the compiler semantics is that volatile operations are always executed exactly once, in strict program order with respect to any other volatile operations); the real question is if there should also be memory clobbers on %dr6 reads and any %dr write.
On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote: > It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is > not... %dr6 is the status register! Yeah, as a precaution I think we should make all those volatile. Just in case. > I believe they should all be volatile (the compiler semantics is that > volatile operations are always executed exactly once, in strict > program order with respect to any other volatile operations); the real > question is if there should also be memory clobbers on %dr6 reads and > any %dr write. Yes, I think so too. From gcc docs: "6.47.2.1 Volatile ................. ... Note that the compiler can move even 'volatile asm' instructions relative to other code, including across jump instructions." We already have __FORCE_ORDER for exactly things like that.
On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote: > It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is > not... %dr6 is the status register! The reason is that on SEV-ES only accesses to DR7 will cause #VC exceptions, DR0-DR6 are not intercepted. Regards, Joerg
On Tue, Jan 31, 2023, Joerg Roedel wrote: > On Mon, Jan 30, 2023 at 09:30:38AM -0800, H. Peter Anvin wrote: > > It's somewhat odd to me that reading %dr7 is volatile, but %dr6 is > > not... %dr6 is the status register! > > The reason is that on SEV-ES only accesses to DR7 will cause #VC > exceptions, DR0-DR6 are not intercepted. I don't think that is technically true. A _well-behaved_ hypervisor will not intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES architecture enforces that behavior.
On Tue, Jan 31, 2023 at 03:53:39PM +0000, Sean Christopherson wrote: > I don't think that is technically true. A _well-behaved_ hypervisor will not > intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES > architecture enforces that behavior. Not from the hardware architecture side, but the GHCB spec does not list NAE events for DR0-DR6 accesses, so a guest is not required to handle them in the VC handler. Linux under SEV-ES will crash if the HV intercepts debug registers, except DR7. Regards, Joerg
On Tue, Jan 31, 2023, Joerg Roedel wrote: > On Tue, Jan 31, 2023 at 03:53:39PM +0000, Sean Christopherson wrote: > > I don't think that is technically true. A _well-behaved_ hypervisor will not > > intercept DR0-DR6 accesses for SEV-ES guests, but AFAICT nothing in the SEV-ES > > architecture enforces that behavior. > > Not from the hardware architecture side, but the GHCB spec does not > list NAE events for DR0-DR6 accesses, so a guest is not required to > handle them in the VC handler. > > Linux under SEV-ES will crash if the HV intercepts debug registers, > except DR7. Right, I'm just objecting to the wording of "DR0-DR6 are not intercepted". E.g. from a security perspective, the kernel shouldn't rely on DR0-DR6 to execute cleanly.
diff --git a/arch/x86/include/asm/debugreg.h b/arch/x86/include/asm/debugreg.h index b049d950612f..396e2ea114dc 100644 --- a/arch/x86/include/asm/debugreg.h +++ b/arch/x86/include/asm/debugreg.h @@ -125,6 +125,35 @@ static __always_inline void local_db_restore(unsigned long dr7) set_debugreg(dr7, 7); } +/* noinline here inline __always_inline'd native_get_debugreg(int regno) */ +static noinline unsigned long native_get_debugreg7(void) +{ + unsigned long dr7; + asm("mov %%db7, %0" :"=r" (dr7)); + return dr7; +} + +static __always_inline unsigned long local_db_save_exc_nmi(void) +{ + unsigned long dr7; + + if (static_cpu_has(X86_FEATURE_HYPERVISOR) && !hw_breakpoint_active()) + return 0; + + dr7 = native_get_debugreg7(); + dr7 &= ~0x400; /* architecturally set bit */ + if (dr7) + set_debugreg(0, 7); + /* + * Ensure the compiler doesn't lower the above statements into + * the critical section; disabling breakpoints late would not + * be good. + */ + barrier(); + + return dr7; +} + #ifdef CONFIG_CPU_SUP_AMD extern void set_dr_addr_mask(unsigned long mask, int dr); #else diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index cec0bfa3bc04..400b5b6b74f6 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -503,7 +503,7 @@ DEFINE_IDTENTRY_RAW(exc_nmi) */ sev_es_ist_enter(regs); - this_cpu_write(nmi_dr7, local_db_save()); + this_cpu_write(nmi_dr7, local_db_save_exc_nmi()); irq_state = irqentry_nmi_enter(regs);
AMD SEV-ES guest kernels compiled without CONFIG_PARAVIRT crash when "perf" executes. "./perf record sleep 20" is an example. Some debugging revealed this happens when CONFIG_PARAVIRT_XXL is not defined. The problem seems to be that between DEFINE_IDTENTRY_RAW(exc_nmi) and actual reading of DB7 (which in turn causes #VC) every function is inlined and no stack frame is created (?). Replacing __always_inline with noinline in local_db_save() or native_get_debugreg() fixes the problem. The crash does not happen with CONFIG_PARAVIRT_XXL as in this case paravirt_get_debugreg() is used and there is an indirect call via PVOP_CALL1(). It has not been noticed as the most configs have CONFIG_XEN enabled which enables CONFIG_PARAVIRT_XXL. This happens with the recent tip/master, here is my test kernel and the config: https://github.com/aik/linux/commits/debug_dr7 Found this while testing DebugSwap (which also fixes the crash as it eliminates DB7 interception == #VC): https://lore.kernel.org/all/20230120031047.628097-1-aik@amd.com Define local_db_save_exc_nmi() to demostrate that the problem better. Why is this crash happening and how to fix that? I am still reading the assembly but was hoping for a shortcut here :) Thanks, aik-Standard-PC-i440FX-PIIX-1996 login: [A[ 15.775303] BUG: NMI stack guard page was hit at 0000000003983d50 (stack is 000000007feb1fa4..00000000574369c2) [ 15.775314] stack guard page: 0000 [#1] PREEMPT SMP NOPTI [ 15.775316] CPU: 0 PID: 790 Comm: sleep Not tainted 6.2.0-rc4_aik-debugswap_ruby-954bhost #73 [ 15.775322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown [ 15.775323] RIP: 0010:error_entry+0x17/0x140 [ 15.775326] Code: f8 e9 98 fd ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 41 53 53 <55> 41 54 41 55 41 56 41 57 56 31 f6 31 d2 31 c9 45 31 c0 45 31 c9 [ 15.775328] RSP: 0000:fffffe2446b2b000 EFLAGS: 00010097 [ 15.775332] RAX: fffffe2446b2b0a8 RBX: 0000000000000000 RCX: ffffffffb3a00fed [ 15.775333] RDX: 0000000000000000 RSI: ffffffffb3a00b69 RDI: fffffe2446b2b0a8 [ 15.775336] RBP: fffffe2446b2b0a8 R08: 0000000000000000 R09: 0000000000000000 [ 15.775337] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 15.775338] R13: 0000000000000000 R14: 000000000002dd80 R15: 0000000000000000 [ 15.775339] FS: 0000000000000000(0000) GS:ffff94b17dc00000(0000) knlGS:ffff94b17dc00000 [ 15.775340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 15.775341] CR2: fffffe2446b2aff8 CR3: 00080000167b8000 CR4: 00000000003506f0 [ 15.775342] Call Trace: [ 15.775352] <NMI> [ 15.775355] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775357] ? exc_page_fault+0x11/0x120 [ 15.775360] ? asm_exc_page_fault+0x22/0x30 [ 15.775364] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775365] ? exc_page_fault+0x11/0x120 [ 15.775367] ? asm_exc_page_fault+0x22/0x30 [ 15.775368] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775369] ? exc_page_fault+0x11/0x120 [ 15.775371] ? asm_exc_page_fault+0x22/0x30 [ 15.775372] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775373] ? exc_page_fault+0x11/0x120 [ 15.775374] ? asm_exc_page_fault+0x22/0x30 [ 15.775375] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775376] ? exc_page_fault+0x11/0x120 [ 15.775378] ? asm_exc_page_fault+0x22/0x30 [ 15.775379] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775380] ? exc_page_fault+0x11/0x120 [ 15.775381] ? asm_exc_page_fault+0x22/0x30 [ 15.775382] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775383] ? exc_page_fault+0x11/0x120 [ 15.775384] ? asm_exc_page_fault+0x22/0x30 [ 15.775385] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775386] ? exc_page_fault+0x11/0x120 [ 15.775388] ? asm_exc_page_fault+0x22/0x30 [ 15.775389] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775390] ? exc_page_fault+0x11/0x120 [ 15.775391] ? asm_exc_page_fault+0x22/0x30 [ 15.775392] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775393] ? exc_page_fault+0x11/0x120 [ 15.775395] ? asm_exc_page_fault+0x22/0x30 [ 15.775396] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775397] ? exc_page_fault+0x11/0x120 [ 15.775398] ? asm_exc_page_fault+0x22/0x30 [ 15.775399] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775400] ? exc_page_fault+0x11/0x120 [ 15.775401] ? asm_exc_page_fault+0x22/0x30 [ 15.775403] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775404] ? exc_page_fault+0x11/0x120 [ 15.775405] ? asm_exc_page_fault+0x22/0x30 [ 15.775406] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775407] ? exc_page_fault+0x11/0x120 [ 15.775408] ? asm_exc_page_fault+0x22/0x30 [ 15.775409] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775410] ? exc_page_fault+0x11/0x120 [ 15.775412] ? asm_exc_page_fault+0x22/0x30 [ 15.775413] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775414] ? exc_page_fault+0x11/0x120 [ 15.775415] ? asm_exc_page_fault+0x22/0x30 [ 15.775416] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775420] ? exc_page_fault+0x11/0x120 [ 15.775421] ? asm_exc_page_fault+0x22/0x30 [ 15.775422] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775423] ? exc_page_fault+0x11/0x120 [ 15.775425] ? asm_exc_page_fault+0x22/0x30 [ 15.775426] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775427] ? exc_page_fault+0x11/0x120 [ 15.775431] ? asm_exc_page_fault+0x22/0x30 [ 15.775432] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775433] ? exc_page_fault+0x11/0x120 [ 15.775435] ? asm_exc_page_fault+0x22/0x30 [ 15.775436] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775437] ? exc_page_fault+0x11/0x120 [ 15.775438] ? asm_exc_page_fault+0x22/0x30 [ 15.775439] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775440] ? exc_page_fault+0x11/0x120 [ 15.775441] ? asm_exc_page_fault+0x22/0x30 [ 15.775442] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775443] ? exc_page_fault+0x11/0x120 [ 15.775445] ? asm_exc_page_fault+0x22/0x30 [ 15.775446] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775447] ? exc_page_fault+0x11/0x120 [ 15.775448] ? asm_exc_page_fault+0x22/0x30 [ 15.775449] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775450] ? exc_page_fault+0x11/0x120 [ 15.775454] ? asm_exc_page_fault+0x22/0x30 [ 15.775455] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775456] ? exc_page_fault+0x11/0x120 [ 15.775458] ? asm_exc_page_fault+0x22/0x30 [ 15.775459] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775460] ? exc_page_fault+0x11/0x120 [ 15.775461] ? asm_exc_page_fault+0x22/0x30 [ 15.775462] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775463] ? exc_page_fault+0x11/0x120 [ 15.775465] ? asm_exc_page_fault+0x22/0x30 [ 15.775466] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775467] ? exc_page_fault+0x11/0x120 [ 15.775468] ? asm_exc_page_fault+0x22/0x30 [ 15.775469] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775470] ? exc_page_fault+0x11/0x120 [ 15.775471] ? asm_exc_page_fault+0x22/0x30 [ 15.775472] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775473] ? exc_page_fault+0x11/0x120 [ 15.775475] ? asm_exc_page_fault+0x22/0x30 [ 15.775476] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775477] ? exc_page_fault+0x11/0x120 [ 15.775478] ? asm_exc_page_fault+0x22/0x30 [ 15.775482] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775483] ? exc_page_fault+0x11/0x120 [ 15.775485] ? asm_exc_page_fault+0x22/0x30 [ 15.775486] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775487] ? exc_page_fault+0x11/0x120 [ 15.775488] ? asm_exc_page_fault+0x22/0x30 [ 15.775490] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775491] ? exc_page_fault+0x11/0x120 [ 15.775492] ? asm_exc_page_fault+0x22/0x30 [ 15.775493] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775494] ? exc_page_fault+0x11/0x120 [ 15.775495] ? asm_exc_page_fault+0x22/0x30 [ 15.775496] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775497] ? exc_page_fault+0x11/0x120 [ 15.775499] ? asm_exc_page_fault+0x22/0x30 [ 15.775500] ? check_preemption_disabled+0x8/0xe0 [ 15.775502] ? __sev_es_ist_enter+0x13/0x100 [ 15.775503] ? exc_nmi+0x10e/0x150 [ 15.775505] ? end_repeat_nmi+0x16/0x67 [ 15.775506] ? asm_exc_double_fault+0x30/0x30 [ 15.775507] ? asm_exc_double_fault+0x30/0x30 [ 15.775508] ? asm_exc_double_fault+0x30/0x30 [ 15.775509] </NMI> [ 15.775509] <#VC> [ 15.775510] ? __show_regs.cold+0x18e/0x23d [ 15.775511] </#VC> [ 15.775511] <#DF> [ 15.775512] ? __die_body.cold+0x1a/0x1f [ 15.775513] ? die+0x26/0x40 [ 15.775517] ? handle_stack_overflow+0x44/0x60 [ 15.775518] ? exc_double_fault+0x14b/0x180 [ 15.775519] ? asm_exc_double_fault+0x1f/0x30 [ 15.775520] ? restore_regs_and_return_to_kernel+0x22/0x22 [ 15.775521] ? asm_exc_page_fault+0x9/0x30 [ 15.775522] ? error_entry+0x17/0x140 [ 15.775523] </#DF> [ 15.775523] WARNING: stack recursion on stack type 6 [ 15.775524] Modules linked in: msr efivarfs [ 15.837935] ---[ end trace 0000000000000000 ]--- Signed-off-by: Alexey Kardashevskiy <aik@amd.com> --- arch/x86/include/asm/debugreg.h | 29 ++++++++++++++++++++ arch/x86/kernel/nmi.c | 2 +- 2 files changed, 30 insertions(+), 1 deletion(-)