Message ID | 78b62646-6fd4-e5b3-bc09-783bb017eaaa@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86emul: (mainly) vendor specific behavior adjustments | expand |
On 24/03/2020 16:29, Jan Beulich wrote: > This is to augment SYSCALL, which has been supported for quite some > time. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> I've compared this to the in-progress version I have in my XSA-204 follow-on series. I'm afraid the behaviour has far more vendor specific quirks than this. > > --- a/xen/arch/x86/x86_emulate/x86_emulate.c > +++ b/xen/arch/x86/x86_emulate/x86_emulate.c > @@ -5975,6 +5975,60 @@ x86_emulate( > goto done; > break; > > + case X86EMUL_OPC(0x0f, 0x07): /* sysret */ > + vcpu_must_have(syscall); > + /* Inject #UD if syscall/sysret are disabled. */ > + fail_if(!ops->read_msr); > + if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY ) > + goto done; > + generate_exception_if((msr_val & EFER_SCE) == 0, EXC_UD); (as with the SYSCALL side), no need for the vcpu_must_have(syscall) as well as this check. > + generate_exception_if(!amd_like(ctxt) && !mode_64bit(), EXC_UD); > + generate_exception_if(!mode_ring0(), EXC_GP, 0); > + generate_exception_if(!in_protmode(ctxt, ops), EXC_GP, 0); > + The Intel SYSRET vulnerability checks regs->rcx for canonicity here, and raises #GP here. I see you've got it below, but this is where the Intel pseudocode puts it, before MSR_STAR gets read, and logically it should be grouped with the other excpetions. > + if ( (rc = ops->read_msr(MSR_STAR, &msr_val, ctxt)) != X86EMUL_OKAY ) > + goto done; > + sreg.sel = ((msr_val >> 48) + 8) | 3; /* SELECTOR_RPL_MASK */ This would be the logical behaviour... AMD CPUs |3 into %cs.sel, but don't make an equivalent adjustment for %ss.sel, and simply take MSR_START.SYSRET_CS + 8. If you aren't careful with MSR_STAR, SYSRET will return to userspace with mismatching RPL/DPL and userspace can really find itself with an %ss with an RPL of 0. (Of course, when you take an interrupt and attempt to IRET back to this context, things fall apart). I discovered this entirely by accident in XTF, but it is confirmed by careful reading of the AMD SYSRET pseudocode. > + cs.sel = op_bytes == 8 ? sreg.sel + 8 : sreg.sel - 8; > + > + cs.base = sreg.base = 0; /* flat segment */ > + cs.limit = sreg.limit = ~0u; /* 4GB limit */ > + cs.attr = 0xcfb; /* G+DB+P+DPL3+S+Code */ > + sreg.attr = 0xcf3; /* G+DB+P+DPL3+S+Data */ Again, that would be the logical behaviour... AMD CPU's don't update anything but %ss.sel, and even comment the fact in pseudocode now. This was discovered by Andy Luto, where he found that taking an interrupt (unconditionally sets %ss to NUL), and opportunistic sysret back to 32bit userspace lets userspace see a sane %ss value, but with the attrs still empty, and the stack unusable. > + > +#ifdef __x86_64__ > + if ( mode_64bit() ) > + { > + if ( op_bytes == 8 ) > + { > + cs.attr = 0xafb; /* L+DB+P+DPL3+S+Code */ > + generate_exception_if(!is_canonical_address(_regs.rcx) && > + !amd_like(ctxt), EXC_GP, 0); Wherever this ends up living, I think it needs calling out with a comment /* CVE-xxx, Intel privilege escalation hole */, as it is a very subtle piece of vendor specific behaviour. Do we have a Centaur/other CPU to try with? I'd err on the side of going with == Intel rather than !AMD to avoid introducing known vulnerabilities into models which stand half a chance of not being affected. > + _regs.rip = _regs.rcx; > + } > + else > + _regs.rip = _regs.ecx; > + > + _regs.eflags = _regs.r11 & ~(X86_EFLAGS_RF | X86_EFLAGS_VM); > + } > + else > +#endif > + { > + _regs.r(ip) = _regs.ecx; > + _regs.eflags |= X86_EFLAGS_IF; > + } > + > + fail_if(!ops->write_segment); > + if ( (rc = ops->write_segment(x86_seg_cs, &cs, ctxt)) != X86EMUL_OKAY || > + (!amd_like(ctxt) && > + (rc = ops->write_segment(x86_seg_ss, &sreg, > + ctxt)) != X86EMUL_OKAY) ) Oh - here is the AMD behaviour with %ss, but its not quite correct. AFAICT, the correct behaviour is to read the old %ss on AMD-like, set flat attributes on Intel, and write back normally, because %ss.sel does get updated. ~Andrew > + goto done; > + > + singlestep = _regs.eflags & X86_EFLAGS_TF; > + break; > + > case X86EMUL_OPC(0x0f, 0x08): /* invd */ > case X86EMUL_OPC(0x0f, 0x09): /* wbinvd / wbnoinvd */ > generate_exception_if(!mode_ring0(), EXC_GP, 0); >
On 25.03.2020 11:00, Andrew Cooper wrote: > On 24/03/2020 16:29, Jan Beulich wrote: >> --- a/xen/arch/x86/x86_emulate/x86_emulate.c >> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c >> @@ -5975,6 +5975,60 @@ x86_emulate( >> goto done; >> break; >> >> + case X86EMUL_OPC(0x0f, 0x07): /* sysret */ >> + vcpu_must_have(syscall); >> + /* Inject #UD if syscall/sysret are disabled. */ >> + fail_if(!ops->read_msr); >> + if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY ) >> + goto done; >> + generate_exception_if((msr_val & EFER_SCE) == 0, EXC_UD); > > (as with the SYSCALL side), no need for the vcpu_must_have(syscall) as > well as this check. Hmm, yes, we do so elsewhere too, so I'll adjust this there and here. >> + generate_exception_if(!amd_like(ctxt) && !mode_64bit(), EXC_UD); >> + generate_exception_if(!mode_ring0(), EXC_GP, 0); >> + generate_exception_if(!in_protmode(ctxt, ops), EXC_GP, 0); >> + > > The Intel SYSRET vulnerability checks regs->rcx for canonicity here, and > raises #GP here. > > I see you've got it below, but this is where the Intel pseudocode puts > it, before MSR_STAR gets read, and logically it should be grouped with > the other excpetions. I had it here first, then moved it down to avoid yet another mode_64bit() instance. I didn't see why the ordering would matter for the overall result, on the basis that the STAR read ought not to fail under normal circumstances. I'll move it back where it was since you ask for it. >> + if ( (rc = ops->read_msr(MSR_STAR, &msr_val, ctxt)) != X86EMUL_OKAY ) >> + goto done; >> + sreg.sel = ((msr_val >> 48) + 8) | 3; /* SELECTOR_RPL_MASK */ > > This would be the logical behaviour... > > AMD CPUs |3 into %cs.sel, but don't make an equivalent adjustment for > %ss.sel, and simply take MSR_START.SYSRET_CS + 8. > > If you aren't careful with MSR_STAR, SYSRET will return to userspace > with mismatching RPL/DPL and userspace can really find itself with an > %ss with an RPL of 0. (Of course, when you take an interrupt and > attempt to IRET back to this context, things fall apart). > > I discovered this entirely by accident in XTF, but it is confirmed by > careful reading of the AMD SYSRET pseudocode. I did notice this in their pseudocode, but it looked too wrong to be true. Will change. >> + cs.sel = op_bytes == 8 ? sreg.sel + 8 : sreg.sel - 8; >> + >> + cs.base = sreg.base = 0; /* flat segment */ >> + cs.limit = sreg.limit = ~0u; /* 4GB limit */ >> + cs.attr = 0xcfb; /* G+DB+P+DPL3+S+Code */ >> + sreg.attr = 0xcf3; /* G+DB+P+DPL3+S+Data */ > > Again, that would be the logical behaviour... > > AMD CPU's don't update anything but %ss.sel, and even comment the fact > in pseudocode now. > > This was discovered by Andy Luto, where he found that taking an > interrupt (unconditionally sets %ss to NUL), and opportunistic sysret > back to 32bit userspace lets userspace see a sane %ss value, but with > the attrs still empty, and the stack unusable. > >> + >> +#ifdef __x86_64__ >> + if ( mode_64bit() ) >> + { >> + if ( op_bytes == 8 ) >> + { >> + cs.attr = 0xafb; /* L+DB+P+DPL3+S+Code */ >> + generate_exception_if(!is_canonical_address(_regs.rcx) && >> + !amd_like(ctxt), EXC_GP, 0); > > Wherever this ends up living, I think it needs calling out with a > comment /* CVE-xxx, Intel privilege escalation hole */, as it is a very > subtle piece of vendor specific behaviour. > > Do we have a Centaur/other CPU to try with? I'd err on the side of > going with == Intel rather than !AMD to avoid introducing known > vulnerabilities into models which stand half a chance of not being affected. I'd rather not - this exception behavior is spelled out by the SDM, and hence imo pretty likely to be followed by clones. While I do have a VIA box somewhere, it's not stable enough to run for more than a couple of minutes. >> + _regs.rip = _regs.rcx; >> + } >> + else >> + _regs.rip = _regs.ecx; >> + >> + _regs.eflags = _regs.r11 & ~(X86_EFLAGS_RF | X86_EFLAGS_VM); >> + } >> + else >> +#endif >> + { >> + _regs.r(ip) = _regs.ecx; >> + _regs.eflags |= X86_EFLAGS_IF; >> + } >> + >> + fail_if(!ops->write_segment); >> + if ( (rc = ops->write_segment(x86_seg_cs, &cs, ctxt)) != X86EMUL_OKAY || >> + (!amd_like(ctxt) && >> + (rc = ops->write_segment(x86_seg_ss, &sreg, >> + ctxt)) != X86EMUL_OKAY) ) > > Oh - here is the AMD behaviour with %ss, but its not quite correct. > > AFAICT, the correct behaviour is to read the old %ss on AMD-like, set > flat attributes on Intel, and write back normally, because %ss.sel does > get updated. Oh, of course - I meant to, got distracted, and then forgot. Will fix. Jan
On 25/03/2020 10:19, Jan Beulich wrote: > On 25.03.2020 11:00, Andrew Cooper wrote: >> On 24/03/2020 16:29, Jan Beulich wrote: >>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c >>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c >>> @@ -5975,6 +5975,60 @@ x86_emulate( >>> goto done; >>> break; >>> >>> + case X86EMUL_OPC(0x0f, 0x07): /* sysret */ >>> + vcpu_must_have(syscall); >>> + /* Inject #UD if syscall/sysret are disabled. */ >>> + fail_if(!ops->read_msr); >>> + if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY ) >>> + goto done; >>> + generate_exception_if((msr_val & EFER_SCE) == 0, EXC_UD); >> (as with the SYSCALL side), no need for the vcpu_must_have(syscall) as >> well as this check. > Hmm, yes, we do so elsewhere too, so I'll adjust this there and here. In theory, the SEP checks for SYSENTER/SYSEXIT could be similarly dropped, once the MSR logic is updated to perform proper availability checks. >>> + if ( (rc = ops->read_msr(MSR_STAR, &msr_val, ctxt)) != X86EMUL_OKAY ) >>> + goto done; >>> + sreg.sel = ((msr_val >> 48) + 8) | 3; /* SELECTOR_RPL_MASK */ >> This would be the logical behaviour... >> >> AMD CPUs |3 into %cs.sel, but don't make an equivalent adjustment for >> %ss.sel, and simply take MSR_START.SYSRET_CS + 8. >> >> If you aren't careful with MSR_STAR, SYSRET will return to userspace >> with mismatching RPL/DPL and userspace can really find itself with an >> %ss with an RPL of 0. (Of course, when you take an interrupt and >> attempt to IRET back to this context, things fall apart). >> >> I discovered this entirely by accident in XTF, but it is confirmed by >> careful reading of the AMD SYSRET pseudocode. > I did notice this in their pseudocode, but it looked too wrong to > be true. Will change. The main reason why my 204 followon series is still pending is because I never got around to completing an XTF test for all of these corner cases. I'm happy to drop my series to Xen in light of this series of yours, but I'd still like to complete the XTF side of things at some point. >>> + >>> +#ifdef __x86_64__ >>> + if ( mode_64bit() ) >>> + { >>> + if ( op_bytes == 8 ) >>> + { >>> + cs.attr = 0xafb; /* L+DB+P+DPL3+S+Code */ >>> + generate_exception_if(!is_canonical_address(_regs.rcx) && >>> + !amd_like(ctxt), EXC_GP, 0); >> Wherever this ends up living, I think it needs calling out with a >> comment /* CVE-xxx, Intel privilege escalation hole */, as it is a very >> subtle piece of vendor specific behaviour. >> >> Do we have a Centaur/other CPU to try with? I'd err on the side of >> going with == Intel rather than !AMD to avoid introducing known >> vulnerabilities into models which stand half a chance of not being affected. > I'd rather not - this exception behavior is spelled out by the > SDM, and hence imo pretty likely to be followed by clones. In pseudocode which certainly used to state somewhere "for reference only, and not to be taken as an precise specification of behaviour". (And yes - that statement was still at the beginning of Vol2 when Intel also claimed that "SYSRET was working according to the spec" in the embargo period of XSA-7, because I called them out on it). And anyway - it is a part of the AMD64 spec, not the Intel32 spec. A 3rd party implementing it for 64bit support is more likely to go with AMD's writings of how it behaves. > While I do have a VIA box somewhere, it's not stable enough to > run for more than a couple of minutes. Fundamentally, it boils down to this. Intel behaviour leaves a privilege escalation vulnerability available to userspace. Assuming AMD behaviour for unknown parts is the safer course of action, because we don't need to issue an XSA/CVE to fix the emulator when it turns out that we're wrong. ~Andrew
On 25.03.2020 11:00, Andrew Cooper wrote: > On 24/03/2020 16:29, Jan Beulich wrote: >> --- a/xen/arch/x86/x86_emulate/x86_emulate.c >> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c >> @@ -5975,6 +5975,60 @@ x86_emulate( >> goto done; >> break; >> >> + case X86EMUL_OPC(0x0f, 0x07): /* sysret */ >> + vcpu_must_have(syscall); >> + /* Inject #UD if syscall/sysret are disabled. */ >> + fail_if(!ops->read_msr); >> + if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY ) >> + goto done; >> + generate_exception_if((msr_val & EFER_SCE) == 0, EXC_UD); > > (as with the SYSCALL side), no need for the vcpu_must_have(syscall) as > well as this check. Upon re-reading I'm now confused - are you suggesting to also drop the EFER.SCE check? That's not what you said in reply to 6/7. If so, what's your thinking behind saying so? If I'm to guess, this may go along the lines of you suggesting to drop the explicit CPUID checks from SYSENTER/SYSEXIT as well, but I'm not seeing there either why you would think this way (albeit there it's also a little vague what exact changes you're thinking of at the MSR handling side). Jan
On 25/03/2020 11:55, Jan Beulich wrote: > On 25.03.2020 11:00, Andrew Cooper wrote: >> On 24/03/2020 16:29, Jan Beulich wrote: >>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c >>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c >>> @@ -5975,6 +5975,60 @@ x86_emulate( >>> goto done; >>> break; >>> >>> + case X86EMUL_OPC(0x0f, 0x07): /* sysret */ >>> + vcpu_must_have(syscall); >>> + /* Inject #UD if syscall/sysret are disabled. */ >>> + fail_if(!ops->read_msr); >>> + if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY ) >>> + goto done; >>> + generate_exception_if((msr_val & EFER_SCE) == 0, EXC_UD); >> (as with the SYSCALL side), no need for the vcpu_must_have(syscall) as >> well as this check. > Upon re-reading I'm now confused - are you suggesting to also drop > the EFER.SCE check? No. The SCE check is critical and needs to remain. The exact delta I had put together was: diff --git a/xen/arch/x86/x86_emulate/x86_emulate.c b/xen/arch/x86/x86_emulate/x86_emulate.c index c730511ebe..57ce7e00be 100644 --- a/xen/arch/x86/x86_emulate/x86_emulate.c +++ b/xen/arch/x86/x86_emulate/x86_emulate.c @@ -5883,9 +5883,11 @@ x86_emulate( #ifdef __XEN__ case X86EMUL_OPC(0x0f, 0x05): /* syscall */ - generate_exception_if(!in_protmode(ctxt, ops), EXC_UD); + if ( !in_protmode(ctxt, ops) || + ((ctxt->cpuid->x86_vendor & X86_VENDOR_INTEL) && !mode_64bit()) ) + generate_exception(EXC_UD); - /* Inject #UD if syscall/sysret are disabled. */ + /* Inject #UD if SCE is disabled. Subsumes the SYSCALL CPUID check. */ fail_if(ops->read_msr == NULL); if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY ) goto done; (Looking at the commit date, Mon Dec 19 13:32:11 2016 is quite a long time ago...) ~Andrew
--- a/xen/arch/x86/x86_emulate/x86_emulate.c +++ b/xen/arch/x86/x86_emulate/x86_emulate.c @@ -5975,6 +5975,60 @@ x86_emulate( goto done; break; + case X86EMUL_OPC(0x0f, 0x07): /* sysret */ + vcpu_must_have(syscall); + /* Inject #UD if syscall/sysret are disabled. */ + fail_if(!ops->read_msr); + if ( (rc = ops->read_msr(MSR_EFER, &msr_val, ctxt)) != X86EMUL_OKAY ) + goto done; + generate_exception_if((msr_val & EFER_SCE) == 0, EXC_UD); + generate_exception_if(!amd_like(ctxt) && !mode_64bit(), EXC_UD); + generate_exception_if(!mode_ring0(), EXC_GP, 0); + generate_exception_if(!in_protmode(ctxt, ops), EXC_GP, 0); + + if ( (rc = ops->read_msr(MSR_STAR, &msr_val, ctxt)) != X86EMUL_OKAY ) + goto done; + + sreg.sel = ((msr_val >> 48) + 8) | 3; /* SELECTOR_RPL_MASK */ + cs.sel = op_bytes == 8 ? sreg.sel + 8 : sreg.sel - 8; + + cs.base = sreg.base = 0; /* flat segment */ + cs.limit = sreg.limit = ~0u; /* 4GB limit */ + cs.attr = 0xcfb; /* G+DB+P+DPL3+S+Code */ + sreg.attr = 0xcf3; /* G+DB+P+DPL3+S+Data */ + +#ifdef __x86_64__ + if ( mode_64bit() ) + { + if ( op_bytes == 8 ) + { + cs.attr = 0xafb; /* L+DB+P+DPL3+S+Code */ + generate_exception_if(!is_canonical_address(_regs.rcx) && + !amd_like(ctxt), EXC_GP, 0); + _regs.rip = _regs.rcx; + } + else + _regs.rip = _regs.ecx; + + _regs.eflags = _regs.r11 & ~(X86_EFLAGS_RF | X86_EFLAGS_VM); + } + else +#endif + { + _regs.r(ip) = _regs.ecx; + _regs.eflags |= X86_EFLAGS_IF; + } + + fail_if(!ops->write_segment); + if ( (rc = ops->write_segment(x86_seg_cs, &cs, ctxt)) != X86EMUL_OKAY || + (!amd_like(ctxt) && + (rc = ops->write_segment(x86_seg_ss, &sreg, + ctxt)) != X86EMUL_OKAY) ) + goto done; + + singlestep = _regs.eflags & X86_EFLAGS_TF; + break; + case X86EMUL_OPC(0x0f, 0x08): /* invd */ case X86EMUL_OPC(0x0f, 0x09): /* wbinvd / wbnoinvd */ generate_exception_if(!mode_ring0(), EXC_GP, 0);
This is to augment SYSCALL, which has been supported for quite some time. Signed-off-by: Jan Beulich <jbeulich@suse.com>