Message ID | 20240930161837.1248144-3-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86/pv: Fixes to guest_io_okay() | expand |
On 30.09.2024 18:18, Andrew Cooper wrote: > RFC: Should we make the boundary check be (port + bytes + 8)? That would be > more correct, but liable to break unsuspecting VMs. Maybe we should just > comment our way out of it. What would the "+ 8" be intended to express? (I take it you mean ... > --- a/xen/arch/x86/pv/emul-priv-op.c > +++ b/xen/arch/x86/pv/emul-priv-op.c > @@ -169,29 +169,26 @@ static intguest_io_okay(unsigned int port, unsigned int bytes, > > if ( (port + bytes) <= v->arch.pv.iobmp_limit ) ... this check, which looks correct to me as is. In particular with the "+ 8" there would appear to be no way to access ports at the very top of the 64k range anymore, as PHYSDEVOP_set_iobitmap handling caps nr_ports at 64k. IOW I think "commenting our way out of it" is the only possible approach.) With or without such a comment added Reviewed-by: Jan Beulich <jbeulich@suse.com> Jan
On 01/10/2024 8:36 am, Jan Beulich wrote: > On 30.09.2024 18:18, Andrew Cooper wrote: >> RFC: Should we make the boundary check be (port + bytes + 8)? That would be >> more correct, but liable to break unsuspecting VMs. Maybe we should just >> comment our way out of it. > What would the "+ 8" be intended to express? (I take it you mean ... > >> --- a/xen/arch/x86/pv/emul-priv-op.c >> +++ b/xen/arch/x86/pv/emul-priv-op.c >> @@ -169,29 +169,26 @@ static intguest_io_okay(unsigned int port, unsigned int bytes, >> >> if ( (port + bytes) <= v->arch.pv.iobmp_limit ) > ... this check, which looks correct to me as is. In particular with the > "+ 8" there would appear to be no way to access ports at the very top of > the 64k range anymore, as PHYSDEVOP_set_iobitmap handling caps nr_ports > at 64k. IOW I think "commenting our way out of it" is the only possible > approach.) This is actually a horrid corner case, different to what I was thinking yesterday. In a real TSS, the CPU doesn't read past the TR limit. Such accesses are turned into a permission failure. But in Xen, guest_io_okay() does read one byte past iobmp_limit. Previously, we'd consume what was there if it was accessible, or declare a permissions failure if it was inaccessible. Now, we'll throw #PF back to the kernel with %cr2 beyond the the limit. It's not OK for Xen to be reading past the given limit; it's not how real CPUs behave. Part of the problem is that, even since it's introduction in 013351bd7 (2006), the public interface was named nr_ports while Xen's internal field was called iobmp_limit. In terms of how it's used these days in PV guests: Linux points the bitmap into it's real TSS, and sets nr_ports to either 0 or 65536 depending on whether the userspace task is permitted to use IO ports or not. NetBSD doesn't seem to use PHYSDEVOP_set_iobitmap at all, and only uses PHYSDEVOP_set_iopl. MiniOS uses neither. I think Xen's internal variable wants renaming to not be a limit, and the comment for PHYSDEVOP_set_iobitmap wants extending to note that, like a real TSS, Xen might read one byte beyond the end of the bitmap. I think this wants doing in a separate patch. Thoughts? > With or without such a comment added > Reviewed-by: Jan Beulich <jbeulich@suse.com> Thanks. ~Andrew
On 01.10.2024 12:41, Andrew Cooper wrote: > I think Xen's internal variable wants renaming to not be a limit, and > the comment for PHYSDEVOP_set_iobitmap wants extending to note that, > like a real TSS, Xen might read one byte beyond the end of the bitmap. > > I think this wants doing in a separate patch. Thoughts? That's probably going to be best. Jan
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c index 978bd6c0775f..b5d184038fa3 100644 --- a/xen/arch/x86/pv/emul-priv-op.c +++ b/xen/arch/x86/pv/emul-priv-op.c @@ -169,29 +169,26 @@ static int guest_io_okay(unsigned int port, unsigned int bytes, if ( (port + bytes) <= v->arch.pv.iobmp_limit ) { - union { uint8_t bytes[2]; uint16_t mask; } x; + const void *__user addr = v->arch.pv.iobmp.p + (port >> 3); + uint16_t mask; + int rc; - /* - * Grab permission bytes from guest space. Inaccessible bytes are - * read as 0xff (no access allowed). - */ + /* Grab permission bytes from guest space. */ if ( user_mode ) toggle_guest_pt(v); - switch ( __copy_from_guest_offset(x.bytes, v->arch.pv.iobmp, - port>>3, 2) ) - { - default: x.bytes[0] = ~0; - /* fallthrough */ - case 1: x.bytes[1] = ~0; - /* fallthrough */ - case 0: break; - } + rc = __copy_from_guest_pv(&mask, addr, 2); if ( user_mode ) toggle_guest_pt(v); - if ( (x.mask & (((1 << bytes) - 1) << (port & 7))) == 0 ) + if ( rc ) + { + x86_emul_pagefault(0, (unsigned long)addr + bytes - rc, ctxt); + return X86EMUL_EXCEPTION; + } + + if ( (mask & (((1 << bytes) - 1) << (port & 7))) == 0 ) return X86EMUL_OKAY; }
The switch statement in guest_io_okay() is a very expensive way of pre-initialising x with ~0, and performing a partial read into it. However, the logic isn't correct either. In a real TSS, the CPU always reads two bytes (like here), and any TSS limit violation turns silently into no-access. But, in-limit accesses trigger #PF as usual. AMD document this property explicitly, and while Intel don't (so far as I can tell), they do behave consistently with AMD. Switch from __copy_from_guest_offset() to __copy_from_guest_pv(), like everything else in this file. This removes code generation setting up copy_from_user_hvm() (in the likely path even), and safety LFENCEs from evaluate_nospec(). Change the logic to raise #PF if __copy_from_guest_pv() fails, rather than disallowing the IO port access. This brings the behaviour better in line with normal x86. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Roger Pau Monné <roger.pau@citrix.com> RFC: Should we make the boundary check be (port + bytes + 8)? That would be more correct, but liable to break unsuspecting VMs. Maybe we should just comment our way out of it. This needs to be combined with Jan's "x86/PV: simplify (and thus correct) guest accessor functions" to function completely correctly. From XTF testing: This series on its own: --- Xen Test Framework --- Environment: PV 64bit (Long mode 4 levels) Test pv-emul-cr2 Error: %cr2 expected 0x3000, got 0x2fff Test result: ERROR This series plus Jan's fix: --- Xen Test Framework --- Environment: PV 64bit (Long mode 4 levels) Test pv-emul-cr2 Test result: SUCCESS Interestingly, the test also does an `INW` without an output parameter straddling that page boundary, and it does reliably get 0x3000 even without the accessor fix. --- xen/arch/x86/pv/emul-priv-op.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-)