From patchwork Fri Jan 31 16:44:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Beulich X-Patchwork-Id: 11360279 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7B76A13A4 for ; Fri, 31 Jan 2020 16:45:15 +0000 (UTC) Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 56363206F0 for ; Fri, 31 Jan 2020 16:45:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 56363206F0 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1ixZP5-0008Sw-6Q; Fri, 31 Jan 2020 16:44:23 +0000 Received: from us1-rack-iad1.inumbo.com ([172.99.69.81]) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1ixZP4-0008Sj-K5 for xen-devel@lists.xenproject.org; Fri, 31 Jan 2020 16:44:22 +0000 X-Inumbo-ID: eddf2364-4448-11ea-8396-bc764e2007e4 Received: from mx2.suse.de (unknown [195.135.220.15]) by us1-rack-iad1.inumbo.com (Halon) with ESMTPS id eddf2364-4448-11ea-8396-bc764e2007e4; Fri, 31 Jan 2020 16:44:21 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 71CA8AD3C; Fri, 31 Jan 2020 16:44:20 +0000 (UTC) From: Jan Beulich To: "xen-devel@lists.xenproject.org" References: Message-ID: Date: Fri, 31 Jan 2020 17:44:23 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.4.2 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US Subject: [Xen-devel] [PATCH v4 4/7] x86/HVM: implement memory read caching for insn emulation X-BeenThere: xen-devel@lists.xenproject.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Cc: Kevin Tian , Wei Liu , Paul Durrant , George Dunlap , Andrew Cooper , Jun Nakajima , =?utf-8?q?Roger_Pau_Monn=C3=A9?= Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" Emulation requiring device model assistance uses a form of instruction re-execution, assuming that the second (and any further) pass takes exactly the same path. This is a valid assumption as far as use of CPU registers goes (as those can't change without any other instruction executing in between), but is wrong for memory accesses. In particular it has been observed that Windows might page out buffers underneath an instruction currently under emulation (hitting between two passes). If the first pass read a memory operand successfully, any subsequent pass needs to get to see the exact same value. Introduce a cache to make sure above described assumption holds. This is a very simplistic implementation for now: Only exact matches are satisfied (no overlaps or partial reads or anything); this is sufficient for the immediate purpose of making re-execution an exact replay. The cache also won't be used just yet for guest page walks; that'll be the subject of a subsequent change. With the cache being generally transparent to upper layers, but with it having limited capacity yet being required for correctness, certain users of hvm_copy_from_guest_*() need to disable caching temporarily, without invalidating the cache. Note that the adjustments here to hvm_hypercall() and hvm_task_switch() are benign at this point; they'll become relevant once we start to be able to emulate respective insns through the main emulator (and more changes will then likely be needed to nested code). As to the actual data page in this scenario, there are a couple of aspects to take into consideration: - We must be talking about an insn accessing two locations (two memory ones, one of which is MMIO, or a memory and an I/O one). - If the non I/O / MMIO side is being read, the re-read (if it occurs at all) is having its result discarded, by taking the shortcut through the first switch()'s STATE_IORESP_READY case in hvmemul_do_io(). Note how, among all the re-issue sanity checks there, we avoid comparing the actual data. - If the non I/O / MMIO side is being written, it is the OSes responsibility to avoid actually moving page contents to disk while there might still be a write access in flight - this is no different in behavior from bare hardware. - Read-modify-write accesses are, as always, complicated, and while we deal with them better nowadays than we did in the past, we're still not quite there to guarantee hardware like behavior in all cases anyway. Nothing is getting worse by the changes made here, afaict. In __hvm_copy() also reduce p's scope and change its type to void *. Signed-off-by: Jan Beulich --- TBD: In principle the caching here yields unnecessary the one used for insn bytes (vio->mmio_insn{,_bytes}. However, to seed the cache with the data SVM may have made available, we'd have to also know the corresponding GPA. It's not safe, however, to re-walk the page tables to find out, as the page tables may have changed in the meantime. Therefore I guess we need to keep the duplicate functionality for now. A possible solution to this could be to use a physical-address-based cache for page table accesses (and looking forward also e.g. SVM/VMX insn emulation), and a linear-address- based one for all other reads. --- v4: Re-write for cache to become transparent to callers. v3: Add text about the actual data page to the description. v2: Re-base. --- a/xen/arch/x86/hvm/emulate.c +++ b/xen/arch/x86/hvm/emulate.c @@ -28,6 +28,17 @@ #include #include +struct hvmemul_cache +{ + unsigned int num_ents; + unsigned int max_ents; + struct { + paddr_t gpa:PADDR_BITS; + unsigned int size:BITS_PER_LONG - PADDR_BITS; + unsigned long data; + } ents[]; +}; + static void hvmtrace_io_assist(const ioreq_t *p) { unsigned int size, event; @@ -1866,12 +1877,17 @@ static int hvmemul_rep_movs( rc = HVMTRANS_okay; } else + { + unsigned int token = hvmemul_cache_disable(curr); + /* * We do a modicum of checking here, just for paranoia's sake and to * definitely avoid copying an unitialised buffer into guest address * space. */ rc = hvm_copy_from_guest_phys(buf, sgpa, bytes); + hvmemul_cache_restore(curr, token); + } if ( rc == HVMTRANS_okay ) rc = hvm_copy_to_guest_phys(dgpa, buf, bytes, curr); @@ -2534,6 +2550,19 @@ static int _hvm_emulate_one(struct hvm_e struct hvm_vcpu_io *vio = &curr->arch.hvm.hvm_io; int rc; + /* + * Enable caching if it's currently disabled, but leave the cache + * untouched if it's already enabled, for re-execution to consume + * entries populated by an earlier pass. + */ + if ( vio->cache->num_ents > vio->cache->max_ents ) + { + ASSERT(vio->io_req.state == STATE_IOREQ_NONE); + vio->cache->num_ents = 0; + } + else + ASSERT(vio->io_req.state == STATE_IORESP_READY); + hvm_emulate_init_per_insn(hvmemul_ctxt, vio->mmio_insn, vio->mmio_insn_bytes); @@ -2547,6 +2576,7 @@ static int _hvm_emulate_one(struct hvm_e { vio->mmio_cache_count = 0; vio->mmio_insn_bytes = 0; + hvmemul_cache_disable(curr); } else { @@ -2838,6 +2868,123 @@ void hvm_dump_emulation_state(const char hvmemul_ctxt->insn_buf); } +int hvmemul_cache_init(struct vcpu *v) +{ + /* + * No insn can access more than 16 independent linear addresses (AVX512F + * scatters/gathers being the worst). Each such linear range can span a + * page boundary, i.e. may require two page walks. Account for each insn + * byte individually. + */ + const unsigned int nents = (CONFIG_PAGING_LEVELS + 1) * + (MAX_INST_LEN + 16 * 2); + struct hvmemul_cache *cache = xmalloc_flex_struct(struct hvmemul_cache, + ents, nents); + + if ( !cache ) + return -ENOMEM; + + /* Cache is disabled initially. */ + cache->num_ents = nents + 1; + cache->max_ents = nents; + + v->arch.hvm.hvm_io.cache = cache; + + return 0; +} + +unsigned int hvmemul_cache_disable(struct vcpu *v) +{ + struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache; + unsigned int token = cache->num_ents; + + cache->num_ents = cache->max_ents + 1; + + return token; +} + +void hvmemul_cache_restore(struct vcpu *v, unsigned int token) +{ + struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache; + + ASSERT(cache->num_ents > cache->max_ents); + cache->num_ents = token; +} + +bool hvmemul_read_cache(const struct vcpu *v, paddr_t gpa, + void *buffer, unsigned int size) +{ + const struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache; + unsigned int i; + + /* Cache unavailable? */ + if ( cache->num_ents > cache->max_ents ) + return false; + + while ( size > sizeof(cache->ents->data) ) + { + i = gpa & (sizeof(cache->ents->data) - 1) + ? -gpa & (sizeof(cache->ents->data) - 1) + : sizeof(cache->ents->data); + if ( !hvmemul_read_cache(v, gpa, buffer, i) ) + return false; + gpa += i; + buffer += i; + size -= i; + } + + for ( i = 0; i < cache->num_ents; ++i ) + if ( cache->ents[i].gpa == gpa && cache->ents[i].size == size ) + { + memcpy(buffer, &cache->ents[i].data, size); + return true; + } + + return false; +} + +void hvmemul_write_cache(const struct vcpu *v, paddr_t gpa, + const void *buffer, unsigned int size) +{ + struct hvmemul_cache *cache = v->arch.hvm.hvm_io.cache; + unsigned int i; + + /* Cache unavailable? */ + if ( cache->num_ents > cache->max_ents ) + return; + + while ( size > sizeof(cache->ents->data) ) + { + i = gpa & (sizeof(cache->ents->data) - 1) + ? -gpa & (sizeof(cache->ents->data) - 1) + : sizeof(cache->ents->data); + hvmemul_write_cache(v, gpa, buffer, i); + gpa += i; + buffer += i; + size -= i; + } + + for ( i = 0; i < cache->num_ents; ++i ) + if ( cache->ents[i].gpa == gpa && cache->ents[i].size == size ) + { + memcpy(&cache->ents[i].data, buffer, size); + return; + } + + if ( unlikely(i >= cache->max_ents) ) + { + ASSERT_UNREACHABLE(); + return; + } + + cache->ents[i].gpa = gpa; + cache->ents[i].size = size; + + memcpy(&cache->ents[i].data, buffer, size); + + cache->num_ents = i + 1; +} + /* * Local variables: * mode: C --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -717,6 +717,8 @@ int hvm_domain_initialise(struct domain /* This function and all its descendants need to be to be idempotent. */ void hvm_domain_relinquish_resources(struct domain *d) { + struct vcpu *v; + if ( hvm_funcs.domain_relinquish_resources ) alternative_vcall(hvm_funcs.domain_relinquish_resources, d); @@ -733,6 +735,9 @@ void hvm_domain_relinquish_resources(str rtc_deinit(d); pmtimer_deinit(d); hpet_deinit(d); + + for_each_vcpu ( d, v ) + hvmemul_cache_destroy(v); } void hvm_domain_destroy(struct domain *d) @@ -1538,6 +1543,10 @@ int hvm_vcpu_initialise(struct vcpu *v) v->arch.hvm.inject_event.vector = HVM_EVENT_VECTOR_UNSET; + rc = hvmemul_cache_init(v); + if ( rc ) + goto fail4; + rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */ if ( rc != 0 ) goto fail4; @@ -1573,6 +1582,7 @@ int hvm_vcpu_initialise(struct vcpu *v) fail5: free_compat_arg_xlat(v); fail4: + hvmemul_cache_destroy(v); hvm_funcs.vcpu_destroy(v); fail3: vlapic_destroy(v); @@ -2934,6 +2944,7 @@ void hvm_task_switch( unsigned int eflags, new_cpl; pagefault_info_t pfinfo; int exn_raised, rc; + unsigned int token = hvmemul_cache_disable(v); struct tss32 tss; hvm_get_segment_register(v, x86_seg_gdtr, &gdt); @@ -3141,6 +3152,8 @@ void hvm_task_switch( out: hvm_unmap_entry(optss_desc); hvm_unmap_entry(nptss_desc); + + hvmemul_cache_restore(v, token); } enum hvm_translation_result hvm_translate_get_page( @@ -3231,7 +3244,6 @@ static enum hvm_translation_result __hvm gfn_t gfn; struct page_info *page; p2m_type_t p2mt; - char *p; int count, todo = size; ASSERT(is_hvm_vcpu(v)); @@ -3279,11 +3291,17 @@ static enum hvm_translation_result __hvm return HVMTRANS_need_retry; } - p = __map_domain_page(page) + pgoff; - - if ( flags & HVMCOPY_to_guest ) + if ( (flags & HVMCOPY_to_guest) || + !hvmemul_read_cache(v, gfn_to_gaddr(gfn) | pgoff, buf, count) ) { - if ( p2m_is_discard_write(p2mt) ) + void *p = __map_domain_page(page) + pgoff; + + if ( !(flags & HVMCOPY_to_guest) ) + { + memcpy(buf, p, count); + hvmemul_write_cache(v, gfn_to_gaddr(gfn) | pgoff, buf, count); + } + else if ( p2m_is_discard_write(p2mt) ) { static unsigned long lastpage; @@ -3300,13 +3318,9 @@ static enum hvm_translation_result __hvm memset(p, 0, count); paging_mark_pfn_dirty(v->domain, _pfn(gfn_x(gfn))); } - } - else - { - memcpy(buf, p, count); - } - unmap_domain_page(p); + unmap_domain_page(p); + } addr += count; if ( buf ) --- a/xen/arch/x86/hvm/hypercall.c +++ b/xen/arch/x86/hvm/hypercall.c @@ -22,6 +22,7 @@ #include #include +#include #include static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) @@ -159,6 +160,7 @@ int hvm_hypercall(struct cpu_user_regs * struct domain *currd = curr->domain; int mode = hvm_guest_x86_mode(curr); unsigned long eax = regs->eax; + unsigned int token; switch ( mode ) { @@ -183,7 +185,18 @@ int hvm_hypercall(struct cpu_user_regs * } if ( (eax & 0x80000000) && is_viridian_domain(currd) ) - return viridian_hypercall(regs); + { + int ret; + + /* See comment below. */ + token = hvmemul_cache_disable(curr); + + ret = viridian_hypercall(regs); + + hvmemul_cache_restore(curr, token); + + return ret; + } BUILD_BUG_ON(ARRAY_SIZE(hvm_hypercall_table) > ARRAY_SIZE(hypercall_args_table)); @@ -202,6 +215,12 @@ int hvm_hypercall(struct cpu_user_regs * return HVM_HCALL_completed; } + /* + * Caching is intended for instruction emulation only. Disable it + * for any accesses by hypercall argument copy-in / copy-out. + */ + token = hvmemul_cache_disable(curr); + curr->hcall_preempted = false; if ( mode == 8 ) @@ -295,6 +314,8 @@ int hvm_hypercall(struct cpu_user_regs * #endif } + hvmemul_cache_restore(curr, token); + HVM_DBG_LOG(DBG_LEVEL_HCALL, "hcall%lu -> %lx", eax, regs->rax); if ( curr->hcall_preempted ) --- a/xen/arch/x86/hvm/intercept.c +++ b/xen/arch/x86/hvm/intercept.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -163,6 +164,9 @@ int hvm_process_io_intercept(const struc { if ( p->data_is_ptr ) { + struct vcpu *curr = current; + unsigned int token = hvmemul_cache_disable(curr); + data = 0; switch ( hvm_copy_from_guest_phys(&data, p->data + step * i, p->size) ) @@ -179,9 +183,11 @@ int hvm_process_io_intercept(const struc ASSERT_UNREACHABLE(); /* fall through */ default: - domain_crash(current->domain); + domain_crash(curr->domain); return X86EMUL_UNHANDLEABLE; } + + hvmemul_cache_restore(curr, token); } else data = p->data; --- a/xen/arch/x86/hvm/svm/svm.c +++ b/xen/arch/x86/hvm/svm/svm.c @@ -1040,6 +1040,8 @@ void svm_vmenter_helper(const struct cpu struct vcpu *curr = current; struct vmcb_struct *vmcb = curr->arch.hvm.svm.vmcb; + ASSERT(hvmemul_cache_disabled(curr)); + svm_asid_handle_vmrun(); if ( unlikely(tb_init_done) ) --- a/xen/arch/x86/hvm/vmsi.c +++ b/xen/arch/x86/hvm/vmsi.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -607,6 +608,7 @@ void msix_write_completion(struct vcpu * if ( !ctrl_address && snoop_addr && v->arch.hvm.hvm_io.msix_snoop_gpa ) { + unsigned int token = hvmemul_cache_disable(v); const struct msi_desc *desc; uint32_t data; @@ -621,6 +623,8 @@ void msix_write_completion(struct vcpu * sizeof(data)) == HVMTRANS_okay && !(data & PCI_MSIX_VECTOR_BITMASK) ) ctrl_address = snoop_addr; + + hvmemul_cache_restore(v, token); } if ( !ctrl_address ) --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -4347,6 +4347,8 @@ bool vmx_vmenter_helper(const struct cpu struct hvm_vcpu_asid *p_asid; bool_t need_flush; + ASSERT(hvmemul_cache_disabled(curr)); + /* Shadow EPTP can't be updated here because irqs are disabled */ if ( nestedhvm_vcpu_in_guestmode(curr) && vcpu_nestedhvm(curr).stale_np2m ) return false; --- a/xen/include/asm-x86/hvm/emulate.h +++ b/xen/include/asm-x86/hvm/emulate.h @@ -13,6 +13,7 @@ #define __ASM_X86_HVM_EMULATE_H__ #include +#include #include #include @@ -96,6 +97,31 @@ int hvmemul_do_pio_buffer(uint16_t port, uint8_t dir, void *buffer); +#ifdef CONFIG_HVM +int __must_check hvmemul_cache_init(struct vcpu *v); +static inline void hvmemul_cache_destroy(struct vcpu *v) +{ + XFREE(v->arch.hvm.hvm_io.cache); +} +bool hvmemul_read_cache(const struct vcpu *, paddr_t gpa, + void *buffer, unsigned int size); +void hvmemul_write_cache(const struct vcpu *, paddr_t gpa, + const void *buffer, unsigned int size); +unsigned int hvmemul_cache_disable(struct vcpu *); +void hvmemul_cache_restore(struct vcpu *, unsigned int token); +/* For use in ASSERT()s only: */ +static inline bool hvmemul_cache_disabled(struct vcpu *v) +{ + return hvmemul_cache_disable(v) == hvmemul_cache_disable(v); +} +#else +static inline bool hvmemul_read_cache(const struct vcpu *v, paddr_t gpa, + void *buf, + unsigned int size) { return false; } +static inline void hvmemul_write_cache(const struct vcpu *v, paddr_t gpa, + const void *buf, unsigned int size) {} +#endif + void hvm_dump_emulation_state(const char *loglvl, const char *prefix, struct hvm_emulate_ctxt *hvmemul_ctxt, int rc); --- a/xen/include/asm-x86/hvm/vcpu.h +++ b/xen/include/asm-x86/hvm/vcpu.h @@ -77,6 +77,8 @@ struct hvm_vcpu_io { /* For retries we shouldn't re-fetch the instruction. */ unsigned int mmio_insn_bytes; unsigned char mmio_insn[16]; + struct hvmemul_cache *cache; + /* * For string instruction emulation we need to be able to signal a * necessary retry through other than function return codes.