Message ID | ff68a73e0cdaf89e56add5c8b6e110df881fede1.1619193043.git.ashish.kalra@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add guest support for SEV live migration. | expand |
On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote: > +static inline void notify_page_enc_status_changed(unsigned long pfn, > + int npages, bool enc) > +{ > + PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); > +} Now the question is whether something like that is needed for TDX, and, if so, could it be shared by both. Sean? > +void notify_addr_enc_status_changed(unsigned long vaddr, int npages, > + bool enc) Let that line stick out. > +{ > +#ifdef CONFIG_PARAVIRT > + unsigned long sz = npages << PAGE_SHIFT; > + unsigned long vaddr_end = vaddr + sz; > + > + while (vaddr < vaddr_end) { > + int psize, pmask, level; > + unsigned long pfn; > + pte_t *kpte; > + > + kpte = lookup_address(vaddr, &level); > + if (!kpte || pte_none(*kpte)) > + return; What does this mean exactly? On the first failure to lookup the address, you return? Why not continue so that you can notify about the remaining pages in [vaddr - vaddr_end)? Also, what does it mean for the current range if the lookup fails? Innocuous situation or do you need to signal it with a WARN or so? > + > + pfn = pg_level_to_pfn(level, kpte, NULL); > + if (!pfn) > + continue; Same here: if it hits the default case, wouldn't it make sense to WARN_ONCE or so to catch potential misuse? Or better yet, the WARN_ONCE should be in pg_level_to_pfn(). > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > index 16f878c26667..45e65517405a 100644 > --- a/arch/x86/mm/pat/set_memory.c > +++ b/arch/x86/mm/pat/set_memory.c > @@ -2012,6 +2012,13 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) > */ > cpa_flush(&cpa, 0); > > + /* > + * Notify hypervisor that a given memory range is mapped encrypted > + * or decrypted. The hypervisor will use this information during the > + * VM migration. > + */ > + notify_addr_enc_status_changed(addr, numpages, enc); If you notify about a range then that function should be called notify_range_enc_status_changed or so.
On Wed, May 12, 2021, Borislav Petkov wrote: > On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote: > > +static inline void notify_page_enc_status_changed(unsigned long pfn, > > + int npages, bool enc) > > +{ > > + PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); > > +} > > Now the question is whether something like that is needed for TDX, and, > if so, could it be shared by both. Yes, TDX needs this same hook, but "can't" reuse the hypercall verbatime. Ditto for SEV-SNP. I wanted to squish everything into a single common hypercall, but that didn't pan out. The problem is that both TDX and SNP define their own versions of this so that any guest kernel that complies with the TDX|SNP specification will run cleanly on a hypervisor that also complies with the spec. This KVM-specific hook doesn't meet those requires because non-Linux guest support will be sketchy at best, and non-KVM hypervisor support will be non-existent. The best we can do, short of refusing to support TDX or SNP, is to make this KVM-specific hypercall compatible with TDX and SNP so that the bulk of the control logic is identical. The mechanics of actually invoking the hypercall will differ, but if done right, everything else should be reusable without modification. I had an in-depth analysis of this, but it was all off-list. Pasted below. TDX uses GPRs to communicate with the host, so it can tunnel "legacy" hypercalls from time zero. SNP could technically do the same (with a revised GHCB spec), but it'd be butt ugly. And of course trying to go that route for either TDX or SNP would run into the problem of having to coordinate the ABI for the "legacy" hypercall across all guests and hosts. So yeah, trying to remove any of the three (KVM vs. SNP vs. TDX) interfaces is sadly just wishful thinking. That being said, I do think we can reuse the KVM specific hypercall for TDX and SNP. Both will still need a {TDX,SNP}-specific GCH{I,B} protocol so that cross- vendor compatibility is guaranteed, but that shouldn't preclude a guest that is KVM enlightened from switching to the KVM specific hypercall once it can do so. More thoughts later on. > I guess a common structure could be used along the lines of what is in the > GHCB spec today, but that seems like overkill for SEV/SEV-ES, which will > only ever really do a single page range at a time (through > set_memory_encrypted() and set_memory_decrypted()). The reason for the > expanded form for SEV-SNP is that the OS can (proactively) adjust multiple > page ranges in advance. Will TDX need to do something similar? Yes, TDX needs the exact same thing. All three (SEV, SNP, and TDX) have more or less the exact same hook in the guest (Linux, obviously) kernel. > If so, the only real common piece in KVM is a function to track what pages > are shared vs private, which would only require a simple interface. It's not just KVM, it's also the relevant code in the guest kernel(s) and other hypervisors. And the MMU side of KVM will likely be able to share code, e.g. to act on the page size hint. > So for SEV/SEV-ES, a simpler hypercall interface to specify a single page > range is really all that is needed, but it must be common across > hypervisors. I think that was one Sean's points originally, we don't want > one hypercall for KVM, one for Hyper-V, one for VMware, one for Xen, etc. For the KVM defined interface (required for SEV/SEV-ES), I think it makes sense to make it a superset of the SNP and TDX protocols so that it _can_ be used in lieu of the SNP/TDX specific protocol. I don't know for sure whether or not that will actually yield better code and/or performance, but it costs us almost nothing and at least gives us the option of further optimizing the Linux+KVM combination. It probably shouldn't be a strict superset, as in practice I don't think SNP approach of having individual entries when batching multiple pages will yield the best performance. E.g. the vast majority (maybe all?) of conversions for a Linux guest will be physically contiguous and will have the same preferred page size, at which point there will be less overhead if the guest specifies a massive range as opposed to having to santize and fill a large buffer. TL;DR: I think the KVM hypercall should be something like this, so that it can be used for SNP and TDX, and possibly for other purposes, e.g. for paravirt performance enhancements or something. 8. KVM_HC_MAP_GPA_RANGE ----------------------- :Architecture: x86 :Status: active :Purpose: Request KVM to map a GPA range with the specified attributes. a0: the guest physical address of the start page a1: the number of (4kb) pages (must be contiguous in GPA space) a2: attributes where 'attributes' could be something like: bits 3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc... bit 4 - plaintext = 0, encrypted = 1 bits 63:5 - reserved (must be zero)
On Wed, May 12, 2021 at 03:51:10PM +0000, Sean Christopherson wrote: > TL;DR: I think the KVM hypercall should be something like this, so that it can > be used for SNP and TDX, and possibly for other purposes, e.g. for paravirt > performance enhancements or something. Ok, good, I was only making sure this is on people's radar but it actually is more than that. I'll let Tom and Jörg comment on the meat of the thing - as always, thanks for the detailed explanation. From my !virt guy POV, I like the aspect of sharing stuff as much as possible and it all makes sense to me but what the hell do I know... > 8. KVM_HC_MAP_GPA_RANGE > ----------------------- > :Architecture: x86 > :Status: active > :Purpose: Request KVM to map a GPA range with the specified attributes. > > a0: the guest physical address of the start page > a1: the number of (4kb) pages (must be contiguous in GPA space) > a2: attributes > > where 'attributes' could be something like: > > bits 3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc... > bit 4 - plaintext = 0, encrypted = 1 > bits 63:5 - reserved (must be zero) Yah, nice and simple. I like. Thx.
Hello Boris, On Wed, May 12, 2021 at 03:15:37PM +0200, Borislav Petkov wrote: > On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote: > > +static inline void notify_page_enc_status_changed(unsigned long pfn, > > + int npages, bool enc) > > +{ > > + PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); > > +} > > Now the question is whether something like that is needed for TDX, and, > if so, could it be shared by both. > > Sean? > > > +void notify_addr_enc_status_changed(unsigned long vaddr, int npages, > > + bool enc) > > Let that line stick out. > > > +{ > > +#ifdef CONFIG_PARAVIRT > > + unsigned long sz = npages << PAGE_SHIFT; > > + unsigned long vaddr_end = vaddr + sz; > > + > > + while (vaddr < vaddr_end) { > > + int psize, pmask, level; > > + unsigned long pfn; > > + pte_t *kpte; > > + > > + kpte = lookup_address(vaddr, &level); > > + if (!kpte || pte_none(*kpte)) > > + return; > > What does this mean exactly? On the first failure to lookup the address, > you return? Why not continue so that you can notify about the remaining > pages in [vaddr - vaddr_end)? What's the use of notification of a partial page list, even a single incorrect guest page encryption status can crash the guest/migrated guest. > Also, what does it mean for the current range if the lookup fails? > Innocuous situation or do you need to signal it with a WARN or so? > Yes, it makes sense to signal it with a WARN or so. > > + > > + pfn = pg_level_to_pfn(level, kpte, NULL); > > + if (!pfn) > > + continue; > > Same here: if it hits the default case, wouldn't it make sense to > WARN_ONCE or so to catch potential misuse? Or better yet, the WARN_ONCE > should be in pg_level_to_pfn(). Yes, it makes sense to add a WARN_ONCE() in pg_level_to_pfn(). > > > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > > index 16f878c26667..45e65517405a 100644 > > --- a/arch/x86/mm/pat/set_memory.c > > +++ b/arch/x86/mm/pat/set_memory.c > > @@ -2012,6 +2012,13 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) > > */ > > cpa_flush(&cpa, 0); > > > > + /* > > + * Notify hypervisor that a given memory range is mapped encrypted > > + * or decrypted. The hypervisor will use this information during the > > + * VM migration. > > + */ > > + notify_addr_enc_status_changed(addr, numpages, enc); > > If you notify about a range then that function should be called > > notify_range_enc_status_changed > Ok. Thanks, Ashish > or so. > > -- > Regards/Gruss, > Boris. > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&data=04%7C01%7CAshish.Kalra%40amd.com%7Cb880e2dae4d24f208c8b08d915480b4a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637564221487050648%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=q%2FOAt%2FQqv0t%2BXDhjvPQAEYj67XQIUWbis0MXGMu4EZY%3D&reserved=0
Hello Sean, On Wed, May 12, 2021 at 03:51:10PM +0000, Sean Christopherson wrote: > On Wed, May 12, 2021, Borislav Petkov wrote: > > On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote: > > > +static inline void notify_page_enc_status_changed(unsigned long pfn, > > > + int npages, bool enc) > > > +{ > > > + PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); > > > +} > > > > Now the question is whether something like that is needed for TDX, and, > > if so, could it be shared by both. > > Yes, TDX needs this same hook, but "can't" reuse the hypercall verbatime. Ditto > for SEV-SNP. I wanted to squish everything into a single common hypercall, but > that didn't pan out. > > The problem is that both TDX and SNP define their own versions of this so that > any guest kernel that complies with the TDX|SNP specification will run cleanly > on a hypervisor that also complies with the spec. This KVM-specific hook doesn't > meet those requires because non-Linux guest support will be sketchy at best, and > non-KVM hypervisor support will be non-existent. > > The best we can do, short of refusing to support TDX or SNP, is to make this > KVM-specific hypercall compatible with TDX and SNP so that the bulk of the > control logic is identical. The mechanics of actually invoking the hypercall > will differ, but if done right, everything else should be reusable without > modification. > > I had an in-depth analysis of this, but it was all off-list. Pasted below. > > TDX uses GPRs to communicate with the host, so it can tunnel "legacy" hypercalls > from time zero. SNP could technically do the same (with a revised GHCB spec), > but it'd be butt ugly. And of course trying to go that route for either TDX or > SNP would run into the problem of having to coordinate the ABI for the "legacy" > hypercall across all guests and hosts. So yeah, trying to remove any of the > three (KVM vs. SNP vs. TDX) interfaces is sadly just wishful thinking. > > That being said, I do think we can reuse the KVM specific hypercall for TDX and > SNP. Both will still need a {TDX,SNP}-specific GCH{I,B} protocol so that cross- > vendor compatibility is guaranteed, but that shouldn't preclude a guest that is > KVM enlightened from switching to the KVM specific hypercall once it can do so. > More thoughts later on. > > > I guess a common structure could be used along the lines of what is in the > > GHCB spec today, but that seems like overkill for SEV/SEV-ES, which will > > only ever really do a single page range at a time (through > > set_memory_encrypted() and set_memory_decrypted()). The reason for the > > expanded form for SEV-SNP is that the OS can (proactively) adjust multiple > > page ranges in advance. Will TDX need to do something similar? > > Yes, TDX needs the exact same thing. All three (SEV, SNP, and TDX) have more or > less the exact same hook in the guest (Linux, obviously) kernel. > > > If so, the only real common piece in KVM is a function to track what pages > > are shared vs private, which would only require a simple interface. > > It's not just KVM, it's also the relevant code in the guest kernel(s) and other > hypervisors. And the MMU side of KVM will likely be able to share code, e.g. to > act on the page size hint. > > > So for SEV/SEV-ES, a simpler hypercall interface to specify a single page > > range is really all that is needed, but it must be common across > > hypervisors. I think that was one Sean's points originally, we don't want > > one hypercall for KVM, one for Hyper-V, one for VMware, one for Xen, etc. > > For the KVM defined interface (required for SEV/SEV-ES), I think it makes sense > to make it a superset of the SNP and TDX protocols so that it _can_ be used in > lieu of the SNP/TDX specific protocol. I don't know for sure whether or not > that will actually yield better code and/or performance, but it costs us almost > nothing and at least gives us the option of further optimizing the Linux+KVM > combination. > > It probably shouldn't be a strict superset, as in practice I don't think SNP > approach of having individual entries when batching multiple pages will yield > the best performance. E.g. the vast majority (maybe all?) of conversions for a > Linux guest will be physically contiguous and will have the same preferred page > size, at which point there will be less overhead if the guest specifies a > massive range as opposed to having to santize and fill a large buffer. > > TL;DR: I think the KVM hypercall should be something like this, so that it can > be used for SNP and TDX, and possibly for other purposes, e.g. for paravirt > performance enhancements or something. > > 8. KVM_HC_MAP_GPA_RANGE > ----------------------- > :Architecture: x86 > :Status: active > :Purpose: Request KVM to map a GPA range with the specified attributes. > > a0: the guest physical address of the start page > a1: the number of (4kb) pages (must be contiguous in GPA space) > a2: attributes > > where 'attributes' could be something like: > > bits 3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc... > bit 4 - plaintext = 0, encrypted = 1 > bits 63:5 - reserved (must be zero) > Ok. Will modify page encryption status hypercall to be compatible with the above defined interface. Thanks, Ashish
On 13/05/21 08:57, Ashish Kalra wrote: >> 8. KVM_HC_MAP_GPA_RANGE >> ----------------------- >> :Architecture: x86 >> :Status: active >> :Purpose: Request KVM to map a GPA range with the specified attributes. >> >> a0: the guest physical address of the start page >> a1: the number of (4kb) pages (must be contiguous in GPA space) >> a2: attributes >> >> where 'attributes' could be something like: >> >> bits 3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc... >> bit 4 - plaintext = 0, encrypted = 1 >> bits 63:5 - reserved (must be zero) >> > > Ok. Will modify page encryption status hypercall to be compatible with > the above defined interface. Great, this is the current state of the host-side patch (untested): From df571861e1d47d81a578b4950c704d01a0ed915e Mon Sep 17 00:00:00 2001 From: Ashish Kalra <ashish.kalra@amd.com> Date: Thu, 15 Apr 2021 15:57:02 +0000 Subject: [PATCH] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall This hypercall is used by the SEV guest to notify a change in the page encryption status to the hypervisor. The hypercall should be invoked only when the encryption attribute is changed from encrypted -> decrypted and vice versa. By default all guest pages are considered encrypted. The hypercall exits to userspace to manage the guest shared regions and integrate with the userspace VMM's migration code. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Borislav Petkov <bp@suse.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 7fcb2fd38f42..0d2abcad0565 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6891,3 +6891,22 @@ This capability is always enabled. This capability indicates that the KVM virtual PTP service is supported in the host. A VMM can check whether the service is available to the guest on migration. + +8.33 KVM_CAP_EXIT_HYPERCALL +--------------------------- + +:Capability: KVM_CAP_EXIT_HYPERCALL +:Architectures: x86 +:Type: vm + +This capability, if enabled, will cause KVM to exit to userspace +with KVM_EXIT_HYPERCALL exit reason to process some hypercalls. + +Calling KVM_CHECK_EXTENSION for this capability will return a bitmask +of hypercalls that can be configured to exit to userspace. +Right now, the only such hypercall is KVM_HC_PAGE_ENC_STATUS. + +The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset +of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace +the hypercalls whose corresponding bit is in the argument, and return +ENOSYS for the others. diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst index cf62162d4be2..1e0013d3c972 100644 --- a/Documentation/virt/kvm/cpuid.rst +++ b/Documentation/virt/kvm/cpuid.rst @@ -96,6 +96,14 @@ KVM_FEATURE_MSI_EXT_DEST_ID 15 guest checks this feature bit before using extended destination ID bits in MSI address bits 11-5. +KVM_FEATURE_HC_PAGE_ENC_STATUS 16 guest checks this feature bit before + using the page encryption state + hypercall to notify the page state + change + +KVM_FEATURE_MIGRATION_CONTROL 17 guest checks this feature bit before + using MSR_KVM_MIGRATION_CONTROL + KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 host will warn if no guest-side per-cpu warps are expected in kvmclock diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/hypercalls.rst index ed4fddd364ea..117ff3b27d3c 100644 --- a/Documentation/virt/kvm/hypercalls.rst +++ b/Documentation/virt/kvm/hypercalls.rst @@ -169,3 +169,24 @@ a0: destination APIC ID :Usage example: When sending a call-function IPI-many to vCPUs, yield if any of the IPI target vCPUs was preempted. + + +8. KVM_HC_PAGE_ENC_STATUS +------------------------- +:Architecture: x86 +:Status: active +:Purpose: Notify the encryption status changes in guest page table (SEV guest) + +a0: the guest physical address of the start page +a1: the number of pages +a2: page encryption status + + Where: + * 1: Page is encrypted + * 0: Page is decrypted + +**Implementation note**: this hypercall is implemented in userspace via +the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability +before advertising KVM_FEATURE_HC_PAGE_ENC_STATUS in the guest CPUID. In +addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace +must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL. diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst index e37a14c323d2..977936176f36 100644 --- a/Documentation/virt/kvm/msr.rst +++ b/Documentation/virt/kvm/msr.rst @@ -376,3 +376,16 @@ data: write '1' to bit 0 of the MSR, this causes the host to re-scan its queue and check if there are more notifications pending. The MSR is available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID. + +MSR_KVM_MIGRATION_CONTROL: + 0x4b564d08 + +data: + This MSR is available if KVM_FEATURE_MIGRATION_CONTROL is present in + CPUID. Bit 0 represents whether live migration of the guest is allowed. + + When a guest is started, bit 0 will be 0 if the guest has encrypted + memory and 1 if the guest does not have encrypted memory. If the + guest is communicating page encryption status to the host using the + ``KVM_HC_PAGE_ENC_STATUS`` hypercall, it can set bit 0 in this MSR to + allow live migration of the guest. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 55efbacfc244..5b9bc8b3db20 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1067,6 +1067,8 @@ struct kvm_arch { u32 user_space_msr_mask; struct kvm_x86_msr_filter __rcu *msr_filter; + u32 hypercall_exit_enabled; + /* Guest can access the SGX PROVISIONKEY. */ bool sgx_provisioning_allowed; diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 950afebfba88..cff18b8b6dec 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -33,6 +33,8 @@ #define KVM_FEATURE_PV_SCHED_YIELD 13 #define KVM_FEATURE_ASYNC_PF_INT 14 #define KVM_FEATURE_MSI_EXT_DEST_ID 15 +#define KVM_FEATURE_HC_PAGE_ENC_STATUS 16 +#define KVM_FEATURE_MIGRATION_CONTROL 17 #define KVM_HINTS_REALTIME 0 @@ -54,6 +56,7 @@ #define MSR_KVM_POLL_CONTROL 0x4b564d05 #define MSR_KVM_ASYNC_PF_INT 0x4b564d06 #define MSR_KVM_ASYNC_PF_ACK 0x4b564d07 +#define MSR_KVM_MIGRATION_CONTROL 0x4b564d08 struct kvm_steal_time { __u64 steal; @@ -90,6 +93,8 @@ struct kvm_clock_pairing { /* MSR_KVM_ASYNC_PF_INT */ #define KVM_ASYNC_PF_VEC_MASK GENMASK(7, 0) +/* MSR_KVM_MIGRATION_CONTROL */ +#define KVM_MIGRATION_READY (1 << 0) /* Operations for KVM_HC_MMU_OP */ #define KVM_MMU_OP_WRITE_PTE 1 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5bd550eaf683..eab7d50eb4e2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -102,6 +102,8 @@ static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE); static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS; +#define KVM_EXIT_HYPERCALL_VALID_MASK (1 << KVM_HC_PAGE_ENC_STATUS) + #define KVM_X2APIC_API_VALID_FLAGS (KVM_X2APIC_API_USE_32BIT_IDS | \ KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK) @@ -3894,6 +3896,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_VM_COPY_ENC_CONTEXT_FROM: r = 1; break; + case KVM_CAP_EXIT_HYPERCALL: + r = KVM_EXIT_HYPERCALL_VALID_MASK; + break; case KVM_CAP_SET_GUEST_DEBUG2: return KVM_GUESTDBG_VALID_MASK; #ifdef CONFIG_KVM_XEN @@ -5494,6 +5499,14 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, break; } #endif + case KVM_CAP_EXIT_HYPERCALL: + if (cap->args[0] & ~KVM_EXIT_HYPERCALL_VALID_MASK) { + r = -EINVAL; + break; + } + kvm->arch.hypercall_exit_enabled = cap->args[0]; + r = 0; + break; case KVM_CAP_VM_COPY_ENC_CONTEXT_FROM: r = -EINVAL; if (kvm_x86_ops.vm_copy_enc_context_from) @@ -8384,6 +8397,16 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id) return; } +static int complete_hypercall_exit(struct kvm_vcpu *vcpu) +{ + u64 ret = vcpu->run->hypercall.ret; + if (!is_64_bit_mode(vcpu)) + ret = (u32)ret; + kvm_rax_write(vcpu, ret); + ++vcpu->stat.hypercalls; + return kvm_skip_emulated_instruction(vcpu); +} + int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) { unsigned long nr, a0, a1, a2, a3, ret; @@ -8449,6 +8472,28 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) kvm_sched_yield(vcpu, a0); ret = 0; break; + case KVM_HC_PAGE_ENC_STATUS: { + u64 gpa = a0, npages = a1, enc = a2; + + ret = -KVM_ENOSYS; + if (!(vcpu->kvm->arch.hypercall_exit_enabled & (1 << KVM_HC_PAGE_ENC_STATUS))) + break; + + if (!PAGE_ALIGNED(gpa) || !npages || + gpa_to_gfn(gpa) + npages <= gpa_to_gfn(gpa)) { + ret = -KVM_EINVAL; + break; + } + + vcpu->run->exit_reason = KVM_EXIT_HYPERCALL; + vcpu->run->hypercall.nr = KVM_HC_PAGE_ENC_STATUS; + vcpu->run->hypercall.args[0] = gpa; + vcpu->run->hypercall.args[1] = npages; + vcpu->run->hypercall.args[2] = enc; + vcpu->run->hypercall.longmode = op_64_bit; + vcpu->arch.complete_userspace_io = complete_hypercall_exit; + return 0; + } default: ret = -KVM_ENOSYS; break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 3fd9a7e9d90c..1fb4fd863324 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1082,6 +1082,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_SGX_ATTRIBUTE 196 #define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197 #define KVM_CAP_PTP_KVM 198 +#define KVM_CAP_EXIT_HYPERCALL 199 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h index 8b86609849b9..847b83b75dc8 100644 --- a/include/uapi/linux/kvm_para.h +++ b/include/uapi/linux/kvm_para.h @@ -29,6 +29,7 @@ #define KVM_HC_CLOCK_PAIRING 9 #define KVM_HC_SEND_IPI 10 #define KVM_HC_SCHED_YIELD 11 +#define KVM_HC_PAGE_ENC_STATUS 12 /* * hypercalls use architecture specific
On 5/12/21 10:51 AM, Sean Christopherson wrote: > On Wed, May 12, 2021, Borislav Petkov wrote: >> On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote: >>> +static inline void notify_page_enc_status_changed(unsigned long pfn, >>> + int npages, bool enc) >>> +{ >>> + PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); >>> +} >> >> Now the question is whether something like that is needed for TDX, and, >> if so, could it be shared by both. > > Yes, TDX needs this same hook, but "can't" reuse the hypercall verbatime. Ditto > for SEV-SNP. I wanted to squish everything into a single common hypercall, but > that didn't pan out. > > The problem is that both TDX and SNP define their own versions of this so that > any guest kernel that complies with the TDX|SNP specification will run cleanly > on a hypervisor that also complies with the spec. This KVM-specific hook doesn't > meet those requires because non-Linux guest support will be sketchy at best, and > non-KVM hypervisor support will be non-existent. > > The best we can do, short of refusing to support TDX or SNP, is to make this > KVM-specific hypercall compatible with TDX and SNP so that the bulk of the > control logic is identical. The mechanics of actually invoking the hypercall > will differ, but if done right, everything else should be reusable without > modification. > > I had an in-depth analysis of this, but it was all off-list. Pasted below. > > TDX uses GPRs to communicate with the host, so it can tunnel "legacy" hypercalls > from time zero. SNP could technically do the same (with a revised GHCB spec), > but it'd be butt ugly. And of course trying to go that route for either TDX or > SNP would run into the problem of having to coordinate the ABI for the "legacy" > hypercall across all guests and hosts. So yeah, trying to remove any of the > three (KVM vs. SNP vs. TDX) interfaces is sadly just wishful thinking. > > That being said, I do think we can reuse the KVM specific hypercall for TDX and > SNP. Both will still need a {TDX,SNP}-specific GCH{I,B} protocol so that cross- > vendor compatibility is guaranteed, but that shouldn't preclude a guest that is > KVM enlightened from switching to the KVM specific hypercall once it can do so. > More thoughts later on. > > > I guess a common structure could be used along the lines of what is in the > > GHCB spec today, but that seems like overkill for SEV/SEV-ES, which will > > only ever really do a single page range at a time (through > > set_memory_encrypted() and set_memory_decrypted()). The reason for the > > expanded form for SEV-SNP is that the OS can (proactively) adjust multiple > > page ranges in advance. Will TDX need to do something similar? > > Yes, TDX needs the exact same thing. All three (SEV, SNP, and TDX) have more or > less the exact same hook in the guest (Linux, obviously) kernel. > > > If so, the only real common piece in KVM is a function to track what pages > > are shared vs private, which would only require a simple interface. > > It's not just KVM, it's also the relevant code in the guest kernel(s) and other > hypervisors. And the MMU side of KVM will likely be able to share code, e.g. to > act on the page size hint. > > > So for SEV/SEV-ES, a simpler hypercall interface to specify a single page > > range is really all that is needed, but it must be common across > > hypervisors. I think that was one Sean's points originally, we don't want > > one hypercall for KVM, one for Hyper-V, one for VMware, one for Xen, etc. > > For the KVM defined interface (required for SEV/SEV-ES), I think it makes sense > to make it a superset of the SNP and TDX protocols so that it _can_ be used in > lieu of the SNP/TDX specific protocol. I don't know for sure whether or not > that will actually yield better code and/or performance, but it costs us almost > nothing and at least gives us the option of further optimizing the Linux+KVM > combination. Right, for SEV-SNP, as long as we know that the KVM interface is available, it could be used. But we would have to fall back to the GHCB specification if it could not be determined. > > It probably shouldn't be a strict superset, as in practice I don't think SNP > approach of having individual entries when batching multiple pages will yield > the best performance. E.g. the vast majority (maybe all?) of conversions for a > Linux guest will be physically contiguous and will have the same preferred page > size, at which point there will be less overhead if the guest specifies a > massive range as opposed to having to santize and fill a large buffer. Originally, the plan was to use ranges, but based on feedback during the GHCB spec review it was updated to its current definition. A concern from other hypervisor vendors was to be able to return back to the guest in the middle of the hypercall, e.g. to deliver an interrupt or such, and then allow the guest to resume the hypercall where it left off. Not sure if you would want to build that into this hypercall, since depending on the range, the operation could take a while. Thanks, Tom > > TL;DR: I think the KVM hypercall should be something like this, so that it can > be used for SNP and TDX, and possibly for other purposes, e.g. for paravirt > performance enhancements or something. > > 8. KVM_HC_MAP_GPA_RANGE > ----------------------- > :Architecture: x86 > :Status: active > :Purpose: Request KVM to map a GPA range with the specified attributes. > > a0: the guest physical address of the start page > a1: the number of (4kb) pages (must be contiguous in GPA space) > a2: attributes > > where 'attributes' could be something like: > > bits 3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc... > bit 4 - plaintext = 0, encrypted = 1 > bits 63:5 - reserved (must be zero) >
On Thu, May 13, 2021 at 04:34:41AM +0000, Ashish Kalra wrote: > What's the use of notification of a partial page list, even a single > incorrect guest page encryption status can crash the guest/migrated > guest. Ok, so explain to me how this looks from the user standpoint: she starts migrating the guest, it fails to lookup an address, there's nothing saying where it failed but the guest crashed. Do you think this is user-friendly?
On 14/05/21 09:33, Borislav Petkov wrote: > Ok, so explain to me how this looks from the user standpoint: she starts > migrating the guest, it fails to lookup an address, there's nothing > saying where it failed but the guest crashed. > > Do you think this is user-friendly? Ok, so explain to me how this looks from the submitter standpoint: he reads your review of his patch, he acknowledges your point with "Yes, it makes sense to signal it with a WARN or so", and still is treated as shit. Do you think this is friendly? Paolo
Hello Boris, Paolo, On Fri, May 14, 2021 at 10:03:18AM +0200, Paolo Bonzini wrote: > On 14/05/21 09:33, Borislav Petkov wrote: > > Ok, so explain to me how this looks from the user standpoint: she starts > > migrating the guest, it fails to lookup an address, there's nothing > > saying where it failed but the guest crashed. > > > > Do you think this is user-friendly? > > Ok, so explain to me how this looks from the submitter standpoint: he reads > your review of his patch, he acknowledges your point with "Yes, it makes > sense to signal it with a WARN or so", and still is treated as shit. > > Do you think this is friendly? > > I absolutely agree with both of your point of view. But what's the alternative ? Ideally we should fail/stop migration even if a single guest page encryption status cannot be notified and that should be the way to proceed in this case, the guest kernel should notify the source userspace VMM to block/stop migration in this case. From a practical side, i do see Qemu's migrate_add_blocker() interface but that looks to be a static interface and also i don't think it will force stop an ongoing migration, is there an existing mechanism to inform userspace VMM from kernel about blocking/stopping migration ? Thanks, Ashish
On Fri, May 14, 2021 at 10:03:18AM +0200, Paolo Bonzini wrote: > Ok, so explain to me how this looks from the submitter standpoint: he reads > your review of his patch, he acknowledges your point with "Yes, it makes > sense to signal it with a WARN or so", and still is treated as shit. How is me asking about the user experience of it all, treating him like shit?! How should I have asked this so that it is not making you think I'm treating him like shit? Because treating someone like shit is not in my goals.
On Fri, May 14, 2021 at 11:24:03AM +0200, Borislav Petkov wrote: > On Fri, May 14, 2021 at 10:03:18AM +0200, Paolo Bonzini wrote: > > Ok, so explain to me how this looks from the submitter standpoint: he reads > > your review of his patch, he acknowledges your point with "Yes, it makes > > sense to signal it with a WARN or so", and still is treated as shit. > > How is me asking about the user experience of it all, treating him like > shit?! > > How should I have asked this so that it is not making you think I'm > treating him like shit? > > Because treating someone like shit is not in my goals. > As i mentioned in my previous reply, this has to be treated as a fatal error from the user point of view, and the kernel needs to inform the userspace VMM to block/stop migration as response to this failure. Thanks, Ashish
On Fri, May 14, 2021 at 09:05:23AM +0000, Ashish Kalra wrote: > Ideally we should fail/stop migration even if a single guest page > encryption status cannot be notified and that should be the way to > proceed in this case, the guest kernel should notify the source > userspace VMM to block/stop migration in this case. Yap, and what I'm trying to point out here is that if the low level machinery fails for whatever reason and it cannot recover, we should propagate that error up the chain so that the user is aware as to why it failed. WARN is a good first start but in some configurations those splats don't even get shown as people don't look at dmesg, etc. And I think it is very important to propagate those errors properly because there's a *lot* of moving parts involved in a guest migration and you have encrypted memory which makes debugging this probably even harder, etc, etc. I hope this makes more sense. > From a practical side, i do see Qemu's migrate_add_blocker() interface > but that looks to be a static interface and also i don't think it will > force stop an ongoing migration, is there an existing mechanism > to inform userspace VMM from kernel about blocking/stopping migration ? Hmm, so __set_memory_enc_dec() which calls notify_addr_enc_status_changed() is called by the guest, right, when it starts migrating. Can an error value from it be propagated up the callchain so it can be turned into an error messsage for the guest owner to see? (I might be way off base here as I have no clue how the whole migration machinery is kicked into gear...) Thx.
On 14/05/21 11:05, Ashish Kalra wrote: > I absolutely agree with both of your point of view. But what's the > alternative ? > > Ideally we should fail/stop migration even if a single guest page > encryption status cannot be notified and that should be the way to > proceed in this case, the guest kernel should notify the source > userspace VMM to block/stop migration in this case. > > From a practical side, i do see Qemu's migrate_add_blocker() interface > but that looks to be a static interface and also i don't think it will > force stop an ongoing migration, is there an existing mechanism > to inform userspace VMM from kernel about blocking/stopping migration ? On the Linux side, all you need to do is WARN and write 0 to the MIGRATION_CONTROL MSR. QEMU can check the MSR value when migrating the CPU registers at the end, and fail migration if the MSR value is 0. Paolo
On Fri, May 14, 2021 at 11:34:32AM +0200, Borislav Petkov wrote: > On Fri, May 14, 2021 at 09:05:23AM +0000, Ashish Kalra wrote: > > Ideally we should fail/stop migration even if a single guest page > > encryption status cannot be notified and that should be the way to > > proceed in this case, the guest kernel should notify the source > > userspace VMM to block/stop migration in this case. > > Yap, and what I'm trying to point out here is that if the low level > machinery fails for whatever reason and it cannot recover, we should > propagate that error up the chain so that the user is aware as to why it > failed. > I totally agree. > WARN is a good first start but in some configurations those splats don't > even get shown as people don't look at dmesg, etc. > > And I think it is very important to propagate those errors properly > because there's a *lot* of moving parts involved in a guest migration > and you have encrypted memory which makes debugging this probably even > harder, etc, etc. > > I hope this makes more sense. > > > From a practical side, i do see Qemu's migrate_add_blocker() interface > > but that looks to be a static interface and also i don't think it will > > force stop an ongoing migration, is there an existing mechanism > > to inform userspace VMM from kernel about blocking/stopping migration ? > > Hmm, so __set_memory_enc_dec() which calls > notify_addr_enc_status_changed() is called by the guest, right, when it > starts migrating. > No, actually notify_addr_enc_status_changed() is called whenever a range of memory is marked as encrypted or decrypted, so it has nothing to do with migration as such. This is basically modifying the encryption attributes on the page tables and correspondingly also making the hypercall to inform the hypervisor about page status encryption changes. The hypervisor will use this information during an ongoing or future migration, so this information is maintained even though migration might never be initiated here. > Can an error value from it be propagated up the callchain so it can be > turned into an error messsage for the guest owner to see? > The error value cannot be propogated up the callchain directly here, but one possibility is to leverage the hypercall and use Sean's proposed hypercall interface to notify the host/hypervisor to block/stop any future/ongoing migration. Or as from Paolo's response, writing 0 to MIGRATION_CONTROL MSR seems more ideal. Thanks, Ashish
On Fri, May 14, 2021 at 10:05:19AM +0000, Ashish Kalra wrote: > No, actually notify_addr_enc_status_changed() is called whenever a range > of memory is marked as encrypted or decrypted, so it has nothing to do > with migration as such. > > This is basically modifying the encryption attributes on the page tables > and correspondingly also making the hypercall to inform the hypervisor about > page status encryption changes. The hypervisor will use this information > during an ongoing or future migration, so this information is maintained > even though migration might never be initiated here. Doh, ofcourse. This doesn't make it easier. > The error value cannot be propogated up the callchain directly > here, Yeah, my thinking was way wrong here - sorry about that. > but one possibility is to leverage the hypercall and use Sean's > proposed hypercall interface to notify the host/hypervisor to block/stop > any future/ongoing migration. > > Or as from Paolo's response, writing 0 to MIGRATION_CONTROL MSR seems > more ideal. Ok. So to sum up: notify_addr_enc_status_changed() should warn but not because of migration but because regardless, we should tell the users when page enc attributes updating fails as that is potentially hinting at a bigger problem so we better make sufficient noise here. Thx.
On Fri, May 14, 2021 at 3:05 AM Ashish Kalra <ashish.kalra@amd.com> wrote: > > On Fri, May 14, 2021 at 11:34:32AM +0200, Borislav Petkov wrote: > > On Fri, May 14, 2021 at 09:05:23AM +0000, Ashish Kalra wrote: > > > Ideally we should fail/stop migration even if a single guest page > > > encryption status cannot be notified and that should be the way to > > > proceed in this case, the guest kernel should notify the source > > > userspace VMM to block/stop migration in this case. > > > > Yap, and what I'm trying to point out here is that if the low level > > machinery fails for whatever reason and it cannot recover, we should > > propagate that error up the chain so that the user is aware as to why it > > failed. > > > > I totally agree. > > > WARN is a good first start but in some configurations those splats don't > > even get shown as people don't look at dmesg, etc. > > > > And I think it is very important to propagate those errors properly > > because there's a *lot* of moving parts involved in a guest migration > > and you have encrypted memory which makes debugging this probably even > > harder, etc, etc. > > > > I hope this makes more sense. > > > > > From a practical side, i do see Qemu's migrate_add_blocker() interface > > > but that looks to be a static interface and also i don't think it will > > > force stop an ongoing migration, is there an existing mechanism > > > to inform userspace VMM from kernel about blocking/stopping migration ? > > > > Hmm, so __set_memory_enc_dec() which calls > > notify_addr_enc_status_changed() is called by the guest, right, when it > > starts migrating. > > > > No, actually notify_addr_enc_status_changed() is called whenever a range > of memory is marked as encrypted or decrypted, so it has nothing to do > with migration as such. > > This is basically modifying the encryption attributes on the page tables > and correspondingly also making the hypercall to inform the hypervisor about > page status encryption changes. The hypervisor will use this information > during an ongoing or future migration, so this information is maintained > even though migration might never be initiated here. > > > Can an error value from it be propagated up the callchain so it can be > > turned into an error messsage for the guest owner to see? > > > > The error value cannot be propogated up the callchain directly > here, but one possibility is to leverage the hypercall and use Sean's > proposed hypercall interface to notify the host/hypervisor to block/stop > any future/ongoing migration. > > Or as from Paolo's response, writing 0 to MIGRATION_CONTROL MSR seems > more ideal. > > Thanks, > Ashish How realistic is this type of failure? If you've gotten this deep, it seems like something has gone very wrong if the memory you are about to mark as shared (or encrypted) doesn't exist and isn't mapped. In particular, is the kernel going to page fault when it tries to reinitialize the page it's currently changing the c-bit of? From what I recall, most paths that do either set_memory_encrypted or set_memory_decrypted memset the region being toggled. Note: dma_pool doesn't immediately memset, but the VA it's providing to set_decrypted is freshly fetched from a recently allocated region (something has to have gone pretty wrong if this is invalid if I'm not mistaken. No one would think twice if you wrote to that freshly allocated page). The reason I mention this is that SEV migration is going to be the least of your concerns if you are already on a one-way train towards a Kernel oops. I'm not certain I would go so far as to say this should BUG() instead (I think the page fault on access might be easier to debug a BUG here), but I'm pretty skeptical that the kernel is going to do too well if it doesn't know if its kernel VAs are valid. If, despite the above, we expect to infrequently-but-not-never disable migration with no intention of reenabling it, we should signal it differently than we currently signal migration enablement. Currently, if you toggle migration from on to off there is an implication that you are about to reboot, and you are only ephemerally unable to migrate. Having permanent disablement be indistinguishable from a really long reboot is a recipe for a really sad long timeout in userspace.
On Mon, May 17, 2021 at 07:01:09PM -0700, Steve Rutherford wrote: > On Fri, May 14, 2021 at 3:05 AM Ashish Kalra <ashish.kalra@amd.com> wrote: > > > > On Fri, May 14, 2021 at 11:34:32AM +0200, Borislav Petkov wrote: > > > On Fri, May 14, 2021 at 09:05:23AM +0000, Ashish Kalra wrote: > > > > Ideally we should fail/stop migration even if a single guest page > > > > encryption status cannot be notified and that should be the way to > > > > proceed in this case, the guest kernel should notify the source > > > > userspace VMM to block/stop migration in this case. > > > > > > Yap, and what I'm trying to point out here is that if the low level > > > machinery fails for whatever reason and it cannot recover, we should > > > propagate that error up the chain so that the user is aware as to why it > > > failed. > > > > > > > I totally agree. > > > > > WARN is a good first start but in some configurations those splats don't > > > even get shown as people don't look at dmesg, etc. > > > > > > And I think it is very important to propagate those errors properly > > > because there's a *lot* of moving parts involved in a guest migration > > > and you have encrypted memory which makes debugging this probably even > > > harder, etc, etc. > > > > > > I hope this makes more sense. > > > > > > > From a practical side, i do see Qemu's migrate_add_blocker() interface > > > > but that looks to be a static interface and also i don't think it will > > > > force stop an ongoing migration, is there an existing mechanism > > > > to inform userspace VMM from kernel about blocking/stopping migration ? > > > > > > Hmm, so __set_memory_enc_dec() which calls > > > notify_addr_enc_status_changed() is called by the guest, right, when it > > > starts migrating. > > > > > > > No, actually notify_addr_enc_status_changed() is called whenever a range > > of memory is marked as encrypted or decrypted, so it has nothing to do > > with migration as such. > > > > This is basically modifying the encryption attributes on the page tables > > and correspondingly also making the hypercall to inform the hypervisor about > > page status encryption changes. The hypervisor will use this information > > during an ongoing or future migration, so this information is maintained > > even though migration might never be initiated here. > > > > > Can an error value from it be propagated up the callchain so it can be > > > turned into an error messsage for the guest owner to see? > > > > > > > The error value cannot be propogated up the callchain directly > > here, but one possibility is to leverage the hypercall and use Sean's > > proposed hypercall interface to notify the host/hypervisor to block/stop > > any future/ongoing migration. > > > > Or as from Paolo's response, writing 0 to MIGRATION_CONTROL MSR seems > > more ideal. > > > > Thanks, > > Ashish > > How realistic is this type of failure? If you've gotten this deep, it > seems like something has gone very wrong if the memory you are about > to mark as shared (or encrypted) doesn't exist and isn't mapped. In > particular, is the kernel going to page fault when it tries to > reinitialize the page it's currently changing the c-bit of? From what > I recall, most paths that do either set_memory_encrypted or > set_memory_decrypted memset the region being toggled. Note: dma_pool > doesn't immediately memset, but the VA it's providing to set_decrypted > is freshly fetched from a recently allocated region (something has to > have gone pretty wrong if this is invalid if I'm not mistaken. No one > would think twice if you wrote to that freshly allocated page). > > The reason I mention this is that SEV migration is going to be the > least of your concerns if you are already on a one-way train towards a > Kernel oops. I'm not certain I would go so far as to say this should > BUG() instead (I think the page fault on access might be easier to > debug a BUG here), but I'm pretty skeptical that the kernel is going > to do too well if it doesn't know if its kernel VAs are valid. > > If, despite the above, we expect to infrequently-but-not-never disable > migration with no intention of reenabling it, we should signal it > differently than we currently signal migration enablement. Currently, > if you toggle migration from on to off there is an implication that > you are about to reboot, and you are only ephemerally unable to > migrate. Having permanent disablement be indistinguishable from a > really long reboot is a recipe for a really sad long timeout in > userspace. Also looking at set_memory_encrypted and set_memory_decrypted usage patterns, the persistent decrypted regions like the dma pool, bss_decrypted, etc., will be setup at boot time. Later the unused bss_decrypted section will be freed and set_memory_encrypted called on the freed memory at end of kernel boot. Most of the runtime set_memory_encrypted and set_memory_decrypted calls will be on dynamically allocated dma buffers via dma_alloc_coherent() and dma_free_coherent(). For example, A dma_alloc_coherent() request will allocate pages and then call set_memory_decrypted() against them. When dma_free_coherent() is called, set_memory_encrypted() is called against the pages about to be freed before they are actually freed. Now these buffers have very short life and only used for immediate I/O and then freed, so they may not be of major concern for SEV migration ? So disabling migration for failure of address lookup or mapping failures on such pages will really be an overkill. Might be in favor of Steve's thoughts above of doing a BUG() here instead. Thanks, Ashish
On 19/05/21 14:06, Ashish Kalra wrote: > Now these buffers have very short life and only used for immediate I/O > and then freed, so they may not be of major concern for SEV > migration ? Well, they are a concern because they do break migration. But it may be indeed good enough to just have a WARN ("bad things may happen and you get to keep both pieces") and not disable future migration. A BUG must always be avoided unless you're sure that something *worse* will happen in the future, e.g. a BUG is acceptable if you have detected a use-after-free or a dangling pointer. This is not the case. Paolo > So disabling migration for failure of address lookup or mapping failures > on such pages will really be an overkill. > Might be in favor of Steve's thoughts above of doing a BUG() here > instead.
On Wed, May 12, 2021, at 6:15 AM, Borislav Petkov wrote: > On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote: > > +static inline void notify_page_enc_status_changed(unsigned long pfn, > > + int npages, bool enc) > > +{ > > + PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); > > +} > > Now the question is whether something like that is needed for TDX, and, > if so, could it be shared by both. The TDX MapGPA call can fail, and presumably it will fail if the page is not sufficiently quiescent from the host's perspective. It seems like a mistake to me to have a KVM-specific hypercall for this that cannot cleanly fail.
On Wed, May 19, 2021, Andy Lutomirski wrote: > On Wed, May 12, 2021, at 6:15 AM, Borislav Petkov wrote: > > On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote: > > > +static inline void notify_page_enc_status_changed(unsigned long pfn, > > > + int npages, bool enc) > > > +{ > > > + PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); > > > +} > > > > Now the question is whether something like that is needed for TDX, and, > > if so, could it be shared by both. > > The TDX MapGPA call can fail, and presumably it will fail if the page is not > sufficiently quiescent from the host's perspective. Barring a guest bug, e.g. requesting a completely non-existent page, MapGPA shouldn't fail. The example in the the GHCI: Invalid operand – for example, the GPA may be already mapped as a shared page. makes no sense to me. An already-mapped page would be an -EBUSY style error, not an invalid operand, and IIRC, I explicitly lobbied against allowing the VMM to return "try again" precisely because it's impossible for the guest to handle in a sane manner. If the physical page is in a state that requires stalling the vCPU, then the VMM is supposed to do exactly that, not punt the problem to the guest. Maybe we should get stronger language into the GHCI? > It seems like a mistake to me to have a KVM-specific hypercall for this that > cannot cleanly fail.
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 4abf110e2243..c0ef13e97f5a 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -84,6 +84,12 @@ static inline void paravirt_arch_exit_mmap(struct mm_struct *mm) PVOP_VCALL1(mmu.exit_mmap, mm); } +static inline void notify_page_enc_status_changed(unsigned long pfn, + int npages, bool enc) +{ + PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); +} + #ifdef CONFIG_PARAVIRT_XXL static inline void load_sp0(unsigned long sp0) { diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index de87087d3bde..7245d08f5500 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -195,6 +195,8 @@ struct pv_mmu_ops { /* Hook for intercepting the destruction of an mm_struct. */ void (*exit_mmap)(struct mm_struct *mm); + void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, + bool enc); #ifdef CONFIG_PARAVIRT_XXL struct paravirt_callee_save read_cr2; diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h index 4352f08bfbb5..ed9cfe062634 100644 --- a/arch/x86/include/asm/set_memory.h +++ b/arch/x86/include/asm/set_memory.h @@ -83,6 +83,8 @@ int set_pages_rw(struct page *page, int numpages); int set_direct_map_invalid_noflush(struct page *page); int set_direct_map_default_noflush(struct page *page); bool kernel_page_present(struct page *page); +void notify_addr_enc_status_changed(unsigned long vaddr, int npages, + bool enc); extern int kernel_set_to_readonly; diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index c60222ab8ab9..192230247ad7 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -335,6 +335,7 @@ struct paravirt_patch_template pv_ops = { (void (*)(struct mmu_gather *, void *))tlb_remove_page, .mmu.exit_mmap = paravirt_nop, + .mmu.notify_page_enc_status_changed = paravirt_nop, #ifdef CONFIG_PARAVIRT_XXL .mmu.read_cr2 = __PV_IS_CALLEE_SAVE(native_read_cr2), diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c index 4b01f7dbaf30..e4b94099645b 100644 --- a/arch/x86/mm/mem_encrypt.c +++ b/arch/x86/mm/mem_encrypt.c @@ -229,29 +229,74 @@ void __init sev_setup_arch(void) swiotlb_adjust_size(size); } -static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc) +static unsigned long pg_level_to_pfn(int level, pte_t *kpte, pgprot_t *ret_prot) { - pgprot_t old_prot, new_prot; - unsigned long pfn, pa, size; - pte_t new_pte; + unsigned long pfn = 0; + pgprot_t prot; switch (level) { case PG_LEVEL_4K: pfn = pte_pfn(*kpte); - old_prot = pte_pgprot(*kpte); + prot = pte_pgprot(*kpte); break; case PG_LEVEL_2M: pfn = pmd_pfn(*(pmd_t *)kpte); - old_prot = pmd_pgprot(*(pmd_t *)kpte); + prot = pmd_pgprot(*(pmd_t *)kpte); break; case PG_LEVEL_1G: pfn = pud_pfn(*(pud_t *)kpte); - old_prot = pud_pgprot(*(pud_t *)kpte); + prot = pud_pgprot(*(pud_t *)kpte); break; default: - return; + return 0; } + if (ret_prot) + *ret_prot = prot; + + return pfn; +} + +void notify_addr_enc_status_changed(unsigned long vaddr, int npages, + bool enc) +{ +#ifdef CONFIG_PARAVIRT + unsigned long sz = npages << PAGE_SHIFT; + unsigned long vaddr_end = vaddr + sz; + + while (vaddr < vaddr_end) { + int psize, pmask, level; + unsigned long pfn; + pte_t *kpte; + + kpte = lookup_address(vaddr, &level); + if (!kpte || pte_none(*kpte)) + return; + + pfn = pg_level_to_pfn(level, kpte, NULL); + if (!pfn) + continue; + + psize = page_level_size(level); + pmask = page_level_mask(level); + + notify_page_enc_status_changed(pfn, psize >> PAGE_SHIFT, enc); + + vaddr = (vaddr & pmask) + psize; + } +#endif +} + +static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc) +{ + pgprot_t old_prot, new_prot; + unsigned long pfn, pa, size; + pte_t new_pte; + + pfn = pg_level_to_pfn(level, kpte, &old_prot); + if (!pfn) + return; + new_prot = old_prot; if (enc) pgprot_val(new_prot) |= _PAGE_ENC; @@ -286,12 +331,13 @@ static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc) static int __init early_set_memory_enc_dec(unsigned long vaddr, unsigned long size, bool enc) { - unsigned long vaddr_end, vaddr_next; + unsigned long vaddr_end, vaddr_next, start; unsigned long psize, pmask; int split_page_size_mask; int level, ret; pte_t *kpte; + start = vaddr; vaddr_next = vaddr; vaddr_end = vaddr + size; @@ -346,6 +392,8 @@ static int __init early_set_memory_enc_dec(unsigned long vaddr, ret = 0; + notify_addr_enc_status_changed(start, PAGE_ALIGN(size) >> PAGE_SHIFT, + enc); out: __flush_tlb_all(); return ret; diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 16f878c26667..45e65517405a 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -2012,6 +2012,13 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) */ cpa_flush(&cpa, 0); + /* + * Notify hypervisor that a given memory range is mapped encrypted + * or decrypted. The hypervisor will use this information during the + * VM migration. + */ + notify_addr_enc_status_changed(addr, numpages, enc); + return ret; }