Message ID | 20210824005248.200037-1-seanjc@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory | expand |
On Mon, Aug 23, 2021 at 05:52:48PM -0700, Sean Christopherson wrote: Thanks a lot for sharing these ideas. Lots of questions are inlined below. :) > The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the > game, on an acceptable direction for supporting guest private memory, e.g. for > Intel's TDX. The TDX architectural effectively allows KVM guests to crash the > host if guest private memory is accessible to host userspace, and thus does not What about incorrect/malicious accesses from host kernel? Should the direct mapping also be removed for guest private memory? > play nice with KVM's existing approach of pulling the pfn and mapping level from > the host page tables. > > This is by no means a complete patch; it's a rough sketch of the KVM changes that > would be needed. The kernel side of things is completely omitted from the patch; > the design concept is below. > > There's also fair bit of hand waving on implementation details that shouldn't > fundamentally change the overall ABI, e.g. how the backing store will ensure > there are no mappings when "converting" to guest private. > > Background > ========== > > This is a loose continuation of Kirill's RFC[*] to support TDX guest private > memory by tracking guest memory at the 'struct page' level. This proposal is the > result of several offline discussions that were prompted by Andy Lutomirksi's > concerns with tracking via 'struct page': > > 1. The kernel wouldn't easily be able to enforce a 1:1 page:guest association, > let alone a 1:1 pfn:gfn mapping. May I ask why? Doesn't FOLL_GUEST in Kirill's earlier patch work? Or just because traversing the host PT to get a PFN(for a PageGuest(page)) is too heavy? > > 2. Does not work for memory that isn't backed by 'struct page', e.g. if devices > gain support for exposing encrypted memory regions to guests. Do you mean that a page not backed by 'struct page' might be mapped to other user space? I thought the VM_GUEST flags for the VMA could prevent that(though I may possiblely be wrong). Could you explain more? Thanks! > > 3. Does not help march toward page migration or swap support (though it doesn't > hurt either). > > [*] https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com > > Concept > ======= > > Guest private memory must be backed by an "enlightened" file descriptor, where > "enlightened" means the implementing subsystem supports a one-way "conversion" to > guest private memory and provides bi-directional hooks to communicate directly > with KVM. Creating a private fd doesn't necessarily have to be a conversion, e.g. it > could also be a flag provided at file creation, a property of the file system itself, > etc... > > Before a private fd can be mapped into a KVM guest, it must be paired 1:1 with a > KVM guest, i.e. multiple guests cannot share a fd. At pairing, KVM and the fd's > subsystem exchange a set of function pointers to allow KVM to call into the subsystem, > e.g. to translate gfn->pfn, and vice versa to allow the subsystem to call into KVM, > e.g. to invalidate/move/swap a gfn range. So the gfn->pfn translation is done by the fd's subsystem? Again, could you please elaborate how? And each private memory region would need a seperate group of callbacks? > > Mapping a private fd in host userspace is disallowed, i.e. there is never a host > virtual address associated with the fd and thus no userspace page tables pointing > at the private memory. > > Pinning _from KVM_ is not required. If the backing store supports page migration > and/or swap, it can query the KVM-provided function pointers to see if KVM supports > the operation. If the operation is not supported (this will be the case initially > in KVM), the backing store is responsible for ensuring correct functionality. > > Unmapping guest memory, e.g. to prevent use-after-free, is handled via a callback > from the backing store to KVM. KVM will employ techniques similar to those it uses > for mmu_notifiers to ensure the guest cannot access freed memory. > > A key point is that, unlike similar failed proposals of the past, e.g. /dev/mktme, > existing backing stores can be englightened, a from-scratch implementations is not > required (though would obviously be possible as well). > > One idea for extending existing backing stores, e.g. HugeTLBFS and tmpfs, is > to add F_SEAL_GUEST, which would convert the entire file to guest private memory > and either fail if the current size is non-zero or truncate the size to zero. Have you discussed memfd_secret(if host direct mapping is also to be removed)? And how does this F_SEAL_GUEST work? > > KVM > === > > Guest private memory is managed as a new address space, i.e. as a different set of > memslots, similar to how KVM has a separate memory view for when a guest vCPU is > executing in virtual SMM. SMM is mutually exclusive with guest private memory. > > The fd (the actual integer) is provided to KVM when a private memslot is added > via KVM_SET_USER_MEMORY_REGION. This is when the aforementioned pairing occurs. My understanding of KVM_SET_USER_MEMORY_REGION is that, this ioctl is to facilitate the binding of HVA and GPA ranges. But if there's no HVAs for a private region at all, why do we need a memslot for it? Besides to keep track of the private GFN ranges, and provide the callbacks, is there any other reason? Another question is: why do we need a whole new address space, instead of one address space accommodating memslot types? > > By default, KVM memslot lookups will be "shared", only specific touchpoints will > be modified to work with private memslots, e.g. guest page faults. All host > accesses to guest memory, e.g. for emulation, will thus look for shared memory > and naturally fail without attempting copy_to/from_user() if the guest attempts Becasue gfn_to_hva() will fail first? > to coerce KVM into access private memory. Note, avoiding copy_to/from_user() and > friends isn't strictly necessary, it's more of a happy side effect. > > A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in vcpu->run > is added to propagate illegal accesses (see above) and implicit conversions Sorry, illegal accesses from VM? Do you actually mean a KVM page fault caused by private access from VM, which implicitly notifies KVM to mark it as private(e.g. by bouncing to Qemu, which then creates a private memory region and ioctls into KVM)? If the answer is yes, how about naming the exit reason as KVM_EXIT_MEMORY_PRIVATE? Meanwhile, is Qemu also supposed to invoke some system call into host kernel before ioctls into KVM? I'm still confused where the kernel callbacks like the gfn_to_pfn() come from(and how they function)... :) > to userspace (see below). Note, the new exit reason + struct can also be to > support several other feature requests in KVM[1][2]. > > The guest may explicitly or implicity request KVM to map a shared/private variant > of a GFN. An explicit map request is done via hypercall (out of scope for this > proposal as both TDX and SNP ABIs define such a hypercall). An implicit map request > is triggered simply by the guest accessing the shared/private variant, which KVM > sees as a guest page fault (EPT violation or #NPF). Ideally only explicit requests > would be supported, but neither TDX nor SNP require this in their guest<->host ABIs. Well, I am wondering, should we assume all guest pages as shared or private by default? I mean, if all guest pages are private when the VM is created, maybe the private memslots can be initialized in VM creation time, and be deleted/splited later(e.g. in response to guest sharing hypercalls)? It may simplify the logic, but may also restrict the VM type(e.g. to be TD guest). > > For implicit or explicit mappings, if a memslot is found that fully covers the > requested range (which is a single gfn for implicit mappings), KVM's normal guest > page fault handling works with minimal modification. > > If a memslot is not found, for explicit mappings, KVM will exit to userspace with > the aforementioned dedicated exit reason. For implict _private_ mappings, KVM will > also immediately exit with the same dedicated reason. For implicit shared mappings, > an additional check is required to differentiate between emulated MMIO and an > implicit private->shared conversion[*]. If there is an existing private memslot > for the gfn, KVM will exit to userspace, otherwise KVM will treat the access as an > emulated MMIO access and handle the page fault accordingly. > > [1] https://lkml.kernel.org/r/YKxJLcg/WomPE422@google.com > [2] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com > > Punching Holes > ============== > > The expected userspace memory model is that mapping requests will be handled as > conversions, e.g. on a shared mapping request, first unmap the private gfn range, > then map the shared gfn range. A new KVM ioctl() will likely be needed to allow > userspace to punch a hole in a memslot, as expressing such an operation isn't > possible with KVM_SET_USER_MEMORY_REGION. While userspace could delete the > memslot, then recreate three new memslots, doing so would be destructive to guest > data as unmapping guest private memory (from the EPT/NPT tables) is destructive > to the data for both TDX and SEV-SNP guests. May I ask why? Thanks! > > Pros (vs. struct page) > ====================== > > Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping. > > Userspace page tables are not populated, e.g. reduced memory footprint, lower > probability of making private memory accessible to userspace. > > Provides line of sight to supporting page migration and swap. > > Provides line of sight to mapping MMIO pages into guest private memory. > > Cons (vs. struct page) > ====================== > > Significantly more churn in KVM, e.g. to plumb 'private' through where needed, > support memslot hole punching, etc... > > KVM's MMU gets another method of retrieving host pfn and page size. And the method is provided by host kernel? How does this method work? [...] > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index a272ccbddfa1..771080235b2d 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -2896,6 +2896,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, > if (max_level == PG_LEVEL_4K) > return PG_LEVEL_4K; > > + if (memslot_is_private(slot)) > + return slot->private_ops->pfn_mapping_level(...); > + Oh, any suggestion how host kernel decides the mapping level here? > host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); > return min(host_level, max_level); > } > @@ -3835,9 +3838,11 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r) > { > - struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn); > + struct kvm_memory_slot *slot; > bool async; > > + slot = __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, fault->private); > + > /* > * Retry the page fault if the gfn hit a memslot that is being deleted > * or moved. This ensures any existing SPTEs for the old memslot will > @@ -3846,8 +3851,19 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > if (slot && (slot->flags & KVM_MEMSLOT_INVALID)) > goto out_retry; > > + /* > + * Exit to userspace to map the requested private/shared memory region > + * if there is no memslot and (a) the access is private or (b) there is > + * an existing private memslot. Emulated MMIO must be accessed through > + * shared GPAs, thus a memslot miss on a private GPA is always handled > + * as an implicit conversion "request". > + */ For (b), do you mean this fault is for a GFN which marked as private, but now converted to a shared? If true, could we just disallow it if no explict sharing hypercall is triggered? > + if (!slot && > + (fault->private || __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, true))) > + goto out_convert; > + > if (!kvm_is_visible_memslot(slot)) { > - /* Don't expose private memslots to L2. */ > + /* Don't expose KVM's internal memslots to L2. */ > if (is_guest_mode(vcpu)) { > fault->pfn = KVM_PFN_NOSLOT; > fault->map_writable = false; > @@ -3890,6 +3906,12 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > out_retry: > *r = RET_PF_RETRY; > return true; > + > +out_convert: > + vcpu->run->exit_reason = KVM_EXIT_MAP_MEMORY; > + /* TODO: fill vcpu->run with more info. */ > + *r = 0; > + return true; > } B.R. Yu
On Tue, Aug 24, 2021, Yu Zhang wrote: > On Mon, Aug 23, 2021 at 05:52:48PM -0700, Sean Christopherson wrote: > > Thanks a lot for sharing these ideas. Lots of questions are inlined below. :) > > > The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the > > game, on an acceptable direction for supporting guest private memory, e.g. for > > Intel's TDX. The TDX architectural effectively allows KVM guests to crash the > > host if guest private memory is accessible to host userspace, and thus does not > > What about incorrect/malicious accesses from host kernel? Should the direct mapping > also be removed for guest private memory? I would say that's out of scope for this RFC as it should not require any coordination between KVM and MM. > > play nice with KVM's existing approach of pulling the pfn and mapping level from > > the host page tables. > > > > This is by no means a complete patch; it's a rough sketch of the KVM changes that > > would be needed. The kernel side of things is completely omitted from the patch; > > the design concept is below. > > > > There's also fair bit of hand waving on implementation details that shouldn't > > fundamentally change the overall ABI, e.g. how the backing store will ensure > > there are no mappings when "converting" to guest private. > > > > Background > > ========== > > > > This is a loose continuation of Kirill's RFC[*] to support TDX guest private > > memory by tracking guest memory at the 'struct page' level. This proposal is the > > result of several offline discussions that were prompted by Andy Lutomirksi's > > concerns with tracking via 'struct page': > > > > 1. The kernel wouldn't easily be able to enforce a 1:1 page:guest association, > > let alone a 1:1 pfn:gfn mapping. > > May I ask why? Doesn't FOLL_GUEST in Kirill's earlier patch work? Or just > because traversing the host PT to get a PFN(for a PageGuest(page)) is too > heavy? To ensure a 1:1 page:guest, 'struct page' would need to track enough information to uniquely identify which guest owns the page. With TDX, one can argue that the kernel can rely on the TDX-Module to do that tracking (I argued this :-)), but for SEV and pure software implementations there is no third party that's tracking who owns what. In other words, without stashing an identifier in 'struct page', the kernel would only know that _a_ guest owns the page, it couldn't identify which guest owns the page. And allocating that much space in 'struct page' would be painfully expensive. 1:1 pfn:gfn is even worse. E.g. if userspace maps the same file as MAP_GUEST at multiple host virtual adddress, then configures KVM's memslots to have multiple regions, one for each alias, the guest will end up with 1:N pfn:gfn mappings. Preventing that would be nigh impossible without ending up with a plethora of races. > > 2. Does not work for memory that isn't backed by 'struct page', e.g. if devices > > gain support for exposing encrypted memory regions to guests. > > Do you mean that a page not backed by 'struct page' might be mapped to other > user space? I thought the VM_GUEST flags for the VMA could prevent that(though > I may possiblely be wrong). Could you explain more? Thanks! It's about enabling, not preventing. If in the future there is some form of memory that is not backed by 'struct page' (CXL memory?), that can also be mapped into a protected KVM guest, relying on 'struct page' to ensure that only the owning KVM guest has access to that memory would not work. > > 3. Does not help march toward page migration or swap support (though it doesn't > > hurt either). > > > > [*] https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com > > > > Concept > > ======= > > > > Guest private memory must be backed by an "enlightened" file descriptor, where > > "enlightened" means the implementing subsystem supports a one-way "conversion" to > > guest private memory and provides bi-directional hooks to communicate directly > > with KVM. Creating a private fd doesn't necessarily have to be a conversion, e.g. it > > could also be a flag provided at file creation, a property of the file system itself, > > etc... > > > > Before a private fd can be mapped into a KVM guest, it must be paired 1:1 with a > > KVM guest, i.e. multiple guests cannot share a fd. At pairing, KVM and the fd's > > subsystem exchange a set of function pointers to allow KVM to call into the subsystem, > > e.g. to translate gfn->pfn, and vice versa to allow the subsystem to call into KVM, > > e.g. to invalidate/move/swap a gfn range. > > So the gfn->pfn translation is done by the fd's subsystem? Again, could you > please elaborate how? Rats, I meant to capture this in the "patch" and did not. The memslot would hold the base offset into the file instead of a host virtual address. I.e. instead of gfn -> hva -> pfn, it would roughly be gfn -> offset -> pfn. > And each private memory region would need a seperate group of callbacks? Each private memslot would have its own pointer to a virtual function table, but there would only be a single table per backing store implementation. > > Mapping a private fd in host userspace is disallowed, i.e. there is never a host > > virtual address associated with the fd and thus no userspace page tables pointing > > at the private memory. > > > > Pinning _from KVM_ is not required. If the backing store supports page migration > > and/or swap, it can query the KVM-provided function pointers to see if KVM supports > > the operation. If the operation is not supported (this will be the case initially > > in KVM), the backing store is responsible for ensuring correct functionality. > > > > Unmapping guest memory, e.g. to prevent use-after-free, is handled via a callback > > from the backing store to KVM. KVM will employ techniques similar to those it uses > > for mmu_notifiers to ensure the guest cannot access freed memory. > > > > A key point is that, unlike similar failed proposals of the past, e.g. /dev/mktme, > > existing backing stores can be englightened, a from-scratch implementations is not > > required (though would obviously be possible as well). > > > > One idea for extending existing backing stores, e.g. HugeTLBFS and tmpfs, is > > to add F_SEAL_GUEST, which would convert the entire file to guest private memory > > and either fail if the current size is non-zero or truncate the size to zero. > > Have you discussed memfd_secret(if host direct mapping is also to be removed)? No. From a userspace perspective, the two would be mutually exclusive > And how does this F_SEAL_GUEST work? Are you asking about semantics or implementation? If it's implementation, that's firmly in the handwaving part :-) Semantically, once F_SEAL_GUEST is set it can't be removed, i.e. the file is forever "guest private". In a way it's destructive, e.g. the host can't ever read out the memory, but that's really just a reflection of the hardware, e.g. even with SEV, the host can read the memory but it can never decrypt the memory. > > > > KVM > > === > > > > Guest private memory is managed as a new address space, i.e. as a different set of > > memslots, similar to how KVM has a separate memory view for when a guest vCPU is > > executing in virtual SMM. SMM is mutually exclusive with guest private memory. > > > > The fd (the actual integer) is provided to KVM when a private memslot is added > > via KVM_SET_USER_MEMORY_REGION. This is when the aforementioned pairing occurs. > > My understanding of KVM_SET_USER_MEMORY_REGION is that, this ioctl is to > facilitate the binding of HVA and GPA ranges. But if there's no HVAs for > a private region at all, why do we need a memslot for it? Besides to keep > track of the private GFN ranges, and provide the callbacks, is there any > other reason? The short answer is that something in KVM needs to translate gfn->pfn. Using just the gfn isn't feasible because there's no anchor (see above regarding the base file offset). And even if the offset weren't needed, KVM still needs a lot of the same metadata, e.g. some day there may be read-only private memsots, dirty logging of private memslots, etc... Implementing something else entirely would also require an absurd amount of code churn as all of KVM revolves around gfn -> slot -> pfn. > Another question is: why do we need a whole new address space, instead of > one address space accommodating memslot types? Either way could be made to work, and there isn't thaaat much code difference. My initial thoughts were to use a flag, but there are some niceties that a separate address space provides (more below). > > By default, KVM memslot lookups will be "shared", only specific touchpoints will > > be modified to work with private memslots, e.g. guest page faults. All host > > accesses to guest memory, e.g. for emulation, will thus look for shared memory > > and naturally fail without attempting copy_to/from_user() if the guest attempts > > Becasue gfn_to_hva() will fail first? Yes. More precisely, gfn_to_hva() will fail because there is no memslot for the current address space (shared). This is one advantage over a flag, e.g. there's no need to check a flag after getting a memslot. It's somewhat superficial as it wouldn't be too difficult to add low level helpers to default to "shared", but there are multiple instances of those types of patterns, so I do hope/think it will yield to cleaner code. > > to coerce KVM into access private memory. Note, avoiding copy_to/from_user() and > > friends isn't strictly necessary, it's more of a happy side effect. > > > > A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in vcpu->run > > is added to propagate illegal accesses (see above) and implicit conversions > > Sorry, illegal accesses from VM? Illegal accesses from the host, e.g. attempting to read/write guest private memory via gfn_to_hva(). > Do you actually mean a KVM page fault caused by private access from VM, which > implicitly notifies KVM to mark it as private(e.g. by bouncing to Qemu, which > then creates a private memory region and ioctls into KVM)? > > If the answer is yes, how about naming the exit reason as KVM_EXIT_MEMORY_PRIVATE? > Meanwhile, is Qemu also supposed to invoke some system call into host kernel > before ioctls into KVM? I'm still confused where the kernel callbacks like > the gfn_to_pfn() come from(and how they function)... :) I would like the exit to be generic so that it can be reused for other, completely unrelated (yet similar) scenarios (see links below). > > to userspace (see below). Note, the new exit reason + struct can also be to > > support several other feature requests in KVM[1][2]. > > > > The guest may explicitly or implicity request KVM to map a shared/private variant > > of a GFN. An explicit map request is done via hypercall (out of scope for this > > proposal as both TDX and SNP ABIs define such a hypercall). An implicit map request > > is triggered simply by the guest accessing the shared/private variant, which KVM > > sees as a guest page fault (EPT violation or #NPF). Ideally only explicit requests > > would be supported, but neither TDX nor SNP require this in their guest<->host ABIs. > > Well, I am wondering, should we assume all guest pages as shared or private by > default? I mean, if all guest pages are private when the VM is created, maybe > the private memslots can be initialized in VM creation time, and be deleted/splited > later(e.g. in response to guest sharing hypercalls)? Define "we". I absolutely expect the userspace VMM to assume all guest pages are private by default. But I feel very strongly that KVM should not be involved in the control logic for decided when to convert a given page between shared and private. Thus, deciding the default state is not KVM's responsibility. > It may simplify the logic, but may also restrict the VM type(e.g. to be TD guest). > > > > > For implicit or explicit mappings, if a memslot is found that fully covers the > > requested range (which is a single gfn for implicit mappings), KVM's normal guest > > page fault handling works with minimal modification. > > > > If a memslot is not found, for explicit mappings, KVM will exit to userspace with > > the aforementioned dedicated exit reason. For implict _private_ mappings, KVM will > > also immediately exit with the same dedicated reason. For implicit shared mappings, > > an additional check is required to differentiate between emulated MMIO and an > > implicit private->shared conversion[*]. If there is an existing private memslot > > for the gfn, KVM will exit to userspace, otherwise KVM will treat the access as an > > emulated MMIO access and handle the page fault accordingly. > > > > [1] https://lkml.kernel.org/r/YKxJLcg/WomPE422@google.com > > [2] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com > > > > Punching Holes > > ============== > > > > The expected userspace memory model is that mapping requests will be handled as > > conversions, e.g. on a shared mapping request, first unmap the private gfn range, > > then map the shared gfn range. A new KVM ioctl() will likely be needed to allow > > userspace to punch a hole in a memslot, as expressing such an operation isn't > > possible with KVM_SET_USER_MEMORY_REGION. While userspace could delete the > > memslot, then recreate three new memslots, doing so would be destructive to guest > > data as unmapping guest private memory (from the EPT/NPT tables) is destructive > > to the data for both TDX and SEV-SNP guests. > > May I ask why? Thanks! Hmm, for SNP it might not actually be destructive, I forget the exact flows for unmapping memory. Anyways, for TDX it's most certainly destructive. When mapping private pages into the guest (ignore pre-boot), the guest must accept a page before accessing the page, as a #VE will occur if the page is not accepted. This holds true even if the host unmaps and remaps the same PFN -> GFN. The TDX module also explicitly zeroes the page when installing a new mapping. And to prevent use-after-free, KVM must fully unmap a page if its corresponding memslot is deleted, e.g. it can't trust userspace to remap the memory at the exact gfn:pfn combo. Without a new API to punch a hole, deleting private memslots to create two smaller, discontiguous memslots would destroy the data in the two remaining/new memslots. > > Pros (vs. struct page) > > ====================== > > > > Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping. > > > > Userspace page tables are not populated, e.g. reduced memory footprint, lower > > probability of making private memory accessible to userspace. > > > > Provides line of sight to supporting page migration and swap. > > > > Provides line of sight to mapping MMIO pages into guest private memory. > > > > Cons (vs. struct page) > > ====================== > > > > Significantly more churn in KVM, e.g. to plumb 'private' through where needed, > > support memslot hole punching, etc... > > > > KVM's MMU gets another method of retrieving host pfn and page size. > > And the method is provided by host kernel? How does this method work? It's a function callback provided by the backing store. The details of the function would be likely specific to the backing store's implementation, e.g. HugeTLBFS would likely have its own implementation. > [...] > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index a272ccbddfa1..771080235b2d 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -2896,6 +2896,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, > > if (max_level == PG_LEVEL_4K) > > return PG_LEVEL_4K; > > > > + if (memslot_is_private(slot)) > > + return slot->private_ops->pfn_mapping_level(...); > > + > > Oh, any suggestion how host kernel decides the mapping level here? That decision comes from the backing store. E.g. HugeTLBFS would simply return its static/configured size/level. > > host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); > > return min(host_level, max_level); > > } > > @@ -3835,9 +3838,11 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > > > static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r) > > { > > - struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn); > > + struct kvm_memory_slot *slot; > > bool async; > > > > + slot = __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, fault->private); > > + > > /* > > * Retry the page fault if the gfn hit a memslot that is being deleted > > * or moved. This ensures any existing SPTEs for the old memslot will > > @@ -3846,8 +3851,19 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > if (slot && (slot->flags & KVM_MEMSLOT_INVALID)) > > goto out_retry; > > > > + /* > > + * Exit to userspace to map the requested private/shared memory region > > + * if there is no memslot and (a) the access is private or (b) there is > > + * an existing private memslot. Emulated MMIO must be accessed through > > + * shared GPAs, thus a memslot miss on a private GPA is always handled > > + * as an implicit conversion "request". > > + */ > > For (b), do you mean this fault is for a GFN which marked as private, but now > converted to a shared? Yes. > If true, could we just disallow it if no explict sharing hypercall is triggered? Sadly, no. Even if all hardware vendor specs _required_ such behavior, KVM has no recourse but to exit to userspace because there's no way to communicate to the guest that it accessed a "bad" address. E.g. KVM can't inject exceptions in TDX, and it can't touch guest register state to "return" an error code, Not to mention that the guest would be in some random flow accessing memory. Happily for KVM, it's userspace's problem^Wdecision.
On 24.08.21 02:52, Sean Christopherson wrote: > The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the > game, on an acceptable direction for supporting guest private memory, e.g. for > Intel's TDX. The TDX architectural effectively allows KVM guests to crash the > host if guest private memory is accessible to host userspace, and thus does not > play nice with KVM's existing approach of pulling the pfn and mapping level from > the host page tables. > > This is by no means a complete patch; it's a rough sketch of the KVM changes that > would be needed. The kernel side of things is completely omitted from the patch; > the design concept is below. > > There's also fair bit of hand waving on implementation details that shouldn't > fundamentally change the overall ABI, e.g. how the backing store will ensure > there are no mappings when "converting" to guest private. > This is a lot of complexity and rather advanced approaches (not saying they are bad, just that we try to teach the whole stack something completely new). What I think would really help is a list of requirements, such that everybody is aware of what we actually want to achieve. Let me start: GFN: Guest Frame Number EPFN: Encrypted Physical Frame Number 1) An EPFN must not get mapped into more than one VM: it belongs exactly to one VM. It must neither be shared between VMs between processes nor between VMs within a processes. 2) User space (well, and actually the kernel) must never access an EPFN: - If we go for an fd, essentially all operations (read/write) have to fail. - If we have to map an EPFN into user space page tables (e.g., to simplify KVM), we could only allow fake swap entries such that "there is something" but it cannot be accessed and is flagged accordingly. - /proc/kcore and friends have to be careful as well and should not read this memory. So there has to be a way to flag these pages. 3) We need a way to express the GFN<->EPFN mapping and essentially assign an EPFN to a GFN. 4) Once we assigned a EPFN to a GFN, that assignment must not longer change. Further, an EPFN must not get assigned to multiple GFNs. 5) There has to be a way to "replace" encrypted parts by "shared" parts and the other way around. What else? > Background > ========== > > This is a loose continuation of Kirill's RFC[*] to support TDX guest private > memory by tracking guest memory at the 'struct page' level. This proposal is the > result of several offline discussions that were prompted by Andy Lutomirksi's > concerns with tracking via 'struct page': > > 1. The kernel wouldn't easily be able to enforce a 1:1 page:guest association, > let alone a 1:1 pfn:gfn mapping. Well, it could with some help on higher layers. Someone has to do the tracking. Marking EPFNs as EPFNs can actually be very helpful, e.g., allow /proc/kcore to just not touch such pages. If we want to do all the tracking in the struct page is a different story. > > 2. Does not work for memory that isn't backed by 'struct page', e.g. if devices > gain support for exposing encrypted memory regions to guests. Let's keep it simple. If a struct page is right now what we need to properly track it, so be it. If not, good. But let's not make this a requirement right from the start if it's stuff for the far future. > > 3. Does not help march toward page migration or swap support (though it doesn't > hurt either). "Does not help towards world peace, (though it doesn't hurt either)". Maybe let's ignore that for now, as it doesn't seem to be required to get something reasonable running. > > [*] https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com > > Concept > ======= > > Guest private memory must be backed by an "enlightened" file descriptor, where > "enlightened" means the implementing subsystem supports a one-way "conversion" to > guest private memory and provides bi-directional hooks to communicate directly > with KVM. Creating a private fd doesn't necessarily have to be a conversion, e.g. it > could also be a flag provided at file creation, a property of the file system itself, > etc... Doesn't sound too crazy. Maybe even introducing memfd_encrypted() if extending the other ones turns out too complicated. > > Before a private fd can be mapped into a KVM guest, it must be paired 1:1 with a > KVM guest, i.e. multiple guests cannot share a fd. At pairing, KVM and the fd's > subsystem exchange a set of function pointers to allow KVM to call into the subsystem, > e.g. to translate gfn->pfn, and vice versa to allow the subsystem to call into KVM, > e.g. to invalidate/move/swap a gfn range. > > Mapping a private fd in host userspace is disallowed, i.e. there is never a host > virtual address associated with the fd and thus no userspace page tables pointing > at the private memory. To keep the primary vs. secondary MMU thing working, I think it would actually be nice to go with special swap entries instead; it just keeps most things working as expected. But let's see where we end up. > > Pinning _from KVM_ is not required. If the backing store supports page migration > and/or swap, it can query the KVM-provided function pointers to see if KVM supports > the operation. If the operation is not supported (this will be the case initially > in KVM), the backing store is responsible for ensuring correct functionality. > > Unmapping guest memory, e.g. to prevent use-after-free, is handled via a callback > from the backing store to KVM. KVM will employ techniques similar to those it uses > for mmu_notifiers to ensure the guest cannot access freed memory. > > A key point is that, unlike similar failed proposals of the past, e.g. /dev/mktme, > existing backing stores can be englightened, a from-scratch implementations is not > required (though would obviously be possible as well). Right. But if it's just a bad fit, let's do something new. Just like we did with memfd_secret. > > One idea for extending existing backing stores, e.g. HugeTLBFS and tmpfs, is > to add F_SEAL_GUEST, which would convert the entire file to guest private memory > and either fail if the current size is non-zero or truncate the size to zero. While possible, I actually do have the feeling that we want eventually to have something new, as the semantics are just too different. But let's see. > KVM > === > > Guest private memory is managed as a new address space, i.e. as a different set of > memslots, similar to how KVM has a separate memory view for when a guest vCPU is > executing in virtual SMM. SMM is mutually exclusive with guest private memory. > > The fd (the actual integer) is provided to KVM when a private memslot is added > via KVM_SET_USER_MEMORY_REGION. This is when the aforementioned pairing occurs. > > By default, KVM memslot lookups will be "shared", only specific touchpoints will > be modified to work with private memslots, e.g. guest page faults. All host > accesses to guest memory, e.g. for emulation, will thus look for shared memory > and naturally fail without attempting copy_to/from_user() if the guest attempts > to coerce KVM into access private memory. Note, avoiding copy_to/from_user() and > friends isn't strictly necessary, it's more of a happy side effect. > > A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in vcpu->run > is added to propagate illegal accesses (see above) and implicit conversions > to userspace (see below). Note, the new exit reason + struct can also be to > support several other feature requests in KVM[1][2]. > > The guest may explicitly or implicity request KVM to map a shared/private variant > of a GFN. An explicit map request is done via hypercall (out of scope for this > proposal as both TDX and SNP ABIs define such a hypercall). An implicit map request > is triggered simply by the guest accessing the shared/private variant, which KVM > sees as a guest page fault (EPT violation or #NPF). Ideally only explicit requests > would be supported, but neither TDX nor SNP require this in their guest<->host ABIs. > > For implicit or explicit mappings, if a memslot is found that fully covers the > requested range (which is a single gfn for implicit mappings), KVM's normal guest > page fault handling works with minimal modification. > > If a memslot is not found, for explicit mappings, KVM will exit to userspace with > the aforementioned dedicated exit reason. For implict _private_ mappings, KVM will > also immediately exit with the same dedicated reason. For implicit shared mappings, > an additional check is required to differentiate between emulated MMIO and an > implicit private->shared conversion[*]. If there is an existing private memslot > for the gfn, KVM will exit to userspace, otherwise KVM will treat the access as an > emulated MMIO access and handle the page fault accordingly. Do you mean some kind of overlay. "Ordinary" user memory regions overlay "private user memory regions"? So when marking something shared, you'd leave the private user memory region alone and only create a new "ordinary"user memory regions that references shared memory in QEMU (IOW, a different mapping)? Reading below, I think you were not actually thinking about an overlay, but maybe overlays might actually be a nice concept to have instead. > Punching Holes > ============== > > The expected userspace memory model is that mapping requests will be handled as > conversions, e.g. on a shared mapping request, first unmap the private gfn range, > then map the shared gfn range. A new KVM ioctl() will likely be needed to allow > userspace to punch a hole in a memslot, as expressing such an operation isn't > possible with KVM_SET_USER_MEMORY_REGION. While userspace could delete the > memslot, then recreate three new memslots, doing so would be destructive to guest > data as unmapping guest private memory (from the EPT/NPT tables) is destructive > to the data for both TDX and SEV-SNP guests. If you'd treat it like an overlay, you'd not actually be punching holes. You'd only be creating/removing ordinary user memory regions when marking something shared/unshared. > > Pros (vs. struct page) > ====================== > > Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping. > > Userspace page tables are not populated, e.g. reduced memory footprint, lower > probability of making private memory accessible to userspace. Agreed to the first part, although I consider that a secondary concern. The second part, I'm not sure if that is really the case. Fake swap entries are just a marker. > > Provides line of sight to supporting page migration and swap. Again, let's leave that out for now. I think that's an kernel internal that will require quite some thought either way. > > Provides line of sight to mapping MMIO pages into guest private memory. That's an interesting thought. Would it work via overlays as well? Can you elaborate? > > Cons (vs. struct page) > ====================== > > Significantly more churn in KVM, e.g. to plumb 'private' through where needed, > support memslot hole punching, etc... > > KVM's MMU gets another method of retrieving host pfn and page size. > > Requires enabling in every backing store that someone wants to support. I think we will only care about anonymous memory eventually with huge/gigantic pages in the next years. Just to what memfd() is already limited. File-backed -- I don't know ... if at all, swapping ... in a couple of years ... > > Because the NUMA APIs work on virtual addresses, new syscalls fmove_pages(), > fbind(), etc... would be required to provide equivalents to existing NUMA > functionality (though those syscalls would likely be useful irrespective of guest > private memory). Right, that's because we don't have a VMA that describes all this. E.g., mbind(). > > Washes (vs. struct page) > ======================== > > A misbehaving guest that triggers a large number of shared memory mappings will > consume a large number of memslots. But, this is likely a wash as similar effect > would happen with VMAs in the struct page approach. Just cap it then to something sane. 32k which we have right now is crazy and only required in very special setups. You can just make QEMU override/set the KVM default. My wild idea after reading everything so far (full of flaws, just want to mention it, maybe it gives some ideas): Introduce memfd_encrypted(). Similar to like memfd_secret() - Most system calls will just fail. - Allow MAP_SHARED only. - Enforce VM_DONTDUMP and skip during fork(). - File size can change exactly once, before any mmap. (IIRC) Different to memfd_secret(), allow mapping each page of the fd exactly one time via mmap() into a single process. The simplest way would be requiring that only the whole file can be mmaped, and that can happen exactly once. memremap() and friends will fail. Splitting the VMA will fail. munmap()/mmap(MAP_FIXED) will fail. You'll have it in a single process mapped ever only and persistent. Unmap will only work when tearing down the MM. Hole punching via fallocate() has to be evaluated (below). You'll end up with a VMA that corresponds to the whole file in a single process only, and that cannot vanish, not even in parts. Define "ordinary" user memory slots as overlay on top of "encrypted" memory slots. Inside KVM, bail out if you encounter such a VMA inside a normal user memory slot. When creating a "encryped" user memory slot, require that the whole VMA is covered at creation time. You know the VMA can't change later. In QEMU, allocate for each RAMBlock a MAP_PRIVATE memfd_encrypted() and a MAP_SHARED memfd(). Make QEMU and other processes always access the MAP_SHARED part only. Initially, create "encrypted" user memory regions in KVM when defining the RAM layout, disallowing changes. Define the MAP_SHARED memfd() overlay user memory regions in KVM depending on the shared/private state. In the actual memory backend, flag all new allocated pages as PG_encrypted. Pages can be faulted into a process page table via a new fake swap entry, where KVM can look them up via a special GUP flag, as Kirill suggested. If access via user space or another GUP user, SIGBUS instead of converting the special swap entry. Allow only a single encrypted VM per process ever in KVM for now. All that handshake regarding swapping/migration ... can be handled internally if ever required. Memory hotplug: should not be an issue. Memory hotunplug would require some thought -- maybe fallcoate(PUNCHHOLE) will do. Reducing memory consumption: With MADV_DONTNEED/fallocate(punchole) we can reclaim memory at least within the shared memory part when switching back and forth. Maybe even in the MAP_PRIVATE/memfd_encrypted() part when marking something shared. TBD.
Thanks a lot for your explaination, Sean. Still many questions below. :) On Thu, Aug 26, 2021 at 12:35:21AM +0000, Sean Christopherson wrote: > On Tue, Aug 24, 2021, Yu Zhang wrote: > > On Mon, Aug 23, 2021 at 05:52:48PM -0700, Sean Christopherson wrote: > > > > Thanks a lot for sharing these ideas. Lots of questions are inlined below. :) > > > > > The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the > > > game, on an acceptable direction for supporting guest private memory, e.g. for > > > Intel's TDX. The TDX architectural effectively allows KVM guests to crash the > > > host if guest private memory is accessible to host userspace, and thus does not > > > > What about incorrect/malicious accesses from host kernel? Should the direct mapping > > also be removed for guest private memory? > > I would say that's out of scope for this RFC as it should not require any > coordination between KVM and MM. So Linux MM still has choice to unmap the private memory in direct mapping, right? > > > > play nice with KVM's existing approach of pulling the pfn and mapping level from > > > the host page tables. > > > > > > This is by no means a complete patch; it's a rough sketch of the KVM changes that > > > would be needed. The kernel side of things is completely omitted from the patch; > > > the design concept is below. > > > > > > There's also fair bit of hand waving on implementation details that shouldn't > > > fundamentally change the overall ABI, e.g. how the backing store will ensure > > > there are no mappings when "converting" to guest private. > > > > > > Background > > > ========== > > > > > > This is a loose continuation of Kirill's RFC[*] to support TDX guest private > > > memory by tracking guest memory at the 'struct page' level. This proposal is the > > > result of several offline discussions that were prompted by Andy Lutomirksi's > > > concerns with tracking via 'struct page': > > > > > > 1. The kernel wouldn't easily be able to enforce a 1:1 page:guest association, > > > let alone a 1:1 pfn:gfn mapping. > > > > May I ask why? Doesn't FOLL_GUEST in Kirill's earlier patch work? Or just > > because traversing the host PT to get a PFN(for a PageGuest(page)) is too > > heavy? > > To ensure a 1:1 page:guest, 'struct page' would need to track enough information > to uniquely identify which guest owns the page. With TDX, one can argue that the > kernel can rely on the TDX-Module to do that tracking (I argued this :-)), but > for SEV and pure software implementations there is no third party that's tracking > who owns what. In other words, without stashing an identifier in 'struct page', > the kernel would only know that _a_ guest owns the page, it couldn't identify which > guest owns the page. And allocating that much space in 'struct page' would be > painfully expensive. So it's to make sure a private page can only be assigned to one guest. I thought PAMT in TDX and RMP in SEV-SNP can do this, is this understanding correct? But for pure software implementations, it's a problem indeed. And why would using a specific fd can achieve this? I saw requirement to only bind the fd to one guest, but is this enough to guarantee the 1:1 page:guest binding? > > 1:1 pfn:gfn is even worse. E.g. if userspace maps the same file as MAP_GUEST at > multiple host virtual adddress, then configures KVM's memslots to have multiple > regions, one for each alias, the guest will end up with 1:N pfn:gfn mappings. > Preventing that would be nigh impossible without ending up with a plethora of > races. You mean different memslots(in one address space) may contain the same gfn? IIUC, gfn overlap is not allowed: kvm_set_memory_region() will just return -EEXIST. > > > > 2. Does not work for memory that isn't backed by 'struct page', e.g. if devices > > > gain support for exposing encrypted memory regions to guests. > > > > Do you mean that a page not backed by 'struct page' might be mapped to other > > user space? I thought the VM_GUEST flags for the VMA could prevent that(though > > I may possiblely be wrong). Could you explain more? Thanks! > > It's about enabling, not preventing. If in the future there is some form of memory > that is not backed by 'struct page' (CXL memory?), that can also be mapped into > a protected KVM guest, relying on 'struct page' to ensure that only the owning KVM > guest has access to that memory would not work. Oh. Keeping the page owner info in 'struct page' is not a good choice. But still, my quesions are: 1> If TDX module/ SEV-SNP can do this, do we still need to enforce a 1:1 page:guest association in host kernel - if we are not so ambitious to also make this work as a pure software design? :-) 2> Besides, could you please explain how this design achieves the 1:1 association? > > > > 3. Does not help march toward page migration or swap support (though it doesn't > > > hurt either). > > > > > > [*] https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com > > > > > > Concept > > > ======= > > > > > > Guest private memory must be backed by an "enlightened" file descriptor, where > > > "enlightened" means the implementing subsystem supports a one-way "conversion" to > > > guest private memory and provides bi-directional hooks to communicate directly > > > with KVM. Creating a private fd doesn't necessarily have to be a conversion, e.g. it > > > could also be a flag provided at file creation, a property of the file system itself, > > > etc... > > > > > > Before a private fd can be mapped into a KVM guest, it must be paired 1:1 with a > > > KVM guest, i.e. multiple guests cannot share a fd. At pairing, KVM and the fd's > > > subsystem exchange a set of function pointers to allow KVM to call into the subsystem, > > > e.g. to translate gfn->pfn, and vice versa to allow the subsystem to call into KVM, > > > e.g. to invalidate/move/swap a gfn range. > > > > So the gfn->pfn translation is done by the fd's subsystem? Again, could you > > please elaborate how? > > Rats, I meant to capture this in the "patch" and did not. The memslot would hold > the base offset into the file instead of a host virtual address. I.e. instead of > gfn -> hva -> pfn, it would roughly be gfn -> offset -> pfn. > And the offset just equals gfn? > > And each private memory region would need a seperate group of callbacks? > > Each private memslot would have its own pointer to a virtual function table, but > there would only be a single table per backing store implementation. > Hmm. So we can have various backing stores for different memslots? And Qemu would inform KVM about this for each memslot? > > > Mapping a private fd in host userspace is disallowed, i.e. there is never a host > > > virtual address associated with the fd and thus no userspace page tables pointing > > > at the private memory. > > > > > > Pinning _from KVM_ is not required. If the backing store supports page migration > > > and/or swap, it can query the KVM-provided function pointers to see if KVM supports > > > the operation. If the operation is not supported (this will be the case initially > > > in KVM), the backing store is responsible for ensuring correct functionality. > > > > > > Unmapping guest memory, e.g. to prevent use-after-free, is handled via a callback > > > from the backing store to KVM. KVM will employ techniques similar to those it uses > > > for mmu_notifiers to ensure the guest cannot access freed memory. > > > > > > A key point is that, unlike similar failed proposals of the past, e.g. /dev/mktme, > > > existing backing stores can be englightened, a from-scratch implementations is not > > > required (though would obviously be possible as well). > > > > > > One idea for extending existing backing stores, e.g. HugeTLBFS and tmpfs, is > > > to add F_SEAL_GUEST, which would convert the entire file to guest private memory > > > and either fail if the current size is non-zero or truncate the size to zero. > > > > Have you discussed memfd_secret(if host direct mapping is also to be removed)? > > No. From a userspace perspective, the two would be mutually exclusive > Sorry? By "mutually exclusive", do you mean we can NOT remove both userspace mappings and kernel direct mappings at the same time? > > And how does this F_SEAL_GUEST work? > > Are you asking about semantics or implementation? If it's implementation, that's > firmly in the handwaving part :-) Semantically, once F_SEAL_GUEST is set it can't > be removed, i.e. the file is forever "guest private". In a way it's destructive, > e.g. the host can't ever read out the memory, but that's really just a reflection > of the hardware, e.g. even with SEV, the host can read the memory but it can never > decrypt the memory. > Acctually, I am asking both. :-) E.g., what do we expect of this fd. For now, my understandings are: 1> It's a dedicated fd, and can not be shared between different processes. 2> mmap() shall fail on this fd. 3> With F_SEAL_GUEST, size of this file is set to 0, or need be truncated to 0. But when should this fd be created? What operations do we expect this fd to offer? And how does this fd function as a channel between MM and KVM? > > > > > > KVM > > > === > > > > > > Guest private memory is managed as a new address space, i.e. as a different set of > > > memslots, similar to how KVM has a separate memory view for when a guest vCPU is > > > executing in virtual SMM. SMM is mutually exclusive with guest private memory. > > > > > > The fd (the actual integer) is provided to KVM when a private memslot is added > > > via KVM_SET_USER_MEMORY_REGION. This is when the aforementioned pairing occurs. > > > > My understanding of KVM_SET_USER_MEMORY_REGION is that, this ioctl is to > > facilitate the binding of HVA and GPA ranges. But if there's no HVAs for > > a private region at all, why do we need a memslot for it? Besides to keep > > track of the private GFN ranges, and provide the callbacks, is there any > > other reason? > > The short answer is that something in KVM needs to translate gfn->pfn. Using just > the gfn isn't feasible because there's no anchor (see above regarding the base file > offset). And even if the offset weren't needed, KVM still needs a lot of the same > metadata, e.g. some day there may be read-only private memsots, dirty logging of > private memslots, etc... Implementing something else entirely would also require > an absurd amount of code churn as all of KVM revolves around gfn -> slot -> pfn. > Yes. Thanks! > > Another question is: why do we need a whole new address space, instead of > > one address space accommodating memslot types? > > Either way could be made to work, and there isn't thaaat much code difference. My > initial thoughts were to use a flag, but there are some niceties that a separate > address space provides (more below). > > > > By default, KVM memslot lookups will be "shared", only specific touchpoints will > > > be modified to work with private memslots, e.g. guest page faults. All host > > > accesses to guest memory, e.g. for emulation, will thus look for shared memory > > > and naturally fail without attempting copy_to/from_user() if the guest attempts > > > > Becasue gfn_to_hva() will fail first? > > Yes. More precisely, gfn_to_hva() will fail because there is no memslot for the > current address space (shared). This is one advantage over a flag, e.g. there's no > need to check a flag after getting a memslot. It's somewhat superficial as it > wouldn't be too difficult to add low level helpers to default to "shared", but there > are multiple instances of those types of patterns, so I do hope/think it will yield > to cleaner code. > > > > to coerce KVM into access private memory. Note, avoiding copy_to/from_user() and > > > friends isn't strictly necessary, it's more of a happy side effect. > > > > > > A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in vcpu->run > > > is added to propagate illegal accesses (see above) and implicit conversions > > > > Sorry, illegal accesses from VM? > > Illegal accesses from the host, e.g. attempting to read/write guest private memory > via gfn_to_hva(). > Sorry, I do not get it. Without a valid HVA, how would the host perform such illegal accesses? > > Do you actually mean a KVM page fault caused by private access from VM, which > > implicitly notifies KVM to mark it as private(e.g. by bouncing to Qemu, which > > then creates a private memory region and ioctls into KVM)? > > > > If the answer is yes, how about naming the exit reason as KVM_EXIT_MEMORY_PRIVATE? > > Meanwhile, is Qemu also supposed to invoke some system call into host kernel > > before ioctls into KVM? I'm still confused where the kernel callbacks like > > the gfn_to_pfn() come from(and how they function)... :) > > I would like the exit to be generic so that it can be reused for other, completely > unrelated (yet similar) scenarios (see links below). > > > > to userspace (see below). Note, the new exit reason + struct can also be to > > > support several other feature requests in KVM[1][2]. > > > > > > The guest may explicitly or implicity request KVM to map a shared/private variant > > > of a GFN. An explicit map request is done via hypercall (out of scope for this > > > proposal as both TDX and SNP ABIs define such a hypercall). An implicit map request > > > is triggered simply by the guest accessing the shared/private variant, which KVM > > > sees as a guest page fault (EPT violation or #NPF). Ideally only explicit requests > > > would be supported, but neither TDX nor SNP require this in their guest<->host ABIs. > > > > Well, I am wondering, should we assume all guest pages as shared or private by > > default? I mean, if all guest pages are private when the VM is created, maybe > > the private memslots can be initialized in VM creation time, and be deleted/splited > > later(e.g. in response to guest sharing hypercalls)? > > Define "we". I absolutely expect the userspace VMM to assume all guest pages are > private by default. But I feel very strongly that KVM should not be involved in > the control logic for decided when to convert a given page between shared and > private. Thus, deciding the default state is not KVM's responsibility. > Could qemu just tell KVM this is a protected VM when creating it? For runtime conversion, KVM can handle hypercalls from guest, and forward to Qemu, just like what Kirill did in previous RFC patches. > > It may simplify the logic, but may also restrict the VM type(e.g. to be TD guest). > > > > > > > > For implicit or explicit mappings, if a memslot is found that fully covers the > > > requested range (which is a single gfn for implicit mappings), KVM's normal guest > > > page fault handling works with minimal modification. > > > > > > If a memslot is not found, for explicit mappings, KVM will exit to userspace with > > > the aforementioned dedicated exit reason. For implict _private_ mappings, KVM will > > > also immediately exit with the same dedicated reason. For implicit shared mappings, > > > an additional check is required to differentiate between emulated MMIO and an > > > implicit private->shared conversion[*]. If there is an existing private memslot > > > for the gfn, KVM will exit to userspace, otherwise KVM will treat the access as an > > > emulated MMIO access and handle the page fault accordingly. > > > > > > [1] https://lkml.kernel.org/r/YKxJLcg/WomPE422@google.com > > > [2] https://lkml.kernel.org/r/20200617230052.GB27751@linux.intel.com > > > > > > Punching Holes > > > ============== > > > > > > The expected userspace memory model is that mapping requests will be handled as > > > conversions, e.g. on a shared mapping request, first unmap the private gfn range, > > > then map the shared gfn range. A new KVM ioctl() will likely be needed to allow > > > userspace to punch a hole in a memslot, as expressing such an operation isn't > > > possible with KVM_SET_USER_MEMORY_REGION. While userspace could delete the > > > memslot, then recreate three new memslots, doing so would be destructive to guest > > > data as unmapping guest private memory (from the EPT/NPT tables) is destructive > > > to the data for both TDX and SEV-SNP guests. > > > > May I ask why? Thanks! > > Hmm, for SNP it might not actually be destructive, I forget the exact flows for > unmapping memory. > > Anyways, for TDX it's most certainly destructive. When mapping private pages into > the guest (ignore pre-boot), the guest must accept a page before accessing the page, > as a #VE will occur if the page is not accepted. This holds true even if the host > unmaps and remaps the same PFN -> GFN. The TDX module also explicitly zeroes the > page when installing a new mapping. > > And to prevent use-after-free, KVM must fully unmap a page if its corresponding > memslot is deleted, e.g. it can't trust userspace to remap the memory at the exact > gfn:pfn combo. > > Without a new API to punch a hole, deleting private memslots to create two smaller, > discontiguous memslots would destroy the data in the two remaining/new memslots. > So that's because deleting a memslot will inevitably zeros guest private pages? > > > Pros (vs. struct page) > > > ====================== > > > > > > Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping. > > > > > > Userspace page tables are not populated, e.g. reduced memory footprint, lower > > > probability of making private memory accessible to userspace. > > > > > > Provides line of sight to supporting page migration and swap. > > > > > > Provides line of sight to mapping MMIO pages into guest private memory. > > > > > > Cons (vs. struct page) > > > ====================== > > > > > > Significantly more churn in KVM, e.g. to plumb 'private' through where needed, > > > support memslot hole punching, etc... > > > > > > KVM's MMU gets another method of retrieving host pfn and page size. > > > > And the method is provided by host kernel? How does this method work? > > It's a function callback provided by the backing store. The details of the > function would be likely specific to the backing store's implementation, e.g. > HugeTLBFS would likely have its own implementation. > HugeTLBFS is a nice-to-have, not a must, right? Is there any other backing store we need support first? > > [...] > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > index a272ccbddfa1..771080235b2d 100644 > > > --- a/arch/x86/kvm/mmu/mmu.c > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > @@ -2896,6 +2896,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, > > > if (max_level == PG_LEVEL_4K) > > > return PG_LEVEL_4K; > > > > > > + if (memslot_is_private(slot)) > > > + return slot->private_ops->pfn_mapping_level(...); > > > + > > > > Oh, any suggestion how host kernel decides the mapping level here? > > That decision comes from the backing store. E.g. HugeTLBFS would simply return > its static/configured size/level. > > > > host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); > > > return min(host_level, max_level); > > > } > > > @@ -3835,9 +3838,11 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > > > > > static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r) > > > { > > > - struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn); > > > + struct kvm_memory_slot *slot; > > > bool async; > > > > > > + slot = __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, fault->private); > > > + > > > /* > > > * Retry the page fault if the gfn hit a memslot that is being deleted > > > * or moved. This ensures any existing SPTEs for the old memslot will > > > @@ -3846,8 +3851,19 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > > if (slot && (slot->flags & KVM_MEMSLOT_INVALID)) > > > goto out_retry; > > > > > > + /* > > > + * Exit to userspace to map the requested private/shared memory region > > > + * if there is no memslot and (a) the access is private or (b) there is > > > + * an existing private memslot. Emulated MMIO must be accessed through > > > + * shared GPAs, thus a memslot miss on a private GPA is always handled > > > + * as an implicit conversion "request". > > > + */ > > > > For (b), do you mean this fault is for a GFN which marked as private, but now > > converted to a shared? > > Yes. > > > If true, could we just disallow it if no explict sharing hypercall is triggered? > > Sadly, no. Even if all hardware vendor specs _required_ such behavior, KVM has no > recourse but to exit to userspace because there's no way to communicate to the > guest that it accessed a "bad" address. E.g. KVM can't inject exceptions in TDX, > and it can't touch guest register state to "return" an error code, Not to mention > that the guest would be in some random flow accessing memory. So, guest can just use a shared GPA(with shared-bit set) directly, without the need to explicitly ask KVM to convert the private one? > > Happily for KVM, it's userspace's problem^Wdecision. > And with KVM_EXIT_MAP_MEMORY returned, qemu should trigger a new ioctl into KVM, which puches holes in KVM private address space's memslots? B.R. Yu
On 8/26/21 3:15 AM, David Hildenbrand wrote: > On 24.08.21 02:52, Sean Christopherson wrote: >> The goal of this RFC is to try and align KVM, mm, and anyone else with >> skin in the >> game, on an acceptable direction for supporting guest private memory, >> e.g. for >> Intel's TDX. The TDX architectural effectively allows KVM guests to >> crash the >> host if guest private memory is accessible to host userspace, and thus >> does not >> play nice with KVM's existing approach of pulling the pfn and mapping >> level from >> the host page tables. >> >> This is by no means a complete patch; it's a rough sketch of the KVM >> changes that >> would be needed. The kernel side of things is completely omitted from >> the patch; >> the design concept is below. >> >> There's also fair bit of hand waving on implementation details that >> shouldn't >> fundamentally change the overall ABI, e.g. how the backing store will >> ensure >> there are no mappings when "converting" to guest private. >> > > This is a lot of complexity and rather advanced approaches (not saying > they are bad, just that we try to teach the whole stack something > completely new). > > > What I think would really help is a list of requirements, such that > everybody is aware of what we actually want to achieve. Let me start: > > GFN: Guest Frame Number > EPFN: Encrypted Physical Frame Number > > > 1) An EPFN must not get mapped into more than one VM: it belongs exactly > to one VM. It must neither be shared between VMs between processes nor > between VMs within a processes. > > > 2) User space (well, and actually the kernel) must never access an EPFN: > > - If we go for an fd, essentially all operations (read/write) have to > fail. > - If we have to map an EPFN into user space page tables (e.g., to > simplify KVM), we could only allow fake swap entries such that "there > is something" but it cannot be accessed and is flagged accordingly. > - /proc/kcore and friends have to be careful as well and should not read > this memory. So there has to be a way to flag these pages. > > 3) We need a way to express the GFN<->EPFN mapping and essentially > assign an EPFN to a GFN. > > > 4) Once we assigned a EPFN to a GFN, that assignment must not longer > change. Further, an EPFN must not get assigned to multiple GFNs. > > > 5) There has to be a way to "replace" encrypted parts by "shared" parts > and the other way around. > > What else? > > > >> Background >> ========== >> >> This is a loose continuation of Kirill's RFC[*] to support TDX guest >> private >> memory by tracking guest memory at the 'struct page' level. This >> proposal is the >> result of several offline discussions that were prompted by Andy >> Lutomirksi's >> concerns with tracking via 'struct page': >> >> 1. The kernel wouldn't easily be able to enforce a 1:1 page:guest >> association, >> let alone a 1:1 pfn:gfn mapping. > > Well, it could with some help on higher layers. Someone has to do the > tracking. Marking EPFNs as EPFNs can actually be very helpful, e.g., > allow /proc/kcore to just not touch such pages. If we want to do all the > tracking in the struct page is a different story. > >> >> 2. Does not work for memory that isn't backed by 'struct page', >> e.g. if devices >> gain support for exposing encrypted memory regions to guests. > > Let's keep it simple. If a struct page is right now what we need to > properly track it, so be it. If not, good. But let's not make this a > requirement right from the start if it's stuff for the far future. > >> >> 3. Does not help march toward page migration or swap support >> (though it doesn't >> hurt either). > > "Does not help towards world peace, (though it doesn't hurt either)". > > Maybe let's ignore that for now, as it doesn't seem to be required to > get something reasonable running. > >> >> [*] >> https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com >> >> >> Concept >> ======= >> >> Guest private memory must be backed by an "enlightened" file >> descriptor, where >> "enlightened" means the implementing subsystem supports a one-way >> "conversion" to >> guest private memory and provides bi-directional hooks to communicate >> directly >> with KVM. Creating a private fd doesn't necessarily have to be a >> conversion, e.g. it >> could also be a flag provided at file creation, a property of the file >> system itself, >> etc... > > Doesn't sound too crazy. Maybe even introducing memfd_encrypted() if > extending the other ones turns out too complicated. > >> >> Before a private fd can be mapped into a KVM guest, it must be paired >> 1:1 with a >> KVM guest, i.e. multiple guests cannot share a fd. At pairing, KVM >> and the fd's >> subsystem exchange a set of function pointers to allow KVM to call >> into the subsystem, >> e.g. to translate gfn->pfn, and vice versa to allow the subsystem to >> call into KVM, >> e.g. to invalidate/move/swap a gfn range. >> >> Mapping a private fd in host userspace is disallowed, i.e. there is >> never a host >> virtual address associated with the fd and thus no userspace page >> tables pointing >> at the private memory. > > To keep the primary vs. secondary MMU thing working, I think it would > actually be nice to go with special swap entries instead; it just keeps > most things working as expected. But let's see where we end up. > >> >> Pinning _from KVM_ is not required. If the backing store supports >> page migration >> and/or swap, it can query the KVM-provided function pointers to see if >> KVM supports >> the operation. If the operation is not supported (this will be the >> case initially >> in KVM), the backing store is responsible for ensuring correct >> functionality. >> >> Unmapping guest memory, e.g. to prevent use-after-free, is handled via >> a callback >> from the backing store to KVM. KVM will employ techniques similar to >> those it uses >> for mmu_notifiers to ensure the guest cannot access freed memory. >> >> A key point is that, unlike similar failed proposals of the past, e.g. >> /dev/mktme, >> existing backing stores can be englightened, a from-scratch >> implementations is not >> required (though would obviously be possible as well). > > Right. But if it's just a bad fit, let's do something new. Just like we > did with memfd_secret. > >> >> One idea for extending existing backing stores, e.g. HugeTLBFS and >> tmpfs, is >> to add F_SEAL_GUEST, which would convert the entire file to guest >> private memory >> and either fail if the current size is non-zero or truncate the size >> to zero. > > While possible, I actually do have the feeling that we want eventually > to have something new, as the semantics are just too different. But > let's see. > > >> KVM >> === >> >> Guest private memory is managed as a new address space, i.e. as a >> different set of >> memslots, similar to how KVM has a separate memory view for when a >> guest vCPU is >> executing in virtual SMM. SMM is mutually exclusive with guest >> private memory. >> >> The fd (the actual integer) is provided to KVM when a private memslot >> is added >> via KVM_SET_USER_MEMORY_REGION. This is when the aforementioned >> pairing occurs. >> >> By default, KVM memslot lookups will be "shared", only specific >> touchpoints will >> be modified to work with private memslots, e.g. guest page faults. >> All host >> accesses to guest memory, e.g. for emulation, will thus look for >> shared memory >> and naturally fail without attempting copy_to/from_user() if the guest >> attempts >> to coerce KVM into access private memory. Note, avoiding >> copy_to/from_user() and >> friends isn't strictly necessary, it's more of a happy side effect. >> >> A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in >> vcpu->run >> is added to propagate illegal accesses (see above) and implicit >> conversions >> to userspace (see below). Note, the new exit reason + struct can also >> be to >> support several other feature requests in KVM[1][2]. >> >> The guest may explicitly or implicity request KVM to map a >> shared/private variant >> of a GFN. An explicit map request is done via hypercall (out of scope >> for this >> proposal as both TDX and SNP ABIs define such a hypercall). An >> implicit map request >> is triggered simply by the guest accessing the shared/private variant, >> which KVM >> sees as a guest page fault (EPT violation or #NPF). Ideally only >> explicit requests >> would be supported, but neither TDX nor SNP require this in their >> guest<->host ABIs. >> >> For implicit or explicit mappings, if a memslot is found that fully >> covers the >> requested range (which is a single gfn for implicit mappings), KVM's >> normal guest >> page fault handling works with minimal modification. >> >> If a memslot is not found, for explicit mappings, KVM will exit to >> userspace with >> the aforementioned dedicated exit reason. For implict _private_ >> mappings, KVM will >> also immediately exit with the same dedicated reason. For implicit >> shared mappings, >> an additional check is required to differentiate between emulated MMIO >> and an >> implicit private->shared conversion[*]. If there is an existing >> private memslot >> for the gfn, KVM will exit to userspace, otherwise KVM will treat the >> access as an >> emulated MMIO access and handle the page fault accordingly. > > Do you mean some kind of overlay. "Ordinary" user memory regions overlay > "private user memory regions"? So when marking something shared, you'd > leave the private user memory region alone and only create a new > "ordinary"user memory regions that references shared memory in QEMU > (IOW, a different mapping)? > > Reading below, I think you were not actually thinking about an overlay, > but maybe overlays might actually be a nice concept to have instead. > > >> Punching Holes >> ============== >> >> The expected userspace memory model is that mapping requests will be >> handled as >> conversions, e.g. on a shared mapping request, first unmap the private >> gfn range, >> then map the shared gfn range. A new KVM ioctl() will likely be >> needed to allow >> userspace to punch a hole in a memslot, as expressing such an >> operation isn't >> possible with KVM_SET_USER_MEMORY_REGION. While userspace could >> delete the >> memslot, then recreate three new memslots, doing so would be >> destructive to guest >> data as unmapping guest private memory (from the EPT/NPT tables) is >> destructive >> to the data for both TDX and SEV-SNP guests. > > If you'd treat it like an overlay, you'd not actually be punching holes. > You'd only be creating/removing ordinary user memory regions when > marking something shared/unshared. > >> >> Pros (vs. struct page) >> ====================== >> >> Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping. >> >> Userspace page tables are not populated, e.g. reduced memory >> footprint, lower >> probability of making private memory accessible to userspace. > > Agreed to the first part, although I consider that a secondary concern. > The second part, I'm not sure if that is really the case. Fake swap > entries are just a marker. > >> >> Provides line of sight to supporting page migration and swap. > > Again, let's leave that out for now. I think that's an kernel internal > that will require quite some thought either way. > >> >> Provides line of sight to mapping MMIO pages into guest private memory. > > That's an interesting thought. Would it work via overlays as well? Can > you elaborate? > >> >> Cons (vs. struct page) >> ====================== >> >> Significantly more churn in KVM, e.g. to plumb 'private' through where >> needed, >> support memslot hole punching, etc... >> >> KVM's MMU gets another method of retrieving host pfn and page size. >> >> Requires enabling in every backing store that someone wants to support. > > I think we will only care about anonymous memory eventually with > huge/gigantic pages in the next years. Just to what memfd() is already > limited. File-backed -- I don't know ... if at all, swapping ... in a > couple of years ... > >> >> Because the NUMA APIs work on virtual addresses, new syscalls >> fmove_pages(), >> fbind(), etc... would be required to provide equivalents to existing NUMA >> functionality (though those syscalls would likely be useful >> irrespective of guest >> private memory). > > Right, that's because we don't have a VMA that describes all this. E.g., > mbind(). > >> >> Washes (vs. struct page) >> ======================== >> >> A misbehaving guest that triggers a large number of shared memory >> mappings will >> consume a large number of memslots. But, this is likely a wash as >> similar effect >> would happen with VMAs in the struct page approach. > > Just cap it then to something sane. 32k which we have right now is crazy > and only required in very special setups. You can just make QEMU > override/set the KVM default. > > > > > > My wild idea after reading everything so far (full of flaws, just want > to mention it, maybe it gives some ideas): > > Introduce memfd_encrypted(). > > Similar to like memfd_secret() > - Most system calls will just fail. > - Allow MAP_SHARED only. > - Enforce VM_DONTDUMP and skip during fork(). This seems like it would work, but integrating it with the hugetlb reserve mechanism might be rather messy. > - File size can change exactly once, before any mmap. (IIRC) Why is this needed? Obviously if the file size can be reduced, then the pages need to get removed safely, but this seems doable if there's a use case. > > Different to memfd_secret(), allow mapping each page of the fd exactly > one time via mmap() into a single process. This doesn't solve the case of multiple memslots pointing at the same address. It also doesn't help with future operations that need to map from a memfd_encrypted() backing page to the GPA that maps it. > You'll end up with a VMA that corresponds to the whole file in a single > process only, and that cannot vanish, not even in parts. > > Define "ordinary" user memory slots as overlay on top of "encrypted" > memory slots. Inside KVM, bail out if you encounter such a VMA inside a > normal user memory slot. When creating a "encryped" user memory slot, > require that the whole VMA is covered at creation time. You know the VMA > can't change later. Oof. That's quite a requirement. What's the point of the VMA once all this is done?
On 26.08.21 19:05, Andy Lutomirski wrote: > On 8/26/21 3:15 AM, David Hildenbrand wrote: >> On 24.08.21 02:52, Sean Christopherson wrote: >>> The goal of this RFC is to try and align KVM, mm, and anyone else with >>> skin in the >>> game, on an acceptable direction for supporting guest private memory, >>> e.g. for >>> Intel's TDX. The TDX architectural effectively allows KVM guests to >>> crash the >>> host if guest private memory is accessible to host userspace, and thus >>> does not >>> play nice with KVM's existing approach of pulling the pfn and mapping >>> level from >>> the host page tables. >>> >>> This is by no means a complete patch; it's a rough sketch of the KVM >>> changes that >>> would be needed. The kernel side of things is completely omitted from >>> the patch; >>> the design concept is below. >>> >>> There's also fair bit of hand waving on implementation details that >>> shouldn't >>> fundamentally change the overall ABI, e.g. how the backing store will >>> ensure >>> there are no mappings when "converting" to guest private. >>> >> >> This is a lot of complexity and rather advanced approaches (not saying >> they are bad, just that we try to teach the whole stack something >> completely new). >> >> >> What I think would really help is a list of requirements, such that >> everybody is aware of what we actually want to achieve. Let me start: >> >> GFN: Guest Frame Number >> EPFN: Encrypted Physical Frame Number >> >> >> 1) An EPFN must not get mapped into more than one VM: it belongs exactly >> to one VM. It must neither be shared between VMs between processes nor >> between VMs within a processes. >> >> >> 2) User space (well, and actually the kernel) must never access an EPFN: >> >> - If we go for an fd, essentially all operations (read/write) have to >> fail. >> - If we have to map an EPFN into user space page tables (e.g., to >> simplify KVM), we could only allow fake swap entries such that "there >> is something" but it cannot be accessed and is flagged accordingly. >> - /proc/kcore and friends have to be careful as well and should not read >> this memory. So there has to be a way to flag these pages. >> >> 3) We need a way to express the GFN<->EPFN mapping and essentially >> assign an EPFN to a GFN. >> >> >> 4) Once we assigned a EPFN to a GFN, that assignment must not longer >> change. Further, an EPFN must not get assigned to multiple GFNs. >> >> >> 5) There has to be a way to "replace" encrypted parts by "shared" parts >> and the other way around. >> >> What else? >> >> >> >>> Background >>> ========== >>> >>> This is a loose continuation of Kirill's RFC[*] to support TDX guest >>> private >>> memory by tracking guest memory at the 'struct page' level. This >>> proposal is the >>> result of several offline discussions that were prompted by Andy >>> Lutomirksi's >>> concerns with tracking via 'struct page': >>> >>> 1. The kernel wouldn't easily be able to enforce a 1:1 page:guest >>> association, >>> let alone a 1:1 pfn:gfn mapping. >> >> Well, it could with some help on higher layers. Someone has to do the >> tracking. Marking EPFNs as EPFNs can actually be very helpful, e.g., >> allow /proc/kcore to just not touch such pages. If we want to do all the >> tracking in the struct page is a different story. >> >>> >>> 2. Does not work for memory that isn't backed by 'struct page', >>> e.g. if devices >>> gain support for exposing encrypted memory regions to guests. >> >> Let's keep it simple. If a struct page is right now what we need to >> properly track it, so be it. If not, good. But let's not make this a >> requirement right from the start if it's stuff for the far future. >> >>> >>> 3. Does not help march toward page migration or swap support >>> (though it doesn't >>> hurt either). >> >> "Does not help towards world peace, (though it doesn't hurt either)". >> >> Maybe let's ignore that for now, as it doesn't seem to be required to >> get something reasonable running. >> >>> >>> [*] >>> https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@linux.intel.com >>> >>> >>> Concept >>> ======= >>> >>> Guest private memory must be backed by an "enlightened" file >>> descriptor, where >>> "enlightened" means the implementing subsystem supports a one-way >>> "conversion" to >>> guest private memory and provides bi-directional hooks to communicate >>> directly >>> with KVM. Creating a private fd doesn't necessarily have to be a >>> conversion, e.g. it >>> could also be a flag provided at file creation, a property of the file >>> system itself, >>> etc... >> >> Doesn't sound too crazy. Maybe even introducing memfd_encrypted() if >> extending the other ones turns out too complicated. >> >>> >>> Before a private fd can be mapped into a KVM guest, it must be paired >>> 1:1 with a >>> KVM guest, i.e. multiple guests cannot share a fd. At pairing, KVM >>> and the fd's >>> subsystem exchange a set of function pointers to allow KVM to call >>> into the subsystem, >>> e.g. to translate gfn->pfn, and vice versa to allow the subsystem to >>> call into KVM, >>> e.g. to invalidate/move/swap a gfn range. >>> >>> Mapping a private fd in host userspace is disallowed, i.e. there is >>> never a host >>> virtual address associated with the fd and thus no userspace page >>> tables pointing >>> at the private memory. >> >> To keep the primary vs. secondary MMU thing working, I think it would >> actually be nice to go with special swap entries instead; it just keeps >> most things working as expected. But let's see where we end up. >> >>> >>> Pinning _from KVM_ is not required. If the backing store supports >>> page migration >>> and/or swap, it can query the KVM-provided function pointers to see if >>> KVM supports >>> the operation. If the operation is not supported (this will be the >>> case initially >>> in KVM), the backing store is responsible for ensuring correct >>> functionality. >>> >>> Unmapping guest memory, e.g. to prevent use-after-free, is handled via >>> a callback >>> from the backing store to KVM. KVM will employ techniques similar to >>> those it uses >>> for mmu_notifiers to ensure the guest cannot access freed memory. >>> >>> A key point is that, unlike similar failed proposals of the past, e.g. >>> /dev/mktme, >>> existing backing stores can be englightened, a from-scratch >>> implementations is not >>> required (though would obviously be possible as well). >> >> Right. But if it's just a bad fit, let's do something new. Just like we >> did with memfd_secret. >> >>> >>> One idea for extending existing backing stores, e.g. HugeTLBFS and >>> tmpfs, is >>> to add F_SEAL_GUEST, which would convert the entire file to guest >>> private memory >>> and either fail if the current size is non-zero or truncate the size >>> to zero. >> >> While possible, I actually do have the feeling that we want eventually >> to have something new, as the semantics are just too different. But >> let's see. >> >> >>> KVM >>> === >>> >>> Guest private memory is managed as a new address space, i.e. as a >>> different set of >>> memslots, similar to how KVM has a separate memory view for when a >>> guest vCPU is >>> executing in virtual SMM. SMM is mutually exclusive with guest >>> private memory. >>> >>> The fd (the actual integer) is provided to KVM when a private memslot >>> is added >>> via KVM_SET_USER_MEMORY_REGION. This is when the aforementioned >>> pairing occurs. >>> >>> By default, KVM memslot lookups will be "shared", only specific >>> touchpoints will >>> be modified to work with private memslots, e.g. guest page faults. >>> All host >>> accesses to guest memory, e.g. for emulation, will thus look for >>> shared memory >>> and naturally fail without attempting copy_to/from_user() if the guest >>> attempts >>> to coerce KVM into access private memory. Note, avoiding >>> copy_to/from_user() and >>> friends isn't strictly necessary, it's more of a happy side effect. >>> >>> A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in >>> vcpu->run >>> is added to propagate illegal accesses (see above) and implicit >>> conversions >>> to userspace (see below). Note, the new exit reason + struct can also >>> be to >>> support several other feature requests in KVM[1][2]. >>> >>> The guest may explicitly or implicity request KVM to map a >>> shared/private variant >>> of a GFN. An explicit map request is done via hypercall (out of scope >>> for this >>> proposal as both TDX and SNP ABIs define such a hypercall). An >>> implicit map request >>> is triggered simply by the guest accessing the shared/private variant, >>> which KVM >>> sees as a guest page fault (EPT violation or #NPF). Ideally only >>> explicit requests >>> would be supported, but neither TDX nor SNP require this in their >>> guest<->host ABIs. >>> >>> For implicit or explicit mappings, if a memslot is found that fully >>> covers the >>> requested range (which is a single gfn for implicit mappings), KVM's >>> normal guest >>> page fault handling works with minimal modification. >>> >>> If a memslot is not found, for explicit mappings, KVM will exit to >>> userspace with >>> the aforementioned dedicated exit reason. For implict _private_ >>> mappings, KVM will >>> also immediately exit with the same dedicated reason. For implicit >>> shared mappings, >>> an additional check is required to differentiate between emulated MMIO >>> and an >>> implicit private->shared conversion[*]. If there is an existing >>> private memslot >>> for the gfn, KVM will exit to userspace, otherwise KVM will treat the >>> access as an >>> emulated MMIO access and handle the page fault accordingly. >> >> Do you mean some kind of overlay. "Ordinary" user memory regions overlay >> "private user memory regions"? So when marking something shared, you'd >> leave the private user memory region alone and only create a new >> "ordinary"user memory regions that references shared memory in QEMU >> (IOW, a different mapping)? >> >> Reading below, I think you were not actually thinking about an overlay, >> but maybe overlays might actually be a nice concept to have instead. >> >> >>> Punching Holes >>> ============== >>> >>> The expected userspace memory model is that mapping requests will be >>> handled as >>> conversions, e.g. on a shared mapping request, first unmap the private >>> gfn range, >>> then map the shared gfn range. A new KVM ioctl() will likely be >>> needed to allow >>> userspace to punch a hole in a memslot, as expressing such an >>> operation isn't >>> possible with KVM_SET_USER_MEMORY_REGION. While userspace could >>> delete the >>> memslot, then recreate three new memslots, doing so would be >>> destructive to guest >>> data as unmapping guest private memory (from the EPT/NPT tables) is >>> destructive >>> to the data for both TDX and SEV-SNP guests. >> >> If you'd treat it like an overlay, you'd not actually be punching holes. >> You'd only be creating/removing ordinary user memory regions when >> marking something shared/unshared. >> >>> >>> Pros (vs. struct page) >>> ====================== >>> >>> Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping. >>> >>> Userspace page tables are not populated, e.g. reduced memory >>> footprint, lower >>> probability of making private memory accessible to userspace. >> >> Agreed to the first part, although I consider that a secondary concern. >> The second part, I'm not sure if that is really the case. Fake swap >> entries are just a marker. >> >>> >>> Provides line of sight to supporting page migration and swap. >> >> Again, let's leave that out for now. I think that's an kernel internal >> that will require quite some thought either way. >> >>> >>> Provides line of sight to mapping MMIO pages into guest private memory. >> >> That's an interesting thought. Would it work via overlays as well? Can >> you elaborate? >> >>> >>> Cons (vs. struct page) >>> ====================== >>> >>> Significantly more churn in KVM, e.g. to plumb 'private' through where >>> needed, >>> support memslot hole punching, etc... >>> >>> KVM's MMU gets another method of retrieving host pfn and page size. >>> >>> Requires enabling in every backing store that someone wants to support. >> >> I think we will only care about anonymous memory eventually with >> huge/gigantic pages in the next years. Just to what memfd() is already >> limited. File-backed -- I don't know ... if at all, swapping ... in a >> couple of years ... >> >>> >>> Because the NUMA APIs work on virtual addresses, new syscalls >>> fmove_pages(), >>> fbind(), etc... would be required to provide equivalents to existing NUMA >>> functionality (though those syscalls would likely be useful >>> irrespective of guest >>> private memory). >> >> Right, that's because we don't have a VMA that describes all this. E.g., >> mbind(). >> >>> >>> Washes (vs. struct page) >>> ======================== >>> >>> A misbehaving guest that triggers a large number of shared memory >>> mappings will >>> consume a large number of memslots. But, this is likely a wash as >>> similar effect >>> would happen with VMAs in the struct page approach. >> >> Just cap it then to something sane. 32k which we have right now is crazy >> and only required in very special setups. You can just make QEMU >> override/set the KVM default. >> >> >> >> >> >> My wild idea after reading everything so far (full of flaws, just want >> to mention it, maybe it gives some ideas): >> >> Introduce memfd_encrypted(). >> >> Similar to like memfd_secret() >> - Most system calls will just fail. >> - Allow MAP_SHARED only. >> - Enforce VM_DONTDUMP and skip during fork(). > > This seems like it would work, but integrating it with the hugetlb > reserve mechanism might be rather messy. One step at a time. > >> - File size can change exactly once, before any mmap. (IIRC) > > Why is this needed? Obviously if the file size can be reduced, then the > pages need to get removed safely, but this seems doable if there's a use > case. Right, but we usually don't resize memfd either. > >> >> Different to memfd_secret(), allow mapping each page of the fd exactly >> one time via mmap() into a single process. > > This doesn't solve the case of multiple memslots pointing at the same > address. It also doesn't help with future operations that need to map > from a memfd_encrypted() backing page to the GPA that maps it. That's trivial to enforce inside KVM when mapping. Encryted user memory regions, just like such VMAs just have to be sticky until we can come up with something different. > >> You'll end up with a VMA that corresponds to the whole file in a single >> process only, and that cannot vanish, not even in parts. >> >> Define "ordinary" user memory slots as overlay on top of "encrypted" >> memory slots. Inside KVM, bail out if you encounter such a VMA inside a >> normal user memory slot. When creating a "encryped" user memory slot, >> require that the whole VMA is covered at creation time. You know the VMA >> can't change later. > > Oof. That's quite a requirement. What's the point of the VMA once all > this is done? You can keep using things like mbind(), madvise(), ... and the GUP code with a special flag might mostly just do what you want. You won't have to reinvent too many wheels on the page fault logic side at least. Just a brain dump. Feel free to refine if you think any of this makes sense.
On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: > On 24.08.21 02:52, Sean Christopherson wrote: > > The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the > > game, on an acceptable direction for supporting guest private memory, e.g. for > > Intel's TDX. The TDX architectural effectively allows KVM guests to crash the > > host if guest private memory is accessible to host userspace, and thus does not > > play nice with KVM's existing approach of pulling the pfn and mapping level from > > the host page tables. > > > > This is by no means a complete patch; it's a rough sketch of the KVM changes that > > would be needed. The kernel side of things is completely omitted from the patch; > > the design concept is below. > > > > There's also fair bit of hand waving on implementation details that shouldn't > > fundamentally change the overall ABI, e.g. how the backing store will ensure > > there are no mappings when "converting" to guest private. > > > > This is a lot of complexity and rather advanced approaches (not saying they > are bad, just that we try to teach the whole stack something completely > new). > > > What I think would really help is a list of requirements, such that > everybody is aware of what we actually want to achieve. Let me start: > > GFN: Guest Frame Number > EPFN: Encrypted Physical Frame Number > > > 1) An EPFN must not get mapped into more than one VM: it belongs exactly to > one VM. It must neither be shared between VMs between processes nor between > VMs within a processes. > > > 2) User space (well, and actually the kernel) must never access an EPFN: > > - If we go for an fd, essentially all operations (read/write) have to > fail. > - If we have to map an EPFN into user space page tables (e.g., to > simplify KVM), we could only allow fake swap entries such that "there > is something" but it cannot be accessed and is flagged accordingly. > - /proc/kcore and friends have to be careful as well and should not read > this memory. So there has to be a way to flag these pages. > > 3) We need a way to express the GFN<->EPFN mapping and essentially assign an > EPFN to a GFN. > > > 4) Once we assigned a EPFN to a GFN, that assignment must not longer change. > Further, an EPFN must not get assigned to multiple GFNs. > > > 5) There has to be a way to "replace" encrypted parts by "shared" parts > and the other way around. > > What else? Thanks a lot for this summary. A question about the requirement: do we or do we not have plan to support assigned device to the protected VM? If yes. The fd based solution may need change the VFIO interface as well( though the fake swap entry solution need mess with VFIO too). Because: 1> KVM uses VFIO when assigning devices into a VM. 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all guest pages will have to be mapped in host IOMMU page table to host pages, which are pinned during the whole life cycle fo the VM. 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver, in vfio_dma_do_map(). 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA and pin the page. But if we are using fd based solution, not every GPA can have a HVA, thus the current VFIO interface to map and pin the GPA(IOVA) wont work. And I doubt if VFIO can be modified to support this easily. B.R. Yu
On Thu, Aug 26, 2021, at 2:26 PM, David Hildenbrand wrote: > On 26.08.21 19:05, Andy Lutomirski wrote: > > Oof. That's quite a requirement. What's the point of the VMA once all > > this is done? > > You can keep using things like mbind(), madvise(), ... and the GUP code > with a special flag might mostly just do what you want. You won't have > to reinvent too many wheels on the page fault logic side at least. > You can keep calling the functions. The implementations working is a different story: you can't just unmap (pte_numa-style or otherwise) a private guest page to quiesce it, move it with memcpy(), and then fault it back in. In any event, adding fd-based NUMA APIs would be quite nice. Look at the numactl command.
On Thu, Aug 26, 2021, David Hildenbrand wrote: > You'll end up with a VMA that corresponds to the whole file in a single > process only, and that cannot vanish, not even in parts. How would userspace tell the kernel to free parts of memory that it doesn't want assigned to the guest, e.g. to free memory that the guest has converted to not-private? > Define "ordinary" user memory slots as overlay on top of "encrypted" memory > slots. Inside KVM, bail out if you encounter such a VMA inside a normal > user memory slot. When creating a "encryped" user memory slot, require that > the whole VMA is covered at creation time. You know the VMA can't change > later. This can work for the basic use cases, but even then I'd strongly prefer not to tie memslot correctness to the VMAs. KVM doesn't truly care what lies behind the virtual address of a memslot, and when it does care, it tends to do poorly, e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn mappings, and that's reflected in the infrastructure. E.g. KVM relies on the mmu_notifiers to handle mprotect()/munmap()/etc... As is, I don't think KVM would get any kind of notification if userpaces unmaps the VMA for a private memslot that does not have any entries in the host page tables. I'm sure it's a solvable problem, e.g. by ensuring at least one page is touched by the backing store, but I don't think the end result would be any prettier than a dedicated API for KVM to consume. Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight to page migration or swap. For those types of operations, KVM currently just reacts to invalidation notifications by zapping guest PTEs, and then gets the new pfn when the guest re-faults on the page. That sequence doesn't work for TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page contents, i.e. the host needs to call into KVM for the actual migration. There's also the memory footprint side of things; the fd-based approach avoids having to create host page tables for memory that by definition will never be used by the host.
On Fri, Aug 27, 2021, Andy Lutomirski wrote: > > On Thu, Aug 26, 2021, at 2:26 PM, David Hildenbrand wrote: > > On 26.08.21 19:05, Andy Lutomirski wrote: > > > > Oof. That's quite a requirement. What's the point of the VMA once all > > > this is done? > > > > You can keep using things like mbind(), madvise(), ... and the GUP code > > with a special flag might mostly just do what you want. You won't have > > to reinvent too many wheels on the page fault logic side at least. Ya, Kirill's RFC more or less proved a special GUP flag would indeed Just Work. However, the KVM page fault side of things would require only a handful of small changes to send private memslots down a different path. Compared to the rest of the enabling, it's quite minor. The counter to that is other KVM architectures would need to learn how to use the new APIs, though I suspect that there will be a fair bit of arch enabling regardless of what route we take. > You can keep calling the functions. The implementations working is a > different story: you can't just unmap (pte_numa-style or otherwise) a private > guest page to quiesce it, move it with memcpy(), and then fault it back in. Ya, I brought this up in my earlier reply. Even the initial implementation (without real NUMA support) would likely be painful, e.g. the KVM TDX RFC/PoC adds dedicated logic in KVM to handle the case where NUMA balancing zaps a _pinned_ page and then KVM fault in the same pfn. It's not thaaat ugly, but it's arguably more invasive to KVM's page fault flows than a new fd-based private memslot scheme.
On 28.08.21 00:18, Sean Christopherson wrote: > On Thu, Aug 26, 2021, David Hildenbrand wrote: >> You'll end up with a VMA that corresponds to the whole file in a single >> process only, and that cannot vanish, not even in parts. > > How would userspace tell the kernel to free parts of memory that it doesn't want > assigned to the guest, e.g. to free memory that the guest has converted to > not-private? I'd guess one possibility could be fallocate(FALLOC_FL_PUNCH_HOLE). Questions are: when would it actually be allowed to perform such a destructive operation? Do we have to protect from that? How would KVM protect from user space replacing private pages by shared pages in any of the models we discuss? > >> Define "ordinary" user memory slots as overlay on top of "encrypted" memory >> slots. Inside KVM, bail out if you encounter such a VMA inside a normal >> user memory slot. When creating a "encryped" user memory slot, require that >> the whole VMA is covered at creation time. You know the VMA can't change >> later. > > This can work for the basic use cases, but even then I'd strongly prefer not to > tie memslot correctness to the VMAs. KVM doesn't truly care what lies behind > the virtual address of a memslot, and when it does care, it tends to do poorly, > e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn mappings, and > that's reflected in the infrastructure. E.g. KVM relies on the mmu_notifiers > to handle mprotect()/munmap()/etc... Right, and for the existing use cases this worked. But encrypted memory breaks many assumptions we once made ... I have somewhat mixed feelings about pages that are mapped into $WHATEVER page tables but not actually mapped into user space page tables. There is no way to reach these via the rmap. We have something like that already via vfio. And that is fundamentally broken when it comes to mmu notifiers, page pinning, page migration, ... > > As is, I don't think KVM would get any kind of notification if userpaces unmaps > the VMA for a private memslot that does not have any entries in the host page > tables. I'm sure it's a solvable problem, e.g. by ensuring at least one page > is touched by the backing store, but I don't think the end result would be any > prettier than a dedicated API for KVM to consume. > > Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight > to page migration or swap. For those types of operations, KVM currently just > reacts to invalidation notifications by zapping guest PTEs, and then gets the > new pfn when the guest re-faults on the page. That sequence doesn't work for > TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page > contents, i.e. the host needs to call into KVM for the actual migration. Right, but I still think this is a kernel internal. You can do such handshake later in the kernel IMHO. But I also already thought: is it really KVM that is to perform the migration or is it the fd-provider that performs the migration? Who says memfd_encrypted() doesn't default to a TDX "backend" on Intel CPUs that just knows how to migrate such a page? I'd love to have some details on how that's supposed to work, and which information we'd need to migrate/swap/... in addition to the EPFN and a new SPFN. > > There's also the memory footprint side of things; the fd-based approach avoids > having to create host page tables for memory that by definition will never be > used by the host. While that is true, that is not a compelling argument IMHO. No need to try to be better than state of the art if it results in something cleaner/better* just sticking with state of the art. Just like we don't have special interfaces to map $WHATEVER into a guest and bypassing user space page tables. * to be shown what actually is cleaner/better. We don't really have prototypes for either.
On 27.08.21 04:31, Yu Zhang wrote: > On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: >> On 24.08.21 02:52, Sean Christopherson wrote: >>> The goal of this RFC is to try and align KVM, mm, and anyone else with skin in the >>> game, on an acceptable direction for supporting guest private memory, e.g. for >>> Intel's TDX. The TDX architectural effectively allows KVM guests to crash the >>> host if guest private memory is accessible to host userspace, and thus does not >>> play nice with KVM's existing approach of pulling the pfn and mapping level from >>> the host page tables. >>> >>> This is by no means a complete patch; it's a rough sketch of the KVM changes that >>> would be needed. The kernel side of things is completely omitted from the patch; >>> the design concept is below. >>> >>> There's also fair bit of hand waving on implementation details that shouldn't >>> fundamentally change the overall ABI, e.g. how the backing store will ensure >>> there are no mappings when "converting" to guest private. >>> >> >> This is a lot of complexity and rather advanced approaches (not saying they >> are bad, just that we try to teach the whole stack something completely >> new). >> >> >> What I think would really help is a list of requirements, such that >> everybody is aware of what we actually want to achieve. Let me start: >> >> GFN: Guest Frame Number >> EPFN: Encrypted Physical Frame Number >> >> >> 1) An EPFN must not get mapped into more than one VM: it belongs exactly to >> one VM. It must neither be shared between VMs between processes nor between >> VMs within a processes. >> >> >> 2) User space (well, and actually the kernel) must never access an EPFN: >> >> - If we go for an fd, essentially all operations (read/write) have to >> fail. >> - If we have to map an EPFN into user space page tables (e.g., to >> simplify KVM), we could only allow fake swap entries such that "there >> is something" but it cannot be accessed and is flagged accordingly. >> - /proc/kcore and friends have to be careful as well and should not read >> this memory. So there has to be a way to flag these pages. >> >> 3) We need a way to express the GFN<->EPFN mapping and essentially assign an >> EPFN to a GFN. >> >> >> 4) Once we assigned a EPFN to a GFN, that assignment must not longer change. >> Further, an EPFN must not get assigned to multiple GFNs. >> >> >> 5) There has to be a way to "replace" encrypted parts by "shared" parts >> and the other way around. >> >> What else? > > Thanks a lot for this summary. A question about the requirement: do we or > do we not have plan to support assigned device to the protected VM? Good question, I assume that is stuff for the far far future. > > If yes. The fd based solution may need change the VFIO interface as well( > though the fake swap entry solution need mess with VFIO too). Because: > > 1> KVM uses VFIO when assigning devices into a VM. > > 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all > guest pages will have to be mapped in host IOMMU page table to host pages, > which are pinned during the whole life cycle fo the VM. > > 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver, > in vfio_dma_do_map(). > > 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA > and pin the page. > > But if we are using fd based solution, not every GPA can have a HVA, thus > the current VFIO interface to map and pin the GPA(IOVA) wont work. And I > doubt if VFIO can be modified to support this easily. I fully agree. Maybe Intel folks have some idea how that's supposed to look like in the future.
On 28.08.21 00:28, Sean Christopherson wrote: > On Fri, Aug 27, 2021, Andy Lutomirski wrote: >> >> On Thu, Aug 26, 2021, at 2:26 PM, David Hildenbrand wrote: >>> On 26.08.21 19:05, Andy Lutomirski wrote: >> >>>> Oof. That's quite a requirement. What's the point of the VMA once all >>>> this is done? >>> >>> You can keep using things like mbind(), madvise(), ... and the GUP code >>> with a special flag might mostly just do what you want. You won't have >>> to reinvent too many wheels on the page fault logic side at least. > > Ya, Kirill's RFC more or less proved a special GUP flag would indeed Just Work. > However, the KVM page fault side of things would require only a handful of small > changes to send private memslots down a different path. Compared to the rest of > the enabling, it's quite minor. > > The counter to that is other KVM architectures would need to learn how to use the > new APIs, though I suspect that there will be a fair bit of arch enabling regardless > of what route we take. > >> You can keep calling the functions. The implementations working is a >> different story: you can't just unmap (pte_numa-style or otherwise) a private >> guest page to quiesce it, move it with memcpy(), and then fault it back in. > > Ya, I brought this up in my earlier reply. Even the initial implementation (without > real NUMA support) would likely be painful, e.g. the KVM TDX RFC/PoC adds dedicated > logic in KVM to handle the case where NUMA balancing zaps a _pinned_ page and then > KVM fault in the same pfn. It's not thaaat ugly, but it's arguably more invasive > to KVM's page fault flows than a new fd-based private memslot scheme. I might have a different mindset, but less code churn doesn't necessarily translate to "better approach". I'm certainly not pushing for what I proposed (it's a rough, broken sketch). I'm much rather trying to come up with alternatives that try solving the same issue, handling the identified requirements. I have a gut feeling that the list of requirements might not be complete yet. For example, I wonder if we have to protect against user space replacing private pages by shared pages or punishing random holes into the encrypted memory fd.
>> Thanks a lot for this summary. A question about the requirement: do >> we or >> do we not have plan to support assigned device to the protected VM? > > Good question, I assume that is stuff for the far far future. It is in principle possible with the current TDX, but not secure. But someone might decide to do it. So it would be good to have basic support at least. -Andi > >
On 31.08.21 22:01, Andi Kleen wrote: > >>> Thanks a lot for this summary. A question about the requirement: do >>> we or >>> do we not have plan to support assigned device to the protected VM? >> >> Good question, I assume that is stuff for the far far future. > > It is in principle possible with the current TDX, but not secure. But > someone might decide to do it. So it would be good to have basic support > at least. Can you elaborate the "not secure" part? Do you mean, making the device only access "shared" memory, not secure/encrypted/whatsoever?
On 8/31/2021 1:15 PM, David Hildenbrand wrote: > On 31.08.21 22:01, Andi Kleen wrote: >> >>>> Thanks a lot for this summary. A question about the requirement: do >>>> we or >>>> do we not have plan to support assigned device to the protected VM? >>> >>> Good question, I assume that is stuff for the far far future. >> >> It is in principle possible with the current TDX, but not secure. But >> someone might decide to do it. So it would be good to have basic support >> at least. > > Can you elaborate the "not secure" part? Do you mean, making the > device only access "shared" memory, not secure/encrypted/whatsoever? Yes that's right. It can only access shared areas. -Andi
On Tue, Aug 31, 2021, David Hildenbrand wrote: > On 28.08.21 00:28, Sean Christopherson wrote: > > On Fri, Aug 27, 2021, Andy Lutomirski wrote: > > > > > > On Thu, Aug 26, 2021, at 2:26 PM, David Hildenbrand wrote: > > > > On 26.08.21 19:05, Andy Lutomirski wrote: > > > > > > > > Oof. That's quite a requirement. What's the point of the VMA once all > > > > > this is done? > > > > > > > > You can keep using things like mbind(), madvise(), ... and the GUP code > > > > with a special flag might mostly just do what you want. You won't have > > > > to reinvent too many wheels on the page fault logic side at least. > > > > Ya, Kirill's RFC more or less proved a special GUP flag would indeed Just Work. > > However, the KVM page fault side of things would require only a handful of small > > changes to send private memslots down a different path. Compared to the rest of > > the enabling, it's quite minor. > > > > The counter to that is other KVM architectures would need to learn how to use the > > new APIs, though I suspect that there will be a fair bit of arch enabling regardless > > of what route we take. > > > > > You can keep calling the functions. The implementations working is a > > > different story: you can't just unmap (pte_numa-style or otherwise) a private > > > guest page to quiesce it, move it with memcpy(), and then fault it back in. > > > > Ya, I brought this up in my earlier reply. Even the initial implementation (without > > real NUMA support) would likely be painful, e.g. the KVM TDX RFC/PoC adds dedicated > > logic in KVM to handle the case where NUMA balancing zaps a _pinned_ page and then > > KVM fault in the same pfn. It's not thaaat ugly, but it's arguably more invasive > > to KVM's page fault flows than a new fd-based private memslot scheme. > > I might have a different mindset, but less code churn doesn't necessarily > translate to "better approach". I wasn't referring to code churn. By "invasive" I mean number of touchpoints in KVM as well as the nature of the touchpoints. E.g. poking into how KVM uses available bits in its shadow PTEs and adding multiple checks through KVM's page fault handler, versus two callbacks to get the PFN and page size. > I'm certainly not pushing for what I proposed (it's a rough, broken sketch). > I'm much rather trying to come up with alternatives that try solving the > same issue, handling the identified requirements. > > I have a gut feeling that the list of requirements might not be complete > yet. For example, I wonder if we have to protect against user space > replacing private pages by shared pages or punishing random holes into the > encrypted memory fd. Replacing a private page with a shared page for a given GFN is very much a requirement as it's expected behavior for all VMM+guests when converting guest memory between shared and private. Punching holes is a sort of optional requirement. It's a "requirement" in that it's allowed if the backing store supports such a behavior, optional in that support wouldn't be strictly necessary and/or could come with constraints. The expected use case is that host userspace would punch a hole to free unreachable private memory, e.g. after the corresponding GFN(s) is converted to shared, so that it doesn't consume 2x memory for the guest.
On Tue, Aug 31, 2021, David Hildenbrand wrote: > On 28.08.21 00:18, Sean Christopherson wrote: > > On Thu, Aug 26, 2021, David Hildenbrand wrote: > > > You'll end up with a VMA that corresponds to the whole file in a single > > > process only, and that cannot vanish, not even in parts. > > > > How would userspace tell the kernel to free parts of memory that it doesn't want > > assigned to the guest, e.g. to free memory that the guest has converted to > > not-private? > > I'd guess one possibility could be fallocate(FALLOC_FL_PUNCH_HOLE). > > Questions are: when would it actually be allowed to perform such a > destructive operation? From the kernel's perspective, userspace is allowed to perform destructive operations at will. It is ultimately the userspace VMM's responsibility to not DoS the guest. > Do we have to protect from that? How would KVM protect from user space > replacing private pages by shared pages in any of the models we discuss? The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*] as both private and shared, where "shared" also incorporates any mapping from the host. Essentially it boils down to the kernel ensuring that a pfn is unmapped before it's converted to/from private, and KVM ensuring that it honors any unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback as proposed in this RFC. As it pertains to PUNCH_HOLE, the responsibilities are no different than when the backing-store is destroyed; the backing-store needs to notify downstream MMUs (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory. [*] Whether or not the kernel's direct mapping needs to be removed is debatable, but my argument is that that behavior is not visible to userspace and thus out of scope for this discussion, e.g. zapping/restoring the direct map can be added/removed without impacting the userspace ABI. > > > Define "ordinary" user memory slots as overlay on top of "encrypted" memory > > > slots. Inside KVM, bail out if you encounter such a VMA inside a normal > > > user memory slot. When creating a "encryped" user memory slot, require that > > > the whole VMA is covered at creation time. You know the VMA can't change > > > later. > > > > This can work for the basic use cases, but even then I'd strongly prefer not to > > tie memslot correctness to the VMAs. KVM doesn't truly care what lies behind > > the virtual address of a memslot, and when it does care, it tends to do poorly, > > e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn mappings, and > > that's reflected in the infrastructure. E.g. KVM relies on the mmu_notifiers > > to handle mprotect()/munmap()/etc... > > Right, and for the existing use cases this worked. But encrypted memory > breaks many assumptions we once made ... > > I have somewhat mixed feelings about pages that are mapped into $WHATEVER > page tables but not actually mapped into user space page tables. There is no > way to reach these via the rmap. > > We have something like that already via vfio. And that is fundamentally > broken when it comes to mmu notifiers, page pinning, page migration, ... I'm not super familiar with VFIO internals, but the idea with the fd-based approach is that the backing-store would be in direct communication with KVM and would handle those operations through that direct channel. > > As is, I don't think KVM would get any kind of notification if userpaces unmaps > > the VMA for a private memslot that does not have any entries in the host page > > tables. I'm sure it's a solvable problem, e.g. by ensuring at least one page > > is touched by the backing store, but I don't think the end result would be any > > prettier than a dedicated API for KVM to consume. > > > > Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight > > to page migration or swap. For those types of operations, KVM currently just > > reacts to invalidation notifications by zapping guest PTEs, and then gets the > > new pfn when the guest re-faults on the page. That sequence doesn't work for > > TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page > > contents, i.e. the host needs to call into KVM for the actual migration. > > Right, but I still think this is a kernel internal. You can do such > handshake later in the kernel IMHO. It is kernel internal, but AFAICT it will be ugly because KVM "needs" to do the migration and that would invert the mmu_notifer API, e.g. instead of "telling" secondary MMUs to invalidate/change a mappings, the mm would be "asking" secondary MMus "can you move this?". More below. > But I also already thought: is it really KVM that is to perform the > migration or is it the fd-provider that performs the migration? Who says > memfd_encrypted() doesn't default to a TDX "backend" on Intel CPUs that just > knows how to migrate such a page? > > I'd love to have some details on how that's supposed to work, and which > information we'd need to migrate/swap/... in addition to the EPFN and a new > SPFN. KVM "needs" to do the migration. On TDX, the migration will be a SEAMCALL, a post-VMXON instruction that transfers control to the TDX-Module, that at minimum needs a per-VM identifier, the gfn, and the page table level. The call into the TDX-Module would also need to take a KVM lock (probably KVM's mmu_lock) to satisfy TDX's concurrency requirement, e.g. to avoid "spurious" errors due to the backing-store attempting to migrate memory that KVM is unmapping due to a memslot change. The per-VM identifier may not apply to SEV-SNP, but I believe everything else holds true.
On Tue, Aug 31, 2021 at 01:39:31PM -0700, Andi Kleen wrote: > > On 8/31/2021 1:15 PM, David Hildenbrand wrote: > > On 31.08.21 22:01, Andi Kleen wrote: > > > > > > > > Thanks a lot for this summary. A question about the requirement: do > > > > > we or > > > > > do we not have plan to support assigned device to the protected VM? > > > > > > > > Good question, I assume that is stuff for the far far future. > > > > > > It is in principle possible with the current TDX, but not secure. But > > > someone might decide to do it. So it would be good to have basic support > > > at least. > > > > Can you elaborate the "not secure" part? Do you mean, making the device > > only access "shared" memory, not secure/encrypted/whatsoever? > > > Yes that's right. It can only access shared areas. Thanks, Andy & David. Actually, enabling of device assinment needs quite some effort, e.g., to guarantee only shared pages are mapped in IOMMU page table (using shared GFNs). And the buffer copying inside TD is still unavoidable, thus not much performance benefit. Maybe we should just *disable* VFIO device in TDX first. As to the fd-based private memory, enventually we will have to tolerate its impact on any place where GUP is needed in virtualization. :) B.R. Yu
On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote: > On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: > Thanks a lot for this summary. A question about the requirement: do we or > do we not have plan to support assigned device to the protected VM? > > If yes. The fd based solution may need change the VFIO interface as well( > though the fake swap entry solution need mess with VFIO too). Because: > > 1> KVM uses VFIO when assigning devices into a VM. > > 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all > guest pages will have to be mapped in host IOMMU page table to host pages, > which are pinned during the whole life cycle fo the VM. > > 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver, > in vfio_dma_do_map(). > > 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA > and pin the page. > > But if we are using fd based solution, not every GPA can have a HVA, thus > the current VFIO interface to map and pin the GPA(IOVA) wont work. And I > doubt if VFIO can be modified to support this easily. > > Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device? If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it. I'm not sure whether anyone will actually do this, but it should be possible, at least in principle. Maybe someone will want to assign a NIC to a TDX guest. An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether. If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot.
On Tue, Aug 31, 2021, at 12:07 PM, David Hildenbrand wrote: > On 28.08.21 00:18, Sean Christopherson wrote: > > On Thu, Aug 26, 2021, David Hildenbrand wrote: > >> You'll end up with a VMA that corresponds to the whole file in a single > >> process only, and that cannot vanish, not even in parts. > > > > How would userspace tell the kernel to free parts of memory that it doesn't want > > assigned to the guest, e.g. to free memory that the guest has converted to > > not-private? > > I'd guess one possibility could be fallocate(FALLOC_FL_PUNCH_HOLE). > > Questions are: when would it actually be allowed to perform such a > destructive operation? Do we have to protect from that? How would KVM > protect from user space replacing private pages by shared pages in any > of the models we discuss? > What do you mean? If userspace maliciously replaces a shared page by a private page, then the guest crashes. (The actual meaning here is a bit different on SNP-ES vs TDX. In SNP-ES, a given GPA can be shared, private, or nonexistent. A guest accesses it with a special bit set in the guest page tables to indicate whether it expects shared or private, and the CPU will produce an appropriate error if the bit doesn't match the page. In TDX, there is actually an entirely separate shared vs private address space, and, in theory, a given "GPA" can exist as shared and as private at once. The full guest n-bit GPA plus the shared/private bit is logically an N+1 bit address, and it's possible to map all of it at once, half shared, and half private. In practice, the defined guest->host APIs don't really support that usage.
> From: Andy Lutomirski <luto@kernel.org> > Sent: Wednesday, September 1, 2021 12:53 PM > > On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote: > > On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: > > > Thanks a lot for this summary. A question about the requirement: do we or > > do we not have plan to support assigned device to the protected VM? > > > > If yes. The fd based solution may need change the VFIO interface as well( > > though the fake swap entry solution need mess with VFIO too). Because: > > > > 1> KVM uses VFIO when assigning devices into a VM. > > > > 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, > all > > guest pages will have to be mapped in host IOMMU page table to host > pages, > > which are pinned during the whole life cycle fo the VM. > > > > 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU > driver, > > in vfio_dma_do_map(). > > > > 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get > the HPA > > and pin the page. > > > > But if we are using fd based solution, not every GPA can have a HVA, thus > > the current VFIO interface to map and pin the GPA(IOVA) wont work. And I > > doubt if VFIO can be modified to support this easily. > > > > > > Do you mean assigning a normal device to a protected VM or a hypothetical > protected-MMIO device? > > If the former, it should work more or less like with a non-protected VM. > mmap the VFIO device, set up a memslot, and use it. I'm not sure whether > anyone will actually do this, but it should be possible, at least in principle. > Maybe someone will want to assign a NIC to a TDX guest. An NVMe device > with the understanding that the guest can't trust it wouldn't be entirely crazy > ether. > > If the latter, AFAIK there is no spec for how it would work even in principle. > Presumably it wouldn't work quite like VFIO -- instead, the kernel could have > a protection-virtual-io-fd mechanism, and that fd could be bound to a > memslot in whatever way we settle on for binding secure memory to a > memslot. FYI the iommu logic in VFIO is being refactored out into an unified /dev/iommu framework [1]. Currently it plans to support the same DMA mapping semantics as what VFIO provides today (HVA-based). in the future it could be extended to support another mapping protocol which accepts fd+offset instead of HVA and then calls the helper function from whatever backing store which can help translate fd+offset to HPA instead of using GUP. Thanks Kevin [1]https://lore.kernel.org/kvm/BN9PR11MB5433B1E4AE5B0480369F97178C189@BN9PR11MB5433.namprd11.prod.outlook.com/
On 01.09.21 06:58, Andy Lutomirski wrote: > On Tue, Aug 31, 2021, at 12:07 PM, David Hildenbrand wrote: >> On 28.08.21 00:18, Sean Christopherson wrote: >>> On Thu, Aug 26, 2021, David Hildenbrand wrote: >>>> You'll end up with a VMA that corresponds to the whole file in a single >>>> process only, and that cannot vanish, not even in parts. >>> >>> How would userspace tell the kernel to free parts of memory that it doesn't want >>> assigned to the guest, e.g. to free memory that the guest has converted to >>> not-private? >> >> I'd guess one possibility could be fallocate(FALLOC_FL_PUNCH_HOLE). >> >> Questions are: when would it actually be allowed to perform such a >> destructive operation? Do we have to protect from that? How would KVM >> protect from user space replacing private pages by shared pages in any >> of the models we discuss? >> > > What do you mean? If userspace maliciously replaces a shared page by a private page, then the guest crashes. Assume we have private pages in a fd and fallocate(FALLOC_FL_PUNCH_HOLE) random pages the guest is still using. If we "only" crash the guest, everything is fine. > > (The actual meaning here is a bit different on SNP-ES vs TDX. In SNP-ES, a given GPA can be shared, private, or nonexistent. A guest accesses it with a special bit set in the guest page tables to indicate whether it expects shared or private, and the CPU will produce an appropriate error if the bit doesn't match the page. Rings a bell, thanks for reminding me. In TDX, there is actually an entirely separate shared vs private address space, and, in theory, a given "GPA" can exist as shared and as private at once. The full guest n-bit GPA plus the shared/private bit is logically an N+1 bit address, and it's possible to map all of it at once, half shared, and half private. In practice, the defined guest->host APIs don't really support that usage. Thanks, that explains a lot.
On 31.08.21 22:45, Sean Christopherson wrote: > On Tue, Aug 31, 2021, David Hildenbrand wrote: >> On 28.08.21 00:28, Sean Christopherson wrote: >>> On Fri, Aug 27, 2021, Andy Lutomirski wrote: >>>> >>>> On Thu, Aug 26, 2021, at 2:26 PM, David Hildenbrand wrote: >>>>> On 26.08.21 19:05, Andy Lutomirski wrote: >>>> >>>>>> Oof. That's quite a requirement. What's the point of the VMA once all >>>>>> this is done? >>>>> >>>>> You can keep using things like mbind(), madvise(), ... and the GUP code >>>>> with a special flag might mostly just do what you want. You won't have >>>>> to reinvent too many wheels on the page fault logic side at least. >>> >>> Ya, Kirill's RFC more or less proved a special GUP flag would indeed Just Work. >>> However, the KVM page fault side of things would require only a handful of small >>> changes to send private memslots down a different path. Compared to the rest of >>> the enabling, it's quite minor. >>> >>> The counter to that is other KVM architectures would need to learn how to use the >>> new APIs, though I suspect that there will be a fair bit of arch enabling regardless >>> of what route we take. >>> >>>> You can keep calling the functions. The implementations working is a >>>> different story: you can't just unmap (pte_numa-style or otherwise) a private >>>> guest page to quiesce it, move it with memcpy(), and then fault it back in. >>> >>> Ya, I brought this up in my earlier reply. Even the initial implementation (without >>> real NUMA support) would likely be painful, e.g. the KVM TDX RFC/PoC adds dedicated >>> logic in KVM to handle the case where NUMA balancing zaps a _pinned_ page and then >>> KVM fault in the same pfn. It's not thaaat ugly, but it's arguably more invasive >>> to KVM's page fault flows than a new fd-based private memslot scheme. >> >> I might have a different mindset, but less code churn doesn't necessarily >> translate to "better approach". > > I wasn't referring to code churn. By "invasive" I mean number of touchpoints in > KVM as well as the nature of the touchpoints. E.g. poking into how KVM uses > available bits in its shadow PTEs and adding multiple checks through KVM's page > fault handler, versus two callbacks to get the PFN and page size. > >> I'm certainly not pushing for what I proposed (it's a rough, broken sketch). >> I'm much rather trying to come up with alternatives that try solving the >> same issue, handling the identified requirements. >> >> I have a gut feeling that the list of requirements might not be complete >> yet. For example, I wonder if we have to protect against user space >> replacing private pages by shared pages or punishing random holes into the >> encrypted memory fd. > > Replacing a private page with a shared page for a given GFN is very much a > requirement as it's expected behavior for all VMM+guests when converting guest > memory between shared and private. > > Punching holes is a sort of optional requirement. It's a "requirement" in that > it's allowed if the backing store supports such a behavior, optional in that > support wouldn't be strictly necessary and/or could come with constraints. The > expected use case is that host userspace would punch a hole to free unreachable > private memory, e.g. after the corresponding GFN(s) is converted to shared, so > that it doesn't consume 2x memory for the guest. > Okay, that matches my understanding then. I was rather thinking about "what happens if we punch a hole where private memory was not converted to shared yet". AFAIU, we will simply crash the guest then.
>> Do we have to protect from that? How would KVM protect from user space >> replacing private pages by shared pages in any of the models we discuss? > > The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*] > as both private and shared, where "shared" also incorporates any mapping from the > host. Essentially it boils down to the kernel ensuring that a pfn is unmapped > before it's converted to/from private, and KVM ensuring that it honors any > unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback > as proposed in this RFC. Okay, so the fallocate(PUNCHHOLE) from user space could trigger the respective unmapping and freeing of backing storage. > > As it pertains to PUNCH_HOLE, the responsibilities are no different than when the > backing-store is destroyed; the backing-store needs to notify downstream MMUs > (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory. Right. > > [*] Whether or not the kernel's direct mapping needs to be removed is debatable, > but my argument is that that behavior is not visible to userspace and thus > out of scope for this discussion, e.g. zapping/restoring the direct map can > be added/removed without impacting the userspace ABI. Right. Removing it shouldn't also be requited IMHO. There are other ways to teach the kernel to not read/write some online pages (filter /proc/kcore, disable hibernation, strict access checks for /dev/mem ...). > >>>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory >>>> slots. Inside KVM, bail out if you encounter such a VMA inside a normal >>>> user memory slot. When creating a "encryped" user memory slot, require that >>>> the whole VMA is covered at creation time. You know the VMA can't change >>>> later. >>> >>> This can work for the basic use cases, but even then I'd strongly prefer not to >>> tie memslot correctness to the VMAs. KVM doesn't truly care what lies behind >>> the virtual address of a memslot, and when it does care, it tends to do poorly, >>> e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn mappings, and >>> that's reflected in the infrastructure. E.g. KVM relies on the mmu_notifiers >>> to handle mprotect()/munmap()/etc... >> >> Right, and for the existing use cases this worked. But encrypted memory >> breaks many assumptions we once made ... >> >> I have somewhat mixed feelings about pages that are mapped into $WHATEVER >> page tables but not actually mapped into user space page tables. There is no >> way to reach these via the rmap. >> >> We have something like that already via vfio. And that is fundamentally >> broken when it comes to mmu notifiers, page pinning, page migration, ... > > I'm not super familiar with VFIO internals, but the idea with the fd-based > approach is that the backing-store would be in direct communication with KVM and > would handle those operations through that direct channel. Right. The problem I am seeing is that e.g., try_to_unmap() might not be able to actually fully unmap a page, because some non-synchronized KVM MMU still maps a page. It would be great to evaluate how the fd callbacks would fit into the whole picture, including the current rmap. I guess I'm missing the bigger picture how it all fits together on the !KVM side. > >>> As is, I don't think KVM would get any kind of notification if userpaces unmaps >>> the VMA for a private memslot that does not have any entries in the host page >>> tables. I'm sure it's a solvable problem, e.g. by ensuring at least one page >>> is touched by the backing store, but I don't think the end result would be any >>> prettier than a dedicated API for KVM to consume. >>> >>> Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight >>> to page migration or swap. For those types of operations, KVM currently just >>> reacts to invalidation notifications by zapping guest PTEs, and then gets the >>> new pfn when the guest re-faults on the page. That sequence doesn't work for >>> TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page >>> contents, i.e. the host needs to call into KVM for the actual migration. >> >> Right, but I still think this is a kernel internal. You can do such >> handshake later in the kernel IMHO. > > It is kernel internal, but AFAICT it will be ugly because KVM "needs" to do the > migration and that would invert the mmu_notifer API, e.g. instead of "telling" > secondary MMUs to invalidate/change a mappings, the mm would be "asking" > secondary MMus "can you move this?". More below. In my thinking, the the rmap via mmu notifiers would do the unmapping just as we know it (from primary MMU -> secondary MMU). Once try_to_unmap() succeeded, the fd provider could kick-off the migration via whatever callback. > >> But I also already thought: is it really KVM that is to perform the >> migration or is it the fd-provider that performs the migration? Who says >> memfd_encrypted() doesn't default to a TDX "backend" on Intel CPUs that just >> knows how to migrate such a page? >> >> I'd love to have some details on how that's supposed to work, and which >> information we'd need to migrate/swap/... in addition to the EPFN and a new >> SPFN. > > KVM "needs" to do the migration. On TDX, the migration will be a SEAMCALL, > a post-VMXON instruction that transfers control to the TDX-Module, that at > minimum needs a per-VM identifier, the gfn, and the page table level. The call The per-VM identifier and the GFN would be easy to grab. Page table level, not so sure -- do you mean the general page table depth? Or if it's mapped as 4k vs. 2M ... ? The latter could be answered by the fd provider already I assume. Does the page still have to be mapped into the secondary MMU when performing the migration via TDX? I assume not, which would simplify things a lot. > into the TDX-Module would also need to take a KVM lock (probably KVM's mmu_lock) > to satisfy TDX's concurrency requirement, e.g. to avoid "spurious" errors due to > the backing-store attempting to migrate memory that KVM is unmapping due to a > memslot change. Something like that might be handled by fixing private memory slots similar to in my draft, right? > > The per-VM identifier may not apply to SEV-SNP, but I believe everything else > holds true. Thanks!
On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote: > > > On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote: > > On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: > > > Thanks a lot for this summary. A question about the requirement: do we or > > do we not have plan to support assigned device to the protected VM? > > > > If yes. The fd based solution may need change the VFIO interface as well( > > though the fake swap entry solution need mess with VFIO too). Because: > > > > 1> KVM uses VFIO when assigning devices into a VM. > > > > 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all > > guest pages will have to be mapped in host IOMMU page table to host pages, > > which are pinned during the whole life cycle fo the VM. > > > > 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver, > > in vfio_dma_do_map(). > > > > 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA > > and pin the page. > > > > But if we are using fd based solution, not every GPA can have a HVA, thus > > the current VFIO interface to map and pin the GPA(IOVA) wont work. And I > > doubt if VFIO can be modified to support this easily. > > > > > > Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device? > > If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it. I'm not sure whether anyone will actually do this, but it should be possible, at least in principle. Maybe someone will want to assign a NIC to a TDX guest. An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether. > > If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot. > Thanks Andy. I was asking the first scenario. Well, I agree it is doable if someone really want some assigned device in TD guest. As Kevin mentioned in his reply, HPA can be generated, by extending VFIO with a new mapping protocol which uses fd+offset, instead of HVA. Another issue is current TDX does not support DMA encryption, and only shared GPA memory shall be mapped in the VT-d. So to support this, KVM may need to work with VFIO to dynamically program host IOPT(IOMMU Page Table) when TD guest notifies a shared GFN range(e.g., with a MAP_GPA TDVMCALL), instead of prepopulating the IOPT at VM creation time, by mapping entire GFN ranges of a guest. So my inclination would be to just disallow using of VFIO device in TDX first, until we have real requirement(with above enabling work finished). B.R. Yu
On Wed, Sep 1, 2021, at 1:09 AM, David Hildenbrand wrote: > >> Do we have to protect from that? How would KVM protect from user space > >> replacing private pages by shared pages in any of the models we discuss? > > > > The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*] > > as both private and shared, where "shared" also incorporates any mapping from the > > host. Essentially it boils down to the kernel ensuring that a pfn is unmapped > > before it's converted to/from private, and KVM ensuring that it honors any > > unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback > > as proposed in this RFC. > > Okay, so the fallocate(PUNCHHOLE) from user space could trigger the > respective unmapping and freeing of backing storage. > > > > > As it pertains to PUNCH_HOLE, the responsibilities are no different than when the > > backing-store is destroyed; the backing-store needs to notify downstream MMUs > > (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory. > > Right. > > > > > [*] Whether or not the kernel's direct mapping needs to be removed is debatable, > > but my argument is that that behavior is not visible to userspace and thus > > out of scope for this discussion, e.g. zapping/restoring the direct map can > > be added/removed without impacting the userspace ABI. > > Right. Removing it shouldn't also be requited IMHO. There are other ways > to teach the kernel to not read/write some online pages (filter > /proc/kcore, disable hibernation, strict access checks for /dev/mem ...). > > > > >>>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory > >>>> slots. Inside KVM, bail out if you encounter such a VMA inside a normal > >>>> user memory slot. When creating a "encryped" user memory slot, require that > >>>> the whole VMA is covered at creation time. You know the VMA can't change > >>>> later. > >>> > >>> This can work for the basic use cases, but even then I'd strongly prefer not to > >>> tie memslot correctness to the VMAs. KVM doesn't truly care what lies behind > >>> the virtual address of a memslot, and when it does care, it tends to do poorly, > >>> e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn mappings, and > >>> that's reflected in the infrastructure. E.g. KVM relies on the mmu_notifiers > >>> to handle mprotect()/munmap()/etc... > >> > >> Right, and for the existing use cases this worked. But encrypted memory > >> breaks many assumptions we once made ... > >> > >> I have somewhat mixed feelings about pages that are mapped into $WHATEVER > >> page tables but not actually mapped into user space page tables. There is no > >> way to reach these via the rmap. > >> > >> We have something like that already via vfio. And that is fundamentally > >> broken when it comes to mmu notifiers, page pinning, page migration, ... > > > > I'm not super familiar with VFIO internals, but the idea with the fd-based > > approach is that the backing-store would be in direct communication with KVM and > > would handle those operations through that direct channel. > > Right. The problem I am seeing is that e.g., try_to_unmap() might not be > able to actually fully unmap a page, because some non-synchronized KVM > MMU still maps a page. It would be great to evaluate how the fd > callbacks would fit into the whole picture, including the current rmap. > > I guess I'm missing the bigger picture how it all fits together on the > !KVM side. The big picture is fundamentally a bit nasty. Logically (ignoring the implementation details of rmap, mmu_notifier, etc), you can call try_to_unmap and end up with a page that is Just A Bunch Of Bytes (tm). Then you can write it to disk, memcpy to another node, compress it, etc. When it gets faulted back in, you make sure the same bytes end up somewhere and put the PTEs back. With guest-private memory, this doesn't work. Forget about the implementation: you simply can't take a page of private memory, quiesce it so the guest can't access it without faulting, and turn it into Just A Bunch Of Bytes. TDX does not have that capability. (Any system with integrity-protected memory won't without having >PAGE_SIZE bytes or otherwise storing metadata, but TDX can't do this at all.) SEV-ES *can* (I think -- I asked the lead architect), but that's not the same thing as saying it's a good idea. So if you want to migrate a TDX page from one NUMA node to another, you need to do something different (I don't know all the details), you will have to ask Intel to explain how this might work in the future (it wasn't in the public docs last time I looked), but I'm fairly confident that it does not resemble try_to_unmap(). Even on SEV, where a page *can* be transformed into a Just A Bunch Of Bytes, the operation doesn't really look like try_to_unmap(). As I understand it, it's more of: look up the one-and-only guest and GPA at which this page is mapped. tear down the NPT PTE. (SEV, unlike TDX, uses paging structures in normal memory.) Ask the magic SEV mechanism to turn the page into swappable data Go to town. This doesn't really resemble the current rmap + mmu_notifier code, and shoehorning this into mmu_notifier seems like it may be uglier and more code than implementing it more directly in the backing store. If you actually just try_to_unmap() a SEV-ES page (and it worked, which it currently won't), you will have data corruption or cache incoherency. If you want to swap a page on TDX, you can't. Sorry, go directly to jail, do not collect $200. So I think there are literally zero code paths that currently call try_to_unmap() that will actually work like that on TDX. If we run out of memory on a TDX host, we can kill the guest completely and reclaim all of its memory (which probably also involves killing QEMU or whatever other user program is in charge), but that's really our only option. --Andy
On 9/1/21 3:24 AM, Yu Zhang wrote: > On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote: >> >> >> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote: >>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: >> >>> Thanks a lot for this summary. A question about the requirement: do we or >>> do we not have plan to support assigned device to the protected VM? >>> >>> If yes. The fd based solution may need change the VFIO interface as well( >>> though the fake swap entry solution need mess with VFIO too). Because: >>> >>> 1> KVM uses VFIO when assigning devices into a VM. >>> >>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all >>> guest pages will have to be mapped in host IOMMU page table to host pages, >>> which are pinned during the whole life cycle fo the VM. >>> >>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver, >>> in vfio_dma_do_map(). >>> >>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA >>> and pin the page. >>> >>> But if we are using fd based solution, not every GPA can have a HVA, thus >>> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I >>> doubt if VFIO can be modified to support this easily. >>> >>> >> >> Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device? >> >> If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it. I'm not sure whether anyone will actually do this, but it should be possible, at least in principle. Maybe someone will want to assign a NIC to a TDX guest. An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether. >> >> If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot. >> > > Thanks Andy. I was asking the first scenario. > > Well, I agree it is doable if someone really want some assigned > device in TD guest. As Kevin mentioned in his reply, HPA can be > generated, by extending VFIO with a new mapping protocol which > uses fd+offset, instead of HVA. I'm confused. I don't see why any new code is needed for this at all. Every proposal I've seen for handling TDX memory continues to handle TDX *shared* memory exactly like regular guest memory today. The only differences are that more hole punching will be needed, which will require lightweight memslots (to have many of them), memslots with holes, or mappings backing memslots with holes (which can be done with munmap() on current kernels). So you can literally just mmap a VFIO device and expect it to work, exactly like it does right now. Whether the guest will be willing to use the device will depend on the guest security policy (all kinds of patches about that are flying around), but if the guest tries to use it, it really should just work. > > Another issue is current TDX does not support DMA encryption, and > only shared GPA memory shall be mapped in the VT-d. So to support > this, KVM may need to work with VFIO to dynamically program host > IOPT(IOMMU Page Table) when TD guest notifies a shared GFN range(e.g., > with a MAP_GPA TDVMCALL), instead of prepopulating the IOPT at VM > creation time, by mapping entire GFN ranges of a guest. Given that there is no encrypted DMA support, shouldn't the only IOMMU mappings (real host-side IOMMU) that point at guest memory be for non-encrypted DMA? I don't see how this interacts at all. If the guest tries to MapGPA to turn a shared MMIO page into private, the host should fail the hypercall because the operation makes no sense. It is indeed the case that, with a TDX guest, MapGPA shared->private to a page that was previously used for unencrypted DMA will need to avoid having IOPT entries to the new private page, but even that doesn't seem particularly bad. The fd+special memslot proposal for private memory means that shared *backing store* pages never actually transition between shared and private without being completely freed. As far as I can tell, the actual problem you're referring to is: >>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all >>> guest pages will have to be mapped in host IOMMU page table to host pages, >>> which are pinned during the whole life cycle fo the VM. In principle, you could actually initialize a TDX guest with all of its memory shared and all of it mapped in the host IOMMU. When a guest turns some pages private, user code could punch a hole in the memslot, allocate private memory at that address, but leave the shared backing store in place and still mapped in the host IOMMU. The result would be that guest-initiated DMA to the previously shared address would actually work but would hit pages that are invisible to the guest. And a whole bunch of memory would be waste, but the whole system should stll work. Improving this may well need VFIO changes, but I think those would be roughly the same pages needed to have VFIO DMA be filtered through a guest-controlled virtual IOMMU and wouldn't have anything to do with private memory per se. --Andy
On 01.09.21 17:54, Andy Lutomirski wrote: > On Wed, Sep 1, 2021, at 1:09 AM, David Hildenbrand wrote: >>>> Do we have to protect from that? How would KVM protect from user space >>>> replacing private pages by shared pages in any of the models we discuss? >>> >>> The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*] >>> as both private and shared, where "shared" also incorporates any mapping from the >>> host. Essentially it boils down to the kernel ensuring that a pfn is unmapped >>> before it's converted to/from private, and KVM ensuring that it honors any >>> unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback >>> as proposed in this RFC. >> >> Okay, so the fallocate(PUNCHHOLE) from user space could trigger the >> respective unmapping and freeing of backing storage. >> >>> >>> As it pertains to PUNCH_HOLE, the responsibilities are no different than when the >>> backing-store is destroyed; the backing-store needs to notify downstream MMUs >>> (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory. >> >> Right. >> >>> >>> [*] Whether or not the kernel's direct mapping needs to be removed is debatable, >>> but my argument is that that behavior is not visible to userspace and thus >>> out of scope for this discussion, e.g. zapping/restoring the direct map can >>> be added/removed without impacting the userspace ABI. >> >> Right. Removing it shouldn't also be requited IMHO. There are other ways >> to teach the kernel to not read/write some online pages (filter >> /proc/kcore, disable hibernation, strict access checks for /dev/mem ...). >> >>> >>>>>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory >>>>>> slots. Inside KVM, bail out if you encounter such a VMA inside a normal >>>>>> user memory slot. When creating a "encryped" user memory slot, require that >>>>>> the whole VMA is covered at creation time. You know the VMA can't change >>>>>> later. >>>>> >>>>> This can work for the basic use cases, but even then I'd strongly prefer not to >>>>> tie memslot correctness to the VMAs. KVM doesn't truly care what lies behind >>>>> the virtual address of a memslot, and when it does care, it tends to do poorly, >>>>> e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn mappings, and >>>>> that's reflected in the infrastructure. E.g. KVM relies on the mmu_notifiers >>>>> to handle mprotect()/munmap()/etc... >>>> >>>> Right, and for the existing use cases this worked. But encrypted memory >>>> breaks many assumptions we once made ... >>>> >>>> I have somewhat mixed feelings about pages that are mapped into $WHATEVER >>>> page tables but not actually mapped into user space page tables. There is no >>>> way to reach these via the rmap. >>>> >>>> We have something like that already via vfio. And that is fundamentally >>>> broken when it comes to mmu notifiers, page pinning, page migration, ... >>> >>> I'm not super familiar with VFIO internals, but the idea with the fd-based >>> approach is that the backing-store would be in direct communication with KVM and >>> would handle those operations through that direct channel. >> >> Right. The problem I am seeing is that e.g., try_to_unmap() might not be >> able to actually fully unmap a page, because some non-synchronized KVM >> MMU still maps a page. It would be great to evaluate how the fd >> callbacks would fit into the whole picture, including the current rmap. >> >> I guess I'm missing the bigger picture how it all fits together on the >> !KVM side. > > The big picture is fundamentally a bit nasty. Logically (ignoring the implementation details of rmap, mmu_notifier, etc), you can call try_to_unmap and end up with a page that is Just A Bunch Of Bytes (tm). Then you can write it to disk, memcpy to another node, compress it, etc. When it gets faulted back in, you make sure the same bytes end up somewhere and put the PTEs back. > > With guest-private memory, this doesn't work. Forget about the implementation: you simply can't take a page of private memory, quiesce it so the guest can't access it without faulting, and turn it into Just A Bunch Of Bytes. TDX does not have that capability. (Any system with integrity-protected memory won't without having >PAGE_SIZE bytes or otherwise storing metadata, but TDX can't do this at all.) SEV-ES *can* (I think -- I asked the lead architect), but that's not the same thing as saying it's a good idea. > > So if you want to migrate a TDX page from one NUMA node to another, you need to do something different (I don't know all the details), you will have to ask Intel to explain how this might work in the future (it wasn't in the public docs last time I looked), but I'm fairly confident that it does not resemble try_to_unmap(). > > Even on SEV, where a page *can* be transformed into a Just A Bunch Of Bytes, the operation doesn't really look like try_to_unmap(). As I understand it, it's more of: > > look up the one-and-only guest and GPA at which this page is mapped. > tear down the NPT PTE. (SEV, unlike TDX, uses paging structures in normal memory.) > Ask the magic SEV mechanism to turn the page into swappable data > Go to town. > > This doesn't really resemble the current rmap + mmu_notifier code, and shoehorning this into mmu_notifier seems like it may be uglier and more code than implementing it more directly in the backing store. > > If you actually just try_to_unmap() a SEV-ES page (and it worked, which it currently won't), you will have data corruption or cache incoherency. > > If you want to swap a page on TDX, you can't. Sorry, go directly to jail, do not collect $200. > > So I think there are literally zero code paths that currently call try_to_unmap() that will actually work like that on TDX. If we run out of memory on a TDX host, we can kill the guest completely and reclaim all of its memory (which probably also involves killing QEMU or whatever other user program is in charge), but that's really our only option. try_to_unmap() would actually do the right thing I think. It's the users that access page content (migration, swapping) that need additional care to special-case these pages. Completely agree that these would be broken.
On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote: [...] > If you want to swap a page on TDX, you can't. Sorry, go directly to > jail, do not collect $200. Actually, even on SEV-ES you can't either. You can read the encrypted page and write it out if you want, but unless you swap it back to the exact same physical memory location, the encryption key won't work. Since we don't guarantee this for swap, I think swap won't actually work for any confidential computing environment. > So I think there are literally zero code paths that currently call > try_to_unmap() that will actually work like that on TDX. If we run > out of memory on a TDX host, we can kill the guest completely and > reclaim all of its memory (which probably also involves killing QEMU > or whatever other user program is in charge), but that's really our > only option. I think our only option for swap is guest co-operation. We're going to have to inflate a balloon or something in the guest and have the guest driver do some type of bounce of the page, where it becomes an unencrypted page in the guest (so the host can read it without the physical address keying of the encryption getting in the way) but actually encrypted with a swap transfer key known only to the guest. I assume we can use the page acceptance infrastructure currently being discussed elsewhere to do swap back in as well ... the host provides the guest with the encrypted swap page and the guest has to decrypt it and place it in encrypted guest memory. That way the swapped memory is securely encrypted, but can be swapped back in. James
On 01.09.21 18:18, James Bottomley wrote: > On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote: > [...] >> If you want to swap a page on TDX, you can't. Sorry, go directly to >> jail, do not collect $200. > > Actually, even on SEV-ES you can't either. You can read the encrypted > page and write it out if you want, but unless you swap it back to the > exact same physical memory location, the encryption key won't work. > Since we don't guarantee this for swap, I think swap won't actually > work for any confidential computing environment. > >> So I think there are literally zero code paths that currently call >> try_to_unmap() that will actually work like that on TDX. If we run >> out of memory on a TDX host, we can kill the guest completely and >> reclaim all of its memory (which probably also involves killing QEMU >> or whatever other user program is in charge), but that's really our >> only option. > > I think our only option for swap is guest co-operation. We're going to > have to inflate a balloon or something in the guest and have the guest > driver do some type of bounce of the page, where it becomes an > unencrypted page in the guest (so the host can read it without the > physical address keying of the encryption getting in the way) but > actually encrypted with a swap transfer key known only to the guest. I > assume we can use the page acceptance infrastructure currently being > discussed elsewhere to do swap back in as well ... the host provides > the guest with the encrypted swap page and the guest has to decrypt it > and place it in encrypted guest memory. Ballooning is indeed *the* mechanism to avoid swapping in the hypervisor and much rather let the guest swap. Shame it requires trusting a guest, which we, in general, can't. Not to mention other issues we already do have with ballooning (latency, broken auto-ballooning, over-inflating, ...).
On 01.09.21 18:07, Andy Lutomirski wrote: > On 9/1/21 3:24 AM, Yu Zhang wrote: >> On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote: >>> >>> >>> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote: >>>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: >>> >>>> Thanks a lot for this summary. A question about the requirement: do we or >>>> do we not have plan to support assigned device to the protected VM? >>>> >>>> If yes. The fd based solution may need change the VFIO interface as well( >>>> though the fake swap entry solution need mess with VFIO too). Because: >>>> >>>> 1> KVM uses VFIO when assigning devices into a VM. >>>> >>>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all >>>> guest pages will have to be mapped in host IOMMU page table to host pages, >>>> which are pinned during the whole life cycle fo the VM. >>>> >>>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver, >>>> in vfio_dma_do_map(). >>>> >>>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA >>>> and pin the page. >>>> >>>> But if we are using fd based solution, not every GPA can have a HVA, thus >>>> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I >>>> doubt if VFIO can be modified to support this easily. >>>> >>>> >>> >>> Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device? >>> >>> If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it. I'm not sure whether anyone will actually do this, but it should be possible, at least in principle. Maybe someone will want to assign a NIC to a TDX guest. An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether. >>> >>> If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot. >>> >> >> Thanks Andy. I was asking the first scenario. >> >> Well, I agree it is doable if someone really want some assigned >> device in TD guest. As Kevin mentioned in his reply, HPA can be >> generated, by extending VFIO with a new mapping protocol which >> uses fd+offset, instead of HVA. > > I'm confused. I don't see why any new code is needed for this at all. > Every proposal I've seen for handling TDX memory continues to handle TDX > *shared* memory exactly like regular guest memory today. The only > differences are that more hole punching will be needed, which will > require lightweight memslots (to have many of them), memslots with > holes, or mappings backing memslots with holes (which can be done with > munmap() on current kernels). > > So you can literally just mmap a VFIO device and expect it to work, > exactly like it does right now. Whether the guest will be willing to > use the device will depend on the guest security policy (all kinds of > patches about that are flying around), but if the guest tries to use it, > it really should just work. ... but if you end up mapping private memory into IOMMU of the device and the device ends up accessing that memory, we're in the same position that the host might get capped, just like access from user space, no? Sure, you can map only the complete duplicate shared-memory region into the IOMMU of the device, that would work. Shame vfio mostly always pins all guest memory and you essentially cannot punch holes into the shared memory anymore -- resulting in the worst case in a duplicate memory consumption for your VM. So you'd actually want to map only the *currently* shared pieces into the IOMMU and update the mappings on demand. Having worked on something related, I can only say that 64k individual mappings, and not being able to modify existing mappings except completely deleting them to replace with something new (!atomic), can be quite an issue for bigger VMs.
On Wed, 2021-09-01 at 18:22 +0200, David Hildenbrand wrote: > On 01.09.21 18:18, James Bottomley wrote: > > On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote: > > [...] > > > If you want to swap a page on TDX, you can't. Sorry, go directly > > > to jail, do not collect $200. > > > > Actually, even on SEV-ES you can't either. You can read the > > encrypted page and write it out if you want, but unless you swap it > > back to the exact same physical memory location, the encryption key > > won't work. Since we don't guarantee this for swap, I think swap > > won't actually work for any confidential computing environment. > > > > > So I think there are literally zero code paths that currently > > > call try_to_unmap() that will actually work like that on TDX. If > > > we run out of memory on a TDX host, we can kill the guest > > > completely and reclaim all of its memory (which probably also > > > involves killing QEMU or whatever other user program is in > > > charge), but that's really our only option. > > > > I think our only option for swap is guest co-operation. We're > > going to have to inflate a balloon or something in the guest and > > have the guest driver do some type of bounce of the page, where it > > becomes an unencrypted page in the guest (so the host can read it > > without the physical address keying of the encryption getting in > > the way) but actually encrypted with a swap transfer key known only > > to the guest. I assume we can use the page acceptance > > infrastructure currently being discussed elsewhere to do swap back > > in as well ... the host provides the guest with the encrypted swap > > page and the guest has to decrypt it and place it in encrypted > > guest memory. > > Ballooning is indeed *the* mechanism to avoid swapping in the > hypervisor and much rather let the guest swap. Shame it requires > trusting a guest, which we, in general, can't. Not to mention other > issues we already do have with ballooning (latency, broken auto- > ballooning, over-inflating, ...). Well not necessarily, but it depends how clever we want to get. If you look over on the OVMF/edk2 list, there's a proposal to do guest migration via a mirror VM that invokes a co-routine embedded in the OVMF binary: https://patchew.org/EDK2/20210818212048.162626-1-tobin@linux.ibm.com/ This gives us a page encryption mechanism that's provided by the host but accepted via the guest using attestation, meaning we have a mutually trusted piece of code that can use to extract encrypted pages. It does seem it could be enhanced to do swapping for us as well if that's a road we want to go down? James
On 01.09.21 18:31, James Bottomley wrote: > On Wed, 2021-09-01 at 18:22 +0200, David Hildenbrand wrote: >> On 01.09.21 18:18, James Bottomley wrote: >>> On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote: >>> [...] >>>> If you want to swap a page on TDX, you can't. Sorry, go directly >>>> to jail, do not collect $200. >>> >>> Actually, even on SEV-ES you can't either. You can read the >>> encrypted page and write it out if you want, but unless you swap it >>> back to the exact same physical memory location, the encryption key >>> won't work. Since we don't guarantee this for swap, I think swap >>> won't actually work for any confidential computing environment. >>> >>>> So I think there are literally zero code paths that currently >>>> call try_to_unmap() that will actually work like that on TDX. If >>>> we run out of memory on a TDX host, we can kill the guest >>>> completely and reclaim all of its memory (which probably also >>>> involves killing QEMU or whatever other user program is in >>>> charge), but that's really our only option. >>> >>> I think our only option for swap is guest co-operation. We're >>> going to have to inflate a balloon or something in the guest and >>> have the guest driver do some type of bounce of the page, where it >>> becomes an unencrypted page in the guest (so the host can read it >>> without the physical address keying of the encryption getting in >>> the way) but actually encrypted with a swap transfer key known only >>> to the guest. I assume we can use the page acceptance >>> infrastructure currently being discussed elsewhere to do swap back >>> in as well ... the host provides the guest with the encrypted swap >>> page and the guest has to decrypt it and place it in encrypted >>> guest memory. >> >> Ballooning is indeed *the* mechanism to avoid swapping in the >> hypervisor and much rather let the guest swap. Shame it requires >> trusting a guest, which we, in general, can't. Not to mention other >> issues we already do have with ballooning (latency, broken auto- >> ballooning, over-inflating, ...). > > > Well not necessarily, but it depends how clever we want to get. If you > look over on the OVMF/edk2 list, there's a proposal to do guest > migration via a mirror VM that invokes a co-routine embedded in the > OVMF binary: Yes, I heard of that. "Interesting" design. > > https://patchew.org/EDK2/20210818212048.162626-1-tobin@linux.ibm.com/ > > This gives us a page encryption mechanism that's provided by the host > but accepted via the guest using attestation, meaning we have a > mutually trusted piece of code that can use to extract encrypted pages. > It does seem it could be enhanced to do swapping for us as well if > that's a road we want to go down? Right, but that's than no longer ballooning, unless I am missing something important. You'd ask the guest to export/import, and you can trust it. But do we want to call something like that out of random kernel context when swapping/writeback, ...? Hard to tell. Feels like it won't win in a beauty contest.
On Wed, 2021-09-01 at 18:37 +0200, David Hildenbrand wrote: > On 01.09.21 18:31, James Bottomley wrote: > > On Wed, 2021-09-01 at 18:22 +0200, David Hildenbrand wrote: > > > On 01.09.21 18:18, James Bottomley wrote: > > > > On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote: > > > > [...] > > > > > If you want to swap a page on TDX, you can't. Sorry, go > > > > > directly > > > > > to jail, do not collect $200. > > > > > > > > Actually, even on SEV-ES you can't either. You can read the > > > > encrypted page and write it out if you want, but unless you > > > > swap it back to the exact same physical memory location, the > > > > encryption key won't work. Since we don't guarantee this for > > > > swap, I think swap won't actually work for any confidential > > > > computing environment. > > > > > > > > > So I think there are literally zero code paths that currently > > > > > call try_to_unmap() that will actually work like that on > > > > > TDX. If we run out of memory on a TDX host, we can kill the > > > > > guest completely and reclaim all of its memory (which > > > > > probably also involves killing QEMU or whatever other user > > > > > program is in charge), but that's really our only option. > > > > > > > > I think our only option for swap is guest co-operation. We're > > > > going to have to inflate a balloon or something in the guest > > > > and have the guest driver do some type of bounce of the page, > > > > where it becomes an unencrypted page in the guest (so the host > > > > can read it without the physical address keying of the > > > > encryption getting in the way) but actually encrypted with a > > > > swap transfer key known only to the guest. I assume we can use > > > > the page acceptance infrastructure currently being discussed > > > > elsewhere to do swap back in as well ... the host provides the > > > > guest with the encrypted swap page and the guest has to decrypt > > > > it and place it in encrypted guest memory. > > > > > > Ballooning is indeed *the* mechanism to avoid swapping in the > > > hypervisor and much rather let the guest swap. Shame it requires > > > trusting a guest, which we, in general, can't. Not to mention > > > other issues we already do have with ballooning (latency, broken > > > auto-ballooning, over-inflating, ...). > > > > Well not necessarily, but it depends how clever we want to get. If > > you look over on the OVMF/edk2 list, there's a proposal to do guest > > migration via a mirror VM that invokes a co-routine embedded in the > > OVMF binary: > > Yes, I heard of that. "Interesting" design. Heh, well what other suggestion do you have? The problem is there needs to be code somewhere to perform some operations that's trusted by both the guest and the host. The only element for a confidential VM that has this shared trust is the OVMF firmware, so it seems logical to use it. > > > https://patchew.org/EDK2/20210818212048.162626-1-tobin@linux.ibm.com/ > > > > This gives us a page encryption mechanism that's provided by the > > host but accepted via the guest using attestation, meaning we have > > a mutually trusted piece of code that can use to extract encrypted > > pages. It does seem it could be enhanced to do swapping for us as > > well if that's a road we want to go down? > > Right, but that's than no longer ballooning, unless I am missing > something important. You'd ask the guest to export/import, and you > can trust it. But do we want to call something like that out of > random kernel context when swapping/writeback, ...? Hard to tell. > Feels like it won't win in a beauty contest. What I was thinking is that OVMF can emulate devices in this trusted code ... another potential use for it is a trusted vTPM for SEV-SNP so we can do measured boot. To use it we'd give the guest kernel some type of virtual swap driver that attaches to this OVMF device. I suppose by the time we've done this, it really does look like a balloon, but I'd like to think of it more as a paravirt memory controller since it might be used to make a guest more co-operative in a host overcommit situation. That's not to say we *should* do this, merely that it doesn't have to look like a pig with lipstick. James
On Wed, Sep 1, 2021, at 9:18 AM, James Bottomley wrote: > On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote: > [...] > > If you want to swap a page on TDX, you can't. Sorry, go directly to > > jail, do not collect $200. > > Actually, even on SEV-ES you can't either. You can read the encrypted > page and write it out if you want, but unless you swap it back to the > exact same physical memory location, the encryption key won't work. > Since we don't guarantee this for swap, I think swap won't actually > work for any confidential computing environment. > > > So I think there are literally zero code paths that currently call > > try_to_unmap() that will actually work like that on TDX. If we run > > out of memory on a TDX host, we can kill the guest completely and > > reclaim all of its memory (which probably also involves killing QEMU > > or whatever other user program is in charge), but that's really our > > only option. > > I think our only option for swap is guest co-operation. We're going to > have to inflate a balloon or something in the guest and have the guest > driver do some type of bounce of the page, where it becomes an > unencrypted page in the guest (so the host can read it without the > physical address keying of the encryption getting in the way) but > actually encrypted with a swap transfer key known only to the guest. I > assume we can use the page acceptance infrastructure currently being > discussed elsewhere to do swap back in as well ... the host provides > the guest with the encrypted swap page and the guest has to decrypt it > and place it in encrypted guest memory. I asked David, and he said the PSP offers a swapping mechanism for SEV-ES. I haven’t read the details, but they should all be public.
>>> Well not necessarily, but it depends how clever we want to get. If >>> you look over on the OVMF/edk2 list, there's a proposal to do guest >>> migration via a mirror VM that invokes a co-routine embedded in the >>> OVMF binary: >> >> Yes, I heard of that. "Interesting" design. > > Heh, well what other suggestion do you have? The problem is there > needs to be code somewhere to perform some operations that's trusted by > both the guest and the host. The only element for a confidential VM > that has this shared trust is the OVMF firmware, so it seems logical to > use it. <offtopic> Let me put it this way: I worked with another architecture that doesn't fault on access of a secure page, but instead automatically exports/encrypts it so it can be swapped. It doesn't send a MCE and kills the host. It doesn't require fancy code in the guest firmware to export a page. The code runs in the ultravisor -- yes, I'm talking about s390x. Now, I am not an expert on all of the glory details of TDX, SEV, ... to say which attack surface they introduced with that design, and if it can't be mitigated. I can only assume that there are real reasons (e.g., supporting an ultravisor is problematic, patents? ;) ) why x86-64 is different. So whenever I see something really complicated to work around such issues, it feels to me like a hardware/platform limitation is making our life hard and forces us to come up with such "interesting" designs. Sure, it's logical in this context, but it feels like "The house doesn't have a door, so I'll have to climb through the window.". It gets the job done but isn't ideally what you'd want to have. If you understand what I am trying to say :) </offtopic> > >> >>> https://patchew.org/EDK2/20210818212048.162626-1-tobin@linux.ibm.com/ >>> >>> This gives us a page encryption mechanism that's provided by the >>> host but accepted via the guest using attestation, meaning we have >>> a mutually trusted piece of code that can use to extract encrypted >>> pages. It does seem it could be enhanced to do swapping for us as >>> well if that's a road we want to go down? >> >> Right, but that's than no longer ballooning, unless I am missing >> something important. You'd ask the guest to export/import, and you >> can trust it. But do we want to call something like that out of >> random kernel context when swapping/writeback, ...? Hard to tell. >> Feels like it won't win in a beauty contest. > > What I was thinking is that OVMF can emulate devices in this trusted > code ... another potential use for it is a trusted vTPM for SEV-SNP so > we can do measured boot. To use it we'd give the guest kernel some > type of virtual swap driver that attaches to this OVMF device. I > suppose by the time we've done this, it really does look like a > balloon, but I'd like to think of it more as a paravirt memory > controller since it might be used to make a guest more co-operative in > a host overcommit situation. > > That's not to say we *should* do this, merely that it doesn't have to > look like a pig with lipstick. It's an interesting approach: it would essentially mean that the OVMF would swap pages out to some virtual device and then essentially "inflate" the pages like a balloon. Still, it doesn't sound like something you want to trigger from actual kernel context when actually swapping in the kernel. It would much rather be something like other balloon implementations: completely controlled by user space. So yes, "doesn't look like a pig with lipstick", but still compared to proper in-kernel swapping, looks like a workaround.
On Wed, Sep 1, 2021, at 9:16 AM, David Hildenbrand wrote: > On 01.09.21 17:54, Andy Lutomirski wrote: > > On Wed, Sep 1, 2021, at 1:09 AM, David Hildenbrand wrote: > >>>> Do we have to protect from that? How would KVM protect from user space > >>>> replacing private pages by shared pages in any of the models we discuss? > >>> > >>> The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*] > >>> as both private and shared, where "shared" also incorporates any mapping from the > >>> host. Essentially it boils down to the kernel ensuring that a pfn is unmapped > >>> before it's converted to/from private, and KVM ensuring that it honors any > >>> unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback > >>> as proposed in this RFC. > >> > >> Okay, so the fallocate(PUNCHHOLE) from user space could trigger the > >> respective unmapping and freeing of backing storage. > >> > >>> > >>> As it pertains to PUNCH_HOLE, the responsibilities are no different than when the > >>> backing-store is destroyed; the backing-store needs to notify downstream MMUs > >>> (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory. > >> > >> Right. > >> > >>> > >>> [*] Whether or not the kernel's direct mapping needs to be removed is debatable, > >>> but my argument is that that behavior is not visible to userspace and thus > >>> out of scope for this discussion, e.g. zapping/restoring the direct map can > >>> be added/removed without impacting the userspace ABI. > >> > >> Right. Removing it shouldn't also be requited IMHO. There are other ways > >> to teach the kernel to not read/write some online pages (filter > >> /proc/kcore, disable hibernation, strict access checks for /dev/mem ...). > >> > >>> > >>>>>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory > >>>>>> slots. Inside KVM, bail out if you encounter such a VMA inside a normal > >>>>>> user memory slot. When creating a "encryped" user memory slot, require that > >>>>>> the whole VMA is covered at creation time. You know the VMA can't change > >>>>>> later. > >>>>> > >>>>> This can work for the basic use cases, but even then I'd strongly prefer not to > >>>>> tie memslot correctness to the VMAs. KVM doesn't truly care what lies behind > >>>>> the virtual address of a memslot, and when it does care, it tends to do poorly, > >>>>> e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn mappings, and > >>>>> that's reflected in the infrastructure. E.g. KVM relies on the mmu_notifiers > >>>>> to handle mprotect()/munmap()/etc... > >>>> > >>>> Right, and for the existing use cases this worked. But encrypted memory > >>>> breaks many assumptions we once made ... > >>>> > >>>> I have somewhat mixed feelings about pages that are mapped into $WHATEVER > >>>> page tables but not actually mapped into user space page tables. There is no > >>>> way to reach these via the rmap. > >>>> > >>>> We have something like that already via vfio. And that is fundamentally > >>>> broken when it comes to mmu notifiers, page pinning, page migration, ... > >>> > >>> I'm not super familiar with VFIO internals, but the idea with the fd-based > >>> approach is that the backing-store would be in direct communication with KVM and > >>> would handle those operations through that direct channel. > >> > >> Right. The problem I am seeing is that e.g., try_to_unmap() might not be > >> able to actually fully unmap a page, because some non-synchronized KVM > >> MMU still maps a page. It would be great to evaluate how the fd > >> callbacks would fit into the whole picture, including the current rmap. > >> > >> I guess I'm missing the bigger picture how it all fits together on the > >> !KVM side. > > > > The big picture is fundamentally a bit nasty. Logically (ignoring the implementation details of rmap, mmu_notifier, etc), you can call try_to_unmap and end up with a page that is Just A Bunch Of Bytes (tm). Then you can write it to disk, memcpy to another node, compress it, etc. When it gets faulted back in, you make sure the same bytes end up somewhere and put the PTEs back. > > > > With guest-private memory, this doesn't work. Forget about the implementation: you simply can't take a page of private memory, quiesce it so the guest can't access it without faulting, and turn it into Just A Bunch Of Bytes. TDX does not have that capability. (Any system with integrity-protected memory won't without having >PAGE_SIZE bytes or otherwise storing metadata, but TDX can't do this at all.) SEV-ES *can* (I think -- I asked the lead architect), but that's not the same thing as saying it's a good idea. > > > > So if you want to migrate a TDX page from one NUMA node to another, you need to do something different (I don't know all the details), you will have to ask Intel to explain how this might work in the future (it wasn't in the public docs last time I looked), but I'm fairly confident that it does not resemble try_to_unmap(). > > > > Even on SEV, where a page *can* be transformed into a Just A Bunch Of Bytes, the operation doesn't really look like try_to_unmap(). As I understand it, it's more of: > > > > look up the one-and-only guest and GPA at which this page is mapped. > > tear down the NPT PTE. (SEV, unlike TDX, uses paging structures in normal memory.) > > Ask the magic SEV mechanism to turn the page into swappable data > > Go to town. > > > > This doesn't really resemble the current rmap + mmu_notifier code, and shoehorning this into mmu_notifier seems like it may be uglier and more code than implementing it more directly in the backing store. > > > > If you actually just try_to_unmap() a SEV-ES page (and it worked, which it currently won't), you will have data corruption or cache incoherency. > > > > If you want to swap a page on TDX, you can't. Sorry, go directly to jail, do not collect $200. > > > > So I think there are literally zero code paths that currently call try_to_unmap() that will actually work like that on TDX. If we run out of memory on a TDX host, we can kill the guest completely and reclaim all of its memory (which probably also involves killing QEMU or whatever other user program is in charge), but that's really our only option. > > try_to_unmap() would actually do the right thing I think. It's the users > that access page content (migration, swapping) that need additional care > to special-case these pages. Completely agree that these would be broken. I suspect that the eventual TDX numa migration flow will involve asking the TD module to relocate the page without unmapping it first. TDX doesn’t really have a nondestructive unmap operation. > > > -- > Thanks, > > David / dhildenb > >
On Wed, 2021-09-01 at 10:08 -0700, Andy Lutomirski wrote: > > On Wed, Sep 1, 2021, at 9:18 AM, James Bottomley wrote: > > On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote: > > [...] > > > If you want to swap a page on TDX, you can't. Sorry, go directly > > > to jail, do not collect $200. > > > > Actually, even on SEV-ES you can't either. You can read the > > encrypted page and write it out if you want, but unless you swap it > > back to the exact same physical memory location, the encryption key > > won't work. Since we don't guarantee this for swap, I think swap > > won't actually work for any confidential computing environment. > > > > > So I think there are literally zero code paths that currently > > > call try_to_unmap() that will actually work like that on TDX. If > > > we run out of memory on a TDX host, we can kill the guest > > > completely and reclaim all of its memory (which probably also > > > involves killing QEMU or whatever other user program is in > > > charge), but that's really our only option. > > > > I think our only option for swap is guest co-operation. We're > > going to have to inflate a balloon or something in the guest and > > have the guest driver do some type of bounce of the page, where it > > becomes an unencrypted page in the guest (so the host can read it > > without the physical address keying of the encryption getting in > > the way) but actually encrypted with a swap transfer key known only > > to the guest. I assume we can use the page acceptance > > infrastructure currently being discussed elsewhere to do swap back > > in as well ... the host provides the guest with the encrypted swap > > page and the guest has to decrypt it and place it in encrypted > > guest memory. > > I asked David, and he said the PSP offers a swapping mechanism for > SEV-ES. I haven’t read the details, but they should all be public. Well it does, but it's not useful: we can't use the PSP for bulk encryption, it's too slow. That's why we're having to fuss about fast migration in the first place. In theory the two PSPs can co-operate to migrate a guest but only if you have about a year to wait for it to happen. James
On Wed, Sep 01, 2021, David Hildenbrand wrote: > > > > Well not necessarily, but it depends how clever we want to get. If > > > > you look over on the OVMF/edk2 list, there's a proposal to do guest > > > > migration via a mirror VM that invokes a co-routine embedded in the > > > > OVMF binary: > > > > > > Yes, I heard of that. "Interesting" design. > > > > Heh, well what other suggestion do you have? The problem is there > > needs to be code somewhere to perform some operations that's trusted by > > both the guest and the host. The only element for a confidential VM > > that has this shared trust is the OVMF firmware, so it seems logical to > > use it. > > <offtopic> > > Let me put it this way: I worked with another architecture that doesn't > fault on access of a secure page, but instead automatically exports/encrypts I thought s390 does fault on insecure accesses to secure pages, and it's the kernel's fault handler that "automatically" converts the page? E.g. trap 0x3d -> do_secure_storage_access() -> arch_make_page_accessible(). > it so it can be swapped. It doesn't send a MCE and kills the host. It > doesn't require fancy code in the guest firmware to export a page. > > The code runs in the ultravisor -- yes, I'm talking about s390x. Now, I am > not an expert on all of the glory details of TDX, SEV, ... to say which > attack surface they introduced with that design, and if it can't be > mitigated. I can only assume that there are real reasons (e.g., supporting > an ultravisor is problematic, patents? ;) ) why x86-64 is different. > > So whenever I see something really complicated to work around such issues, > it feels to me like a hardware/platform limitation is making our life hard > and forces us to come up with such "interesting" designs. Oh, 100% agree, the TDX "limitation" of poisoning cache line and leaving a land mine to trip over is absolutely abhorrent. SEV-ES and SEV aren't much better in that they happily corrupt guest memory. I am quite jealous of s390 behavior of simply faulting on the actual access. SEV-SNP does fault on the access and could do something similar to s390, but I'm not totally convinced that's actually desirable as it has ramifications with respect to debugging host and/or guest. But it'd be nice to have the option... > Sure, it's logical in this context, but it feels like "The house doesn't Heh, for some definitions of "logical". > have a door, so I'll have to climb through the window.". It gets the job > done but isn't ideally what you'd want to have. If you understand what I am > trying to say :) > > </offtopic>
On 01.09.21 19:50, Sean Christopherson wrote: > On Wed, Sep 01, 2021, David Hildenbrand wrote: >>>>> Well not necessarily, but it depends how clever we want to get. If >>>>> you look over on the OVMF/edk2 list, there's a proposal to do guest >>>>> migration via a mirror VM that invokes a co-routine embedded in the >>>>> OVMF binary: >>>> >>>> Yes, I heard of that. "Interesting" design. >>> >>> Heh, well what other suggestion do you have? The problem is there >>> needs to be code somewhere to perform some operations that's trusted by >>> both the guest and the host. The only element for a confidential VM >>> that has this shared trust is the OVMF firmware, so it seems logical to >>> use it. >> >> <offtopic> >> >> Let me put it this way: I worked with another architecture that doesn't >> fault on access of a secure page, but instead automatically exports/encrypts > > I thought s390 does fault on insecure accesses to secure pages, and it's the > kernel's fault handler that "automatically" converts the page? E.g. trap 0x3d > -> do_secure_storage_access() -> arch_make_page_accessible(). "automatic" as in "the kernel can do it easily automatically under the hood when accessing such memory", yes that's what I meant :)
On Wed, Sep 1, 2021, at 9:18 AM, James Bottomley wrote: > On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote: > [...] > > If you want to swap a page on TDX, you can't. Sorry, go directly to > > jail, do not collect $200. > > Actually, even on SEV-ES you can't either. You can read the encrypted > page and write it out if you want, but unless you swap it back to the > exact same physical memory location, the encryption key won't work. > Since we don't guarantee this for swap, I think swap won't actually > work for any confidential computing environment. > > > So I think there are literally zero code paths that currently call > > try_to_unmap() that will actually work like that on TDX. If we run > > out of memory on a TDX host, we can kill the guest completely and > > reclaim all of its memory (which probably also involves killing QEMU > > or whatever other user program is in charge), but that's really our > > only option. > > I think our only option for swap is guest co-operation. We're going to > have to inflate a balloon or something in the guest and have the guest > driver do some type of bounce of the page, where it becomes an > unencrypted page in the guest (so the host can read it without the > physical address keying of the encryption getting in the way) but > actually encrypted with a swap transfer key known only to the guest. I > assume we can use the page acceptance infrastructure currently being > discussed elsewhere to do swap back in as well ... the host provides > the guest with the encrypted swap page and the guest has to decrypt it > and place it in encrypted guest memory. I think the TD module could, without hardware changes, fairly efficiently re-encrypt guest pages for swap. The TD module has access to the full CPU crypto capabilities, so this wouldn't be slow. It would just require convincing the TD module team to implement such a feature. It would be very, very nice for whatever swap support ends up existing to work without guest cooperation -- even if the guest is cooperative, we don't want to end up waiting for the guest to help when the host is under memory pressure.
On 9/1/21 9:18 AM, James Bottomley wrote: >> So I think there are literally zero code paths that currently call >> try_to_unmap() that will actually work like that on TDX. If we run >> out of memory on a TDX host, we can kill the guest completely and >> reclaim all of its memory (which probably also involves killing QEMU >> or whatever other user program is in charge), but that's really our >> only option. > I think our only option for swap is guest co-operation. Yes, today that's the only choice. Both TDX and SEV-SNP can *theoretically* extend their architectures to enable swap with uncooperative guests. But, frankly, nobody has been asking for it. If you care, please ask Intel and AMD to fix this gap.
On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote: > On 9/1/21 3:24 AM, Yu Zhang wrote: > > On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote: > >> > >> > >> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote: > >>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: > >> > >>> Thanks a lot for this summary. A question about the requirement: do we or > >>> do we not have plan to support assigned device to the protected VM? > >>> > >>> If yes. The fd based solution may need change the VFIO interface as well( > >>> though the fake swap entry solution need mess with VFIO too). Because: > >>> > >>> 1> KVM uses VFIO when assigning devices into a VM. > >>> > >>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all > >>> guest pages will have to be mapped in host IOMMU page table to host pages, > >>> which are pinned during the whole life cycle fo the VM. > >>> > >>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver, > >>> in vfio_dma_do_map(). > >>> > >>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA > >>> and pin the page. > >>> > >>> But if we are using fd based solution, not every GPA can have a HVA, thus > >>> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I > >>> doubt if VFIO can be modified to support this easily. > >>> > >>> > >> > >> Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device? > >> > >> If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it. I'm not sure whether anyone will actually do this, but it should be possible, at least in principle. Maybe someone will want to assign a NIC to a TDX guest. An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether. > >> > >> If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot. > >> > > > > Thanks Andy. I was asking the first scenario. > > > > Well, I agree it is doable if someone really want some assigned > > device in TD guest. As Kevin mentioned in his reply, HPA can be > > generated, by extending VFIO with a new mapping protocol which > > uses fd+offset, instead of HVA. > > I'm confused. I don't see why any new code is needed for this at all. > Every proposal I've seen for handling TDX memory continues to handle TDX > *shared* memory exactly like regular guest memory today. The only > differences are that more hole punching will be needed, which will > require lightweight memslots (to have many of them), memslots with > holes, or mappings backing memslots with holes (which can be done with > munmap() on current kernels). Thanks for pointing this out. And yes, for DMAs not capable of encryption( which is the case in current TDX). GUP shall work as it is in VFIO. :) > > So you can literally just mmap a VFIO device and expect it to work, > exactly like it does right now. Whether the guest will be willing to > use the device will depend on the guest security policy (all kinds of > patches about that are flying around), but if the guest tries to use it, > it really should just work. > But I think there's still problem. For now, 1> Qemu mmap()s all GPAs into its HVA space, when the VM is created. 2> With no idea which part of guest memory shall be shared, VFIO will just set up the IOPT, by mapping whole GPA ranges in IOPT. 3> And those GPAs are actually private ones, with no shared-bit set. Later when guest tries to perform a DMA(using a shared GPA), IO page fault shall happen. > > > > Another issue is current TDX does not support DMA encryption, and > > only shared GPA memory shall be mapped in the VT-d. So to support > > this, KVM may need to work with VFIO to dynamically program host > > IOPT(IOMMU Page Table) when TD guest notifies a shared GFN range(e.g., > > with a MAP_GPA TDVMCALL), instead of prepopulating the IOPT at VM > > creation time, by mapping entire GFN ranges of a guest. > > Given that there is no encrypted DMA support, shouldn't the only IOMMU > mappings (real host-side IOMMU) that point at guest memory be for > non-encrypted DMA? I don't see how this interacts at all. If the guest > tries to MapGPA to turn a shared MMIO page into private, the host should > fail the hypercall because the operation makes no sense. > > It is indeed the case that, with a TDX guest, MapGPA shared->private to > a page that was previously used for unencrypted DMA will need to avoid > having IOPT entries to the new private page, but even that doesn't seem > particularly bad. The fd+special memslot proposal for private memory > means that shared *backing store* pages never actually transition > between shared and private without being completely freed. > > As far as I can tell, the actual problem you're referring to is: > > >>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all > >>> guest pages will have to be mapped in host IOMMU page table to host > pages, > >>> which are pinned during the whole life cycle fo the VM. Yes. That's the primary concern. :) > > In principle, you could actually initialize a TDX guest with all of its > memory shared and all of it mapped in the host IOMMU. When a guest > turns some pages private, user code could punch a hole in the memslot, > allocate private memory at that address, but leave the shared backing > store in place and still mapped in the host IOMMU. The result would be > that guest-initiated DMA to the previously shared address would actually > work but would hit pages that are invisible to the guest. And a whole > bunch of memory would be waste, but the whole system should stll work. Do you mean to let VFIO & IOMMU to treat all guest memory as shared first, and then just allocate the private pages in another backing store? I guess that could work, but with the cost of allocating roughly 2x physical pages of the guest RAM size. After all, the shared pages shall be only a small part of guest memory. If device assignment is desired in current TDX. My understanding of the enabling work would be like this: 1> Change qemu to not trigger VFIO_IOMMU_MAP_DMA for the TD, thus no IOPT prepopulated, and no physical page allocated. 2> KVM forwards MapGPA(private -> shared) request to Qemu. 3> Qemu asks VFIO to pin and map the shared GPAs. For private -> shared transitions, the memslot punching, IOPT unmapping, and iotlb flushing are necessary. Possibly new interface between VFIO and KVM is needed. But actually I am not sure if people really want assigned device in current TDX. Bottleneck of the performance should be the copying to/from swiotlb buffers. B.R. Yu
On Wed, Sep 01, 2021 at 06:27:20PM +0200, David Hildenbrand wrote: > On 01.09.21 18:07, Andy Lutomirski wrote: > > On 9/1/21 3:24 AM, Yu Zhang wrote: > > > On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote: > > > > > > > > > > > > On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote: > > > > > On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: > > > > > > > > > Thanks a lot for this summary. A question about the requirement: do we or > > > > > do we not have plan to support assigned device to the protected VM? > > > > > > > > > > If yes. The fd based solution may need change the VFIO interface as well( > > > > > though the fake swap entry solution need mess with VFIO too). Because: > > > > > > > > > > 1> KVM uses VFIO when assigning devices into a VM. > > > > > > > > > > 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all > > > > > guest pages will have to be mapped in host IOMMU page table to host pages, > > > > > which are pinned during the whole life cycle fo the VM. > > > > > > > > > > 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver, > > > > > in vfio_dma_do_map(). > > > > > > > > > > 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA > > > > > and pin the page. > > > > > > > > > > But if we are using fd based solution, not every GPA can have a HVA, thus > > > > > the current VFIO interface to map and pin the GPA(IOVA) wont work. And I > > > > > doubt if VFIO can be modified to support this easily. > > > > > > > > > > > > > > > > > > Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device? > > > > > > > > If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it. I'm not sure whether anyone will actually do this, but it should be possible, at least in principle. Maybe someone will want to assign a NIC to a TDX guest. An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether. > > > > > > > > If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot. > > > > > > > > > > Thanks Andy. I was asking the first scenario. > > > > > > Well, I agree it is doable if someone really want some assigned > > > device in TD guest. As Kevin mentioned in his reply, HPA can be > > > generated, by extending VFIO with a new mapping protocol which > > > uses fd+offset, instead of HVA. > > > > I'm confused. I don't see why any new code is needed for this at all. > > Every proposal I've seen for handling TDX memory continues to handle TDX > > *shared* memory exactly like regular guest memory today. The only > > differences are that more hole punching will be needed, which will > > require lightweight memslots (to have many of them), memslots with > > holes, or mappings backing memslots with holes (which can be done with > > munmap() on current kernels). > > > > So you can literally just mmap a VFIO device and expect it to work, > > exactly like it does right now. Whether the guest will be willing to > > use the device will depend on the guest security policy (all kinds of > > patches about that are flying around), but if the guest tries to use it, > > it really should just work. > > ... but if you end up mapping private memory into IOMMU of the device and > the device ends up accessing that memory, we're in the same position that > the host might get capped, just like access from user space, no? Well, according to the spec: - If the result of the translation results in a physical address with a TD private key ID, then the IOMMU will abort the transaction and report a VT-d DMA remapping failure. - If the GPA in the transaction that is input to the IOMMU is private (SHARED bit is 0), then the IOMMU may abort the transaction and report a VT-d DMA remapping failure. So I guess mapping private GPAs in IOMMU is not that dangerous as mapping into userspace. Though still wrong. > > Sure, you can map only the complete duplicate shared-memory region into the > IOMMU of the device, that would work. Shame vfio mostly always pins all > guest memory and you essentially cannot punch holes into the shared memory > anymore -- resulting in the worst case in a duplicate memory consumption for > your VM. > > So you'd actually want to map only the *currently* shared pieces into the > IOMMU and update the mappings on demand. Having worked on something related, Exactly. On demand mapping and page pinning for shared memory is necessary. > I can only say that 64k individual mappings, and not being able to modify > existing mappings except completely deleting them to replace with something > new (!atomic), can be quite an issue for bigger VMs. Do you mean atomicity in mapping/unmapping can hardly be guaranteed during the shared <-> private transition? May I ask for elaboration? Thanks! B.R. Yu
On 02.09.21 10:34, Yu Zhang wrote: > On Wed, Sep 01, 2021 at 06:27:20PM +0200, David Hildenbrand wrote: >> On 01.09.21 18:07, Andy Lutomirski wrote: >>> On 9/1/21 3:24 AM, Yu Zhang wrote: >>>> On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote: >>>>> >>>>> >>>>> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote: >>>>>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: >>>>> >>>>>> Thanks a lot for this summary. A question about the requirement: do we or >>>>>> do we not have plan to support assigned device to the protected VM? >>>>>> >>>>>> If yes. The fd based solution may need change the VFIO interface as well( >>>>>> though the fake swap entry solution need mess with VFIO too). Because: >>>>>> >>>>>> 1> KVM uses VFIO when assigning devices into a VM. >>>>>> >>>>>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all >>>>>> guest pages will have to be mapped in host IOMMU page table to host pages, >>>>>> which are pinned during the whole life cycle fo the VM. >>>>>> >>>>>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver, >>>>>> in vfio_dma_do_map(). >>>>>> >>>>>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA >>>>>> and pin the page. >>>>>> >>>>>> But if we are using fd based solution, not every GPA can have a HVA, thus >>>>>> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I >>>>>> doubt if VFIO can be modified to support this easily. >>>>>> >>>>>> >>>>> >>>>> Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device? >>>>> >>>>> If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it. I'm not sure whether anyone will actually do this, but it should be possible, at least in principle. Maybe someone will want to assign a NIC to a TDX guest. An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether. >>>>> >>>>> If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot. >>>>> >>>> >>>> Thanks Andy. I was asking the first scenario. >>>> >>>> Well, I agree it is doable if someone really want some assigned >>>> device in TD guest. As Kevin mentioned in his reply, HPA can be >>>> generated, by extending VFIO with a new mapping protocol which >>>> uses fd+offset, instead of HVA. >>> >>> I'm confused. I don't see why any new code is needed for this at all. >>> Every proposal I've seen for handling TDX memory continues to handle TDX >>> *shared* memory exactly like regular guest memory today. The only >>> differences are that more hole punching will be needed, which will >>> require lightweight memslots (to have many of them), memslots with >>> holes, or mappings backing memslots with holes (which can be done with >>> munmap() on current kernels). >>> >>> So you can literally just mmap a VFIO device and expect it to work, >>> exactly like it does right now. Whether the guest will be willing to >>> use the device will depend on the guest security policy (all kinds of >>> patches about that are flying around), but if the guest tries to use it, >>> it really should just work. >> >> ... but if you end up mapping private memory into IOMMU of the device and >> the device ends up accessing that memory, we're in the same position that >> the host might get capped, just like access from user space, no? > > Well, according to the spec: > > - If the result of the translation results in a physical address with a TD > private key ID, then the IOMMU will abort the transaction and report a VT-d > DMA remapping failure. > > - If the GPA in the transaction that is input to the IOMMU is private (SHARED > bit is 0), then the IOMMU may abort the transaction and report a VT-d DMA > remapping failure. > > So I guess mapping private GPAs in IOMMU is not that dangerous as mapping > into userspace. Though still wrong. > >> >> Sure, you can map only the complete duplicate shared-memory region into the >> IOMMU of the device, that would work. Shame vfio mostly always pins all >> guest memory and you essentially cannot punch holes into the shared memory >> anymore -- resulting in the worst case in a duplicate memory consumption for >> your VM. >> >> So you'd actually want to map only the *currently* shared pieces into the >> IOMMU and update the mappings on demand. Having worked on something related, > > Exactly. On demand mapping and page pinning for shared memory is necessary. > >> I can only say that 64k individual mappings, and not being able to modify >> existing mappings except completely deleting them to replace with something >> new (!atomic), can be quite an issue for bigger VMs. > > Do you mean atomicity in mapping/unmapping can hardly be guaranteed during > the shared <-> private transition? May I ask for elaboration? Thanks! If we expect to really only have little shared memory, and expect a VM always has no shared memory when booting up (I think this is the case), I guess this could work. The issue is if the guest e.g., makes contiguous 2 MiB shared and later wants to unshare individual pages/parts. You'll have to DMA map the 2 MiB in page granularity, otherwise you'll have to DMA unmap 2 MiB and DMA remap all still-shared pieces. That is not atomic and can be problematic if the device is accessing some of the shared parts at that time. Consequently that means, that large shared regions can be problematic when mapped, because we'll have to map in page granularity. We have 64k such individual mappings in general. 64k * 4KiB == 256 MiB Not sure if there would be use cases, e.g., with GPGPUs and similar, where you'd want to share a lot of memory with a device ... But these are just my thoughts, maybe I am missing something important.
On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote: > In principle, you could actually initialize a TDX guest with all of its > memory shared and all of it mapped in the host IOMMU. Not sure how this works in TDX, but in SEV code fetches are always treated as encrypted. So this approach would not work with SEV, not to speak about attestation, which will not work with this approach either :) Regards, Joerg
On Wed, Sep 01, 2021 at 10:08:33AM -0700, Andy Lutomirski wrote: > I asked David, and he said the PSP offers a swapping mechanism for > SEV-ES. I haven’t read the details, but they should all be public. Right, the details are here: https://www.amd.com/system/files/TechDocs/55766_SEV-KM_API_Specification.pdf Namely section 6.25 and 6.26 which describe the SWAP_OUT and SWAP_IN commands. Regards, Joerg
On Thu, Sep 02, 2021 at 10:44:02AM +0200, David Hildenbrand wrote: > On 02.09.21 10:34, Yu Zhang wrote: > > On Wed, Sep 01, 2021 at 06:27:20PM +0200, David Hildenbrand wrote: > > > On 01.09.21 18:07, Andy Lutomirski wrote: > > > > On 9/1/21 3:24 AM, Yu Zhang wrote: > > > > > On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote: > > > > > > > > > > > > > > > > > > On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote: > > > > > > > On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: > > > > > > > > > > > > > Thanks a lot for this summary. A question about the requirement: do we or > > > > > > > do we not have plan to support assigned device to the protected VM? > > > > > > > > > > > > > > If yes. The fd based solution may need change the VFIO interface as well( > > > > > > > though the fake swap entry solution need mess with VFIO too). Because: > > > > > > > > > > > > > > 1> KVM uses VFIO when assigning devices into a VM. > > > > > > > > > > > > > > 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all > > > > > > > guest pages will have to be mapped in host IOMMU page table to host pages, > > > > > > > which are pinned during the whole life cycle fo the VM. > > > > > > > > > > > > > > 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver, > > > > > > > in vfio_dma_do_map(). > > > > > > > > > > > > > > 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA > > > > > > > and pin the page. > > > > > > > > > > > > > > But if we are using fd based solution, not every GPA can have a HVA, thus > > > > > > > the current VFIO interface to map and pin the GPA(IOVA) wont work. And I > > > > > > > doubt if VFIO can be modified to support this easily. > > > > > > > > > > > > > > > > > > > > > > > > > > Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device? > > > > > > > > > > > > If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it. I'm not sure whether anyone will actually do this, but it should be possible, at least in principle. Maybe someone will want to assign a NIC to a TDX guest. An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether. > > > > > > > > > > > > If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot. > > > > > > > > > > > > > > > > Thanks Andy. I was asking the first scenario. > > > > > > > > > > Well, I agree it is doable if someone really want some assigned > > > > > device in TD guest. As Kevin mentioned in his reply, HPA can be > > > > > generated, by extending VFIO with a new mapping protocol which > > > > > uses fd+offset, instead of HVA. > > > > > > > > I'm confused. I don't see why any new code is needed for this at all. > > > > Every proposal I've seen for handling TDX memory continues to handle TDX > > > > *shared* memory exactly like regular guest memory today. The only > > > > differences are that more hole punching will be needed, which will > > > > require lightweight memslots (to have many of them), memslots with > > > > holes, or mappings backing memslots with holes (which can be done with > > > > munmap() on current kernels). > > > > > > > > So you can literally just mmap a VFIO device and expect it to work, > > > > exactly like it does right now. Whether the guest will be willing to > > > > use the device will depend on the guest security policy (all kinds of > > > > patches about that are flying around), but if the guest tries to use it, > > > > it really should just work. > > > > > > ... but if you end up mapping private memory into IOMMU of the device and > > > the device ends up accessing that memory, we're in the same position that > > > the host might get capped, just like access from user space, no? > > > > Well, according to the spec: > > > > - If the result of the translation results in a physical address with a TD > > private key ID, then the IOMMU will abort the transaction and report a VT-d > > DMA remapping failure. > > > > - If the GPA in the transaction that is input to the IOMMU is private (SHARED > > bit is 0), then the IOMMU may abort the transaction and report a VT-d DMA > > remapping failure. > > > > So I guess mapping private GPAs in IOMMU is not that dangerous as mapping > > into userspace. Though still wrong. > > > > > > > > Sure, you can map only the complete duplicate shared-memory region into the > > > IOMMU of the device, that would work. Shame vfio mostly always pins all > > > guest memory and you essentially cannot punch holes into the shared memory > > > anymore -- resulting in the worst case in a duplicate memory consumption for > > > your VM. > > > > > > So you'd actually want to map only the *currently* shared pieces into the > > > IOMMU and update the mappings on demand. Having worked on something related, > > > > Exactly. On demand mapping and page pinning for shared memory is necessary. > > > > > I can only say that 64k individual mappings, and not being able to modify > > > existing mappings except completely deleting them to replace with something > > > new (!atomic), can be quite an issue for bigger VMs. > > > > Do you mean atomicity in mapping/unmapping can hardly be guaranteed during > > the shared <-> private transition? May I ask for elaboration? Thanks! > > If we expect to really only have little shared memory, and expect a VM > always has no shared memory when booting up (I think this is the case), I > guess this could work. > > The issue is if the guest e.g., makes contiguous 2 MiB shared and later > wants to unshare individual pages/parts. > > You'll have to DMA map the 2 MiB in page granularity, otherwise you'll have > to DMA unmap 2 MiB and DMA remap all still-shared pieces. That is not atomic > and can be problematic if the device is accessing some of the shared parts > at that time. > > Consequently that means, that large shared regions can be problematic when > mapped, because we'll have to map in page granularity. We have 64k such > individual mappings in general. > > 64k * 4KiB == 256 MiB > > Not sure if there would be use cases, e.g., with GPGPUs and similar, where > you'd want to share a lot of memory with a device ... > > But these are just my thoughts, maybe I am missing something important. Good point. And thanks! So to guarantee the atomicity, we shall only use 4K mappings for shared pages, thus having to bear with bigger memory footprint and more IOTLB miss penalties. :) B.R. Yu
>> >> In principle, you could actually initialize a TDX guest with all of its >> memory shared and all of it mapped in the host IOMMU. When a guest >> turns some pages private, user code could punch a hole in the memslot, >> allocate private memory at that address, but leave the shared backing >> store in place and still mapped in the host IOMMU. The result would be >> that guest-initiated DMA to the previously shared address would actually >> work but would hit pages that are invisible to the guest. And a whole >> bunch of memory would be waste, but the whole system should stll work. > > Do you mean to let VFIO & IOMMU to treat all guest memory as shared first, > and then just allocate the private pages in another backing store? I guess > that could work, but with the cost of allocating roughly 2x physical pages > of the guest RAM size. After all, the shared pages shall be only a small > part of guest memory. Yes. My point is that I don't think there should be any particular danger in leaving the VFIO code alone as part of TDX enablement. The code ought to *work* even if it will be wildly inefficient. If someone cares to make it work better, they're welcome to do so. --Andy
On 9/2/21 2:27 AM, Joerg Roedel wrote: > On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote: >> In principle, you could actually initialize a TDX guest with all of its >> memory shared and all of it mapped in the host IOMMU. > > Not sure how this works in TDX, but in SEV code fetches are always > treated as encrypted. So this approach would not work with SEV, not to > speak about attestation, which will not work with this approach either > :) > Oof.
Hi folks, I try to sketch how the memfd changes would look like. I've added F_SEAL_GUEST. The new seal is only allowed if there's no pre-existing pages in the fd (i_mapping->nrpages check) and there's no existing mapping of the file (RB_EMPTY_ROOT(&i_mapping->i_mmap.rb_root check). After the seal is set, no read/write/mmap from userspace is allowed. Although it's not clear how to serialize read check vs. seal setup: seal is protected with inode_lock() which we don't hold in read path because it is expensive. I don't know yet how to get it right. For TDX, it's okay to allow read as it cannot trigger #MCE. Maybe we can allow it? Truncate and punch hole are tricky. We want to allow it to save memory if substantial range is converted to shared. Partial truncate and punch hole effectively writes zeros to partially truncated page and may lead to #MCE. We can reject any partial truncate/punch requests, but it doesn't help the situation with THPs. If we truncate to the middle of THP page, we try to split it into small pages and proceed as usual for small pages. But split is allowed to fail. If it happens we zero part of THP. I guess we may reject truncate if split fails. It should work fine if we only use it for saving memory. We need to modify truncation/punch path to notify kvm that pages are about to be freed. I think we will register callback in the memfd on adding the fd to KVM memslot that going to be called for the notification. That means 1:1 between memfd and memslot. I guess it's okay. Migration going to always fail on F_SEAL_GUEST for now. Can be modified to use a callback in the future. Swapout will also always fail on F_SEAL_GUEST. It seems trivial. Again, it can be a callback in the future. For GPA->PFN translation KVM could use vm_ops->fault(). Semantically it is a good fit, but we don't have any VMAs around and ->mmap is forbidden for F_SEAL_GUEST. Other option is call shmem_getpage() directly, but it looks like a layering violation to me. And it's not available to modules :/ Any comments?
On Thu, Sep 02, 2021, Andy Lutomirski wrote: > On 9/2/21 2:27 AM, Joerg Roedel wrote: > > On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote: > >> In principle, you could actually initialize a TDX guest with all of its > >> memory shared and all of it mapped in the host IOMMU. > > > > Not sure how this works in TDX, but in SEV code fetches are always > > treated as encrypted. So this approach would not work with SEV, not to > > speak about attestation, which will not work with this approach either > > :) > > > > Oof. TDX is kinda similar. _All_ accesses are private if paging is disabled because the shared bit is either bit 48 or bit 51 in the GPA, i.e. can't be reached if paging is disabled. The vCPU is hardcoded to start in unpaged protected mode, so at least some amount of guest memory needs to be private. I also could've sworn code fetches from shared memory would #VE, but I can't find anything in the specs that confirm that. I may be conflating TDX with SGX's #GP on a code fetch outside of ELRANGE...
On 9/2/21 11:57 AM, Sean Christopherson wrote: > On Thu, Sep 02, 2021, Andy Lutomirski wrote: >> On 9/2/21 2:27 AM, Joerg Roedel wrote: >>> On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote: >>>> In principle, you could actually initialize a TDX guest with all of its >>>> memory shared and all of it mapped in the host IOMMU. >>> Not sure how this works in TDX, but in SEV code fetches are always >>> treated as encrypted. So this approach would not work with SEV, not to >>> speak about attestation, which will not work with this approach either >>> :) >>> >> Oof. > TDX is kinda similar. _All_ accesses are private if paging is disabled because > the shared bit is either bit 48 or bit 51 in the GPA, i.e. can't be reached if > paging is disabled. The vCPU is hardcoded to start in unpaged protected mode, > so at least some amount of guest memory needs to be private. That's a rule we should definitely add to our page table checker. Just like how we can look for W+X, we should also look for Shared+X.
On Thu, Sep 02, 2021, Kirill A. Shutemov wrote: > Hi folks, > > I try to sketch how the memfd changes would look like. > > I've added F_SEAL_GUEST. The new seal is only allowed if there's no > pre-existing pages in the fd (i_mapping->nrpages check) and there's > no existing mapping of the file (RB_EMPTY_ROOT(&i_mapping->i_mmap.rb_root check). > > After the seal is set, no read/write/mmap from userspace is allowed. > > Although it's not clear how to serialize read check vs. seal setup: seal > is protected with inode_lock() which we don't hold in read path because it > is expensive. I don't know yet how to get it right. For TDX, it's okay to > allow read as it cannot trigger #MCE. Maybe we can allow it? Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem? > Truncate and punch hole are tricky. > > We want to allow it to save memory if substantial range is converted to > shared. Partial truncate and punch hole effectively writes zeros to > partially truncated page and may lead to #MCE. We can reject any partial > truncate/punch requests, but it doesn't help the situation with THPs. > > If we truncate to the middle of THP page, we try to split it into small > pages and proceed as usual for small pages. But split is allowed to fail. > If it happens we zero part of THP. > I guess we may reject truncate if split fails. It should work fine if we > only use it for saving memory. FWIW, splitting a THP will also require a call into KVM to demote the huge page to the equivalent small pages. > We need to modify truncation/punch path to notify kvm that pages are about > to be freed. I think we will register callback in the memfd on adding the > fd to KVM memslot that going to be called for the notification. That means > 1:1 between memfd and memslot. I guess it's okay. Hmm, 1:1 memfd to memslot will be problematic as that would prevent punching a hole in KVM's memslots, e.g. to convert a subset to shared. It would also disallow backing guest memory with a single memfd that's split across two memslots for <4gb and >4gb. But I don't think we need a 1:1 relationship. To keep KVM sane, we can require each private memslot to be wholly contained in a single memfd, I can't think of any reason that would be problematic for userspace. For the callbacks, I believe the rule should be 1:1 between memfd and KVM instance. That would allow mapping multiple memslots to a single memfd so long as they're all coming from the same KVM instance. > Migration going to always fail on F_SEAL_GUEST for now. Can be modified to > use a callback in the future. > > Swapout will also always fail on F_SEAL_GUEST. It seems trivial. Again, it > can be a callback in the future. > > For GPA->PFN translation KVM could use vm_ops->fault(). Semantically it is > a good fit, but we don't have any VMAs around and ->mmap is forbidden for > F_SEAL_GUEST. > Other option is call shmem_getpage() directly, but it looks like a > layering violation to me. And it's not available to modules :/ My idea for this was to have the memfd:KVM exchange callbacks, i.e. memfd would have callbacks into KVM, but KVM would also have callbacks into memfd. To avoid circular refcounts, KVM would hold a reference to the memfd (since it's the instigator) and KVM would be responsible for unregistering itself before freeing it's reference to the memfd. The memfd callbacks would be tracked per private memslot, which meshes nicely without how KVM uses memslots to translate gfn->pfn. In effect, the ops pointer in the memslots replaces the host virtual address that's used to get the pfn for non-private memslots. @@ -2428,8 +2453,12 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic, bool *async, bool write_fault, bool *writable, hva_t *hva) { - unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault); + unsigned long addr; + if (memslot_is_private(slot)) + return slot->private_ops->gfn_to_pfn(...); + + addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault); if (hva) *hva = addr; > > Any comments? > > -- > Kirill A. Shutemov
On Thu, Sep 2, 2021, at 12:07 PM, Dave Hansen wrote: > On 9/2/21 11:57 AM, Sean Christopherson wrote: > > On Thu, Sep 02, 2021, Andy Lutomirski wrote: > >> On 9/2/21 2:27 AM, Joerg Roedel wrote: > >>> On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote: > >>>> In principle, you could actually initialize a TDX guest with all of its > >>>> memory shared and all of it mapped in the host IOMMU. > >>> Not sure how this works in TDX, but in SEV code fetches are always > >>> treated as encrypted. So this approach would not work with SEV, not to > >>> speak about attestation, which will not work with this approach either > >>> :) > >>> > >> Oof. > > TDX is kinda similar. _All_ accesses are private if paging is disabled because > > the shared bit is either bit 48 or bit 51 in the GPA, i.e. can't be reached if > > paging is disabled. The vCPU is hardcoded to start in unpaged protected mode, > > so at least some amount of guest memory needs to be private. > > That's a rule we should definitely add to our page table checker. Just > like how we can look for W+X, we should also look for Shared+X. > The only case I can thing of where the TDX vs SEV rule matters is for some mildly crazy user who wants to run user code out of an unencrypted DAX device (or virtio-fs, I guess). We can save that for another year :)
On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote: > Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem? I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look closer into locking next. > FWIW, splitting a THP will also require a call into KVM to demote the huge page > to the equivalent small pages. But will it though? Even if the page got split we still don't allow migration so EPT can stay 2M. Or do I miss something? For truncate/punch I disallowed non-page aligned operations. I ignore failed split. Yes, we zero the truncated parts, but these parts are already removed from KVM, so it should be safe. > > We need to modify truncation/punch path to notify kvm that pages are about > > to be freed. I think we will register callback in the memfd on adding the > > fd to KVM memslot that going to be called for the notification. That means > > 1:1 between memfd and memslot. I guess it's okay. > > Hmm, 1:1 memfd to memslot will be problematic as that would prevent punching a > hole in KVM's memslots, e.g. to convert a subset to shared. It would also > disallow backing guest memory with a single memfd that's split across two > memslots for <4gb and >4gb. > > But I don't think we need a 1:1 relationship. To keep KVM sane, we can require > each private memslot to be wholly contained in a single memfd, I can't think of > any reason that would be problematic for userspace. > > For the callbacks, I believe the rule should be 1:1 between memfd and KVM instance. > That would allow mapping multiple memslots to a single memfd so long as they're > all coming from the same KVM instance. Okay, I've added guest_owner field that I check on registering guest. I don't think we want to tie it directly to kvm, so I left it just void *. > > > Migration going to always fail on F_SEAL_GUEST for now. Can be modified to > > use a callback in the future. > > > > Swapout will also always fail on F_SEAL_GUEST. It seems trivial. Again, it > > can be a callback in the future. > > > > For GPA->PFN translation KVM could use vm_ops->fault(). Semantically it is > > a good fit, but we don't have any VMAs around and ->mmap is forbidden for > > F_SEAL_GUEST. > > Other option is call shmem_getpage() directly, but it looks like a > > layering violation to me. And it's not available to modules :/ > > My idea for this was to have the memfd:KVM exchange callbacks, i.e. memfd would > have callbacks into KVM, but KVM would also have callbacks into memfd. To avoid > circular refcounts, KVM would hold a reference to the memfd (since it's the > instigator) and KVM would be responsible for unregistering itself before freeing > it's reference to the memfd. Okay I went this path. This callback exchange looks wierd to me, but I suppose it works. Below is how I see it can look like. It only compile-tested and might be broken in many ways. I have not found a right abstaction for shmem_test_guest(). Maybe it has to be a new inode operation? I donno. For initial prototyping we can call it directly. Any comments? diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 8e775ce517bb..6f94fc46a3b1 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -10,6 +10,17 @@ #include <linux/xattr.h> #include <linux/fs_parser.h> +struct guest_ops { + void (*invalidate_page_range)(struct inode *inode, + pgoff_t start, pgoff_t end); +}; + +struct guest_mem_ops { + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset); + void (*put_unlock_pfn)(unsigned long pfn); + +}; + /* inode in-kernel data */ struct shmem_inode_info { @@ -24,6 +35,8 @@ struct shmem_inode_info { struct simple_xattrs xattrs; /* list of xattrs */ atomic_t stop_eviction; /* hold when working on inode */ struct inode vfs_inode; + void *guest_owner; + const struct guest_ops *guest_ops; }; struct shmem_sb_info { diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..c79bc8572721 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_GUEST 0x0020 /* (1U << 31) is reserved for signed error codes */ /* diff --git a/mm/memfd.c b/mm/memfd.c index 081dd33e6a61..ccaa4edb21f7 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -134,7 +134,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) F_SEAL_SHRINK | \ F_SEAL_GROW | \ F_SEAL_WRITE | \ - F_SEAL_FUTURE_WRITE) + F_SEAL_FUTURE_WRITE | \ + F_SEAL_GUEST) static int memfd_add_seals(struct file *file, unsigned int seals) { @@ -203,10 +204,27 @@ static int memfd_add_seals(struct file *file, unsigned int seals) } } + if (seals & F_SEAL_GUEST) { + i_mmap_lock_read(inode->i_mapping); + + if (!RB_EMPTY_ROOT(&inode->i_mapping->i_mmap.rb_root)) { + error = -EBUSY; + goto unlock; + } + + if (i_size_read(inode)) { + error = -EBUSY; + goto unlock; + } + } + *file_seals |= seals; error = 0; unlock: + if (seals & F_SEAL_GUEST) + i_mmap_unlock_read(inode->i_mapping); + inode_unlock(inode); return error; } diff --git a/mm/shmem.c b/mm/shmem.c index dacda7463d54..761cf4e2152e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -883,6 +883,17 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end) return split_huge_page(page) >= 0; } +static void guest_invalidate_page(struct inode *inode, + struct page *page, pgoff_t start, pgoff_t end) +{ + struct shmem_inode_info *info = SHMEM_I(inode); + + start = max(start, page->index); + end = min(end, page->index + HPAGE_PMD_NR) - 1; + + info->guest_ops->invalidate_page_range(inode, start, end); +} + /* * Remove range of pages and swap entries from page cache, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. @@ -923,6 +934,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, } index += thp_nr_pages(page) - 1; + guest_invalidate_page(inode, page, start, end); + if (!unfalloc || !PageUptodate(page)) truncate_inode_page(mapping, page); unlock_page(page); @@ -999,6 +1012,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, index--; break; } + + guest_invalidate_page(inode, page, start, end); + VM_BUG_ON_PAGE(PageWriteback(page), page); if (shmem_punch_compound(page, start, end)) truncate_inode_page(mapping, page); @@ -1074,6 +1090,9 @@ static int shmem_setattr(struct user_namespace *mnt_userns, (newsize > oldsize && (info->seals & F_SEAL_GROW))) return -EPERM; + if ((info->seals & F_SEAL_GUEST) && (newsize & ~PAGE_MASK)) + return -EINVAL; + if (newsize != oldsize) { error = shmem_reacct_size(SHMEM_I(inode)->flags, oldsize, newsize); @@ -1348,6 +1367,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) goto redirty; if (!total_swap_pages) goto redirty; + if (info->seals & F_SEAL_GUEST) + goto redirty; /* * Our capabilities prevent regular writeback or sync from ever calling @@ -2274,6 +2295,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret; + if (info->seals & F_SEAL_GUEST) + return -EPERM; + /* arm64 - allow memory tagging on RAM-based files */ vma->vm_flags |= VM_MTE_ALLOWED; @@ -2471,12 +2495,14 @@ shmem_write_begin(struct file *file, struct address_space *mapping, pgoff_t index = pos >> PAGE_SHIFT; /* i_mutex is held by caller */ - if (unlikely(info->seals & (F_SEAL_GROW | - F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) { + if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE | + F_SEAL_FUTURE_WRITE | F_SEAL_GUEST))) { if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) return -EPERM; if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) return -EPERM; + if (info->seals & F_SEAL_GUEST) + return -EPERM; } return shmem_getpage(inode, index, pagep, SGP_WRITE); @@ -2530,6 +2556,8 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) ssize_t retval = 0; loff_t *ppos = &iocb->ki_pos; + if (SHMEM_I(inode)->seals & F_SEAL_GUEST) + return -EPERM; /* * Might this read be for a stacking filesystem? Then when reading * holes of a sparse file, we actually need to allocate those pages, @@ -2675,6 +2703,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; } + if ((info->seals & F_SEAL_GUEST) && + (offset & ~PAGE_MASK || len & ~PAGE_MASK)) { + error = -EINVAL; + goto out; + } + shmem_falloc.waitq = &shmem_falloc_waitq; shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT; shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; @@ -3761,6 +3795,20 @@ static void shmem_destroy_inodecache(void) kmem_cache_destroy(shmem_inode_cachep); } +#ifdef CONFIG_MIGRATION +int shmem_migrate_page(struct address_space *mapping, + struct page *newpage, struct page *page, + enum migrate_mode mode) +{ + struct inode *inode = mapping->host; + struct shmem_inode_info *info = SHMEM_I(inode); + + if (info->seals & F_SEAL_GUEST) + return -ENOTSUPP; + return migrate_page(mapping, newpage, page, mode); +} +#endif + const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .set_page_dirty = __set_page_dirty_no_writeback, @@ -3769,12 +3817,57 @@ const struct address_space_operations shmem_aops = { .write_end = shmem_write_end, #endif #ifdef CONFIG_MIGRATION - .migratepage = migrate_page, + .migratepage = shmem_migrate_page, #endif .error_remove_page = generic_error_remove_page, }; EXPORT_SYMBOL(shmem_aops); +static unsigned long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset) +{ + struct page *page; + int ret; + + ret = shmem_getpage(inode, offset, &page, SGP_WRITE); + if (ret) + return ret; + + return page_to_pfn(page); +} + +static void shmem_put_unlock_pfn(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + + VM_BUG_ON_PAGE(!PageLocked(page), page); + + set_page_dirty(page); + unlock_page(page); + put_page(page); +} + +static const struct guest_mem_ops shmem_guest_ops = { + .get_lock_pfn = shmem_get_lock_pfn, + .put_unlock_pfn = shmem_put_unlock_pfn, +}; + +int shmem_register_guest(struct inode *inode, void *owner, + const struct guest_ops *guest_ops, + const struct guest_mem_ops **guest_mem_ops) +{ + struct shmem_inode_info *info = SHMEM_I(inode); + + if (!owner) + return -EINVAL; + + if (info->guest_owner && info->guest_owner != owner) + return -EPERM; + + info->guest_ops = guest_ops; + *guest_mem_ops = &shmem_guest_ops; + return 0; +} + static const struct file_operations shmem_file_operations = { .mmap = shmem_mmap, .get_unmapped_area = shmem_get_unmapped_area,
On 9/3/21 12:14 PM, Kirill A. Shutemov wrote: > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote: >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem? > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look > closer into locking next. We can decisively eliminate this sort of failure by making the switch happen at open time instead of after. For a memfd-like API, this would be straightforward. For a filesystem, it would take a bit more thought.
On Thu, Sep 02, 2021 at 04:19:23PM +0800, Yu Zhang wrote: > On Wed, Sep 01, 2021 at 09:07:59AM -0700, Andy Lutomirski wrote: > > On 9/1/21 3:24 AM, Yu Zhang wrote: > > > On Tue, Aug 31, 2021 at 09:53:27PM -0700, Andy Lutomirski wrote: > > >> > > >> > > >> On Thu, Aug 26, 2021, at 7:31 PM, Yu Zhang wrote: > > >>> On Thu, Aug 26, 2021 at 12:15:48PM +0200, David Hildenbrand wrote: > > >> > > >>> Thanks a lot for this summary. A question about the requirement: do we or > > >>> do we not have plan to support assigned device to the protected VM? > > >>> > > >>> If yes. The fd based solution may need change the VFIO interface as well( > > >>> though the fake swap entry solution need mess with VFIO too). Because: > > >>> > > >>> 1> KVM uses VFIO when assigning devices into a VM. > > >>> > > >>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all > > >>> guest pages will have to be mapped in host IOMMU page table to host pages, > > >>> which are pinned during the whole life cycle fo the VM. > > >>> > > >>> 3> IOMMU mapping is done during VM creation time by VFIO and IOMMU driver, > > >>> in vfio_dma_do_map(). > > >>> > > >>> 4> However, vfio_dma_do_map() needs the HVA to perform a GUP to get the HPA > > >>> and pin the page. > > >>> > > >>> But if we are using fd based solution, not every GPA can have a HVA, thus > > >>> the current VFIO interface to map and pin the GPA(IOVA) wont work. And I > > >>> doubt if VFIO can be modified to support this easily. > > >>> > > >>> > > >> > > >> Do you mean assigning a normal device to a protected VM or a hypothetical protected-MMIO device? > > >> > > >> If the former, it should work more or less like with a non-protected VM. mmap the VFIO device, set up a memslot, and use it. I'm not sure whether anyone will actually do this, but it should be possible, at least in principle. Maybe someone will want to assign a NIC to a TDX guest. An NVMe device with the understanding that the guest can't trust it wouldn't be entirely crazy ether. > > >> > > >> If the latter, AFAIK there is no spec for how it would work even in principle. Presumably it wouldn't work quite like VFIO -- instead, the kernel could have a protection-virtual-io-fd mechanism, and that fd could be bound to a memslot in whatever way we settle on for binding secure memory to a memslot. > > >> > > > > > > Thanks Andy. I was asking the first scenario. > > > > > > Well, I agree it is doable if someone really want some assigned > > > device in TD guest. As Kevin mentioned in his reply, HPA can be > > > generated, by extending VFIO with a new mapping protocol which > > > uses fd+offset, instead of HVA. > > > > I'm confused. I don't see why any new code is needed for this at all. > > Every proposal I've seen for handling TDX memory continues to handle TDX > > *shared* memory exactly like regular guest memory today. The only > > differences are that more hole punching will be needed, which will > > require lightweight memslots (to have many of them), memslots with > > holes, or mappings backing memslots with holes (which can be done with > > munmap() on current kernels). > > Thanks for pointing this out. And yes, for DMAs not capable of encryption( > which is the case in current TDX). GUP shall work as it is in VFIO. :) > > > > > So you can literally just mmap a VFIO device and expect it to work, > > exactly like it does right now. Whether the guest will be willing to > > use the device will depend on the guest security policy (all kinds of > > patches about that are flying around), but if the guest tries to use it, > > it really should just work. > > > > But I think there's still problem. For now, > > 1> Qemu mmap()s all GPAs into its HVA space, when the VM is created. > 2> With no idea which part of guest memory shall be shared, VFIO will just > set up the IOPT, by mapping whole GPA ranges in IOPT. actually, it's not GPA in IOPT. It's IOVA in IOPT, and the IOVA is equal to GPA in this case. > 3> And those GPAs are actually private ones, with no shared-bit set. > in Guest, IOVA is set to GPA (without shared bit) for shared memory in TDX now. > Later when guest tries to perform a DMA(using a shared GPA), IO page fault > shall happen. So, this situation should not happen. > > > > > > Another issue is current TDX does not support DMA encryption, and > > > only shared GPA memory shall be mapped in the VT-d. So to support > > > this, KVM may need to work with VFIO to dynamically program host > > > IOPT(IOMMU Page Table) when TD guest notifies a shared GFN range(e.g., > > > with a MAP_GPA TDVMCALL), instead of prepopulating the IOPT at VM > > > creation time, by mapping entire GFN ranges of a guest. > > > > Given that there is no encrypted DMA support, shouldn't the only IOMMU > > mappings (real host-side IOMMU) that point at guest memory be for > > non-encrypted DMA? I don't see how this interacts at all. If the guest > > tries to MapGPA to turn a shared MMIO page into private, the host should > > fail the hypercall because the operation makes no sense. > > > > It is indeed the case that, with a TDX guest, MapGPA shared->private to > > a page that was previously used for unencrypted DMA will need to avoid > > having IOPT entries to the new private page, but even that doesn't seem > > particularly bad. The fd+special memslot proposal for private memory > > means that shared *backing store* pages never actually transition > > between shared and private without being completely freed. > > > > As far as I can tell, the actual problem you're referring to is: > > > > >>> 2> Not knowing which GPA ranges may be used by the VM as DMA buffer, all > > >>> guest pages will have to be mapped in host IOMMU page table to host > > pages, > > >>> which are pinned during the whole life cycle fo the VM. > > Yes. That's the primary concern. :) > > > > > In principle, you could actually initialize a TDX guest with all of its > > memory shared and all of it mapped in the host IOMMU. When a guest > > turns some pages private, user code could punch a hole in the memslot, > > allocate private memory at that address, but leave the shared backing > > store in place and still mapped in the host IOMMU. The result would be > > that guest-initiated DMA to the previously shared address would actually > > work but would hit pages that are invisible to the guest. And a whole > > bunch of memory would be waste, but the whole system should stll work. > > Do you mean to let VFIO & IOMMU to treat all guest memory as shared first, > and then just allocate the private pages in another backing store? I guess just a curious question. what if this shared->private conversion is on MMIO ranges that are mapped into the device? do you have enough info to get MMIO hpa for the private mapping? Thanks Yan > that could work, but with the cost of allocating roughly 2x physical pages > of the guest RAM size. After all, the shared pages shall be only a small > part of guest memory. > > If device assignment is desired in current TDX. My understanding of the > enabling work would be like this: > 1> Change qemu to not trigger VFIO_IOMMU_MAP_DMA for the TD, thus no IOPT > prepopulated, and no physical page allocated. > 2> KVM forwards MapGPA(private -> shared) request to Qemu. > 3> Qemu asks VFIO to pin and map the shared GPAs. > > For private -> shared transitions, the memslot punching, IOPT unmapping, > and iotlb flushing are necessary. Possibly new interface between VFIO and > KVM is needed. > > But actually I am not sure if people really want assigned device in current > TDX. Bottleneck of the performance should be the copying to/from swiotlb > buffers. > > B.R. > Yu >
On 9/1/21 11:18 AM, James Bottomley wrote: > On Wed, 2021-09-01 at 08:54 -0700, Andy Lutomirski wrote: > [...] >> If you want to swap a page on TDX, you can't. Sorry, go directly to >> jail, do not collect $200. > > Actually, even on SEV-ES you can't either. You can read the encrypted > page and write it out if you want, but unless you swap it back to the > exact same physical memory location, the encryption key won't work. > Since we don't guarantee this for swap, I think swap won't actually > work for any confidential computing environment. There are SEV/SEV-ES and SEV-SNP APIs to swap a page out and in. There is also an API to copy (for SEV/SEV-ES) or move (for SEV-SNP) a page in memory to help with balancing if needed. These aren't currently implemented in the kernel, but can be. It requires the changes going in for live migration to recognize that a page is encrypted/private and invoke the APIs to transform the page before doing the operation. Thanks, Tom > >> So I think there are literally zero code paths that currently call >> try_to_unmap() that will actually work like that on TDX. If we run >> out of memory on a TDX host, we can kill the guest completely and >> reclaim all of its memory (which probably also involves killing QEMU >> or whatever other user program is in charge), but that's really our >> only option. > > I think our only option for swap is guest co-operation. We're going to > have to inflate a balloon or something in the guest and have the guest > driver do some type of bounce of the page, where it becomes an > unencrypted page in the guest (so the host can read it without the > physical address keying of the encryption getting in the way) but > actually encrypted with a swap transfer key known only to the guest. I > assume we can use the page acceptance infrastructure currently being > discussed elsewhere to do swap back in as well ... the host provides > the guest with the encrypted swap page and the guest has to decrypt it > and place it in encrypted guest memory. > > That way the swapped memory is securely encrypted, but can be swapped > back in. > > James > >
On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote: > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote: > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote: > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem? > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look > > closer into locking next. > > We can decisively eliminate this sort of failure by making the switch > happen at open time instead of after. For a memfd-like API, this would > be straightforward. For a filesystem, it would take a bit more thought. I think it should work fine as long as we check seals after i_size in the read path. See the comment in shmem_file_read_iter(). Below is updated version. I think it should be good enough to start integrate with KVM. I also attach a test-case that consists of kernel patch and userspace program. It demonstrates how it can be integrated into KVM code. One caveat I noticed is that guest_ops::invalidate_page_range() can be called after the owner (struct kvm) has being freed. It happens because memfd can outlive KVM. So the callback has to check if such owner exists, than check that there's a memslot with such inode. I guess it should be okay: we have vm_list we can check owner against. We may consider replace vm_list with something more scalable if number of VMs will get too high. Any comments? diff --git a/include/linux/memfd.h b/include/linux/memfd.h index 4f1600413f91..3005e233140a 100644 --- a/include/linux/memfd.h +++ b/include/linux/memfd.h @@ -4,13 +4,34 @@ #include <linux/file.h> +struct guest_ops { + void (*invalidate_page_range)(struct inode *inode, void *owner, + pgoff_t start, pgoff_t end); +}; + +struct guest_mem_ops { + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset); + void (*put_unlock_pfn)(unsigned long pfn); + +}; + #ifdef CONFIG_MEMFD_CREATE extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg); + +extern inline int memfd_register_guest(struct inode *inode, void *owner, + const struct guest_ops *guest_ops, + const struct guest_mem_ops **guest_mem_ops); #else static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a) { return -EINVAL; } +static inline int memfd_register_guest(struct inode *inode, void *owner, + const struct guest_ops *guest_ops, + const struct guest_mem_ops **guest_mem_ops) +{ + return -EINVAL; +} #endif #endif /* __LINUX_MEMFD_H */ diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 8e775ce517bb..265d0c13bc5e 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -12,6 +12,9 @@ /* inode in-kernel data */ +struct guest_ops; +struct guest_mem_ops; + struct shmem_inode_info { spinlock_t lock; unsigned int seals; /* shmem seals */ @@ -24,6 +27,8 @@ struct shmem_inode_info { struct simple_xattrs xattrs; /* list of xattrs */ atomic_t stop_eviction; /* hold when working on inode */ struct inode vfs_inode; + void *guest_owner; + const struct guest_ops *guest_ops; }; struct shmem_sb_info { @@ -90,6 +95,10 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma); extern unsigned long shmem_partial_swap_usage(struct address_space *mapping, pgoff_t start, pgoff_t end); +extern int shmem_register_guest(struct inode *inode, void *owner, + const struct guest_ops *guest_ops, + const struct guest_mem_ops **guest_mem_ops); + /* Flag allocation requirements to shmem_getpage */ enum sgp_type { SGP_READ, /* don't exceed i_size, don't allocate page */ diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..c79bc8572721 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_GUEST 0x0020 /* (1U << 31) is reserved for signed error codes */ /* diff --git a/mm/memfd.c b/mm/memfd.c index 081dd33e6a61..ae43454789f4 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -130,11 +130,24 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) return NULL; } +int memfd_register_guest(struct inode *inode, void *owner, + const struct guest_ops *guest_ops, + const struct guest_mem_ops **guest_mem_ops) +{ + if (shmem_mapping(inode->i_mapping)) { + return shmem_register_guest(inode, owner, + guest_ops, guest_mem_ops); + } + + return -EINVAL; +} + #define F_ALL_SEALS (F_SEAL_SEAL | \ F_SEAL_SHRINK | \ F_SEAL_GROW | \ F_SEAL_WRITE | \ - F_SEAL_FUTURE_WRITE) + F_SEAL_FUTURE_WRITE | \ + F_SEAL_GUEST) static int memfd_add_seals(struct file *file, unsigned int seals) { @@ -203,10 +216,27 @@ static int memfd_add_seals(struct file *file, unsigned int seals) } } + if (seals & F_SEAL_GUEST) { + i_mmap_lock_read(inode->i_mapping); + + if (!RB_EMPTY_ROOT(&inode->i_mapping->i_mmap.rb_root)) { + error = -EBUSY; + goto unlock; + } + + if (i_size_read(inode)) { + error = -EBUSY; + goto unlock; + } + } + *file_seals |= seals; error = 0; unlock: + if (seals & F_SEAL_GUEST) + i_mmap_unlock_read(inode->i_mapping); + inode_unlock(inode); return error; } diff --git a/mm/shmem.c b/mm/shmem.c index dacda7463d54..54c213b7b42a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -80,6 +80,7 @@ static struct vfsmount *shm_mnt; #include <linux/userfaultfd_k.h> #include <linux/rmap.h> #include <linux/uuid.h> +#include <linux/memfd.h> #include <linux/uaccess.h> @@ -883,6 +884,18 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end) return split_huge_page(page) >= 0; } +static void guest_invalidate_page(struct inode *inode, + struct page *page, pgoff_t start, pgoff_t end) +{ + struct shmem_inode_info *info = SHMEM_I(inode); + + start = max(start, page->index); + end = min(end, page->index + HPAGE_PMD_NR) - 1; + + info->guest_ops->invalidate_page_range(inode, info->guest_owner, + start, end); +} + /* * Remove range of pages and swap entries from page cache, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. @@ -923,6 +936,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, } index += thp_nr_pages(page) - 1; + guest_invalidate_page(inode, page, start, end); + if (!unfalloc || !PageUptodate(page)) truncate_inode_page(mapping, page); unlock_page(page); @@ -999,6 +1014,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, index--; break; } + + guest_invalidate_page(inode, page, start, end); + VM_BUG_ON_PAGE(PageWriteback(page), page); if (shmem_punch_compound(page, start, end)) truncate_inode_page(mapping, page); @@ -1074,6 +1092,9 @@ static int shmem_setattr(struct user_namespace *mnt_userns, (newsize > oldsize && (info->seals & F_SEAL_GROW))) return -EPERM; + if ((info->seals & F_SEAL_GUEST) && (newsize & ~PAGE_MASK)) + return -EINVAL; + if (newsize != oldsize) { error = shmem_reacct_size(SHMEM_I(inode)->flags, oldsize, newsize); @@ -1348,6 +1369,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) goto redirty; if (!total_swap_pages) goto redirty; + if (info->seals & F_SEAL_GUEST) + goto redirty; /* * Our capabilities prevent regular writeback or sync from ever calling @@ -2274,6 +2297,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret; + if (info->seals & F_SEAL_GUEST) + return -EPERM; + /* arm64 - allow memory tagging on RAM-based files */ vma->vm_flags |= VM_MTE_ALLOWED; @@ -2471,12 +2497,14 @@ shmem_write_begin(struct file *file, struct address_space *mapping, pgoff_t index = pos >> PAGE_SHIFT; /* i_mutex is held by caller */ - if (unlikely(info->seals & (F_SEAL_GROW | - F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) { + if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE | + F_SEAL_FUTURE_WRITE | F_SEAL_GUEST))) { if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) return -EPERM; if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) return -EPERM; + if (info->seals & F_SEAL_GUEST) + return -EPERM; } return shmem_getpage(inode, index, pagep, SGP_WRITE); @@ -2550,6 +2578,20 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) end_index = i_size >> PAGE_SHIFT; if (index > end_index) break; + + /* + * inode_lock protects setting up seals as well as write to + * i_size. Setting F_SEAL_GUEST only allowed with i_size == 0. + * + * Check F_SEAL_GUEST after i_size. It effectively serialize + * read vs. setting F_SEAL_GUEST without taking inode_lock in + * read path. + */ + if (SHMEM_I(inode)->seals & F_SEAL_GUEST) { + error = -EPERM; + break; + } + if (index == end_index) { nr = i_size & ~PAGE_MASK; if (nr <= offset) @@ -2675,6 +2717,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; } + if ((info->seals & F_SEAL_GUEST) && + (offset & ~PAGE_MASK || len & ~PAGE_MASK)) { + error = -EINVAL; + goto out; + } + shmem_falloc.waitq = &shmem_falloc_waitq; shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT; shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; @@ -3761,6 +3809,20 @@ static void shmem_destroy_inodecache(void) kmem_cache_destroy(shmem_inode_cachep); } +#ifdef CONFIG_MIGRATION +int shmem_migrate_page(struct address_space *mapping, + struct page *newpage, struct page *page, + enum migrate_mode mode) +{ + struct inode *inode = mapping->host; + struct shmem_inode_info *info = SHMEM_I(inode); + + if (info->seals & F_SEAL_GUEST) + return -ENOTSUPP; + return migrate_page(mapping, newpage, page, mode); +} +#endif + const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .set_page_dirty = __set_page_dirty_no_writeback, @@ -3769,12 +3831,57 @@ const struct address_space_operations shmem_aops = { .write_end = shmem_write_end, #endif #ifdef CONFIG_MIGRATION - .migratepage = migrate_page, + .migratepage = shmem_migrate_page, #endif .error_remove_page = generic_error_remove_page, }; EXPORT_SYMBOL(shmem_aops); +static unsigned long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset) +{ + struct page *page; + int ret; + + ret = shmem_getpage(inode, offset, &page, SGP_WRITE); + if (ret) + return ret; + + return page_to_pfn(page); +} + +static void shmem_put_unlock_pfn(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + + VM_BUG_ON_PAGE(!PageLocked(page), page); + + set_page_dirty(page); + unlock_page(page); + put_page(page); +} + +static const struct guest_mem_ops shmem_guest_ops = { + .get_lock_pfn = shmem_get_lock_pfn, + .put_unlock_pfn = shmem_put_unlock_pfn, +}; + +int shmem_register_guest(struct inode *inode, void *owner, + const struct guest_ops *guest_ops, + const struct guest_mem_ops **guest_mem_ops) +{ + struct shmem_inode_info *info = SHMEM_I(inode); + + if (!owner) + return -EINVAL; + + if (info->guest_owner && info->guest_owner != owner) + return -EPERM; + + info->guest_ops = guest_ops; + *guest_mem_ops = &shmem_guest_ops; + return 0; +} + static const struct file_operations shmem_file_operations = { .mmap = shmem_mmap, .get_unmapped_area = shmem_get_unmapped_area,
>> diff --git a/mm/memfd.c b/mm/memfd.c >> index 081dd33e6a61..ae43454789f4 100644 >> --- a/mm/memfd.c >> +++ b/mm/memfd.c >> @@ -130,11 +130,24 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) >> return NULL; >> } >> >> +int memfd_register_guest(struct inode *inode, void *owner, >> + const struct guest_ops *guest_ops, >> + const struct guest_mem_ops **guest_mem_ops) >> +{ >> + if (shmem_mapping(inode->i_mapping)) { >> + return shmem_register_guest(inode, owner, >> + guest_ops, guest_mem_ops); >> + } >> + >> + return -EINVAL; >> +} > > Are we stick our design to memfd interface (e.g other memory backing > stores like tmpfs and hugetlbfs will all rely on this memfd interface to > interact with KVM), or this is just the initial implementation for PoC? I don't think we are, it still feels like we are in the early prototype phase (even way before a PoC). I'd be happy to see something "cleaner" so to say -- it still feels kind of hacky to me, especially there seem to be many pieces of the big puzzle missing so far. Unfortunately, this series hasn't caught the attention of many -MM people so far, maybe because other people miss the big picture as well and are waiting for a complete design proposal. For example, what's unclear to me: we'll be allocating pages with GFP_HIGHUSER_MOVABLE, making them land on MIGRATE_CMA or ZONE_MOVABLE; then we silently turn them unmovable, which breaks these concepts. Who'd migrate these pages away just like when doing long-term pinning, or how is that supposed to work? Also unclear to me is how refcount and mapcount will be handled to prevent swapping, who will actually do some kind of gfn-epfn etc. mapping, how we'll forbid access to this memory e.g., via /proc/kcore or when dumping memory ... and how it would ever work with migration/swapping/rmap (it's clearly future work, but it's been raised that this would be the way to make it work, I don't quite see how it would all come together). <note> Last but not least, I raised to Intel via a different channel that I'd appreciate updated hardware that avoids essentially crashing the hypervisor when writing to encrypted memory from user space. It has the smell of "broken hardware" to it that might just be fixed by a new hardware generation to make it look more similar to other successful implementations of secure/encrypted memory. That might it much easier to support an initial version of TDX -- instead of having to reinvent the way we map guest memory just now to support hardware that might sort out the root problem later. Having that said, there might be benefits to mapping guest memory differently, but my gut feeling is that it might take quite a long time to get something reasonable working, to settle on a design, and to get it accepted by all involved parties to merge it upstream. Just my 2 cents, I might be all wrong as so often. <\note>
On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote: > On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote: > > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote: > > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote: > > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote: > > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem? > > > > > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look > > > > closer into locking next. > > > > > > We can decisively eliminate this sort of failure by making the switch > > > happen at open time instead of after. For a memfd-like API, this would > > > be straightforward. For a filesystem, it would take a bit more thought. > > > > I think it should work fine as long as we check seals after i_size in the > > read path. See the comment in shmem_file_read_iter(). > > > > Below is updated version. I think it should be good enough to start > > integrate with KVM. > > > > I also attach a test-case that consists of kernel patch and userspace > > program. It demonstrates how it can be integrated into KVM code. > > > > One caveat I noticed is that guest_ops::invalidate_page_range() can be > > called after the owner (struct kvm) has being freed. It happens because > > memfd can outlive KVM. So the callback has to check if such owner exists, > > than check that there's a memslot with such inode. > > Would introducing memfd_unregister_guest() fix this? I considered this, but it get complex quickly. At what point it gets called? On KVM memslot destroy? What if multiple KVM slot share the same memfd? Add refcount into memfd on how many times the owner registered the memfd? It would leave us in strange state: memfd refcount owners (struct KVM) and KVM memslot pins the struct file. Weird refcount exchnage program. I hate it. > > I guess it should be okay: we have vm_list we can check owner against. > > We may consider replace vm_list with something more scalable if number of > > VMs will get too high. > > > > Any comments? > > > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h > > index 4f1600413f91..3005e233140a 100644 > > --- a/include/linux/memfd.h > > +++ b/include/linux/memfd.h > > @@ -4,13 +4,34 @@ > > > > #include <linux/file.h> > > > > +struct guest_ops { > > + void (*invalidate_page_range)(struct inode *inode, void *owner, > > + pgoff_t start, pgoff_t end); > > +}; > > I can see there are two scenarios to invalidate page(s), when punching a > hole or ftruncating to 0, in either cases KVM should already been called > with necessary information from usersapce with memory slot punch hole > syscall or memory slot delete syscall, so wondering this callback is > really needed. So what you propose? Forbid truncate/punch from userspace and make KVM handle punch hole/truncate from within kernel? I think it's layering violation. > > + > > +struct guest_mem_ops { > > + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset); > > + void (*put_unlock_pfn)(unsigned long pfn); > > Same as above, I’m not clear on which time put_unlock_pfn() would be > called, I’m thinking the page can be put_and_unlock when userspace > punching a hole or ftruncating to 0 on the fd. No. put_unlock_pfn() has to be called after the pfn is in SEPT. This way we close race between SEPT population and truncate/punch. get_lock_pfn() would stop truncate untile put_unlock_pfn() called. > We did miss pfn_mapping_level() callback which is needed for KVM to query > the page size level (e.g. 4K or 2M) that the backing store can support. Okay, makes sense. We can return the information as part of get_lock_pfn() call. > Are we stick our design to memfd interface (e.g other memory backing > stores like tmpfs and hugetlbfs will all rely on this memfd interface to > interact with KVM), or this is just the initial implementation for PoC? > > If we really want to expose multiple memory backing stores directly to > KVM (as opposed to expose backing stores to memfd and then memfd expose > single interface to KVM), I feel we need a third layer between KVM and > backing stores to eliminate the direct call like this. Backing stores > can register ‘memory fd providers’ and KVM should be able to connect to > the right backing store provider with the fd provided by usersapce under > the help of this third layer. memfd can provide shmem and hugetlbfs. That's should be enough for now.
On Wed, Sep 15, 2021 at 03:51:25PM +0200, David Hildenbrand wrote: > > > diff --git a/mm/memfd.c b/mm/memfd.c > > > index 081dd33e6a61..ae43454789f4 100644 > > > --- a/mm/memfd.c > > > +++ b/mm/memfd.c > > > @@ -130,11 +130,24 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) > > > return NULL; > > > } > > > +int memfd_register_guest(struct inode *inode, void *owner, > > > + const struct guest_ops *guest_ops, > > > + const struct guest_mem_ops **guest_mem_ops) > > > +{ > > > + if (shmem_mapping(inode->i_mapping)) { > > > + return shmem_register_guest(inode, owner, > > > + guest_ops, guest_mem_ops); > > > + } > > > + > > > + return -EINVAL; > > > +} > > > > Are we stick our design to memfd interface (e.g other memory backing > > stores like tmpfs and hugetlbfs will all rely on this memfd interface to > > interact with KVM), or this is just the initial implementation for PoC? > > I don't think we are, it still feels like we are in the early prototype > phase (even way before a PoC). I'd be happy to see something "cleaner" so to > say -- it still feels kind of hacky to me, especially there seem to be many > pieces of the big puzzle missing so far. Unfortunately, this series hasn't > caught the attention of many -MM people so far, maybe because other people > miss the big picture as well and are waiting for a complete design proposal. > > For example, what's unclear to me: we'll be allocating pages with > GFP_HIGHUSER_MOVABLE, making them land on MIGRATE_CMA or ZONE_MOVABLE; then > we silently turn them unmovable, which breaks these concepts. Who'd migrate > these pages away just like when doing long-term pinning, or how is that > supposed to work? That's fair point. We can fix it by changing mapping->gfp_mask. > Also unclear to me is how refcount and mapcount will be handled to prevent > swapping, refcount and mapcount are unchanged. Pages not pinned per se. Swapping prevented with the change in shmem_writepage(). > who will actually do some kind of gfn-epfn etc. mapping, how we'll > forbid access to this memory e.g., via /proc/kcore or when dumping memory It's not aimed to prevent root to shoot into his leg. Root do root. > ... and how it would ever work with migration/swapping/rmap (it's clearly > future work, but it's been raised that this would be the way to make it > work, I don't quite see how it would all come together). Given that hardware supports it migration and swapping can be implemented by providing new callbacks in guest_ops. Like ->migrate_page would transfer encrypted data between pages and ->swapout would provide encrypted blob that can be put on disk or handled back to ->swapin to bring back to memory.
>> I don't think we are, it still feels like we are in the early prototype >> phase (even way before a PoC). I'd be happy to see something "cleaner" so to >> say -- it still feels kind of hacky to me, especially there seem to be many >> pieces of the big puzzle missing so far. Unfortunately, this series hasn't >> caught the attention of many -MM people so far, maybe because other people >> miss the big picture as well and are waiting for a complete design proposal. >> >> For example, what's unclear to me: we'll be allocating pages with >> GFP_HIGHUSER_MOVABLE, making them land on MIGRATE_CMA or ZONE_MOVABLE; then >> we silently turn them unmovable, which breaks these concepts. Who'd migrate >> these pages away just like when doing long-term pinning, or how is that >> supposed to work? > > That's fair point. We can fix it by changing mapping->gfp_mask. That's essentially what secretmem does when setting up a file. > >> Also unclear to me is how refcount and mapcount will be handled to prevent >> swapping, > > refcount and mapcount are unchanged. Pages not pinned per se. Swapping > prevented with the change in shmem_writepage(). So when mapping into the guest, we'd increment the refcount but not the mapcount I assume? > >> who will actually do some kind of gfn-epfn etc. mapping, how we'll >> forbid access to this memory e.g., via /proc/kcore or when dumping memory > > It's not aimed to prevent root to shoot into his leg. Root do root. IMHO being root is not an excuse to read some random file (actually used in production environments) to result in the machine crashing. Not acceptable for distributions. I'm still missing the whole gfn-epfn 1:1 mapping discussion we identified as requirements. Is that supposed to be done by KVM? How? > >> ... and how it would ever work with migration/swapping/rmap (it's clearly >> future work, but it's been raised that this would be the way to make it >> work, I don't quite see how it would all come together). > > Given that hardware supports it migration and swapping can be implemented > by providing new callbacks in guest_ops. Like ->migrate_page would > transfer encrypted data between pages and ->swapout would provide > encrypted blob that can be put on disk or handled back to ->swapin to > bring back to memory. Again, I'm missing the complete picture. To make swapping decisions vmscan code needs track+handle dirty+reference information. How would we be able to track references? Does the hardware allow for temporary unmapping of encrypted memory and faulting on it? How would page_referenced() continue working? "we can add callbacks" is not a satisfying answer, at least for me. Especially, when it comes to eventual locking problems and races. Maybe saying "migration+swap is not supported" is clearer than "we can add callbacks" and missing some details on the bigger picture. Again, a complete design proposal would be highly valuable, especially to get some more review from other -MM folks. Otherwise there is a high chance that this will be rejected late when trying to upstream and -MM people stumbling over it (we've had some similar thing happening just recently unfortunately ...).
>> >>> who will actually do some kind of gfn-epfn etc. mapping, how we'll >>> forbid access to this memory e.g., via /proc/kcore or when dumping memory >> >> It's not aimed to prevent root to shoot into his leg. Root do root. > > IMHO being root is not an excuse to read some random file (actually used > in production environments) to result in the machine crashing. Not > acceptable for distributions. I just realized that reading encrypted memory should be ok and only writing is an issue, right?
On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote: > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote: > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote: > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote: > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem? > > > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look > > > closer into locking next. > > > > We can decisively eliminate this sort of failure by making the switch > > happen at open time instead of after. For a memfd-like API, this would > > be straightforward. For a filesystem, it would take a bit more thought. > > I think it should work fine as long as we check seals after i_size in the > read path. See the comment in shmem_file_read_iter(). > > Below is updated version. I think it should be good enough to start > integrate with KVM. > > I also attach a test-case that consists of kernel patch and userspace > program. It demonstrates how it can be integrated into KVM code. > > One caveat I noticed is that guest_ops::invalidate_page_range() can be > called after the owner (struct kvm) has being freed. It happens because > memfd can outlive KVM. So the callback has to check if such owner exists, > than check that there's a memslot with such inode. Would introducing memfd_unregister_guest() fix this? > > I guess it should be okay: we have vm_list we can check owner against. > We may consider replace vm_list with something more scalable if number of > VMs will get too high. > > Any comments? > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h > index 4f1600413f91..3005e233140a 100644 > --- a/include/linux/memfd.h > +++ b/include/linux/memfd.h > @@ -4,13 +4,34 @@ > > #include <linux/file.h> > > +struct guest_ops { > + void (*invalidate_page_range)(struct inode *inode, void *owner, > + pgoff_t start, pgoff_t end); > +}; I can see there are two scenarios to invalidate page(s), when punching a hole or ftruncating to 0, in either cases KVM should already been called with necessary information from usersapce with memory slot punch hole syscall or memory slot delete syscall, so wondering this callback is really needed. > + > +struct guest_mem_ops { > + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset); > + void (*put_unlock_pfn)(unsigned long pfn); Same as above, I’m not clear on which time put_unlock_pfn() would be called, I’m thinking the page can be put_and_unlock when userspace punching a hole or ftruncating to 0 on the fd. We did miss pfn_mapping_level() callback which is needed for KVM to query the page size level (e.g. 4K or 2M) that the backing store can support. > + > +}; > + > #ifdef CONFIG_MEMFD_CREATE > extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg); > + > +extern inline int memfd_register_guest(struct inode *inode, void *owner, > + const struct guest_ops *guest_ops, > + const struct guest_mem_ops **guest_mem_ops); > #else > static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a) > { > return -EINVAL; > } > +static inline int memfd_register_guest(struct inode *inode, void *owner, > + const struct guest_ops *guest_ops, > + const struct guest_mem_ops **guest_mem_ops) > +{ > + return -EINVAL; > +} > #endif > > #endif /* __LINUX_MEMFD_H */ > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h > index 8e775ce517bb..265d0c13bc5e 100644 > --- a/include/linux/shmem_fs.h > +++ b/include/linux/shmem_fs.h > @@ -12,6 +12,9 @@ > > /* inode in-kernel data */ > > +struct guest_ops; > +struct guest_mem_ops; > + > struct shmem_inode_info { > spinlock_t lock; > unsigned int seals; /* shmem seals */ > @@ -24,6 +27,8 @@ struct shmem_inode_info { > struct simple_xattrs xattrs; /* list of xattrs */ > atomic_t stop_eviction; /* hold when working on inode */ > struct inode vfs_inode; > + void *guest_owner; > + const struct guest_ops *guest_ops; > }; > > struct shmem_sb_info { > @@ -90,6 +95,10 @@ extern unsigned long shmem_swap_usage(struct vm_area_struct *vma); > extern unsigned long shmem_partial_swap_usage(struct address_space *mapping, > pgoff_t start, pgoff_t end); > > +extern int shmem_register_guest(struct inode *inode, void *owner, > + const struct guest_ops *guest_ops, > + const struct guest_mem_ops **guest_mem_ops); > + > /* Flag allocation requirements to shmem_getpage */ > enum sgp_type { > SGP_READ, /* don't exceed i_size, don't allocate page */ > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > index 2f86b2ad6d7e..c79bc8572721 100644 > --- a/include/uapi/linux/fcntl.h > +++ b/include/uapi/linux/fcntl.h > @@ -43,6 +43,7 @@ > #define F_SEAL_GROW 0x0004 /* prevent file from growing */ > #define F_SEAL_WRITE 0x0008 /* prevent writes */ > #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ > +#define F_SEAL_GUEST 0x0020 > /* (1U << 31) is reserved for signed error codes */ > > /* > diff --git a/mm/memfd.c b/mm/memfd.c > index 081dd33e6a61..ae43454789f4 100644 > --- a/mm/memfd.c > +++ b/mm/memfd.c > @@ -130,11 +130,24 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) > return NULL; > } > > +int memfd_register_guest(struct inode *inode, void *owner, > + const struct guest_ops *guest_ops, > + const struct guest_mem_ops **guest_mem_ops) > +{ > + if (shmem_mapping(inode->i_mapping)) { > + return shmem_register_guest(inode, owner, > + guest_ops, guest_mem_ops); > + } > + > + return -EINVAL; > +} Are we stick our design to memfd interface (e.g other memory backing stores like tmpfs and hugetlbfs will all rely on this memfd interface to interact with KVM), or this is just the initial implementation for PoC? If we really want to expose multiple memory backing stores directly to KVM (as opposed to expose backing stores to memfd and then memfd expose single interface to KVM), I feel we need a third layer between KVM and backing stores to eliminate the direct call like this. Backing stores can register ‘memory fd providers’ and KVM should be able to connect to the right backing store provider with the fd provided by usersapce under the help of this third layer. Thanks, Chao
On Wed, Sep 15, 2021 at 04:59:46PM +0200, David Hildenbrand wrote: > > > > I don't think we are, it still feels like we are in the early prototype > > > phase (even way before a PoC). I'd be happy to see something "cleaner" so to > > > say -- it still feels kind of hacky to me, especially there seem to be many > > > pieces of the big puzzle missing so far. Unfortunately, this series hasn't > > > caught the attention of many -MM people so far, maybe because other people > > > miss the big picture as well and are waiting for a complete design proposal. > > > > > > For example, what's unclear to me: we'll be allocating pages with > > > GFP_HIGHUSER_MOVABLE, making them land on MIGRATE_CMA or ZONE_MOVABLE; then > > > we silently turn them unmovable, which breaks these concepts. Who'd migrate > > > these pages away just like when doing long-term pinning, or how is that > > > supposed to work? > > > > That's fair point. We can fix it by changing mapping->gfp_mask. > > That's essentially what secretmem does when setting up a file. > > > > > > Also unclear to me is how refcount and mapcount will be handled to prevent > > > swapping, > > > > refcount and mapcount are unchanged. Pages not pinned per se. Swapping > > prevented with the change in shmem_writepage(). > > So when mapping into the guest, we'd increment the refcount but not the > mapcount I assume? No. The only refcount hold page cache. But we inform KVM via callback before removing the page from the page cache. It is similar to mmu_notifier scheme KVM uses at the moment. > > > > > > who will actually do some kind of gfn-epfn etc. mapping, how we'll > > > forbid access to this memory e.g., via /proc/kcore or when dumping memory > > > > It's not aimed to prevent root to shoot into his leg. Root do root. > > IMHO being root is not an excuse to read some random file (actually used in > production environments) to result in the machine crashing. Not acceptable > for distributions. Reading does not cause problems. Writing does. > I'm still missing the whole gfn-epfn 1:1 mapping discussion we identified as > requirements. Is that supposed to be done by KVM? How? KVM memslots that represents a range of GFNs refers to memfd (and holds file pin) plus offset in the file. This info enough to calculate offset in the file and find PFN. memfd tied 1:1 to struct kvm and KVM would make sure that there's only one possible gfn for a file offset. > > > ... and how it would ever work with migration/swapping/rmap (it's clearly > > > future work, but it's been raised that this would be the way to make it > > > work, I don't quite see how it would all come together). > > > > Given that hardware supports it migration and swapping can be implemented > > by providing new callbacks in guest_ops. Like ->migrate_page would > > transfer encrypted data between pages and ->swapout would provide > > encrypted blob that can be put on disk or handled back to ->swapin to > > bring back to memory. > > Again, I'm missing the complete picture. To make swapping decisions vmscan > code needs track+handle dirty+reference information. How would we be able to > track references? Does the hardware allow for temporary unmapping of > encrypted memory and faulting on it? How would page_referenced() continue > working? "we can add callbacks" is not a satisfying answer, at least for me. > Especially, when it comes to eventual locking problems and races. HW doesn't support swapping yet, so details will be just speculation. IIUC, there's an accessed bit in EPT that can be used for tracking. > Maybe saying "migration+swap is not supported" is clearer than "we can add > callbacks" and missing some details on the bigger picture. > > Again, a complete design proposal would be highly valuable, especially to > get some more review from other -MM folks. Otherwise there is a high chance > that this will be rejected late when trying to upstream and -MM people > stumbling over it (we've had some similar thing happening just recently > unfortunately ...). I only work on core-mm side of the story. We will definitely need to look at whole picture again once all pieces are somewhat ready.
On Wed, Sep 15, 2021 at 05:11:47PM +0300, Kirill A. Shutemov wrote: > On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote: > > On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote: > > > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote: > > > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote: > > > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote: > > > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem? > > > > > > > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look > > > > > closer into locking next. > > > > > > > > We can decisively eliminate this sort of failure by making the switch > > > > happen at open time instead of after. For a memfd-like API, this would > > > > be straightforward. For a filesystem, it would take a bit more thought. > > > > > > I think it should work fine as long as we check seals after i_size in the > > > read path. See the comment in shmem_file_read_iter(). > > > > > > Below is updated version. I think it should be good enough to start > > > integrate with KVM. > > > > > > I also attach a test-case that consists of kernel patch and userspace > > > program. It demonstrates how it can be integrated into KVM code. > > > > > > One caveat I noticed is that guest_ops::invalidate_page_range() can be > > > called after the owner (struct kvm) has being freed. It happens because > > > memfd can outlive KVM. So the callback has to check if such owner exists, > > > than check that there's a memslot with such inode. > > > > Would introducing memfd_unregister_guest() fix this? > > I considered this, but it get complex quickly. > > At what point it gets called? On KVM memslot destroy? I meant when the VM gets destroyed. > > What if multiple KVM slot share the same memfd? Add refcount into memfd on > how many times the owner registered the memfd? > > It would leave us in strange state: memfd refcount owners (struct KVM) and > KVM memslot pins the struct file. Weird refcount exchnage program. > > I hate it. But yes agree things will get much complex in practice. > > > > I guess it should be okay: we have vm_list we can check owner against. > > > We may consider replace vm_list with something more scalable if number of > > > VMs will get too high. > > > > > > Any comments? > > > > > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h > > > index 4f1600413f91..3005e233140a 100644 > > > --- a/include/linux/memfd.h > > > +++ b/include/linux/memfd.h > > > @@ -4,13 +4,34 @@ > > > > > > #include <linux/file.h> > > > > > > +struct guest_ops { > > > + void (*invalidate_page_range)(struct inode *inode, void *owner, > > > + pgoff_t start, pgoff_t end); > > > +}; > > > > I can see there are two scenarios to invalidate page(s), when punching a > > hole or ftruncating to 0, in either cases KVM should already been called > > with necessary information from usersapce with memory slot punch hole > > syscall or memory slot delete syscall, so wondering this callback is > > really needed. > > So what you propose? Forbid truncate/punch from userspace and make KVM > handle punch hole/truncate from within kernel? I think it's layering > violation. As far as I understand the flow for punch hole/truncate in this design, there will be two steps for userspace: 1. punch hole/delete kvm memory slot, and then 2. puncn hole/truncate on the memory backing store fd. In concept we can do whatever needed for invalidation in either steps. If we can do the invalidation in step 1 then we don’t need bothering this callback. This is what I mean but agree the current callback can also work. > > > > + > > > +struct guest_mem_ops { > > > + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset); > > > + void (*put_unlock_pfn)(unsigned long pfn); > > > > Same as above, I’m not clear on which time put_unlock_pfn() would be > > called, I’m thinking the page can be put_and_unlock when userspace > > punching a hole or ftruncating to 0 on the fd. > > No. put_unlock_pfn() has to be called after the pfn is in SEPT. This way > we close race between SEPT population and truncate/punch. get_lock_pfn() > would stop truncate untile put_unlock_pfn() called. Okay, makes sense. Thanks, Chao
On 15/09/21 16:11, Kirill A. Shutemov wrote: >> Would introducing memfd_unregister_guest() fix this? > I considered this, but it get complex quickly. > > At what point it gets called? On KVM memslot destroy? > > What if multiple KVM slot share the same memfd? Add refcount into memfd on > how many times the owner registered the memfd? You will always have multiple KVM slots sharing the same memfd, because memslots are SRCU-protected. So there will be multiple generations of memslots around and unregistering must be delayed to after synchronize_srcu (around the call to kvm_arch_commit_memory_region). So KVM could just call memfd_{,un}register_guest as many times as it calls fdget/fput. Looking at your test device, it would be like the following pseudo-patch: case GUEST_MEM_REGISTER: { struct fd memfd = fdget(arg); memfd_file = memfd.file; return memfd_register_guest(memfd_file->f_inode, file, &guest_ops, &guest_mem_ops); } case GUEST_MEM_UNREGISTER: { if (!memfd_file) return -EINVAL; + memfd_unregister_guest(memfd_file->f_inode, file); fput(memfd_file); memfd_file = NULL; guest_mem_ops = NULL; return 0; and shmem_unregister_guest would be something like struct shmem_inode_info *info = SHMEM_I(inode); if (WARN_ON_ONCE(info->guest_owner != owner)) return; if (--info->guest_usage_count) return; info->guest_owner = NULL; info->guest_ops = NULL; Paolo > It would leave us in strange state: memfd refcount owners (struct KVM) and > KVM memslot pins the struct file. Weird refcount exchnage program.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 20daaf67a5bf..5ab6e5e9f38b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1810,10 +1810,11 @@ enum { #define HF_SMM_INSIDE_NMI_MASK (1 << 7) #define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE -#define KVM_ADDRESS_SPACE_NUM 2 +#define KVM_ADDRESS_SPACE_NUM 3 -#define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0) -#define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm) +#define kvm_arch_vcpu_memslots_id(vcpu, private) \ + (((vcpu)->arch.hflags & HF_SMM_MASK) ? 1 : (!!private) << 1) +#define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm | (role).private << 1) #define KVM_ARCH_WANT_MMU_NOTIFIER diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index a6c327f8ad9e..600bf108741d 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -53,6 +53,10 @@ /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 +#define KVM_DEFAULT_ADDRESS_SPACE 0 +#define KVM_SMM_ADDRESS_SPACE 1 +#define KVM_PRIVATE_ADDRESS_SPACE 2 + struct kvm_memory_alias { __u32 slot; /* this has a different namespace than memory slots */ __u32 flags; diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index ce369a533800..fc620eda27fd 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -127,6 +127,9 @@ struct kvm_page_fault { const bool rsvd; const bool user; + /* Guest private, derived from error_code (SNP) or gfn (TDX). */ + const bool private; + /* Derived from mmu and global state. */ const bool is_tdp; const bool nx_huge_page_workaround_enabled; diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a272ccbddfa1..771080235b2d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2896,6 +2896,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, if (max_level == PG_LEVEL_4K) return PG_LEVEL_4K; + if (memslot_is_private(slot)) + return slot->private_ops->pfn_mapping_level(...); + host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); return min(host_level, max_level); } @@ -3835,9 +3838,11 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r) { - struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn); + struct kvm_memory_slot *slot; bool async; + slot = __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, fault->private); + /* * Retry the page fault if the gfn hit a memslot that is being deleted * or moved. This ensures any existing SPTEs for the old memslot will @@ -3846,8 +3851,19 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, if (slot && (slot->flags & KVM_MEMSLOT_INVALID)) goto out_retry; + /* + * Exit to userspace to map the requested private/shared memory region + * if there is no memslot and (a) the access is private or (b) there is + * an existing private memslot. Emulated MMIO must be accessed through + * shared GPAs, thus a memslot miss on a private GPA is always handled + * as an implicit conversion "request". + */ + if (!slot && + (fault->private || __kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn, true))) + goto out_convert; + if (!kvm_is_visible_memslot(slot)) { - /* Don't expose private memslots to L2. */ + /* Don't expose KVM's internal memslots to L2. */ if (is_guest_mode(vcpu)) { fault->pfn = KVM_PFN_NOSLOT; fault->map_writable = false; @@ -3890,6 +3906,12 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, out_retry: *r = RET_PF_RETRY; return true; + +out_convert: + vcpu->run->exit_reason = KVM_EXIT_MAP_MEMORY; + /* TODO: fill vcpu->run with more info. */ + *r = 0; + return true; } static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index d447b21cdd73..21ff766f98d0 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -434,6 +434,7 @@ struct kvm_memory_slot { u32 flags; short id; u16 as_id; + struct guest_private_memory_ops *private_ops; }; static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot) @@ -514,7 +515,7 @@ struct kvm_irq_routing_table { #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS) #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE -static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu) +static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu, bool private) { return 0; } @@ -785,13 +786,19 @@ static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm) return __kvm_memslots(kvm, 0); } -static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu) +static inline struct kvm_memslots *__kvm_vcpu_memslots(struct kvm_vcpu *vcpu, + bool private) { - int as_id = kvm_arch_vcpu_memslots_id(vcpu); + int as_id = kvm_arch_vcpu_memslots_id(vcpu, private); return __kvm_memslots(vcpu->kvm, as_id); } +static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu) +{ + return __kvm_vcpu_memslots(vcpu, false); +} + static inline struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id) { @@ -807,6 +814,15 @@ struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id) return slot; } +static inline bool memslot_is_private(struct kvm_memory_slot *slot) +{ +#ifdef KVM_PRIVATE_ADDRESS_SPACE + return slot && slot->as_id == KVM_PRIVATE_ADDRESS_SPACE; +#else + return false; +#endif +} + /* * KVM_SET_USER_MEMORY_REGION ioctl allows the following operations: * - create a new memory slot @@ -946,6 +962,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot, g void mark_page_dirty(struct kvm *kvm, gfn_t gfn); struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu); +struct kvm_memory_slot *__kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, + gfn_t gfn, bool private); struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn); kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn); kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn); diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index a067410ebea5..bd01cff9aeff 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -101,6 +101,9 @@ struct kvm_userspace_memory_region { __u64 guest_phys_addr; __u64 memory_size; /* bytes */ __u64 userspace_addr; /* start of the userspace allocated memory */ +#ifdef KVM_PRIVATE_ADDRESS_SPACE + __u32 fd; /* valid iff memslot is guest private memory */ +#endif }; /* @@ -269,6 +272,7 @@ struct kvm_xen_exit { #define KVM_EXIT_AP_RESET_HOLD 32 #define KVM_EXIT_X86_BUS_LOCK 33 #define KVM_EXIT_XEN 34 +#define KVM_EXIT_MEMORY_ERROR 35 /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 3e67c93ca403..8747a39d4173 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1583,7 +1583,19 @@ static int kvm_set_memslot(struct kvm *kvm, kvm_copy_memslots(slots, __kvm_memslots(kvm, as_id)); } +#ifdef KVM_PRIVATE_ADDRESS_SPACE + if (change == KVM_MR_CREATE && as_id == KVM_PRIVATE_ADDRESS_SPACE) { + r = kvm_register_private_memslot(kvm, mem, old, new); + if (r) + goto out_slots; + } +#endif + r = kvm_arch_prepare_memory_region(kvm, new, mem, change); +#ifdef KVM_PRIVATE_ADDRESS_SPACE + if ((r || change == KVM_MR_DELETE) && as_id == KVM_PRIVATE_ADDRESS_SPACE) + kvm_unregister_private_memslot(kvm, mem, old, new); +#endif if (r) goto out_slots; @@ -1706,6 +1718,12 @@ int __kvm_set_memory_region(struct kvm *kvm, new.dirty_bitmap = NULL; memset(&new.arch, 0, sizeof(new.arch)); } else { /* Modify an existing slot. */ +#ifdef KVM_PRIVATE_ADDRESS_SPACE + /* Private memslots are immutable, they can only be deleted. */ + if (as_id == KVM_PRIVATE_ADDRESS_SPACE) + return -EINVAL; +#endif + if ((new.userspace_addr != old.userspace_addr) || (new.npages != old.npages) || ((new.flags ^ old.flags) & KVM_MEM_READONLY)) @@ -2059,9 +2077,10 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn) } EXPORT_SYMBOL_GPL(gfn_to_memslot); -struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn) +struct kvm_memory_slot *__kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, + gfn_t gfn, bool private) { - struct kvm_memslots *slots = kvm_vcpu_memslots(vcpu); + struct kvm_memslots *slots = __kvm_vcpu_memslots(vcpu, private); struct kvm_memory_slot *slot; int slot_index; @@ -2082,6 +2101,12 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn return NULL; } +EXPORT_SYMBOL_GPL(__kvm_vcpu_gfn_to_memslot); + +struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + return __kvm_vcpu_gfn_to_memslot(vcpu, gfn, false); +} EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot); bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn) @@ -2428,8 +2453,12 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic, bool *async, bool write_fault, bool *writable, hva_t *hva) { - unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault); + unsigned long addr; + if (memslot_is_private(slot)) + return slot->private_ops->gfn_to_pfn(...); + + addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault); if (hva) *hva = addr; @@ -2624,8 +2653,7 @@ EXPORT_SYMBOL_GPL(kvm_map_gfn); int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map) { - return __kvm_map_gfn(kvm_vcpu_memslots(vcpu), gfn, map, - NULL, false); + return __kvm_map_gfn(kvm_vcpu_memslots(vcpu), gfn, map, NULL, false); } EXPORT_SYMBOL_GPL(kvm_vcpu_map);