Message ID | 11a923a314accf36a82aac4b676310a4802f5c75.1612777752.git.kai.huang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM SGX virtualization support | expand |
On Mon, Feb 08, 2021 at 11:54:09PM +1300, Kai Huang wrote: > From: Sean Christopherson <sean.j.christopherson@intel.com> > > Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw" EPC > without an associated enclave. The intended and only known use case for > raw EPC allocation is to expose EPC to a KVM guest, hence the 'vepc' > moniker, virt.{c,h} files and X86_SGX_KVM Kconfig. This commit message does give existential background for having vEPC. I.e. everything below this paragraph is "good enough" to make the case for SGX subsystem controlled vEPC. However, it does not give any existential background for /dev/sgx_vpec. Even with differing internals you could just as well make the whole thing as subfunction of /dev/sgx_enclave. It's perfectly doable. It does not really matter how much the same internals are used (e.g. sgx_encl). Without that clearly documented, it would be unwise to merge this. /Jarkko
On Tue, Feb 09, 2021 at 11:18:13PM +0200, Jarkko Sakkinen wrote: > On Mon, Feb 08, 2021 at 11:54:09PM +1300, Kai Huang wrote: > > From: Sean Christopherson <sean.j.christopherson@intel.com> > > > > Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw" EPC > > without an associated enclave. The intended and only known use case for > > raw EPC allocation is to expose EPC to a KVM guest, hence the 'vepc' > > moniker, virt.{c,h} files and X86_SGX_KVM Kconfig. > > This commit message does give existential background for having vEPC. > I.e. everything below this paragraph is "good enough" to make the case > for SGX subsystem controlled vEPC. > > However, it does not give any existential background for /dev/sgx_vpec. > Even with differing internals you could just as well make the whole > thing as subfunction of /dev/sgx_enclave. It's perfectly doable. It > does not really matter how much the same internals are used (e.g. > sgx_encl). > > Without that clearly documented, it would be unwise to merge this. E.g. - Have ioctl() to turn opened fd as vEPC. - If FLC is disabled, you could only use the fd for creating vEPC. Quite easy stuff to implement. /Jarkko
On 2/9/21 1:19 PM, Jarkko Sakkinen wrote: >> Without that clearly documented, it would be unwise to merge this. > E.g. > > - Have ioctl() to turn opened fd as vEPC. > - If FLC is disabled, you could only use the fd for creating vEPC. > > Quite easy stuff to implement. The most important question to me is not how close vEPC is today, but how close it will be in the future. It's basically the age old question of: do we make one syscall that does two things or two syscalls? Is there a _compelling_ reason to change direction? How much code would we save?
On Tue, 2021-02-09 at 13:36 -0800, Dave Hansen wrote: > On 2/9/21 1:19 PM, Jarkko Sakkinen wrote: > > > Without that clearly documented, it would be unwise to merge this. > > E.g. > > > > - Have ioctl() to turn opened fd as vEPC. > > - If FLC is disabled, you could only use the fd for creating vEPC. > > > > Quite easy stuff to implement. > > The most important question to me is not how close vEPC is today, but > how close it will be in the future. It's basically the age old question > of: do we make one syscall that does two things or two syscalls? > > Is there a _compelling_ reason to change direction? How much code would > we save? Basically we need to defer 'sgx_encl' related code from open to after mmap(). For instance, We need to defer 'sgx_encl' allocation from open to SGX_IOC_ENCLAVE_CREATE. And due to this change, we also need to move some members out of 'sgx_encl', and use them as common for enclave and vEPC. The 'struct xarray page_array' would be a good example. The code we can save, from my first glance, is just misc_register("/dev/sgx_vepc") related, maybe plus some mmap() related code. The logic to handle both host enclave and vEPC still needs to be there. To me the major concern is /dev/sgx_enclave, by its name, implies it is associated with host enclave. Adding IOCTL to *convert* it to vEPC is just mixing things up, and is ugly. If we really want to do this, IMHO we need at least change /dev/sgx_enclave to /dev/sgx_epc, for instance, to imply the fd we opened and mmap()'d just represents some raw EPC. However this is changing to existing userspace ABI. Sean, What's your opinion? Did I miss anything?
On Wed, Feb 10, 2021, Kai Huang wrote: > On Tue, 2021-02-09 at 13:36 -0800, Dave Hansen wrote: > > On 2/9/21 1:19 PM, Jarkko Sakkinen wrote: > > > > Without that clearly documented, it would be unwise to merge this. > > > E.g. > > > > > > - Have ioctl() to turn opened fd as vEPC. > > > - If FLC is disabled, you could only use the fd for creating vEPC. > > > > > > Quite easy stuff to implement. ... > What's your opinion? Did I miss anything? Frankly, I think trying to smush them together would be a complete trainwreck. The vast majority of flows would need to go down completely different paths, so you'd end up with code like this: diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c index f2eac41bb4ff..5128043c7871 100644 --- a/arch/x86/kernel/cpu/sgx/driver.c +++ b/arch/x86/kernel/cpu/sgx/driver.c @@ -46,6 +46,9 @@ static int sgx_release(struct inode *inode, struct file *file) struct sgx_encl *encl = file->private_data; struct sgx_encl_mm *encl_mm; + if (encl->not_an_enclave) + return sgx_virt_epc_release(encl); + /* * Drain the remaining mm_list entries. At this point the list contains * entries for processes, which have closed the enclave file but have @@ -83,6 +86,9 @@ static int sgx_mmap(struct file *file, struct vm_area_struct *vma) struct sgx_encl *encl = file->private_data; int ret; + if (encl->not_an_enclave) + return sgx_virt_epc_mmap(encl, vma); + ret = sgx_encl_may_map(encl, vma->vm_start, vma->vm_end, vma->vm_flags); if (ret) return ret; @@ -104,6 +110,11 @@ static unsigned long sgx_get_unmapped_area(struct file *file, unsigned long pgoff, unsigned long flags) { + struct sgx_encl *encl = file->private_data; + + if (encl->not_an_enclave) + return sgx_virt_epc_mmap(encl, addr, len, pgoff, flags); + if ((flags & MAP_TYPE) == MAP_PRIVATE) return -EINVAL; I suspect it would also be tricky to avoid introducing races, since anything that is different for virtual EPC would have a dependency on the ioctl() being called. This would also prevent making /dev/sgx_enclave root-only while allowing users access to /dev/sgx_vepc. Forcing admins to use LSMs to do the same is silly. For the few flows that can share code, just split out the common bits to helpers.
On Wed, 10 Feb 2021 08:52:25 -0800 Sean Christopherson wrote: > On Wed, Feb 10, 2021, Kai Huang wrote: > > On Tue, 2021-02-09 at 13:36 -0800, Dave Hansen wrote: > > > On 2/9/21 1:19 PM, Jarkko Sakkinen wrote: > > > > > Without that clearly documented, it would be unwise to merge this. > > > > E.g. > > > > > > > > - Have ioctl() to turn opened fd as vEPC. > > > > - If FLC is disabled, you could only use the fd for creating vEPC. > > > > > > > > Quite easy stuff to implement. > > ... > > > What's your opinion? Did I miss anything? > > Frankly, I think trying to smush them together would be a complete trainwreck. > > The vast majority of flows would need to go down completely different paths, so > you'd end up with code like this: > > diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c > index f2eac41bb4ff..5128043c7871 100644 > --- a/arch/x86/kernel/cpu/sgx/driver.c > +++ b/arch/x86/kernel/cpu/sgx/driver.c > @@ -46,6 +46,9 @@ static int sgx_release(struct inode *inode, struct file *file) > struct sgx_encl *encl = file->private_data; > struct sgx_encl_mm *encl_mm; > > + if (encl->not_an_enclave) > + return sgx_virt_epc_release(encl); > + > /* > * Drain the remaining mm_list entries. At this point the list contains > * entries for processes, which have closed the enclave file but have > @@ -83,6 +86,9 @@ static int sgx_mmap(struct file *file, struct vm_area_struct *vma) > struct sgx_encl *encl = file->private_data; > int ret; > > + if (encl->not_an_enclave) > + return sgx_virt_epc_mmap(encl, vma); > + > ret = sgx_encl_may_map(encl, vma->vm_start, vma->vm_end, vma->vm_flags); > if (ret) > return ret; > @@ -104,6 +110,11 @@ static unsigned long sgx_get_unmapped_area(struct file *file, > unsigned long pgoff, > unsigned long flags) > { > + struct sgx_encl *encl = file->private_data; > + > + if (encl->not_an_enclave) > + return sgx_virt_epc_mmap(encl, addr, len, pgoff, flags); > + > if ((flags & MAP_TYPE) == MAP_PRIVATE) > return -EINVAL; > > I suspect it would also be tricky to avoid introducing races, since anything that > is different for virtual EPC would have a dependency on the ioctl() being called. > > This would also prevent making /dev/sgx_enclave root-only while allowing users > access to /dev/sgx_vepc. Forcing admins to use LSMs to do the same is silly. Agreed. This is really a good point. Two different device nodes allows different permission control. Thanks. > > For the few flows that can share code, just split out the common bits to helpers.
On Tue, Feb 09, 2021 at 01:36:06PM -0800, Dave Hansen wrote: > On 2/9/21 1:19 PM, Jarkko Sakkinen wrote: > >> Without that clearly documented, it would be unwise to merge this. > > E.g. > > > > - Have ioctl() to turn opened fd as vEPC. > > - If FLC is disabled, you could only use the fd for creating vEPC. > > > > Quite easy stuff to implement. > > The most important question to me is not how close vEPC is today, but > how close it will be in the future. It's basically the age old question > of: do we make one syscall that does two things or two syscalls? > > Is there a _compelling_ reason to change direction? How much code would > we save? I'm not really concerned about saving any code, at all. I'm more concerned of adding unnecessary bells and whistles to uapi. IMHO when new uapi is adding it has the burden of ensuring that it is relevant, and necessary. For me it like now to be frank that I'm sure, which one is the right way to go, i.e. I'm not positioned to any side. But being unsure neither is not a great position to ack anything. /Jarkko
On Wed, Feb 10, 2021 at 01:20:32PM +1300, Kai Huang wrote: > On Tue, 2021-02-09 at 13:36 -0800, Dave Hansen wrote: > > On 2/9/21 1:19 PM, Jarkko Sakkinen wrote: > > > > Without that clearly documented, it would be unwise to merge this. > > > E.g. > > > > > > - Have ioctl() to turn opened fd as vEPC. > > > - If FLC is disabled, you could only use the fd for creating vEPC. > > > > > > Quite easy stuff to implement. > > > > The most important question to me is not how close vEPC is today, but > > how close it will be in the future. It's basically the age old question > > of: do we make one syscall that does two things or two syscalls? > > > > Is there a _compelling_ reason to change direction? How much code would > > we save? > > Basically we need to defer 'sgx_encl' related code from open to after mmap(). For > instance, We need to defer 'sgx_encl' allocation from open to SGX_IOC_ENCLAVE_CREATE. > And due to this change, we also need to move some members out of 'sgx_encl', and use > them as common for enclave and vEPC. The 'struct xarray page_array' would be a good > example. > > The code we can save, from my first glance, is just misc_register("/dev/sgx_vepc") > related, maybe plus some mmap() related code. The logic to handle both host enclave > and vEPC still needs to be there. > > To me the major concern is /dev/sgx_enclave, by its name, implies it is associated > with host enclave. Adding IOCTL to *convert* it to vEPC is just mixing things up, and > is ugly. If we really want to do this, IMHO we need at least change /dev/sgx_enclave > to /dev/sgx_epc, for instance, to imply the fd we opened and mmap()'d just represents > some raw EPC. However this is changing to existing userspace ABI. > > Sean, > > What's your opinion? Did I miss anything? I think this mostly makes sense to me. And also, it gives an opportunity to have add some granularity to access control (worth of mentioning in the commit message). Thanks. /Jarkko
On Wed, Feb 10, 2021 at 08:52:25AM -0800, Sean Christopherson wrote: > On Wed, Feb 10, 2021, Kai Huang wrote: > > On Tue, 2021-02-09 at 13:36 -0800, Dave Hansen wrote: > > > On 2/9/21 1:19 PM, Jarkko Sakkinen wrote: > > > > > Without that clearly documented, it would be unwise to merge this. > > > > E.g. > > > > > > > > - Have ioctl() to turn opened fd as vEPC. > > > > - If FLC is disabled, you could only use the fd for creating vEPC. > > > > > > > > Quite easy stuff to implement. > > ... > > > What's your opinion? Did I miss anything? > > Frankly, I think trying to smush them together would be a complete trainwreck. > > The vast majority of flows would need to go down completely different paths, so > you'd end up with code like this: > > diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c > index f2eac41bb4ff..5128043c7871 100644 > --- a/arch/x86/kernel/cpu/sgx/driver.c > +++ b/arch/x86/kernel/cpu/sgx/driver.c > @@ -46,6 +46,9 @@ static int sgx_release(struct inode *inode, struct file *file) > struct sgx_encl *encl = file->private_data; > struct sgx_encl_mm *encl_mm; > > + if (encl->not_an_enclave) > + return sgx_virt_epc_release(encl); > + > /* > * Drain the remaining mm_list entries. At this point the list contains > * entries for processes, which have closed the enclave file but have > @@ -83,6 +86,9 @@ static int sgx_mmap(struct file *file, struct vm_area_struct *vma) > struct sgx_encl *encl = file->private_data; > int ret; > > + if (encl->not_an_enclave) > + return sgx_virt_epc_mmap(encl, vma); > + > ret = sgx_encl_may_map(encl, vma->vm_start, vma->vm_end, vma->vm_flags); > if (ret) > return ret; > @@ -104,6 +110,11 @@ static unsigned long sgx_get_unmapped_area(struct file *file, > unsigned long pgoff, > unsigned long flags) > { > + struct sgx_encl *encl = file->private_data; > + > + if (encl->not_an_enclave) > + return sgx_virt_epc_mmap(encl, addr, len, pgoff, flags); > + > if ((flags & MAP_TYPE) == MAP_PRIVATE) > return -EINVAL; > > I suspect it would also be tricky to avoid introducing races, since anything that > is different for virtual EPC would have a dependency on the ioctl() being called. > > This would also prevent making /dev/sgx_enclave root-only while allowing users > access to /dev/sgx_vepc. Forcing admins to use LSMs to do the same is silly. > > For the few flows that can share code, just split out the common bits to helpers. I'm cool with keeping the device. This is just my opinion that even "obvious" should be documented when it comes to uapi. I.e. no matter how stupid and simple reasons are to add a new device file, please just write it down to commit message. /Jarkko
On Fri, 2021-02-12 at 14:17 +0200, Jarkko Sakkinen wrote: > On Wed, Feb 10, 2021 at 08:52:25AM -0800, Sean Christopherson wrote: > > On Wed, Feb 10, 2021, Kai Huang wrote: > > > On Tue, 2021-02-09 at 13:36 -0800, Dave Hansen wrote: > > > > On 2/9/21 1:19 PM, Jarkko Sakkinen wrote: > > > > > > Without that clearly documented, it would be unwise to merge this. > > > > > E.g. > > > > > > > > > > - Have ioctl() to turn opened fd as vEPC. > > > > > - If FLC is disabled, you could only use the fd for creating vEPC. > > > > > > > > > > Quite easy stuff to implement. > > > > ... > > > > > What's your opinion? Did I miss anything? > > > > Frankly, I think trying to smush them together would be a complete trainwreck. > > > > The vast majority of flows would need to go down completely different paths, so > > you'd end up with code like this: > > > > diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c > > index f2eac41bb4ff..5128043c7871 100644 > > --- a/arch/x86/kernel/cpu/sgx/driver.c > > +++ b/arch/x86/kernel/cpu/sgx/driver.c > > @@ -46,6 +46,9 @@ static int sgx_release(struct inode *inode, struct file *file) > > struct sgx_encl *encl = file->private_data; > > struct sgx_encl_mm *encl_mm; > > > > > > + if (encl->not_an_enclave) > > + return sgx_virt_epc_release(encl); > > + > > /* > > * Drain the remaining mm_list entries. At this point the list contains > > * entries for processes, which have closed the enclave file but have > > @@ -83,6 +86,9 @@ static int sgx_mmap(struct file *file, struct vm_area_struct *vma) > > struct sgx_encl *encl = file->private_data; > > int ret; > > > > > > + if (encl->not_an_enclave) > > + return sgx_virt_epc_mmap(encl, vma); > > + > > ret = sgx_encl_may_map(encl, vma->vm_start, vma->vm_end, vma->vm_flags); > > if (ret) > > return ret; > > @@ -104,6 +110,11 @@ static unsigned long sgx_get_unmapped_area(struct file *file, > > unsigned long pgoff, > > unsigned long flags) > > { > > + struct sgx_encl *encl = file->private_data; > > + > > + if (encl->not_an_enclave) > > + return sgx_virt_epc_mmap(encl, addr, len, pgoff, flags); > > + > > if ((flags & MAP_TYPE) == MAP_PRIVATE) > > return -EINVAL; > > > > I suspect it would also be tricky to avoid introducing races, since anything that > > is different for virtual EPC would have a dependency on the ioctl() being called. > > > > This would also prevent making /dev/sgx_enclave root-only while allowing users > > access to /dev/sgx_vepc. Forcing admins to use LSMs to do the same is silly. > > > > For the few flows that can share code, just split out the common bits to helpers. > > I'm cool with keeping the device. This is just my opinion that even > "obvious" should be documented when it comes to uapi. I.e. no matter > how stupid and simple reasons are to add a new device file, please > just write it down to commit message. > > /Jarkko Yes reasonable. I added some description to the commit message. I have already sent out the v5. Please take a look and see whether it is OK for you. Thanks!
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 21f851179ff0..ccb35d14c297 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1951,6 +1951,18 @@ config X86_SGX If unsure, say N. +config X86_SGX_KVM + bool "Software Guard eXtensions (SGX) Virtualization" + depends on X86_SGX && KVM_INTEL + help + + Enables KVM guests to create SGX enclaves. + + This includes support to expose "raw" unreclaimable enclave memory to + guests via a device node, e.g. /dev/sgx_vepc. + + If unsure, say N. + config EFI bool "EFI runtime service support" depends on ACPI diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile index 91d3dc784a29..9c1656779b2a 100644 --- a/arch/x86/kernel/cpu/sgx/Makefile +++ b/arch/x86/kernel/cpu/sgx/Makefile @@ -3,3 +3,4 @@ obj-y += \ encl.o \ ioctl.o \ main.o +obj-$(CONFIG_X86_SGX_KVM) += virt.o diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h index 5fa42d143feb..1bff93be7bf4 100644 --- a/arch/x86/kernel/cpu/sgx/sgx.h +++ b/arch/x86/kernel/cpu/sgx/sgx.h @@ -83,4 +83,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page); int sgx_unmark_page_reclaimable(struct sgx_epc_page *page); struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim); +#ifdef CONFIG_X86_SGX_KVM +int __init sgx_vepc_init(void); +#else +static inline int __init sgx_vepc_init(void) +{ + return -ENODEV; +} +#endif + #endif /* _X86_SGX_H */ diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c new file mode 100644 index 000000000000..47542140f8c1 --- /dev/null +++ b/arch/x86/kernel/cpu/sgx/virt.c @@ -0,0 +1,259 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Device driver to expose SGX enclave memory to KVM guests. + * + * Copyright(c) 2021 Intel Corporation. + */ + +#include <linux/miscdevice.h> +#include <linux/mm.h> +#include <linux/mman.h> +#include <linux/sched/mm.h> +#include <linux/sched/signal.h> +#include <linux/slab.h> +#include <linux/xarray.h> +#include <asm/sgx.h> +#include <uapi/asm/sgx.h> + +#include "encls.h" +#include "sgx.h" + +struct sgx_vepc { + struct xarray page_array; + struct mutex lock; +}; + +/* + * Temporary SECS pages that cannot be EREMOVE'd due to having child in other + * virtual EPC instances, and the lock to protect it. + */ +static struct mutex zombie_secs_pages_lock; +static struct list_head zombie_secs_pages; + +static int __sgx_vepc_fault(struct sgx_vepc *vepc, + struct vm_area_struct *vma, unsigned long addr) +{ + struct sgx_epc_page *epc_page; + unsigned long index, pfn; + int ret; + + WARN_ON(!mutex_is_locked(&vepc->lock)); + + /* Calculate index of EPC page in virtual EPC's page_array */ + index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start); + + epc_page = xa_load(&vepc->page_array, index); + if (epc_page) + return 0; + + epc_page = sgx_alloc_epc_page(vepc, false); + if (IS_ERR(epc_page)) + return PTR_ERR(epc_page); + + ret = xa_err(xa_store(&vepc->page_array, index, epc_page, GFP_KERNEL)); + if (ret) + goto err_free; + + pfn = PFN_DOWN(sgx_get_epc_phys_addr(epc_page)); + + ret = vmf_insert_pfn(vma, addr, pfn); + if (ret != VM_FAULT_NOPAGE) { + ret = -EFAULT; + goto err_delete; + } + + return 0; + +err_delete: + xa_erase(&vepc->page_array, index); +err_free: + sgx_free_epc_page(epc_page); + return ret; +} + +static vm_fault_t sgx_vepc_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct sgx_vepc *vepc = vma->vm_private_data; + int ret; + + mutex_lock(&vepc->lock); + ret = __sgx_vepc_fault(vepc, vma, vmf->address); + mutex_unlock(&vepc->lock); + + if (!ret) + return VM_FAULT_NOPAGE; + + if (ret == -EBUSY && (vmf->flags & FAULT_FLAG_ALLOW_RETRY)) { + mmap_read_unlock(vma->vm_mm); + return VM_FAULT_RETRY; + } + + return VM_FAULT_SIGBUS; +} + +const struct vm_operations_struct sgx_vepc_vm_ops = { + .fault = sgx_vepc_fault, +}; + +static int sgx_vepc_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct sgx_vepc *vepc = file->private_data; + + if (!(vma->vm_flags & VM_SHARED)) + return -EINVAL; + + vma->vm_ops = &sgx_vepc_vm_ops; + /* Don't copy VMA in fork() */ + vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTDUMP | VM_DONTCOPY; + vma->vm_private_data = vepc; + + return 0; +} + +static int sgx_vepc_free_page(struct sgx_epc_page *epc_page) +{ + int ret; + + /* + * Take a previously guest-owned EPC page and return it to the + * general EPC page pool. + * + * Guests can not be trusted to have left this page in a good + * state, so run EREMOVE on the page unconditionally. In the + * case that a guest properly EREMOVE'd this page, a superfluous + * EREMOVE is harmless. + */ + ret = __eremove(sgx_get_epc_virt_addr(epc_page)); + if (ret) { + /* + * Only SGX_CHILD_PRESENT is expected, which is because of + * EREMOVE'ing an SECS still with child, in which case it can + * be handled by EREMOVE'ing the SECS again after all pages in + * virtual EPC have been EREMOVE'd. See comments in below in + * sgx_vepc_release(). + * + * The user of virtual EPC (KVM) needs to guarantee there's no + * logical processor is still running in the enclave in guest, + * otherwise EREMOVE will get SGX_ENCLAVE_ACT which cannot be + * handled here. + */ + WARN_ONCE(ret != SGX_CHILD_PRESENT, + "EREMOVE (EPC page 0x%lx): unexpected error: %d\n", + sgx_get_epc_phys_addr(epc_page), ret); + return ret; + } + + sgx_free_epc_page(epc_page); + return 0; +} + +static int sgx_vepc_release(struct inode *inode, struct file *file) +{ + struct sgx_vepc *vepc = file->private_data; + struct sgx_epc_page *epc_page, *tmp, *entry; + unsigned long index; + + LIST_HEAD(secs_pages); + + xa_for_each(&vepc->page_array, index, entry) { + /* + * Remove all normal, child pages. sgx_vepc_free_page() + * will fail if EREMOVE fails, but this is OK and expected on + * SECS pages. Those can only be EREMOVE'd *after* all their + * child pages. Retries below will clean them up. + */ + if (sgx_vepc_free_page(entry)) + continue; + + xa_erase(&vepc->page_array, index); + } + + /* + * Retry EREMOVE'ing pages. This will clean up any SECS pages that + * only had children in this 'epc' area. + */ + xa_for_each(&vepc->page_array, index, entry) { + epc_page = entry; + /* + * An EREMOVE failure here means that the SECS page still + * has children. But, since all children in this 'sgx_vepc' + * have been removed, the SECS page must have a child on + * another instance. + */ + if (sgx_vepc_free_page(epc_page)) + list_add_tail(&epc_page->list, &secs_pages); + + xa_erase(&vepc->page_array, index); + } + + /* + * SECS pages are "pinned" by child pages, an unpinned once all + * children have been EREMOVE'd. A child page in this instance + * may have pinned an SECS page encountered in an earlier release(), + * creating a zombie. Since some children were EREMOVE'd above, + * try to EREMOVE all zombies in the hopes that one was unpinned. + */ + mutex_lock(&zombie_secs_pages_lock); + list_for_each_entry_safe(epc_page, tmp, &zombie_secs_pages, list) { + /* + * Speculatively remove the page from the list of zombies, + * if the page is successfully EREMOVE it will be added to + * the list of free pages. If EREMOVE fails, throw the page + * on the local list, which will be spliced on at the end. + */ + list_del(&epc_page->list); + + if (sgx_vepc_free_page(epc_page)) + list_add_tail(&epc_page->list, &secs_pages); + } + + if (!list_empty(&secs_pages)) + list_splice_tail(&secs_pages, &zombie_secs_pages); + mutex_unlock(&zombie_secs_pages_lock); + + kfree(vepc); + + return 0; +} + +static int sgx_vepc_open(struct inode *inode, struct file *file) +{ + struct sgx_vepc *vepc; + + vepc = kzalloc(sizeof(struct sgx_vepc), GFP_KERNEL); + if (!vepc) + return -ENOMEM; + mutex_init(&vepc->lock); + xa_init(&vepc->page_array); + + file->private_data = vepc; + + return 0; +} + +static const struct file_operations sgx_vepc_fops = { + .owner = THIS_MODULE, + .open = sgx_vepc_open, + .release = sgx_vepc_release, + .mmap = sgx_vepc_mmap, +}; + +static struct miscdevice sgx_vepc_dev = { + .minor = MISC_DYNAMIC_MINOR, + .name = "sgx_vepc", + .nodename = "sgx_vepc", + .fops = &sgx_vepc_fops, +}; + +int __init sgx_vepc_init(void) +{ + /* SGX virtualization requires KVM to work */ + if (!boot_cpu_has(X86_FEATURE_VMX)) + return -ENODEV; + + INIT_LIST_HEAD(&zombie_secs_pages); + mutex_init(&zombie_secs_pages_lock); + + return misc_register(&sgx_vepc_dev); +}