Message ID | 20210913131153.1202354-2-pbonzini@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: sgx_vepc: implement ioctl to EREMOVE all pages | expand |
On 9/13/21 6:11 AM, Paolo Bonzini wrote: > Windows expects all pages to be in uninitialized state on startup. > In order to implement this, we will need a ioctl that performs > EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor: > other possibilities, such as closing and reopening the device, > are racy. Hi Paolo, How does this end up happening in the first place? All enclave pages should start out on 'sgx_dirty_page_list' and ksgxd sanitizes them with EREMOVE before making them available. That should cover EREMOVE after reboots while SGX pages are initialized, including kexec(). sgx_vepc_free_page() should do the same for pages that a guest not not clean up properly. sgx_encl_free_epc_page() does an EREMOVE after a normal enclave has used a page. Those are the only three cases that I can think of. So, it sounds like one of those is buggy, or there's another unexpected path out there. Ultimately, I think it would be really handy if we could do this EREMOVE implicitly and without any new ABI.
On 13/09/21 16:05, Dave Hansen wrote: > On 9/13/21 6:11 AM, Paolo Bonzini wrote: >> Windows expects all pages to be in uninitialized state on startup. >> In order to implement this, we will need a ioctl that performs >> EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor: >> other possibilities, such as closing and reopening the device, >> are racy. > > Hi Paolo, > > How does this end up happening in the first place? > > All enclave pages should start out on 'sgx_dirty_page_list' and > ksgxd sanitizes them with EREMOVE before making them available. That > should cover EREMOVE after reboots while SGX pages are initialized, > including kexec(). By "Windows startup" I mean even after guest reboot. Because another process could sneak in and steal your EPC pages between a close() and an open(), I'd like to have a way to EREMOVE the pages while keeping them assigned to the specific vEPC instance, i.e. *without* going through sgx_vepc_free_page(). Thanks, Paolo > sgx_vepc_free_page() should do the same for pages that a guest not not > clean up properly. > > sgx_encl_free_epc_page() does an EREMOVE after a normal enclave has used > a page. > > Those are the only three cases that I can think of. So, it sounds like > one of those is buggy, or there's another unexpected path out there. > Ultimately, I think it would be really handy if we could do this EREMOVE > implicitly and without any new ABI. >
On 9/13/21 7:24 AM, Paolo Bonzini wrote: >> How does this end up happening in the first place? >> >> All enclave pages should start out on 'sgx_dirty_page_list' and >> ksgxd sanitizes them with EREMOVE before making them available. That >> should cover EREMOVE after reboots while SGX pages are initialized, >> including kexec(). > > By "Windows startup" I mean even after guest reboot. Because another > process could sneak in and steal your EPC pages between a close() and an > open(), I'd like to have a way to EREMOVE the pages while keeping them > assigned to the specific vEPC instance, i.e. *without* going through > sgx_vepc_free_page(). Oh, so you want fresh EPC state for the guest, but you're concerned that the previous guest might have left them in a bad state. The current method of getting a new vepc instance (which guarantees fresh state) has some other downsides. Can't another process steal pages via sgxd and reclaim at any time? What's the extra concern here about going through a close()/open() cycle? Performance?
On 13/09/21 16:55, Dave Hansen wrote: >> By "Windows startup" I mean even after guest reboot. Because another >> process could sneak in and steal your EPC pages between a close() and an >> open(), I'd like to have a way to EREMOVE the pages while keeping them >> assigned to the specific vEPC instance, i.e.*without* going through >> sgx_vepc_free_page(). > Oh, so you want fresh EPC state for the guest, but you're concerned that > the previous guest might have left them in a bad state. The current > method of getting a new vepc instance (which guarantees fresh state) has > some other downsides. > > Can't another process steal pages via sgxd and reclaim at any time? vEPC pages never call sgx_mark_page_reclaimable, don't they? > What's the extra concern here about going through a close()/open() > cycle? Performance? Apart from reclaiming, /dev/sgx_vepc might disappear between the first open() and subsequent ones. Paolo
On 9/13/21 8:14 AM, Paolo Bonzini wrote: > On 13/09/21 16:55, Dave Hansen wrote: >>> By "Windows startup" I mean even after guest reboot. Because another >>> process could sneak in and steal your EPC pages between a close() and an >>> open(), I'd like to have a way to EREMOVE the pages while keeping them >>> assigned to the specific vEPC instance, i.e.*without* going through >>> sgx_vepc_free_page(). >> Oh, so you want fresh EPC state for the guest, but you're concerned that >> the previous guest might have left them in a bad state. The current >> method of getting a new vepc instance (which guarantees fresh state) has >> some other downsides. >> >> Can't another process steal pages via sgxd and reclaim at any time? > > vEPC pages never call sgx_mark_page_reclaimable, don't they? Oh, I was just looking that they were on the SGX LRU. You might be right. But, we certainly don't want the fact that they are unreclaimable today to be part of the ABI. It's more of a bug than a feature. >> What's the extra concern here about going through a close()/open() >> cycle? Performance? > > Apart from reclaiming, /dev/sgx_vepc might disappear between the first > open() and subsequent ones. Aside from it being rm'd, I don't think that's possible.
On 13/09/21 17:29, Dave Hansen wrote: > On 9/13/21 8:14 AM, Paolo Bonzini wrote: >> On 13/09/21 16:55, Dave Hansen wrote: >>>> By "Windows startup" I mean even after guest reboot. Because another >>>> process could sneak in and steal your EPC pages between a close() and an >>>> open(), I'd like to have a way to EREMOVE the pages while keeping them >>>> assigned to the specific vEPC instance, i.e.*without* going through >>>> sgx_vepc_free_page(). >>> Oh, so you want fresh EPC state for the guest, but you're concerned that >>> the previous guest might have left them in a bad state. The current >>> method of getting a new vepc instance (which guarantees fresh state) has >>> some other downsides. >>> >>> Can't another process steal pages via sgxd and reclaim at any time? >> >> vEPC pages never call sgx_mark_page_reclaimable, don't they? > > Oh, I was just looking that they were on the SGX LRU. You might be right. > But, we certainly don't want the fact that they are unreclaimable today > to be part of the ABI. It's more of a bug than a feature. Sure, that's fine. >>> What's the extra concern here about going through a close()/open() >>> cycle? Performance? >> >> Apart from reclaiming, /dev/sgx_vepc might disappear between the first >> open() and subsequent ones. > > Aside from it being rm'd, I don't think that's possible. > Being rm'd would be a possibility in principle, and it would be ugly for it to cause issues on running VMs. Also I'd like for it to be able to pass /dev/sgx_vepc in via a file descriptor, and run QEMU in a chroot or a mount namespace. Alternatively, with seccomp it may be possible to sandbox a running QEMU process in such a way that open() is forbidden at runtime (all hotplug is done via file descriptor passing); it is not yet possible, but it is a goal. Paolo
On 9/13/21 11:35 AM, Paolo Bonzini wrote: >>> Apart from reclaiming, /dev/sgx_vepc might disappear between the first >>> open() and subsequent ones. >> >> Aside from it being rm'd, I don't think that's possible. >> > > Being rm'd would be a possibility in principle, and it would be ugly for > it to cause issues on running VMs. Also I'd like for it to be able to > pass /dev/sgx_vepc in via a file descriptor, and run QEMU in a chroot or > a mount namespace. Alternatively, with seccomp it may be possible to > sandbox a running QEMU process in such a way that open() is forbidden at > runtime (all hotplug is done via file descriptor passing); it is not yet > possible, but it is a goal. OK, so maybe another way of saying this: For bare-metal SGX on real hardware, the hardware provides guarantees SGX state at reboot. For instance, all pages start out uninitialized. The vepc driver provides a similar guarantee today for freshly-opened vepc instances. But, vepc users have a problem: they might want to run an OS that expects to be booted with clean, fully uninitialized SGX state, just as it would be on bare-metal. Right now, the only way to get that is to create a new vepc instance. That might not be possible in all configurations, like if the permission to open(/dev/sgx_vepc) has been lost since the VM was first booted. Windows has these expectations about booting with fully uninitialized state. There are also a number of environments where QEMU is sandboxed or drops permissions in a way that prevent free and open access to /dev/sgx_vepc. So good so far?
On Mon, 2021-09-13 at 09:11 -0400, Paolo Bonzini wrote: > Windows expects all pages to be in uninitialized state on startup. > In order to implement this, we will need a ioctl that performs > EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor: > other possibilities, such as closing and reopening the device, > are racy. So what makes it racy, and what do mean by racy in this case? I.e. you have to do open() and mmap(), and munmap() + close() for removal. Depending on situation that is racy or not... And is "Windows" more precisely a "Windows guest running in Linux QEMU host"? It's ambiguous.. /Jarkko
On Mon, 2021-09-13 at 16:24 +0200, Paolo Bonzini wrote: > On 13/09/21 16:05, Dave Hansen wrote: > > On 9/13/21 6:11 AM, Paolo Bonzini wrote: > > > Windows expects all pages to be in uninitialized state on startup. > > > In order to implement this, we will need a ioctl that performs > > > EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor: > > > other possibilities, such as closing and reopening the device, > > > are racy. > > > > Hi Paolo, > > > > How does this end up happening in the first place? > > > > All enclave pages should start out on 'sgx_dirty_page_list' and > > ksgxd sanitizes them with EREMOVE before making them available. That > > should cover EREMOVE after reboots while SGX pages are initialized, > > including kexec(). > > By "Windows startup" I mean even after guest reboot. Because another > process could sneak in and steal your EPC pages between a close() and an > open(), I'd like to have a way to EREMOVE the pages while keeping them > assigned to the specific vEPC instance, i.e. *without* going through > sgx_vepc_free_page(). Isn't "other process in and steal your EPC pages" more like sysadmin problem, rather than software? I'm lacking of understanding what would be the collateral damage in the end. /Jarkko
On Mon, 2021-09-13 at 07:55 -0700, Dave Hansen wrote: > On 9/13/21 7:24 AM, Paolo Bonzini wrote: > > > How does this end up happening in the first place? > > > > > > All enclave pages should start out on 'sgx_dirty_page_list' and > > > ksgxd sanitizes them with EREMOVE before making them available. That > > > should cover EREMOVE after reboots while SGX pages are initialized, > > > including kexec(). > > > > By "Windows startup" I mean even after guest reboot. Because another > > process could sneak in and steal your EPC pages between a close() and an > > open(), I'd like to have a way to EREMOVE the pages while keeping them > > assigned to the specific vEPC instance, i.e. *without* going through > > sgx_vepc_free_page(). > > Oh, so you want fresh EPC state for the guest, but you're concerned that > the previous guest might have left them in a bad state. The current > method of getting a new vepc instance (which guarantees fresh state) has > some other downsides. > > Can't another process steal pages via sgxd and reclaim at any time? > What's the extra concern here about going through a close()/open() > cycle? Performance? sgxd does not steal anything from vepc regions. They are not part of the reclaiming process. /Jarkko
On Mon, 2021-09-13 at 17:14 +0200, Paolo Bonzini wrote: > On 13/09/21 16:55, Dave Hansen wrote: > > > By "Windows startup" I mean even after guest reboot. Because another > > > process could sneak in and steal your EPC pages between a close() and an > > > open(), I'd like to have a way to EREMOVE the pages while keeping them > > > assigned to the specific vEPC instance, i.e.*without* going through > > > sgx_vepc_free_page(). > > Oh, so you want fresh EPC state for the guest, but you're concerned that > > the previous guest might have left them in a bad state. The current > > method of getting a new vepc instance (which guarantees fresh state) has > > some other downsides. > > > > Can't another process steal pages via sgxd and reclaim at any time? > > vEPC pages never call sgx_mark_page_reclaimable, don't they? > > > What's the extra concern here about going through a close()/open() > > cycle? Performance? > > Apart from reclaiming, /dev/sgx_vepc might disappear between the first > open() and subsequent ones. If /dev/sgx_vepc dissapears, why is it a problem *for the software*, and not a sysadmin problem? I think that this is what the whole patch is lacking, why are we talking about a software problem... /Jarkko
On Mon, 2021-09-13 at 20:35 +0200, Paolo Bonzini wrote: > On 13/09/21 17:29, Dave Hansen wrote: > > On 9/13/21 8:14 AM, Paolo Bonzini wrote: > > > On 13/09/21 16:55, Dave Hansen wrote: > > > > > By "Windows startup" I mean even after guest reboot. Because another > > > > > process could sneak in and steal your EPC pages between a close() and an > > > > > open(), I'd like to have a way to EREMOVE the pages while keeping them > > > > > assigned to the specific vEPC instance, i.e.*without* going through > > > > > sgx_vepc_free_page(). > > > > Oh, so you want fresh EPC state for the guest, but you're concerned that > > > > the previous guest might have left them in a bad state. The current > > > > method of getting a new vepc instance (which guarantees fresh state) has > > > > some other downsides. > > > > > > > > Can't another process steal pages via sgxd and reclaim at any time? > > > > > > vEPC pages never call sgx_mark_page_reclaimable, don't they? > > > > Oh, I was just looking that they were on the SGX LRU. You might be right. > > But, we certainly don't want the fact that they are unreclaimable today > > to be part of the ABI. It's more of a bug than a feature. > > Sure, that's fine. > > > > > What's the extra concern here about going through a close()/open() > > > > cycle? Performance? > > > > > > Apart from reclaiming, /dev/sgx_vepc might disappear between the first > > > open() and subsequent ones. > > > > Aside from it being rm'd, I don't think that's possible. > > > > Being rm'd would be a possibility in principle, and it would be ugly for > it to cause issues on running VMs. Also I'd like for it to be able to > pass /dev/sgx_vepc in via a file descriptor, and run QEMU in a chroot or > a mount namespace. Alternatively, with seccomp it may be possible to > sandbox a running QEMU process in such a way that open() is forbidden at > runtime (all hotplug is done via file descriptor passing); it is not yet > possible, but it is a goal. AFAIK, as long you have open files for a device, they work as expected. /Jarkko
On Mon, 2021-09-13 at 12:25 -0700, Dave Hansen wrote: > On 9/13/21 11:35 AM, Paolo Bonzini wrote: > > > > Apart from reclaiming, /dev/sgx_vepc might disappear between the first > > > > open() and subsequent ones. > > > > > > Aside from it being rm'd, I don't think that's possible. > > > > > > > Being rm'd would be a possibility in principle, and it would be ugly for > > it to cause issues on running VMs. Also I'd like for it to be able to > > pass /dev/sgx_vepc in via a file descriptor, and run QEMU in a chroot or > > a mount namespace. Alternatively, with seccomp it may be possible to > > sandbox a running QEMU process in such a way that open() is forbidden at > > runtime (all hotplug is done via file descriptor passing); it is not yet > > possible, but it is a goal. > > OK, so maybe another way of saying this: > > For bare-metal SGX on real hardware, the hardware provides guarantees > SGX state at reboot. For instance, all pages start out uninitialized. > The vepc driver provides a similar guarantee today for freshly-opened > vepc instances. > > But, vepc users have a problem: they might want to run an OS that > expects to be booted with clean, fully uninitialized SGX state, just as > it would be on bare-metal. Right now, the only way to get that is to > create a new vepc instance. That might not be possible in all > configurations, like if the permission to open(/dev/sgx_vepc) has been > lost since the VM was first booted. So you maintain your systems in a way that this does not happen? /Jarkko
On 13/09/21 23:13, Jarkko Sakkinen wrote: >> Apart from reclaiming, /dev/sgx_vepc might disappear between the first >> open() and subsequent ones. > > If /dev/sgx_vepc disappears, why is it a problem *for the software*, and > not a sysadmin problem? Rather than disappearing, it could be that a program first gets all the resources it needs before it gets malicious input, and then enter a restrictive sandbox. In this case open() could be completely forbidden. I will improve the documentation and changelogs when I post the non-RFC version; that could have been done better, sorry. Paolo
On Tue, 2021-09-14 at 07:36 +0200, Paolo Bonzini wrote: > On 13/09/21 23:13, Jarkko Sakkinen wrote: > > > Apart from reclaiming, /dev/sgx_vepc might disappear between the first > > > open() and subsequent ones. > > > > If /dev/sgx_vepc disappears, why is it a problem *for the software*, and > > not a sysadmin problem? > > Rather than disappearing, it could be that a program first gets all the > resources it needs before it gets malicious input, and then enter a > restrictive sandbox. In this case open() could be completely forbidden. > > I will improve the documentation and changelogs when I post the non-RFC > version; that could have been done better, sorry. > No worries, just trying to get bottom of the actual issue. /Jarkko
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c index 64511c4a5200..59b9c13121cd 100644 --- a/arch/x86/kernel/cpu/sgx/virt.c +++ b/arch/x86/kernel/cpu/sgx/virt.c @@ -111,7 +111,7 @@ static int sgx_vepc_mmap(struct file *file, struct vm_area_struct *vma) return 0; } -static int sgx_vepc_free_page(struct sgx_epc_page *epc_page) +static int sgx_vepc_remove_page(struct sgx_epc_page *epc_page) { int ret; @@ -140,11 +140,17 @@ static int sgx_vepc_free_page(struct sgx_epc_page *epc_page) */ WARN_ONCE(ret != SGX_CHILD_PRESENT, EREMOVE_ERROR_MESSAGE, ret, ret); - return ret; } + return ret; +} - sgx_free_epc_page(epc_page); +static int sgx_vepc_free_page(struct sgx_epc_page *epc_page) +{ + int ret = sgx_vepc_remove_page(epc_page); + if (ret) + return ret; + sgx_free_epc_page(epc_page); return 0; }
Windows expects all pages to be in uninitialized state on startup. In order to implement this, we will need a ioctl that performs EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor: other possibilities, such as closing and reopening the device, are racy. Start the implementation by pulling the EREMOVE into a separate function. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/x86/kernel/cpu/sgx/virt.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)